school

UM E-Theses Collection (澳門大學電子學位論文庫)

Title

Knowledge discovery in medical diagnosis data

English Abstract

KNOWLEDGE DISCOVERY IN MEDICAL DIAGNOSIS DATA by Zhang Siqi Thesis Supervisor: Professor Gong Zhi-Guo Co-Supervisor: Professor Dong Ming-Chui E-Commerce Technology The study of this thesis is conducted under the research project "Network-based Intelligent Home Healthcare System (NIHHS)", which aims to be as a medical diagnosis assistant for patients in home. The "medical knowledge" is the kernel of the success about medical intelligence. Consequently, self-learning & discovering the effective medical knowledge vis machine learning autonomously plays an important role in this research project. Knowledge discovery in database (KDD) is the nontrivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data. It can be applied in all fields that accumulated a lot of data. Chinese Traditional Medicine Cases is one of the most important fields in which KDD technique can be applied. Because the original free text format of the cases is far from the requirement of Knowledge Discover Algorithm, the Data Preprocessing is a very important phase in the entire KDD process. Through analyzing the characters of Traditional Chinese Medicine diagnoses, we have designed several preprocessing algorithms for the diagnosis data, implemented some tools based on the algorithms and established on overall preprocessing system for Traditional Chinese Medicine diagnoses. The functions of these algorithms include: unifying the metric units and representations, erasing redundant attributes, filling in missing values, representing the original alphabetic descriptions of attributes with numeric ones and so on. As we all know, there are a large number of redundancies in the result during the data processing, which will number of redundancies in the result during the data processing, which will result in numerous and jumbled information and produces lots of unnecessary rules. In this thesis, we use the minimum support, which is a measure of Association Rules algorithm, to decision nodes in every layer to control the quantity of potential branch. Since there is little information contained in discarded data, the above processing procedure can guarantee there is enough information in the remnant data. After all of the preprocessing, the diagnosis data were transformed from the original free text descriptions into standard, clear, normalized data formats. KDD technique can discover many kinds of knowledge from database, and diagnosis rule is an important one among them. This thesis addresses Decision Tree discovery algorithms, applies them to preprocessed Traditional Chinese Medicine Diagnosis database, and mines potential diagnosis rules from it. Traditional Decision Tree has an obvious shortcoming which is no feedback course during building tree. In this way, it can not carry on reevaluation and selection. So we combine decision tree algorithms with feedback method to improve the correctness and performance of the system. The solution of Feedback Decision Tree includes saving the limitation of attribute selection, finding an optimal tree and save the cost of batch learning. Measures of pattern interestingness are essential for the efficient discovery of patterns of value to the given user. Such measures can be used to rank the discovered patterns according to their interestingness. More importantly, such measures can be used to guide and constrain the discovery process, improving the search efficiency by pruning away subsets of the pattern space that do not satisfy prespecified interestingness constraints. The transparency of the users from the rules will be strengthened. Finally, all of the works mentioned above constitute a whole Traditional Chinese Medicine Diagnosis KDD System. We also validate the work with experiments. Key words: Data Mining, Knowledge Discovery in Database (KDD), Decision Tree, Feedback Decision Tree, heart disease

Chinese Abstract

List of Figures ...... iii List of Tables ...... iv Glossary ...... v CHAPTER 1 Introduction ...... 1 1.1 Background ...... 1 1.2 Development of Medical Intelligence Diagnosis ...... 3 1.3 Introduction of Data Mining and KDD ...... 6 1.3.1 Basic Definition ...... 6 1.3.2 The Six-Step DMKD Process Model ...... 9 1.3.3 The Data-Mining Step of the KDD Process ...... 10 1.4 Works of this Thesis ...... 17 CHAPTER 2 Data Preprocessing ...... 19 2.1 The Characters of Medical Records ...... 19 2.2 The Definition of Date Preprocessing ...... 21 2.3 Data Preprocessing in Medical Data ...... 23 2.3.1 Data Intergration ...... 23 2.3.2 Makeup the Missing Data ...... 24 2.3.3 Discretization ...... 25 2.3.4 Data Conversion ...... 31 2.3.5 Feature Selection ...... 32 2.4 Summary ...... 36 CHAPTER 3 Decision Tree ...... 38 3.1 Data Classification ...... 38 3.2 Decision Tree Inducation Techniques ...... 39 3.3 Multiclass Feedback Decision Tree ...... 44 3.3.1 Multiclass as Multiple Two-Class Problems ...... 44 3.3.2 Feedback Decision Tree ...... 45 3.4 Extracting Classification Rules ...... 49 3.5 Reprocessing the Result of Mining ...... 50 3.6 Summary ...... 51 CHAPTER 4 Simulation and Experimental Study ...... 53 4.1 Interoduction ...... 53 4.2 User-System Interaction ...... 54 4.3 Data Preprocessing ...... 55 4.3.1 Data Integration ...... 55 4.3.2 Extracting and Clearing the Database ...... 56 4.4 Knowledge Discovery ...... 58 4.4.1 Building Decision Tree and Extraction Rules ...... 58 4.4.2 Decision Tree with Feature Selection ...... 59 4.4.3 Diagnosis Rules with Interestingness Measure ...... 61 4.4.4 Evaluations of Feedback DT ...... 61 4.5 Summary ...... 62 CHAPTER 5 Conclusion ...... 63 Bibliography ...... 65 Appendix: Publications ...... 73

Issue date

2005.

Author

Zhang, Si Qi

Faculty

Faculty of Science and Technology

Department

Department of Computer and Information Science

Degree

M.Sc.

Subject

Knowledge acquisition (Expert systems)

Information storage and retrieval systems

Data mining

Medicine -- Databases

Supervisor

Gong, Zhi Guo

Dong, Ming Chui

Files In This Item

View the Table of Contents

View the Abstract

Location
1/F Zone C
Library URL
991008400359706306