UM E-Theses Collection (澳門大學電子學位論文庫)
- Title
-
Business information extraction from web
- English Abstract
-
Show / Hidden
University of Macau Abstract BUSINESS INFORMATION EXTRACTION FROM WEB by Lam Man I Thesis Supervisor: Associate Professor, Gong Zhiguo Master of Science in E-Commerce Technology Along with the continuous development of the internet technologies, the Word Wide Web has become one of the most important information repositories. It alters the traditional way of preserving and searching information. Now a day, search engine is a very popular method for searching information on the web. However, it only presents a list of documents rather than the specific answers or piece of knowledge. Therefore, the information extraction from web has been extensively studied. Information in web pages is free from standards in presentation and lacks being organized in good format. It is a challenging work to extract appropriate and useful information from web pages. In addition, as e-commerce is becoming an increasingly important method for business to be involved in, many web sites are set up, the information inside would be a useful source for deriving some knowledge to take advantages on doing business and consuming. Currently, many web extraction Systems called web wrappers have been developed, they could be classified into several classes, including language for wrapper development based, HTML-aware based, natural language processing based, wrapper induction based, modeling based and ontology based. In this thesis, some existing techniques are investigated. Then our work on the web information extraction is presented. In the initial stage of our research, we have applied the Hypertext Markup Language (HTML)analyzing and Extensible Stylesheet Language Transformations (XSLT)pattern template approaches for extracting a specific structure of web information. Although theses approaches could guarantee high accuracy, the knowledge of creating the XSLT pattern and HTML structure must be included. To overcome the weaknesses, the systems based on pattern classifications approaches are further developed. After the human training process, extraction rules for different extraction patterns, which are represented in regular expression, will be generated. The rules will then be applied to variety of web pages to extract information. This type of system will have high performance when the extraction patterns are in rather stable structure. In order to extract the information in a more general way, the system bases on extraction field classification approach are developed. We have classified the patterns of information into static and non-static structures. Then use different technique to extract the relevant information. As a final result, all the extracted information is packaged into a machine readable format of Extensible Markup Language (XML) or load into a relational database.
- Issue date
-
2008.
- Author
-
林敏兒
- Faculty
-
Faculty of Science and Technology
- Department
-
Department of Computer and Information Science
- Degree
-
M.Sc.
- Subject
-
Business -- Data processing
Information storage and retrieval systems
Database management
Data mining
- Supervisor
-
Gong, Zhi Guo
- Files In This Item
- Location
- 1/F Zone C
- Library URL
- 991003003399706306