University of Macau Library | UML Digital Resources Hub

UM Dissertations & Theses Collection (澳門大學電子學位論文庫)

Title

Business information extraction from web

English Abstract

University of Macau Abstract BUSINESS INFORMATION EXTRACTION FROM WEB by Lam Man I Thesis Supervisor: Associate Professor, Gong Zhiguo Master of Science in E-Commerce Technology Along with the continuous development of the internet technologies, the Word Wide Web has become one of the most important information repositories. It alters the traditional way of preserving and searching information. Now a day, search engine is a very popular method for searching information on the web. However, it only presents a list of documents rather than the specific answers or piece of knowledge. Therefore, the information extraction from web has been extensively studied. Information in web pages is free from standards in presentation and lacks being organized in good format. It is a challenging work to extract appropriate and useful information from web pages. In addition, as e-commerce is becoming an increasingly important method for business to be involved in, many web sites are set up, the information inside would be a useful source for deriving some knowledge to take advantages on doing business and consuming. Currently, many web extraction Systems called web wrappers have been developed, they could be classified into several classes, including language for wrapper development based, HTML-aware based, natural language processing based, wrapper induction based, modeling based and ontology based. In this thesis, some existing techniques are investigated. Then our work on the web information extraction is presented. In the initial stage of our research, we have applied the Hypertext Markup Language (HTML)analyzing and Extensible Stylesheet Language Transformations (XSLT)pattern template approaches for extracting a specific structure of web information. Although theses approaches could guarantee high accuracy, the knowledge of creating the XSLT pattern and HTML structure must be included. To overcome the weaknesses, the systems based on pattern classifications approaches are further developed. After the human training process, extraction rules for different extraction patterns, which are represented in regular expression, will be generated. The rules will then be applied to variety of web pages to extract information. This type of system will have high performance when the extraction patterns are in rather stable structure. In order to extract the information in a more general way, the system bases on extraction field classification approach are developed. We have classified the patterns of information into static and non-static structures. Then use different technique to extract the relevant information. As a final result, all the extracted information is packaged into a machine readable format of Extensible Markup Language (XML) or load into a relational database.

Issue date

2008.

Author

林敏兒

Faculty

Faculty of Science and Technology

Department

Department of Computer and Information Science

Degree

M.Sc.

Subject

Business -- Data processing

Information storage and retrieval systems

Database management

Data mining

Supervisor

Gong, Zhi Guo

Files In This Item

View the Table of Contents

View the Abstract

Location

1/F Zone C

Library URL

991003003399706306