Dissertation > Excellent graduate degree dissertation topics show

An Approach to the Key Problems of Web Information Extraction Based on Prefix Expression

Author: SunLing
Tutor: ZengQingTian
School: Shandong University of Science and Technology
Course: Applied Computer Technology
Keywords: Web information extraction prefix expression wrapper reptile algorithm Web noise removing
CLC: TP391.1
Type: Master's thesis
Year: 2010
Downloads: 40
Quote: 0
Read: Download Dissertation


The rapid development of World Wide Web leads to a rapid expansion of Web data. Considering the massive amount of Web data, the phenomenon of "rich data, poor information" attracts more and more attention. To resolve this problem, information extraction technology appears.The current Web information extraction methods being in use, which aim at articular sites and generate wrappers manually, obviously, can not adapt themselves to the portability of program or the changes of web page structure. After further researches for these problems, in order to extract information automatically, this thesis presents a new Web information extraction method using prefix expression, which works under the same domain, same level and same kind of Web pages. The main work of this thesis is as follows:(1) Propose and implement a web noise removing method based on comparison of DOM trees.Firstly, this thesis compares two random pages to find alternative noise nodes. Secondly, by comparing more pages, some fake noise nodes are filtered out. Finally, the noise set is identified by checking the location of each noise node. Therefore, program can remove every noise node in web pages with the help of noise set, which improves the efficiency and accuracy of the program.(2) Propose and implement a Web information extraction method based on prefix expression.Firstly, this thesis finds some random sample pages, and then generates the prefix expression queue for each page. Secondly, the final queue is identified by comparing the weight of different queues. Finally, information is extracted with the help of the final queue. This method this thesis uses to obtain prefix expression queue, needs no user participation, which increases automaticity of the program.The method proposed in this thesis does not require any prior knowledge of the target pages or structures, such as page layout, page style, page subject, etc.. This method does not require users to provide special training samples or source code annotations, because it will select random samples instead. This method does not require any participation of users when extracting infomation. To some extent, these features increase the automation of the program, and improve the robustness and expansibility of the program.

Related Dissertations

  1. The Design and Implement of Mediator and Wrapper Mechanism in Massive Multi-Database Intergration,TP311.13
  2. Studies on Key Planting Skills of Cigar-Wrapper Tabacco,S572
  3. Web Page Attribute Extraction Method Research,TP391.1
  4. Template independent web information extraction,TP393.092
  5. Research on Data Acquisition and Topic Analysis of Online Public Opinion,TP393.09
  6. Research on Crawling Deep Web Information,TP393.09
  7. Design and Implementation of Web Information Extraction Based on DOM,TP393.09
  8. Lightweight Intrusion Detection System Based on Feature Selection,TP393.08
  9. Application Research of Heterogeneous ERP Database Integration in Network Auditing,TP311.52
  10. Research on Web-based Opinion Analysis for Stock Reviews,TP391.1
  11. Technology for Domain-Oriented Automatic Information Extraction from Semi-Structured Web,TP391.1
  12. Research of Enterprise Competitive Intelligence System Based on Data Processing Center,F272
  13. Research and Implementation of an Information Pre-process Platform of Public Opinion,TP393.09
  14. Design and Implementation of Warpper Generation System Based on Nested-Pattern in Web Pages,TP393.092
  15. Development of human technology - based web content extraction system,TP393.092
  16. Research and Implementation of Web Page Segmentation Algorithm MFPS Based on Multi-Feature,TP393.092
  17. Research on Data Extraction and Schema Recognition on Deep Web,TP393.09
  18. Research and Implementation of Agent Wrapper Model in System Integration,TP311.52
  19. Research on Competitive Information Extraction Based on Web,TP391.1
  20. Research and Application of Information Extraction -based the multidimensional semantics Internet drugs,TP393.09

CLC: > Industrial Technology > Automation technology,computer technology > Computing technology,computer technology > Computer applications > Information processing (information processing) > Text Processing
© 2012 www.DissertationTopic.Net  Mobile