Dissertation > Excellent graduate degree dissertation topics show

Design and Implementation of Warpper Generation System Based on Nested-Pattern in Web Pages

Author: ShenXun
Tutor: SongMaoQiang
School: Beijing University of Posts and Telecommunications
Course: Software Engineering
Keywords: Web Information Extraction Deep Web Noise Elimination Suffix Tree
CLC: TP393.092
Type: Master's thesis
Year: 2010
Downloads: 14
Quote: 0
Read: Download Dissertation

Abstract


As the Web grows, more and more data has become available on the Internet. It is quite convenient for us to get the information in which we are interested. We can send out a query to a Search Engine to obtain the information of interest, but we must face to a huge amount of data. The data on the Internet is displayed in the form of HTML code which is semi-structured. It is easy to read for people, but it is hard for a computer to process automatically. So, if we can extract the useful data from web pages and store it into Database, it will be easy for us to do deep analysis. Thus, it is important and necessary to extract useful information from web pages, which is Web Information Extraction and Integration. Currently, generating Wrapper is widely used to extract information from Web pages automatically.In this paper, we implement the generation of a Wrapper for Web Information Extraction and Integration. It can generate Wrapper automatically for web pages which contain nested-structured data. We construct a wrapper by 4 steps to extract information from Web pages for Deep Web:1. Pre-process Web pages, and eliminate noisy data. We propose a new algorithm called ENDW which is based on "Query Keyword" and DOM trees to ensure the integrality of useful data.2. Construct suffix tree for a given web page based on Ukkonen’s algorithm. Suffix trees are used to discover all continuous repeated substrings. We consider HTML code of a web page as a string. After the given web page is processed in step 1, the HTML code containing no noisy data is used as input to construct a suffix tree base on Ukkonen’s algorithm.3. Search for all continuous repeated strings based on a suffix tree. For Deep Web, data records displayed in web pages are continuous repeated substring. We can discover nested-structure based on these continuous repeated substrings. Next step, we will abstract the Regular Expression representing the pattern (structure) of the web pages based on these continuous repeated substrings.4. Generate Regular Expression as Wrapper that can represent the structure of web pages.

Related Dissertations

  1. Web Page Attribute Extraction Method Research,TP391.1
  2. Research and Application on Short Message Text Clustering,TP391.1
  3. Inspection systems for metal abrasive noise cancellation algorithm,TP391.41
  4. Template independent web information extraction,TP393.092
  5. Research of Data Source Selection with Similar Theme in Deep Web Integrated System,TP311.13
  6. Deep Web Data Cleaning Method Research and Application,TP393.09
  7. Research on Data Acquisition and Topic Analysis of Online Public Opinion,TP393.09
  8. Research on Crawling Deep Web Information,TP393.09
  9. Design and Implementation of Web Information Extraction Based on DOM,TP393.09
  10. An Approach to the Key Problems of Web Information Extraction Based on Prefix Expression,TP391.1
  11. Research on Web-based Opinion Analysis for Stock Reviews,TP391.1
  12. Technology for Domain-Oriented Automatic Information Extraction from Semi-Structured Web,TP391.1
  13. Research of Enterprise Competitive Intelligence System Based on Data Processing Center,F272
  14. Research on Duplicate Records Identification Model in Deep Web,TP311.13
  15. Research and Application on the Technology of Web Information Extraction Based on the HTML,TP393.09
  16. Research on Database Discovery and Clustering of Deep Web,TP311.13
  17. The Research of Realizing a Deep Web Crawler Based on AJAX Technology,TP393.092
  18. Web information extraction technology in the enterprise competitive intelligence platform,TP393.09
  19. The Literature Information Retrieval and Matching from the Web,TP393.09
  20. Post-Processing of Deep Web Querying Result,TP393.09

CLC: > Industrial Technology > Automation technology,computer technology > Computing technology,computer technology > Computer applications > Computer network > General issues > The application of computer network > Web browser
© 2012 www.DissertationTopic.Net  Mobile