Dissertation > Excellent graduate degree dissertation topics show

Research on Web Information Extraction Tool

Author: LiangHongWei
Tutor: XuJianChao
School: Changchun University of
Course: Computer Software and Theory
Keywords: HTML Information Extraction DOM NekoHTML Web Page
CLC: TP393.092
Type: Master's thesis
Year: 2011
Downloads: 49
Quote: 0
Read: Download Dissertation


With the development of the technology, popularizing rate of the computer is increasing, more and more people browse information on Internet.Today, people use the Internet in living, work and business activities, web has become an important way of people obtaining information. Web pages contain text, images, videos, music and so on. Different people like different web information, the information that people are not interested in scatter around the information that people are interested in, they distract from people’s attention, it is inconvenience to reading web information.The paper presents a DOM-based Web information extraction methods, the way can filter out the information that people are not interested in the web pages, leaving only the information that the people are interested in.This method is not mechanical to find the information that we are interested in, but delete the information that we are not interested in. First, we use the Eclipse development tools, use HTML parser NekoHTML of open source parse web pages to a DOM tree. The paper use depth-first search algorithm to recursively traverse every node of the DOM Tree to determine whether the node contains the information that we are interested in. We preserve the node that contains the information that we are interested in, we delete the node that contains the information that we are interested in. The paper use java programming language to implement extraction algorithms of web information, use the JSP and Servlet to develop graphical user interface. The paper use extraction algorithm to delete the information that the user is not interested in and retain only the information that he user is not interested in.Users can choose their favorite information by graphical interface, our extraction algorithm will be based on the user’s choice, to delete the information that users are not interested in, to return the information they like. The paper first introduces the purpose of studying the Web information extraction tools, and then analyze the advantages and disadvantages of 11 types of Web information extraction technology, introduces the web page type and web page composition, and then introduces the DOM tree and the open source web analytic tools NekoHTML, the final design Web information extraction algorithms, complete implementation of Web information extraction tools.

Related Dissertations

  1. Research on Domain Entity Attribute and Event Extraction Technology,TP391.1
  2. Study on Growth Monitoring Technique Based on Pixel Un-Mixing Method and HJ Remote Sensing Images in Paddy Rice,S511
  3. Land Desertification in Qinghai Lake Landscape Pattern Change,X171
  4. Active faults based radar image information extraction method applied research and demonstration,P542.3
  5. Based on high-resolution remote sensing data mining houses information extraction,TP751
  6. Web Page Attribute Extraction Method Research,TP391.1
  7. High-performed Kernel Classification Methods Based on Multi-kernel Learning,TP391.41
  8. The Design and Development of a Embedded Browser Running on WinCE,TP393.092
  9. Research & Implementation of Web Page Layout in Embedded Browser,TP393.092
  10. The Research for Named Entity Recognition and Relation Extraction in Text,TP391.1
  11. The MObile Widget Engine Researching and Implementing Based on the Webkit,TP391.3
  12. Reptiles theme for Education News Design and Implementation,TP391.3
  13. Hull section robotic welding path planning and offline programming,TP242
  14. Template independent web information extraction,TP393.092
  15. Internet-based personalized health information customized system build,TP311.52
  16. Personalized Multi-media Resources to A Vertical Search Engine Technology Research,TP391.3
  17. HTML text embedded browser - based design and Implementation,TP393.092
  18. The Research of Fragile Watermarking for Tamper Detection of Web Pages,TP393.092
  19. Design and Realization of a Web Page Gathering System with JavaScript Parsing,TP393.092
  20. The Information Website Development of Huludao Local Taxation Bureau,TP393.092
  21. Research and Application of the Web Information Mining,TP311.13

CLC: > Industrial Technology > Automation technology,computer technology > Computing technology,computer technology > Computer applications > Computer network > General issues > The application of computer network > Web browser
© 2012 www.DissertationTopic.Net  Mobile