Dissertation > Excellent graduate degree dissertation topics show

Design and Implementation of a Customizable Crawler for College Forums

Author: YuanJie
Tutor: XuZuo
School: Huazhong University of Science and Technology
Course: Communication and Information System
Keywords: Incremental Web Crawler Information Extraction XQuery Templates Saxon
CLC: TP393.09
Type: Master's thesis
Year: 2013
Downloads: 9
Quote: 0
Read: Download Dissertation


As the Internet network is growing popularity in our country, for those young collegestudents who are willing to accept new things, the Internet has almost became the basicnecessities of life, as well as food and clothing. At the same time, the campus forumsbecome the main platform where students express and exchange their views. In order tounderstand the hot topics in campus, it is very meaningful to build a campus networkinformation management system. The forum spider designed and implemented by thisthesis is a sub-system of the information management system, it’s main task is collectingforum data which is prepared for future analyzing.While crawling forums, traditional general crawlers would encounter a large numberof duplicate links. This would be a waste of resources and inefficient. On the other hand,most existing forum crawlers are tailored for specific users, therefore they only act on asingle forum. This thesis has analyzed the differences in the structure of many forums, andstudied the features and system architectures of several mainstream crawlers, and finallyproposed an implementation of incremental web crawler system which could be applied tomany forums. After analyzing system requirements, this thesis designed each sub-module,and then elaborated details of the implementation of each module. The main work of thisthesis includes the following aspects. First, analyzed the features of many campus forums,extracted their commonalities and differences, determined crawling mode for each style offorums. Then, according to the heat and features of forum sections, determinedincremental crawling strategy based on the weight of forum sections. At last, on thepurpose of improving the versatility and flexibility of the crawler, this thesis used XQuerytemplates to parse the web pages and extract the content.After deploying and running the crawler, this thesis analyzed the test results, it showsthat the spider system was running stably, so the system has met the needs of design, andis useful.

Related Dissertations

  1. Study on Growth Monitoring Technique Based on Pixel Un-Mixing Method and HJ Remote Sensing Images in Paddy Rice,S511
  2. Active faults based radar image information extraction method applied research and demonstration,P542.3
  3. Based on high-resolution remote sensing data mining houses information extraction,TP751
  4. Scholar Resume Automatic Generation Based on Text Mining,TP391.1
  5. Object-Based Automatic Extraction of Change Information Based on High-Resolution Remote Sensing Image Research,P237
  6. Comparative Experimentation Study on Information Extraction of Object-Oriented Based on Quick Bird Image,P237
  7. Reptiles theme for Education News Design and Implementation,TP391.3
  8. GPU-based image search Chinese Research on key technologies of the retrieval,TP391.1
  9. Engineering News reported information extraction and applied research,G212
  10. Based on semi- structured text transporter protein substrate information extraction system,Q811.4
  11. Research on Information Extraction and Visualization in Program Comprehension,TP311.1
  12. Object-oriented Information Extraction of woodland,P237
  13. Study on Extraction of Coniferous Forest Information in Southern China,TP79
  14. Study on Information Extraction and the Dynamic Monitoring of Grassland Coverage in Three River Source Area,S812
  15. Design and Implementation of Web Information Extraction Based on DOM,TP393.09
  16. Research on Object-oriented Remote Sensing Image Information Extraction Technology,P237
  17. Research and Implementation of an Information Pre-process Platform of Public Opinion,TP393.09
  18. On Representation of Text Annotation in Database and Its Application,TP311.13
  19. Study on Technologies of Remote Sensing Feature Analysis and Information Extaction of Earthquake Disaster,P237
  20. City names addresses coding Matching,P208
  21. The laser detection automotive active anti-collision technology and systems research,TN247

CLC: > Industrial Technology > Automation technology,computer technology > Computing technology,computer technology > Computer applications > Computer network > General issues > The application of computer network
© 2012 www.DissertationTopic.Net  Mobile