Dissertation > Excellent graduate degree dissertation topics show

Research on the Key Technology of Theme Crawler

Author: HuangZhengDe
Tutor: ZhangWen
School: Harbin Engineering University
Course: Computer Software and Theory
Keywords: Theme crawler PageRank Algorithm Correlation Calculation URL cancelatioin
CLC: TP391.3
Type: Master's thesis
Year: 2013
Downloads: 40
Quote: 0
Read: Download Dissertation

Abstract


Nowadays, the dissemination and release of information become more and morefast,which because of the rapid development of internet. The network information quantityscale become so huger that becomes more difficult for information retrieval now. Fortunatelythe users can use the search engine for rapid information retrieval, and take it as a tool of thedaily life and often use it. The network reptiles as one of the important parts of the searchengine is mainly responsible for the Internet webpage collection. The quality of searchengine service depends largely on the crawler crawling performance and the quality ofcollected webpage. So the crawler system is an important part of a search engine, and it isworthy of studying and improvement. In recent years, the limit of network size result in anincreasing burden on general reptiles. While the theme crawler will be more targeted to selecta specific area to crawl,then obtain the information required by the users. Further more,thetheme crawler can obtain higher operation efficiency. So the theme crawler has attractedwidespread attention. A new path in the theme crawler areas is being carried out with highresearch value and pragmatic value.This article focuses on the research of the technology andcharacteristics that the theme crawler touched on. The main work and results as follows:(1) Implemented an improved PageRank algorithm.The improved PageRank algorithm isput the whole web page of the Internet into a number of blocks, and then uses thedivide-and-conquer,calculated each block of the PageRank value, then according to eachblock of the weights of the relative importance,calculating the PageRank value of the wholeweb page.(2) Improve a correlation algorithm, mainly to establish the basis of the theme of theappropriate dimension vector, and then compressed into the search to articles with the sametheme reference vector dimension, and then use the correlation formula obtained by crawlsthe web meets the requirements.(3) When the reptiles crawling to a very large number of pages, how to eliminate theduplicate URL. This paper is mainly with the MD5algorithm to establish index, then theindex set up into the tree structure, make index stored in memory, and the data stored in thepart of hard disk, which reduces the space complexity.(4) By improving relevant algorithm, simulation and brief implements a mobile phonetheme crawler system, with the code, and the demonstration analysis of the experimental data, this paper demonstrates the validity and rationality of the theory.

Related Dissertations

  1. Web Page Sorting Algorithms Based on the Analysis of the Linking Structure,TP393.092
  2. Research on Web Structure Mining,TP393.09
  3. Research on Related Theme of Search Engines,TP391.3
  4. Web Security Based on Search Algorithm,TP393.08
  5. Design and Implementation for Image Search Engine Based on Mobile Phone,TP391.3
  6. Amelioration of Pagerank Algorithm,TP301.6
  7. The Design and Implementation of Web Crawler Based on Pagerank Algorithm in the Project of Malicious URL Detection,TP391.3
  8. Topic Crawler Based on Ant Colony Research and Implementation,TP391.3
  9. Research of Page Rank Algorithm Based on Link Structure,TP393.092
  10. Research on the Web Structure Mining Algorithm Based on Nutch,TP393.09
  11. A New Outlier Detection Method Based on PageRank Algorithm and Its Application,TP311.13
  12. The Research of Web Structure Mining Based on Quickly Similarity,TP393.09
  13. Research and Implementation of Key Technologies of BLOG Search Engine Based on Ontology,TP391.3
  14. Study of MBA Education Resource Search Engine Based on PageRank,TP391.3
  15. Graph-Based Pattern Mining and Application,TP311.13
  16. A Number of Studies Basesd on PageRank Sort Algorithm Improvenment,TP393.092
  17. Research and Implementation of Key Technology in Verticle Search Engine System,TP391.3
  18. The Research of Topical Crawler Search Strategy in Web Page,TP391.3
  19. The Research of Personalized Search Engine Based on Web Data Mining,TP393.092
  20. Search engine PageRank Algorithm,TP391.3
  21. Research on Web Structure Mining Algorithm Based on Cloud Computing,TP393.09

CLC: > Industrial Technology > Automation technology,computer technology > Computing technology,computer technology > Computer applications > Information processing (information processing) > Retrieval machine
© 2012 www.DissertationTopic.Net  Mobile