Dissertation > Excellent graduate degree dissertation topics show

De-duplication Technology Research and Implementation of Large-scale Short Texts Orient

Author: YangHu
Tutor: YangShuQiang;HanWeiHong
School: National University of Defense Science and Technology
Course: Software Engineering
Keywords: Text De-duplication Text Mining ARFA ARFA-SA
CLC: TP391.1
Type: Master's thesis
Year: 2007
Downloads: 212
Quote: 1
Read: Download Dissertation

Abstract


With the rapid development of computer science and communication technology, short texts such as immediate communication, BBS, newsgroup, and e-mail have also been increasing fast. Although the rapid growth of text messages has brought convenience to People’s life, it has made people difficult to obtain useful information because the increasing short texts are out of people’s control. At the same time, useless and harmful information seriously affect the decisions of government departments, companies, enterprises and other managers. Research shows that close to half of the massive text messages are the repeated information. Through de-duplication, not only can the users optimize the data storage, but also can find hot topics for analysis-using and decision-making.Auto de-duplication as a basic technology in text mining, can not only been used in data preparation, like data cleaning, data merging and data exchanging, but also in data analysis, like duplicate records detecting. At present, auto de-duplication mainly includes field matching techniques and detecting duplication records. Field matching techniques can effectively detect the mismatches in database, for example, spelling mistakes, breviary and excessive words. Detecting duplication records can put the duplication or identical texts into the same category through machine learning and intelligent method.The application of text de-duplication technique is restricted by the short and large-scale characteristics of short texts. Because feature selection is not effective for short texts, classification and clustering can not be well applied in de-duplication field.In regard with the application of de-duplication in text mining, and by combining the requirement of users, this paper will introduce:1. Association Rule and Feature Code Based Fast Remove Duplication Algorithm, ARFA. Considering texts attribute, ARFA implements de-duplication by differentiating texts through association rules, and detecting duplication texts through feature code. The experiment shows that this algorithm has well performed, which can deal with large-scale information effectively. In addition, it displays high compression ratio.2. ARFA-SA implements de-duplication based on ARFA. When the similarity between texts is more than a threshold value, similarity transfer occurs. According to this hypothesis, identical or similar texts are put into the same group through similarity computation.3. The application of ARFA and ARFA-SA. The application of de-duplication algorithm in data mining system realizes duplication records detection and storage of data optimization. The function of duplication records detection includes detecting users who send group messages, and who accept group messages, and the related short text IDs. The function of storage optimization includes removing or merging redundant data.

Related Dissertations

  1. The Study of Topic-Oriented IT News with Search Enging and Web Page Analysing,TP393.092
  2. Web text mining,TP393.09
  3. Research on Sentiment Tendency of Online Public Opinions,G206
  4. The Research on Fuzzy C-means Documents Clustering Based on Ant Colony Optimization,TP391.1
  5. Research on Reminding Algorithms for Creative Design Support Engine,TP391.1
  6. Key Techniques of Text Ming on Criminal Cases,TP391.1
  7. Text Mining and Its Application in Text Retrieval,TP391.3
  8. Biomedical Text Mining and Its Application in Gene Regulatory Information Analysis,R319
  9. The Research of Text Preprocessing Based on Web Mining and Itsapplication,TP391.1
  10. A SOM-based Text Clustering and Apply to Search Result,TP311.13
  11. Study on Text Categorization Method Based on Support Vector Machine,TP391.1
  12. Internet public opinion found and views Mining Technology,TP393.09
  13. The Research and Implementation of the Agricultural Information Acquisition System Based on Web Classification Technology,TP393.09
  14. Research on Evaluation Mechanism for Creative Design Supporting Engine,TP182
  15. The Research and Application of IT Project-Oriented Scope Planning Methods,F270.7
  16. Research of Opinion Mining Model on Web Product Reviews,F49
  17. Research on Web Text Mining,TP391.1
  18. Research on Several Models in Text Classification and Clustering,TP391.1
  19. Usage of Web Data Mining in Public Security Monitor,TP311.13
  20. Study on the System of Chinese Automatic Word Segmentation Based on Text Information of BBS,TP391.1

CLC: > Industrial Technology > Automation technology,computer technology > Computing technology,computer technology > Computer applications > Information processing (information processing) > Text Processing
© 2012 www.DissertationTopic.Net  Mobile