Dissertation > Excellent graduate degree dissertation topics show

Text Categorization Based on Rough Set Theory

Author: XuXin
Tutor: HuangLiCan
School: Zhejiang University of Technology
Course: Signal and Information Processing
Keywords: Text Categorization Rough set theory Feature weight Matching rule
CLC: TP18
Type: Master's thesis
Year: 2011
Downloads: 50
Quote: 1
Read: Download Dissertation

Abstract


In recent years, with the development of Internet and the information technology, people will face with the fact that with the amount of information increasing ceaselessly, it is more and more urgent to find a way to manage and access information effectively and easily. Text categorization is a very good solution to this problem. Text Categorization is the key topic in many areas such as Information Retrieval, Data Mining and so on. There are many methods having been applied to text categorization now, for example KNN, Na?ve Bayes, Decision Tree, SVM and so on.The rough set theory is proposed by Pawlak in 1982, which is a powerful tool for dealing with imprecise or incomplete information in attribute dependence analysis, knowledge reduction and decision rule extraction. The rough set has the following advantages in text categorization: firstly, the Rough set doesn’t need to supply any prior-probability information besides the data set used for solving the problem; secondly, Rough set theory can reduce the dimensions of feature vector and get classification rules of explicit formulation without influencing the accuracy of text categorization.Feature weighting is an important problem in text categorization. For computing feature weights, we analyzed the characteristics of rough set theory and TFIDF, and proposed a feature weighting scheme for text categorization based on rough set theory in this paper. In rough set theory, approximation quality and approximation accuracy can reflect the importance of the feature from a global perspective, so we can introduce the rough set theory to the weight of the feature word. However, if there are only these two parameters in the weighting formula, the information of the feature in single text will be ignored. TFIDF cares about the frequency of feature words and the distribution of the feature word in the whole examples space. So the frequency of the feature will be introduced into the feature weight. The weighting formula combines the advantages of TFIDF and rough set theory.In most cases, the rules induced by rough set reduction theory are unacceptable as laws to classify test texts. There are many reasons for this problem, the main point is the test texts are various, it is not easy to get a comprehensive rule sets. By analyzing the method of complete matching and partial matching, we proposed a new partial matching method based on feature weight, this method combines the idea of the partial matching and feature weight. The experiments show that the partial matching method based on feature weight for matching rules can improve the matching possibility and correctness of basic decision rules.Finally, we concluded the achievements and insufficient points of the article and looked ahead the next research work.

Related Dissertations

  1. Application of Information Fusion Technology of Improved D-S Evidence Theory in Fault Diagnosis of Rotating Equipments of Generating Units,TM307.1
  2. Research on Attribute Reduction Based on Equivalence Class with Uncertain Decision Value,TP18
  3. Study on the Decision Tree Classification Algorithm and Its Application Based on Rough Set Theory,TP18
  4. Study on the Engineering Properties of Loess and the Stability of Loess Slope Along a Railway,U212.22
  5. Network public opinion analysis to key technology research and,TP393.09
  6. The Implementation and Research of the Probabilistic Latent Semantic Analysis Model in the Search Engine’s Business Text Classification System,TP391.1
  7. Research on Image Segmentation Based on Rough Set Theory,TP391.41
  8. The Rough Set Method Based on Evidence Theory,TP18
  9. Content-based spam filtering technology research,TP393.098
  10. Internet News Hot Mining System Research and Implementation,TP393.09
  11. Research of Environmental Contamination Accidents Emergency Control System Based on GIS and Spatial Data Mining,TP311.13
  12. Study on Chinese Text Categorization,TP391.1
  13. Commercial Bank Credit Risk Evaluation Based on Neural Network and Evidence Theory,F830.33
  14. Research on Intelligent Control System Based on Rough Set Theory,TP18
  15. Research and Application of News Automatic Classification Technology Based on Support Vector Machines,TP391.1
  16. The application of data mining in the quality management of the H08 small electronic transformer,TP311.13
  17. Research on Web Chinese Text Automatic Categorization Based on RS-SVM,TP391.1
  18. Research and Improvement to Text Classification Algorithm,TP391.1
  19. Design and Implementation of a Lucene Based Intra-site Information Retrieval System for a Journal Site,TP391.3
  20. A Technology of Text Categorization on Imbalanced Datasets,TP391.1
  21. Research and Realization on Correlation Techniques of Topic Search-Specific Engine,TP391.3

CLC: > Industrial Technology > Automation technology,computer technology > Automated basic theory > Artificial intelligence theory
© 2012 www.DissertationTopic.Net  Mobile