Dissertation > Excellent graduate degree dissertation topics show

Word Sense Disambiguation Corpus Automatic Acquisition

Author: GuoYuHang
Tutor: LiuTing
School: Harbin Institute of Technology
Course: Computer Science and Technology
Keywords: Natural language processing Word sense disambiguation Language model Pointwise mutual information
CLC: TP391.1
Type: Master's thesis
Year: 2008
Downloads: 90
Quote: 0
Read: Download Dissertation

Abstract


The phenomenon that one word has several senses brings many difficulties to the processing of natural language by computer. In the final analysis, plenty of problems from natural language understanding are to solve the problem of ambiguous terms. Since the issue’s impact was noted, it has passed more than 60 years. During that period, academics put forward a number of ways to word sense disambiguation (WSD). With the development of large-scale computer text-processing technology, supervised machine learning methods predominates in the approaches toward WSD tasks due to their high accuracy. However, these methods’successes depend on enough training data deeply. And the annotation of these data is time consuming and laborious as well as difficult to guarantee the consistency. Data sparseness led by the lack of training data restricts the promotion of the supervised methods. Some studies started in the purpose of obtaining training corpus automatically. Among them, a method using synonyms to expand training corpus has lower resources costs and better expandability. However, the experiment found that the corpus this method obtained contains too much noise and has high bias. Therefore, focusing on how to obtain effective training corpus automatically, this article promotes a two-stage strategy of expansion-verification, which eliminates noise in the training corpus brought by expansion stage. Here we focus on the verification capabilities of two ways which are based on language model and pointwise mutual information respectively.In order to contrast in the follow experiment, an SVM based supervised WSD system was developed in this article. Experiment on Semeval-2007 English lexical sample corpus shows that the linear kernel SVM has the best performance. Next we use the synonyms of the target words in Senseval-3 Chinese corpus and Semeval-2007 English corpus to obtain candidate WSD corpus on Web and raw corpus, then filter these corpus using language model and pointwise mutual information approaches and append these expansion corpus into the supervised systems respectively. The results show that both of these two approaches have the capability to verify and improve the final performance of the system. Language model approach improves the accuracy of the system on Senseval-3 Chinese lexical sample corpus from 62% up to 63.06%. Evaluation on Semeval-2007 English lexical sample corpus shows the accuracy improves from 88.19% to 88.46% by the pointwise mutual information verification approach.

Related Dissertations

  1. Research on Structure Transition Technology for SMT,TP391.2
  2. Printers based on natural language HCI Research and implementation,TP11
  3. Based on Chinese Wikipedia semantic correlation computation Research and Implementation,TP391.1
  4. The Research of Web-based Community Medical Intelligent Service System,TP311.52
  5. AraOntoLT: A Framework for Ontology Learning from Arabic Text,TP391.1
  6. Probability and statistics based on dictionaries and Chinese word segmentation algorithm,TP391.1
  7. Web Knowledge Service Oriented of Medical Information Classification Approach,TP391.1
  8. Research in Thesaurus-based Ontology Building Method,TP391.1
  9. Research on Transformation from Use Case Diagrams to Sequence Diagrams,TP311.52
  10. Based the AADL model validation and code generation technology,TP311.52
  11. Research of Topic Tracking Based on HowNet and Topic Renewal,TP391.1
  12. Research on Methods of Ontology Modeling Based on MDA,TP182
  13. Study on Meme and Parody of Language,H05
  14. For specific areas of research and application of statistical machine translation,TP391.2
  15. Repeat the question based on extended research,TP391.2
  16. A Model-Driven Approach for Dynamic Web Service Composition,TP393.09
  17. Information Geometry Based Pure High-order Interaction Model and Its Applications,TP391.1
  18. The Study and Analysis of Oracle Bone Inscriptions Based on Statistical Natural Language Processing,TP391.1
  19. Home Academic Information Extraction System,TP393.092
  20. Research of Sentence Similarity Computation Based on Semantic Analysis,TP391.1

CLC: > Industrial Technology > Automation technology,computer technology > Computing technology,computer technology > Computer applications > Information processing (information processing) > Text Processing
© 2012 www.DissertationTopic.Net  Mobile