Dissertation > Excellent graduate degree dissertation topics show

Study on the Tibetan Word Segmentation and Named Entity Recognition with Conditional Random Fields

Author: LiYaChao
Tutor: YuHongZhi
School: Northwest University for Nationalities
Course: Applied Computer Technology
Keywords: Tibetan word segmentation named entity recognition abbreviated word recognition conditional random fields maximum entropy
CLC: TP391.1
Type: Master's thesis
Year: 2013
Downloads: 36
Quote: 0
Read: Download Dissertation


Tibetan word segmentation (TWS), and named entity recognition(NER) is an important problem in Tibetan information processing. TWS is used to segment raw Tibetan sentence into word sequence, while NER is used to recognition entities in the word sequence which classified at the same time. The traditional method of Tibetan word is rule-based, which has a poor performance in unknown word and ambiguity. The Tibetan NER’s research foundation is weak, mainly concentrated on the rule-based method. Existing TWS, NER method based on statistics, which usually as a secondary method, in recent three years, method with large-scale corpus and machine learning be taken seriously.The paper systematically studied TWS and NER based on conditional random fields (CRF), research and implements a Tibetan word segmentation system based on CRF. We proposed a method which combines maximum entropy and conditional random fields to identify Tibetan person names. Work includes:The paper propose a Tibetan abbreviated word recognition (AWR) method based on statistical methods, and experiments with CRF, the result indicate that AWR problem has no significant effect on the TWS. Tibetan character is encoded as alphabetic writing, Tibetan word composed of syllable, TWS is combines the continuous syllable sequences into a word sequence. AWR affects the recognition of syllable, thereby reducing the effect of TWS. The statistical AWS method treats AWS as a classification problem, using a machine learning method for classifying. Compared to the rule-based method, our approach does not require vocabulary support, and can be combined easily with the segmentation model based on statistical model, which significantly increases the effect of the Tibetan word segmentation.Determining suitable forms of syllable tagging system, our system outperforms the previous system in the literature. The TWS method with syllable tagging treats TWS as determining the position a syllable in the word, the tagging system greatly affected the TWS. The paper proposes a four position tagging system,"BMES", which combines with AWS model significantly increases the effect of the Tibetan word segmentation. In a comparison experiment, this system outperforms the previous systems in the literature.We systematically study the feature selection, unknown word recognition on the CRF system. Select the appropriate feature is the most important step in the statistical segmentation method, there is rarely literature on the feature selection in TWS with CRF. Our systematically study the different feature of TWS with CRF. Unknown word is a key problem in the word segmentation system. Unknown word recognition (UWR) is an important index in word segmentation system, we study the UWR on the single dataset and crossing dataset, and carried out on the open corpus, and compare the performance of Chinese UWR.The paper proposes a method which combines maximum entropy (ME) and conditional random fields to identify Tibetan named entities, which gets a better performance and balanced the short of precision, recall rate in the two models. There is not have open Tibetan NER corpus, we annotated the Tibet Daily corpus, and experiment the NER with ME and CRF model respectively. A problem in the two models we proposes a method which combines ME and CRF, which achieved good results.

Related Dissertations

  1. Research on Domain Entity Attribute and Event Extraction Technology,TP391.1
  2. Research on Extraction and Tracking of People’s Opinion,TP391.1
  3. Researches on Urban Rail Transportation Operation Management System Test and Evaluation Method,TP311.52
  4. The Frame Disambiguation of Automatic Identification of Chinese Frame,TP391.1
  5. Research on Opinion Target Extraction,TP391.1
  6. Chinese study nested entity recognition method named,TP391.1
  7. Experimental Study on the Artificial Convex Structure for Beasch protection and Siltation Promotion,P753
  8. The Research for Named Entity Recognition and Relation Extraction in Text,TP391.1
  9. Research on Automatic Construction Technology of Chinese Verb Frame Base,TP391.1
  10. Ontology-based medicine named entity recognition technology research,TP391.1
  11. CRF -based named joint extraction of entities and relationships,TP391.4
  12. Click data and search results based on fragments excavated named entities,TP391.3
  13. Chinese named entity recognition and disambiguation of,TP391.1
  14. Study on Chinese Name Entity Recognition and Some Related Issues,TP391.41
  15. The Research of Conditional Random Fields Based Chinese Named Entity Recognition,TP391.4
  16. Chinese Named Entity Recognition Based on Conditional Random Fields,TP391.43
  17. The Study of POI Abbreviations Dictionary in the Filed of Location Search,TP391.3
  18. Research on Product Named Entity Recognition and Normalization,TP391.1
  19. English two-way time numbers and quantifiers identification and translation technology,TP391.2
  20. Study on CRF-based Chinese Named Entity Recognition,TP391.43
  21. Business Information Extraction Based on Internet,TP399-C2

CLC: > Industrial Technology > Automation technology,computer technology > Computing technology,computer technology > Computer applications > Information processing (information processing) > Text Processing
© 2012 www.DissertationTopic.Net  Mobile