Dissertation > Excellent graduate degree dissertation topics show

Statistical Chinese Lexical Analysis and Its Reinforcement Learning Mechanism

Author: JiangWei
Tutor: WangXiaoLong
School: Harbin Institute of Technology
Course: Applied Computer Technology
Keywords: Lexical Analysis Statistic Language Model Feature Extraction Artificial Immune System Reinforcement Learning
CLC: TP391.1
Type: PhD thesis
Year: 2007
Downloads: 449
Quote: 4
Read: Download Dissertation

Abstract


Lexical Analysis (LA) is a foundational task of natural language processing (NLP), so it greatly influences the Syntactic Analysis and successive applications of LA. In this text, LA includes the Word segmentation, Part-of-speech (POS) tagging and Named Entity Recognition (NER). As a prerequisite part, early error in LA will cascade through the chain, causing the whole effect on the final performance, such as the performance of Information retrieval, Question Answer System and Machine Translation. In another side, the approaches and the techniques in LA are helpful to solve the similar task, such as Pinyin-to-character conversion, shallow parsing, and biological information processing. So this work is a valuable and meaningful task.The main difficulties to improve LA include ambiguity problem, sparse data problem and independent identical distribution (iid.) assumption. This dissertation is focus on the LA task, and research with the statistic approach. In terms of the model: 1) As for Supervised learning, we explore the N-gram, Maximum Entropy Model (ME), Conditional Random Fields (CRF) and Support Vector Machine (SVM) etc. 2) As for unsupervised learning, we build Word Vector Space. In terms of the feature: we propose to extract complicated features by the Rough set theory, and to extract the named entity features by the trigger pair method. And we do deep research in LA with above theories and approaches. The dissertation concerns the following aspects:1) Build Chinese POS tagging model based on CRF. HMM is a generative model, so it is not easily added the rich features. Maximum Entropy Markov Model (MEMM) is conditional probabilistic model, easily to fuse rich features, while suffers from label bias problem. CRF has a single exponential model for the joint probability of the entire sequence of labels given the observation sequence, therefore, not only rich features can be fused into this model, but also label bias problem can be overcome. In addition, we apply the trigger pair features to implement the long distance dependency, and explore the feedback influence of Chunking feature to the POS tagging. We describe a method to build the sequential tagging task based on SVM, and apply it into the Pinyin-to-character conversion. Finally, a method of the Multi-model combination in the POS tagging is described. 2) Research on the Chinese NER based on ME. ME is a conditional probabilistic model, and easily to fuse rich features. Recently evaluation seems to have indicated that linear or the log-linear model has good performance in NER task. We propose to collect the stable features by making use of the trigger pair method. Furthermore, we explore the feature extending approach by combining the thesaurus with the word cluster. Considering the attribution of Chinese NER, we propose the double layer mixing model, and introduce the domain extended learning strategy, so that the paragraph or the chapter features can be used to improve the performance.3) Propose to apply Rough set theory to extract the complicated contextual features. These features are difficult to be extracted from the corpus by using existing models, especially from the corpus which contains noise and inconsistent samples , for there are more serious sparse data problem and noise problem, when extracting the complicated features. Based on rough sets, the complex and long-distance features are collected effectively. In addition, these rough rules are added into the maximum entropy model, to allocate the weight of all the features according to the whole performance of the model. Furthermore, we apply the variable precison Rough set theory, to improve the performance with the imbalance distribution of all the the decision tags. The experiments have verified the effectiveness of our approaches.4) Research on the Reinforcement Learning (RL) in the LA task. The supervised learning approach based on corpus almost encounters the sparse data problem, and makes iid. assumption. While due to Zipf’s law, the sparse data problem can be hardly solved by enlarging the corpora. In another side, the applied field is generally different from that in the training corpora, so iid. is not easily met. In many task, the above two problems bring obstacles to improve the supervised learning algorithm. In the case that the improvement to system based on supervised learning encounters the bottleneck, the RL approach is a meaning research direction. Considering the "local perception property" in the real feedback information, we focus on explore the online learning with "local perception". In this dissertation, we build the Chinese Person Named Recognition based on Clonal Selection Theory, and build a word segmentation, POS tagging and Pinyin-to-Character Conversion model based on reinforcement learning technology.

Related Dissertations

  1. Research on Cooperative Orbit Determination in Satellite Network Based on Multi-Agent System Theory,V474
  2. Research on Automatic Detection Algorithm for Substructure Distress of Highway Pavement Based on SVM,U418.6
  3. ISAR Imaging Simulation of Space Targets and Target Recognition Based on ISAR Images,TN957.52
  4. Research on Feature Extraction and Classification of Pulse Waveform for Cholecystitis and Nephrotic Syndrome Diagnosis,TP391.41
  5. Application of Q-Learning in the Content-Based Image Retrieval Technology,TP391.41
  6. Research on Transductive Support Vector Machine and Its Application in Image Retrieval,TP391.41
  7. Research on Feature Extraction and Classification of Tongue Shape and Tooth-Marked Tongue in TCM Tongue Diagnosis,TP391.41
  8. Research on Visual Measurement for Spacecraft Rendezvous and Approach,TP391.41
  9. Research on the Image Real-Time Acquisition, Storage and Image Processing System,TP391.41
  10. Feature Extraction, Selection and Combination in Lipreading,TP391.41
  11. Multi-currency Notes Technology Research and Implementation,TP391.41
  12. The Research on Paper Currency Classification Method Based on Harr-Like Feature and Minimal Ball Including Samples,TP391.41
  13. Pavement Distress Recognition Based on Image,TP391.41
  14. Research on Visual Detection and Tracking of Mobile Robots,TP242.62
  15. Research on Fusion Algorithm of Hyper Spectral and High Spatial Resolution Remote Sensing Image,TP751
  16. An Approach for Identifying a Plant Resistance Gene Based on the Random Forest,Q943
  17. Tobacco Diseases Auto-Recognition Research Based on Image Processing Technology,S435.72
  18. Research on Nondestructive Detection Technology for External Qualities of Papayas Based-on Vision,S667.9
  19. Research on Identification System of Cashmere and Wool Fiber,TS101.921
  20. Research for Infrared Image Target Identification and Tracking Technology,TP391.41
  21. The Compression and Fusion Technique Research of Underwater Target Feature,TN911.7

CLC: > Industrial Technology > Automation technology,computer technology > Computing technology,computer technology > Computer applications > Information processing (information processing) > Text Processing
© 2012 www.DissertationTopic.Net  Mobile