Dissertation > Excellent graduate degree dissertation topics show

Research on Chinese Shallow Parsing Based on Statistical Language Model

Author: GaoHong
Tutor: YangYuanSheng;HuangDeGen
School: Dalian University of Technology
Course: Applied Computer Technology
Keywords: Statistical Language Model Chinese Shallow Parsing New Word Recognition Named Entity Recognition Text Chunking
CLC: TP391.1
Type: PhD thesis
Year: 2007
Downloads: 600
Quote: 2
Read: Download Dissertation


Natural language parsing is the important and difficult task in natural language pro-cessing (NLP). In order to solve the difficulties when parsing large-scale real texts, manyresearchers have tried to divide the full parsing problem to several subproblems. Thusthe difficulties in full parsing can be degraded step by step and parsing efficiency can beimproved. Thus, shallow parsing is presented to simplify the structure of the sentences,and the aim of which is to dividing text into syntactically related non-overlapping groupswhich are simple in structure and important in significance. Shallow parsing, a newtechnique in NLP, will be of great benefit to full parsing. It is very useful for machinetranslation and other NLP tasks in which do not require a complete syntactic analy-sis, such as dictionary compilation, information retrieval, text categorization, summerygeneration and question-answer system and so on.With the widely application of empiricist approach in NLP, statistical language modelhas been the main techniques in all kinds of NLP tasks. In this thesis, Chinese shallowparsing is studied, including new word recognition, named entity recognition and textchunking, based on statistical methods.In new word recognition, a method combining mutual information and string fre-quency is presented to recognize new words except named entities. Single-characters,single-character words and adjacent multi-character words are possible components ofnew words. When compute mutual information between two adjacent components, theconfidence of the component, and its length are considered. String frequency is added intothe mutual information. The method achieves good results for new word recognition.Named entities are an important kind of unknown words. Unknown words can bringsome errors in word segmentation and those segmentation errors make the recognition ofunknown words more difficult. To solve this problem, we present a method of named entityrecognition synchronized with Chinese word segmentation based on a digraph model.Lexical word candidates and named entity candidates are the vertices of the digraph, andedges indicate the two end-points are two adjacent words. The edge weight is computedwith N-gram model to make the optimal segmentation of the sentence correspond to the shortest path of the digraph as can as possible. This method has improved the accuracyof named entity recognition.Double-rule AdaBoost (DR-AdaBoost) algorithm is presented and it is successfullyapplied in Chinese text chunking. At each round, DR-AdaBoost considers a liner combina-tion of double rules (the optimal rule and second-optimal rule) as the resulting hypothesis.Experimental results based on UCI and CoNLL shared data sets show DR-AdaBoost hasfaster convergence and higher accuracy than AdaBoost. DR-AdaBoost has better perfor-mance than AdaBoost in Chinese text chunking task and it can be used in other NLPtasks and other classifications.

Related Dissertations

  1. The Research for Named Entity Recognition and Relation Extraction in Text,TP391.1
  2. Ontology-based medicine named entity recognition technology research,TP391.1
  3. CRF -based named joint extraction of entities and relationships,TP391.4
  4. Click data and search results based on fragments excavated named entities,TP391.3
  5. Chinese named entity recognition and disambiguation of,TP391.1
  6. Study on Chinese Name Entity Recognition and Some Related Issues,TP391.41
  7. The Research of Conditional Random Fields Based Chinese Named Entity Recognition,TP391.4
  8. Chinese Named Entity Recognition Based on Conditional Random Fields,TP391.43
  9. The Study of POI Abbreviations Dictionary in the Filed of Location Search,TP391.3
  10. Research on Product Named Entity Recognition and Normalization,TP391.1
  11. English two-way time numbers and quantifiers identification and translation technology,TP391.2
  12. Kullback-Leibler distance framework abstracts retrieved in Retrieval,TP391.3
  13. Study on CRF-based Chinese Named Entity Recognition,TP391.43
  14. Business Information Extraction Based on Internet,TP399-C2
  15. Study on the Kazakh Named Entity Recognition Method Based on N-gram Model,TP391.43
  16. Japanese Morphological Analysis and Its Application for Clir,TP391.1
  17. Research on Named Entity Processing of Statistical Machine Translaton,TP391.2
  18. Research on biomedical named entity recognition,TP391.41
  19. The Research of Semantic Annotation System for Scientific Literature,TP391.1
  20. The combination of rules and statistics music field named entity recognition,TP391.4
  21. Research on Named Entity Recognition Based on Rules,TP391.1

CLC: > Industrial Technology > Automation technology,computer technology > Computing technology,computer technology > Computer applications > Information processing (information processing) > Text Processing
© 2012 www.DissertationTopic.Net  Mobile