Dissertation > Excellent graduate degree dissertation topics show

Research on a Two-Stage Method for Chinese Named Entity Recognition

Author: HeZuo
Tutor: DongYuan
School: Beijing University of Posts and Telecommunications
Course: Signal and Information Processing
Keywords: Chinese named entity recognition Conditional Random Fields Maximum Entropy Model two-stage
CLC: TP391.4
Type: Master's thesis
Year: 2008
Downloads: 274
Quote: 0
Read: Download Dissertation

Abstract


As a basic task, also an important task for Information Extraction, Named Entity Recognition (NER) has been one of the central issues in natural language processing. Message Understanding Conference (MUC) sponsored by DARPA (Defense Advanced Research Projects Agency) in America had set NER as one of its sub-tasks since 1998, meanwhile, Named Entity (NE) is catalogued officially into three groups for the first time: 1.entity (organization names, person names, and location names); 2.temporal expression (data and time); 3.figure (monetary value and percentage). The following Automatic Content Extraction (ACE) contest had brought new features to NE, such as entity mention and relationship between entities.Since 2003, the Special Interest Group on Chinese Language Processing (SIGHAN) of Association of Computational Linguistics (ACL) presented bakeoff on Chinese word segmentation and named entity recognition. The bakeoff have been held four times until 2007. The first two only focused on Chinese word segmentation, however, the last two brought Chinese named entity recognition into consideration. NE in SIGHAN definition includes person names, location names, organization names and geopolitical names for some corpus. Participants are required to tag the scope and category of NE in un-segmented corpus.According to NE definition and annotation guideline proposed by SIGHAN bakeoff, a two-stage method for Chinese NER, which is boundary detection and category identification respectively, is presented. Considering the characteristics of different stages, different machine learning algorithms are implemented. To be concrete, Conditional Random Fields(CRFs) for boundary detection and Maximum Entropy Model(MaxEnt) for category identification. Owing to the two-stage method, the cost for training CRFs model is greatly reduced compared with traditional one-stage method, at the same time, the overall performance remains almost the same. It’s especially meaningful for Conditional Random Fields (CRFs), for its tremendous training cost.The procedure for two-stage Chinese NER is as follows: at first, boundary detection is performed. As a sequence tagging problem, CRFs is very suitable here, for its ability of integrating large amount of features and absence of label bias problem, which is the defect of other digraph models. Secondly, Maximum Entropy (MaxEnt) is employed to identify NE category, because it is in keeping with the principle that when one has only partial information about the possible outcomes one should choose the probabilities so as to maximize the uncertainty about the missing information.There’re several highlights in boundary detection experiment: 1. the performance of six label sets are compared comprehensively, the result shows that BIOE label set, which emphasizes both beginning and end of a NE, is the best; 2. comparison between different window size in feature templates is conducted, and the conclusion is that it should be neither too large nor too small. Although larger window size would get more features involved, the computational complexity grows as well, what’s more, there would be data sparse problem. Smaller window size would lose some important context information, so neither too large nor too small window size is desired.When performing category identification, the features are catalogued into two groups, which is local features and global features. Local features are related with entity itself exclusively, and global features take context of NE into consideration. Experiment result shows that promising performance could be reached when using local features only. The reason is that confusion between different kinds of NE is rare, that is why the information about NE itself is sufficient for NE category identification.When the results for two-stage NER are derived, comparisons between one-stage and two-stage methods are made. Compared with one-stage, two-stage has brought on 80% reduction on time and memory consumption roughly, while the total performance remains almost the same. Both methods achieve competitive overall F-measure which is almost as good as top result in the bakeoff.More than 20 hours are needed for one-stage training procedure, but for two-stage method 3.5 hours is enough. There’re about 100 million features in one-stage which calls for 12GB memory storage, however, only 6 million features are involved in two-stage, and memory occupation is reduced to 3.2 GB.Finally the advantage of two-stage method is proved theoretically, and some comments about future works are made.

Related Dissertations

  1. Research and Implement of Chinese Word Segment Techniques Based on the Conditional Random Field,TP391.1
  2. Integration of Spatial Information Bag of Feature in Image Annotation,TP391.41
  3. Based on self-learning social relation extraction research,TP391.1
  4. Chinese Automatic identification function block,TP391.1
  5. Design and Implementation of the Chinese Webpage Classifier Based on the Maximum Entropy Model,TP393.092
  6. Study on Chinese Name Entity Recognition and Some Related Issues,TP391.41
  7. Numerical Methods for Solving Nonsymmetric Algebraic Riccati Equation,O241.6
  8. High alkaline ammonia leaching of copper oxide ores,TD952
  9. The Study of Q & A System Based on Maximum Entropy Model Semantic Parsing,TP391.1
  10. The Research of Conditional Random Fields Based Chinese Named Entity Recognition,TP391.4
  11. The China Accounting Information installment of Development Outlook,F232
  12. Research on Product Named Entity Recognition and Normalization,TP391.1
  13. E-commerce for the product summary Mining Technology,TP391.1
  14. The Automatic Identification of the Semantic Core Words for Frame Elements,TP391.1
  15. The Characteristic Analysis of the Protein Secondery Structure and Protein Interaction Prediction,Q51
  16. The Research and Application of Automatic Term Extraction Technology,TP391.1
  17. Chinese information processing research on key issues,TP391.1
  18. Questions facing the financial sector Chinese semantic analysis method blocks,TP391.1
  19. Music Named Entity Recognition Technology,TP391.1
  20. In the name of the entity body application of information extraction,TP391.1
  21. Hybrid method based on the complexity of named entity extraction,TP391.1

CLC: > Industrial Technology > Automation technology,computer technology > Computing technology,computer technology > Computer applications > Information processing (information processing) > Pattern Recognition and devices
© 2012 www.DissertationTopic.Net  Mobile