Dissertation > Excellent graduate degree dissertation topics show

A Study on the Recognition of Biomedical Named Entity Based on Statistic

Author: QiuSha
Tutor: YuZhongHua;WangZhenJiang
School: Sichuan University
Course: Applied Computer Technology
Keywords: Statistical Natural Language Processing (SNLP) Biomedical Named Entity Recognition(Bio_NER) corpus Hidden Markov Model(HMM) Viterbi algorithm smoothing technology
CLC: TN912.34
Type: Master's thesis
Year: 2006
Downloads: 148
Quote: 2
Read: Download Dissertation

Abstract


NER(Named Entity Recognition) in biomedical literature is presently one ofthe internationally-concerned NLP(Natural Language Processing) researchquestions. The studies on NLP have already won remarkable success in a fewfields, however, they have achieved little in the biomedical domain. With theflourishing development of biomedicine, new NEs(Named Entities) are emergingone after another. Irregular naming as well as new uses of old words havemade Bio-NER(Biomedical Named Entity Recognition) a hard task, to somedegree, influencing the development of research in biomedical domain. Thereare a great number of research methods for Bio-NER, of whichSNLP(Statistical Natural Language Processing) is one of the methods frequentlyused for Bio-NER research, because its study methods, based on statistics, donot require the researchers’ profound professional knowledge in biomedicine.In addition, among methods of SNLP, HMM (Hidden Markov Model) is widelyapplied due to its statistic features.HMM is a significant approach to constructing statistic models in themodern speech recognition system. It’s able to study rules with a few trainning data. Up till now, a great many of international researchers haveworked on answering Bio-NER research questions by adopting HMM and itsvarieties. Though they have made some remarkable progress in it, none ofthem has achieved the goal of "approximating to human beings". Manyquestions have remained to be answered, but actually in China researches onBio-NER are still in the beginning stages. In this case, this thesis depicts astudy on constructing a statistic model for Bio-NER by adopting HMM. Thestudy is illustrated as follows:1. HMM is trained in annotated corpus using statistics.By counting upannotated datas, parameters of HMM are obtained: set of states(S), outputalphabet(K), intial state probabilities(p), state transition probabilities(A),symbol emission probabilities(B). Some regular patterns of NEs are foundby adopting different methods in various experiments, and those patternsare further incorporated to form K set. Probabilities are counted on thebasis of the procedure above. When probabilities being calculated, in orderto solve the problem of lacking sufficient data, an approach of linearinterpolation is adopted to smooth. In the study, a concept of LSS(LexicalStructure Similarity) is given, which provides a measurable standard insymbol comparing.2. The trained HMM is tested on non-annotated corpus. A sentence ofnon-annotated corpus is used as an input sequence of HMM, and then anoutput sequence is computed through Viterbi algorithm. As a result, therecognized Bio-NEs are found. When the input sequence is formed,different ways to dividing a sentence into words are applied to differentexperiments. By means of computing the similarity between a series ofwords in a sentence and each item in K set, and besides, by simplyanalyzing parts of speech as a supplement, the bordering of dividing asentence into word sequence is determined.3. The HMM is improved by calculating and comparing Recall and Precisionof the tested result. The above procedures are repeated till a HMM that could effectively recognize Bio-NEs is formed.The present research on Bio-NER has produced a marked achievement inthe study narrated above. The effectiveness of the algorithm is verified.

Related Dissertations

  1. A Study of the Acquisition of Chinese Progressive Complex Sentence Based on the Interlanguage Corpus,H195
  2. A Corpus-Based Intertextuality Analysis of the Reportage on Shanghai World Expo,H052
  3. Research and Design of Electronic Equalization Based on Most Likelihood Sequence Estimation,TN911.5
  4. Modern Chinese Function sentence and corpus construction,H146
  5. English Academic title phrase characteristics of,H313
  6. Explicitation of Causal Links in Students’ Translation,H315.9
  7. A Corpus Study on HSK First-Degree Psych Verb Collocations in Chinese/Non-Chinese Cultural Circles,H195
  8. The Study of Zu Wuze’s Poetry and Essay,I206.2
  9. A Corpus-based Contrastive Study of High-frequency Verb Collocation and Sentence Patterns under Specific Semantic Categories,H319
  10. The Study of Idiomatic Phrase in Teaching Chinese as a Foreign Language,H195
  11. \,H212
  12. Research of Multi-Sensory Myoelectric Prosthetic Hand with Hardness and Thermal Conductivity,TP242
  13. Research on the Key Technologies fo Speech Recognition for Robot Communication,TN912.34
  14. A Corpus-based Study of the Collocation Use of the English Verb--get,H319
  15. Research on Automatic Notation of Word for Tibetan Corpus Based on HMM,H214
  16. Comparison Study of Contrast-enhanced Ultrasonography on the Ovarian Endometriotic Cysts and Corpus Leteum Hematoma,R445.1
  17. A Corpus-Based Contrastive Analysis between Chinese Caused-Motion Ba-Construction and Its English Translation,H146
  18. The Status and Strategy of Conjunction Teaching in Teaching Chinese as a Foreign Language,H195
  19. The Research of Compliance Testing Technology of Traffic Terminology and Standards,TP391.1
  20. A Study on Ordering Molecular Marker Loci in a Genetic Linkage Group,TP18
  21. Image Processing Basing on Fuzzy Theory and Hidden Markov Model,TP391.41

CLC: > Industrial Technology > Radio electronics, telecommunications technology > Communicate > Electro-acoustic technology and speech signal processing > Speech Signal Processing > Speech Recognition and equipment
© 2012 www.DissertationTopic.Net  Mobile