Dissertation > Excellent graduate degree dissertation topics show

The Recognition of Protein Name in the Biomedical Documents

Author: LiGang
Tutor: TangHuanWenï¼›LinHongFei
School: Dalian University of Technology
Course: Operational Research and Cybernetics
Keywords: Entity Recognition Protein name recognition Candidate word Edit distance Classifier
CLC: Q51-04
Type: Master's thesis
Year: 2006
Downloads: 118
Quote: 0
Read: Download Dissertation

Abstract


In recent years, unprecedented growth in the amount of biological data due to the implementation of the Human Genome Project (Human Genome Project) as well as molecular biology, the development of information science, DNA, RNA, and protein, while the large number of functional genomic and proteomic data has begun to emerge. The biomedical literature also in the number of rapid expansion, the data is not the same as knowledge, but it is a source of information and knowledge. Surge data behind many important information how to extract knowledge from massive medical data has become a hot research to extract knowledge from the biomedical literature, the first thing to do is to correctly identify the literature appear biological the name of the entity. Entity recognition accuracy will directly affect the data mining system is good or bad or not, become a critical step in the biomedical literature mining entity recognition. Entity recognition method are mainly the following, the artificial organizational rules-based method, based on the dictionary methods and methods based on machine learning, used a dictionary-based method based on machine learning force method. Dictionary method can provide the name of the entity ID information, machine learning, gradually training to improve their ability to identify, but because of the particularity of the bio-entity name, for example, there is no unified naming the same entity, there may be a different naming two methods did not achieve the desired results, the first problem is due to the diversity of protein name spelling to cause a lot of false identification. Another problem is that many protein name is by two or more words, a plurality of words composed of the entity name to the word order in the dictionary, only one of the most common arrangement, while a typical mix of algorithms difficult to literature appears in the other order of the entity name and find all of the resulting many deformation wording not recognize. Therefore can not simply by looking up the word in the dictionary as the target word. The machine learning methods to the experiment proved that it is a very effective way, but it can not provide information on the authentication information of the identification translation. Other machine learning methods require large-scale training text to improve the ability to identify, but the text of this training are not enough. Biological entity recognition research, a combination of dictionary method and machine learning method has the advantage, improve recognition accuracy and recall rate, the recognition process consists of two steps: First, the recognition stage, that is, by the protein name dictionary and approximate matching algorithm to determine the protein name candidate word solve spelling diverse issues, improve the recall rate; second filter stage, that machine learning methods to train a classifier using approximate matching algorithm error identified false protein name filter out, in order to improve the accuracy of identification. But still some problem is not resolved, such as word order reversed the problem, the paper made a number of improvements, introduced DICE coefficient and the first word calculation, recall, while addressing the problem of inverted word order, and reduce the amount of computation. The test results show that the improvement is effective.

Related Dissertations

  1. ISAR Imaging Simulation of Space Targets and Target Recognition Based on ISAR Images,TN957.52
  2. High-performed Kernel Classification Methods Based on Multi-kernel Learning,TP391.41
  3. Research on Approximate String Matching and Its Application on URL Detection,TP393.08
  4. Research on Diagnosis Methods of Breast Masses Based on Reference Images,TP391.41
  5. The Research for Named Entity Recognition and Relation Extraction in Text,TP391.1
  6. Intelligent Video Surveillance Target Detection Technology,TP391.41
  7. Ontology-based medicine named entity recognition technology research,TP391.1
  8. Based on wavelet transform and linear subspace face recognition technology,TP391.41
  9. Dynamic multi- classifier integration technology and its applications,TP181
  10. Training and weighted online dictionary sparse representation of difference,TP391.41
  11. Applications of Information Gain Based Bayes Data Mining Algorithm in Spam Filtering,TP393.098
  12. Application and Research on Beyas Classification Algorithm,TP18
  13. Research on Multi Agent-Based Information Retrieval Mechanism for Engineering Supervision,TP391.3
  14. Research of Face Recognition Technology,TP391.41
  15. Research on Interference Type Recognition in DSSS Communication Systems,TN911.7
  16. Research and Implementation of Network Sensitive Information Filtering,TP393.09
  17. Research on the Technology of Face Recognition Based-on Locality Preserving Projection,TP391.41
  18. A Technology of Text Categorization on Imbalanced Datasets,TP391.1
  19. Based on Data Mining Technology Securities Investment Research,F224
  20. A Technique Research on Automatically Classify RBC&WBC Image That Capable of Touching-Cells,TP391.41
  21. Text-oriented classification method of feature word selection,TP181

CLC: > Biological Sciences > Biochemistry > Protein
© 2012 www.DissertationTopic.Net  Mobile