Dissertation > Excellent graduate degree dissertation topics show

The Research and Implementation of Protein Classification Algorithm on the Basic of String Kernel

Author: TangDeChang
Tutor: ZhangYan
School: Harbin Institute of Technology
Course: Computer Science and Technology
Keywords: classification of protein string kernel spectrum kernel suffix tree
CLC: TP301.6
Type: Master's thesis
Year: 2008
Downloads: 26
Quote: 0
Read: Download Dissertation


An important research topic in bioinformatics is to understand the meaning and function of each protein encoded into the genome. One of the most successful approaches to this problem is via protein classification. It has for long played a central role on how to improve results of the classification, or improve the computing efficiency and reducing the memory requirement on the condition that the results will not be reduced too much. Forcing on this problem, we seek to get the better feature map and the faster computing means on the basic of the research to the protein classification algorithm.The SVM on the basic of string kernels is one of the best classifier to the proteins. The spectrum kernel which is one of these is fast and also has good classification results. And the mismatch kernel, which is improved on the basic of spectrum kernel, gets the better classification result at the cost of the increase of the computing time. First, we analyses the string kernel and the computing methods that on the trie-tree. And then, we proposed the improved method on the feature mapping and kernel computing. The main contribution of the article includes the follows:(1) For the lack of the feature mapping, we proposed a new feature map method that called sample kernel. The sample kernel defines the feature space of the kernel on the training sample. So, the sample kernel is defined on the basic of other classified kernel, and can alter the classification result by increasing the prior knowledge or changing the feature space. Subsequently, we analyzed the design, selection and computation of the sample kernel for different applications.(2) For the computing of the string kernel, we designed and adopted a data structure called pruning suffix tree. The pruning suffix tree combines the suffix tree which has suffix chain and the trie-tree which computing the kernel value in the leaf. But it uses less space than suffix tree and computes faster than the trie-tree. Subsequently, we designed the fast computing method on the computation of the p-spectrum kernel using the pruning suffix tree.(3) For the lack of characters matching of the p-spectrum kernel, we proposed a new inexact string matching kernel-vague spectrum kernel. The vague kernel also includes the inexact string matching thinking. But unlike the mismatch kernel, the vague spectrum kernel defined the inexact string matching on the sample strings. The vague kernel also improves the speed of matching characters using the pruning suffix tree.Finally, we designed and realized a classification model for protein. And then test the kernel mentioned above used the model. The test results showed that the samples kernel can improving the classification results for the string kernel significantly, and the pruning suffix tree also improved the computing speed of the kernels.

Related Dissertations

  1. Research and Application on Short Message Text Clustering,TP391.1
  2. Research of Finding Maximal Unique Matches in Genome,TP301.6
  3. Research on String Kernel Function SVM with Two Threshold Parameters,TP181
  4. Research and Applications on Network Protocol Anomaly Detection Models,TP393.08
  5. Research on Domain-Oriented Public Sentiment Analysis Technology,TP393.09
  6. Research on Kernels for Structured Data,TP18
  7. Research on Web-Oriented XML Retrieval,TP391.3
  8. Research on Key Technologies for Data Extracting in Data Warehousing,TP311.13
  9. Research on Method of Establishment of Related Articles Database for Biomedical Literatures Orienting Knowledge Service,G353
  10. An Algorithm Based on Suffix Tree for Identification of Repeats in DNA Sequence,TP301.6
  11. Open electronic document plagiarism detection services to build technology research,TP309.7
  12. Research and Implementation of Finding Duplicate Science Project Based on Non-segmentation Techniques,G311
  13. The Association between Coagulation System Activation and Risk Stratification of Cerebral infarction,R743.3
  14. Research and Implementation of the Domain-Dependent Vertical Search System,TP391.3
  15. Support Vector Machines and Its Applications in Study of Bio-Materials Function,TB39
  16. The Research and Implementation of Biological Sequence Alignment,TP399-C8
  17. Research on Web Document Clustering Approaches Based on Phrase Features,TP391.1
  18. The Realization of Chinese and English Clustering Engine Based on the Improved Suffix Tree Algorithm,TP391.3
  19. Research and Improvement of Detection Algorithm for Snort Detection Engine,TP393.08
  20. Frequent Sequence Pattern Mining in Web Log,TP311.13

CLC: > Industrial Technology > Automation technology,computer technology > Computing technology,computer technology > General issues > Theories, methods > Algorithm Theory
© 2012 www.DissertationTopic.Net  Mobile