Dissertation > Excellent graduate degree dissertation topics show

Research on Some New Methods of Statistical Learning Based on Chemical Data

Author: HuangXin
Tutor: XuQingSong
School: Central South University
Course: Probability Theory and Mathematical Statistics
Keywords: Statistical learning Classification and regression tree Kernelmethods structure-activity relationship Cross validation Support vectormachine Sure independence screening Partial least squares
CLC: O213
Type: PhD thesis
Year: 2013
Downloads: 2
Quote: 0
Read: Download Dissertation


For the increasingly complex data, especially in the field of structure-activity relationship and spectra data, how to mine the most useful information from the complex data by statistical learning methods is one of the hot issues in current applied statistics research. Under the guidance of "data-driven", in the background of chemical data, through in-depth study the advantages and disadvantages of some classical statistical methods, such as classification and regression tree, support vector machine, partial least squares, etc. we proposed creatively some new statistical learning methods. The thesis consists of seven chapters.Firstly, we briefly introduced the research background and motivation, and then reviewed some theories and methods of statistical learning on chemical data analysis. These are the foundation of the new methods of statistical learning. Finally, we introduced the main content and innovation of this thesis in Chapter1.In Chapter2, the constructed tree kernel is proposed for the first time, which is one of the most important innovations. We discussed in detail the classification and regression tree(CART) algorithm. We pointed out that the samples under the same terminal node may possess some specific similarity to some extent, rather than only being limited to class similarity. Simultaneously, in order to obtain the diversity of tree structures, We coupled Monte Carlo procedure with a classification tree algorithm, and skillfully constructed a novel tree kernel by using the fuzzy pruning strategy and ensemble strategy. The fuzzy pruning strategy helps in effectively exploiting the information of inner nodes in trees, but does not totally destroy the structure of tree. Ensemble strategy selection can effectively guarantee that the results by tree kernel is more stable and reliable compared to one by CART, not deriving from the chanciness. This is our original motivation of building tree kernel. In fact, CART carries out a greedy but may not be global optimal search in sample and variable to seek for variable subsets most relevant to classification and sample subsets with specific similarity under different variable subspace. The constructed tree kernel has several outstanding advantages:It is "supervised" because the class information dictates the structure of the trees in the process of constructing tree kernel; Because irrelevant metabolites contribute little to the tree ensemble, they have little influence on the proximity measure, and tree kernel thereby can easily discover the inportant variable; By means of the classification tree, constructed tree kernel can effectively deal with nonlinear problems.Then, under the framework of kernel methods, we coupled a novel tree kernel with support vector machine, partial least squares and k-nearest neighbor, and presented three new statistical learning methods: tree kernel support vector machine (TKSVM),tree kernel partial least squares (TKPLS) and tree kernel k-nearest neighbor (TKk-NN). Three datasets related to different categorical bioactivities of compounds are used to test the performance of these methods. The results show that advantages of constructed tree kernel can effectively improve the traditional methods.For the high-dimensional spectral data, we proposed a novel model method PLSSIS. A difficulty of high-dimensional data analysis lies in multi-collinear and a lot of redundant information. PLS can be usually employed to deal with this case. However, calibration model including all the variables contains much redundant information, which will bring about negative influence on the prediction ability of the model. By employing PLS regression coefficients and sure independence screening principle, a novel strategy for selecting stepwise the variables, named PLS regression combined with sure independence screening (PLSSIS), is developed. PLSSIS is a forward iteration algorithm that combines the PLSR with SIS, which can fastly and efficiently deal with the high dimensional collinear data. For three spectral datasets, Our study shows that better prediction is obtained by PLSSIS when compared to PLS modeling and moving window partial least squares regression (MWPLSR).At last, Chapter7is the summarization of whole thesis and expectation for the future.

Related Dissertations

  1. Pavement Distress Recognition Based on Image,TP391.41
  2. Design, Synthesis and Biological Evaluation of Novel Podophyllotoxin Derivatives as Antitumor Agents,R284
  3. The Establishment of Chemical Toxicological Database for Wool Industry,TS131
  4. The Application of 3D Amino Acids Descriptors to the Quantitative Structure-Activity Relationship Study of Peptides,TQ460.1
  5. QSAR Studies of Soil Sorption Coefficients of Organic Pollutants,X131.3
  6. The Cumulative logistic Regression Classification of Students’ Poverty Data,O212.1
  7. QSAR of Chemical Pesticides Based on Support Vector Machine,S482.4
  8. Theory of Support Vector Machine and Its Application,O212.1
  9. Isolation, Purification and Quantitative Structure Activity Relationship of ACE Inhibitory Activity Peptides from Grass Carp Protein,TQ464.7
  10. Soft-sensing Technology in the Ethylene Distillation Process Applied Research,TQ221.211
  11. Study on Isolation, Identification, Antimicrobial Activity of Alkaloids from Lotus Leaf,R284.2
  12. Virtual Screening for New Drug Candidates Against Alzheimer’s Disease Based on STITCH Database,R96
  13. Design of Small-sized Immersed Instrument of COD Using Uy-vis Spectrophotometry,TH744.121
  14. Study the Relationship of Capital-GDP Marginal Growth Rate and the Industrial Structure,F127
  15. Automatic Target Recognition fluorescent magnetic particle inspection image processing techniques,TP391.41
  16. Design and Implementation on the Text Classifier Based on Support Vector Machine,TP391.1
  17. Research on Human Meridian Potential Signal Processing and Physical Status Classification Method,TN911.7
  18. The Research of Support Vector Machine Incremental Learning Algorithms,TP18
  19. Research Remote Image Classification Based on Support Vector Machine,P237
  20. Quantitative Structure-Activity/Property Relationship Studies in Biomolecules Based on Partial Least Squares and Support Vector Machine,Q50
  21. Study of the Density Function Corresponding with Support Vector Regression Machine,O174

CLC: > Mathematical sciences and chemical > Mathematics > Probability Theory and Mathematical Statistics > Application of statistical mathematics
© 2012 www.DissertationTopic.Net  Mobile