Dissertation > Excellent graduate degree dissertation topics show

Design and Implementation of Machine Learning Platform Based on Spark

Author: TangZhenKun
Tutor: LinZuo
School: Xiamen University
Course: Computer technology
Keywords: Spark Machine Learning Massive Data Mining
CLC: TP181
Type: Master's thesis
Year: 2014
Downloads: 52
Quote: 0
Read: Download Dissertation

Abstract


Accompanied by the development of technologies of cloud computing and distributed cluster, the concept of big data was extended widely and deeply in volume and value, and machine learning that plays an essential role in exploring big data was attracted unprecedented attention in recent years. Traditional data mining algorithms is incapable to deal with massive dataset. MapReduce has been successfully applied to many big data problems, however, it lacks the ability to efficiently support parallelized, iterative machine learning algorithms. To address the above problems, we propose a machine learning platform based on the emerging Spark framework, not only to process massive data efficiently, but also with a favorable scalability, which can satisfy the demand of many kinds of machine learning tasks.The contribution of this thesis are as follows:We develop a variety of machine learning algorithms based on Spark and theory of large scale machine learning, including parallelized linear regression, support vector machine, KMeans, matrix factorization and PageRank algorithms based on graph computing model, and KMeans in dataflow to achieve both high utility, scalability and efficiency.Some strategies are used in the implementation of the platform to improve and optimize performance for large scale datasets. For example, Bagging strategy based on ensemble learning theory are adopted to improve the stability of the model, and sub-gradient model optimization to promote the efficiency of model computation. And a variant of matrix factorization algorithm based on graph computing framework are suitable for extremely sparse ratings matrix in massive datasets. In addition, we implement algorithms with objected-oriented design methods for expendability. Design patterns such as Factory pattern and Strategy pattern are encapsulated in the framework.Followed the design of Lambda architecture, the platform are divided into three hierarchy. They are batch layer, service layer and dataflow layer. The batch layer are designed by the hybrid of Spark and Hadoop to model batch dataset. The service layer constructs indexes of the batch model to support parallel real-time requests. And the dataflow layer are mainly emphasized on streaming computation to model real-time dataset. The incoming requests will combine the batch and dataflow results into the final output.The performance of our algorithms in platform are verified by experiment results. Compared with serial algorithm on single computer and algorithms based on MapReduce, out methods have shown significant improvement in runtime, speedup ratio and throughput.

Related Dissertations

  1. Preparation and Tribological Properties of TZ3Y20A-SrSO4 Ceramic Matrix Composites,TB332
  2. Research on Operational Life Span and Applications of Coated Electrode Resistance Welding for Galvanized Steel,TG453.9
  3. Prediction of Binding Affinity of Human Transporter Associated with Antigen Processing,R392.1
  4. Research on Mapping Mechnism of Learning Expression,TP181
  5. The Research and Realization of the Military Port Objects Classification Platform,TP751
  6. Investigation on the Forming Mechanism of M42/45 Steel Bimetal Composites Sintered by Spark Plasma Sintering,TB331
  7. Classifier Design and Weight Optimization Methods Based on Multiple Views,TP18
  8. Study on the Performance Characteristics of the Triggering System of Gas Spark Switch,TM833
  9. Reaction Sintering Preparation of Titaniferous Compound Ceramic Composites by Two-step Method,TQ174.1
  10. A Static Behavior-Based Method to Detect Malware on Android,TP309
  11. Numerical Simulation on Combustion Process of Spark-Ignition Methanol Engline and Parameter Optimization of Combuation Chamber,TK401
  12. Experimental Investigation on Combiustion Control of Multi-Cylinder HCCI Gasoline Engine,TK417
  13. Research on Series Resonant Soft-switching High-voltage High-frequency Power Supply for Electrostatic Precipitator,X701.2
  14. Study on Structure of Triggering System of Main Switch in SG-Ⅲ Energy Module,TM832
  15. Learning-based human motion synthesis inverse kinematics,TP391.41
  16. Nanoscale Bi 2 Te 3 Powders and Sintered Properties,TB383.1
  17. Based on rough sets and SVM national defense Comprehensive Quality Assessment Methods,E075
  18. Random Forests Feature Selection,TP311.13
  19. Automotive electric fuel pump with permanent magnet DC motor commutation Performance,U464.136.1
  20. Research of Protein-Protein Interaction Extraction Based on Rich Feature and Multiple Kernels Learning,Q51
  21. Brain Connectivity Pattern Analysis Based on DTI,R445.2

CLC: > Industrial Technology > Automation technology,computer technology > Automated basic theory > Artificial intelligence theory > Automated reasoning,machine learning
© 2012 www.DissertationTopic.Net  Mobile