Dissertation > Excellent graduate degree dissertation topics show

Extracting Informative Semantic Contents from Web Pages

Author: HeZheng
Tutor: LiuXiao
School: East China Normal University
Course: Computer Software and Theory
Keywords: Web Semantic Data Semantic Blocks Annotation Web Mining Web Data Extraction Machine Learning SVM
CLC: TP393.092
Type: Master's thesis
Year: 2013
Downloads: 10
Quote: 0
Read: Download Dissertation

Abstract


The fact that the number of web pages grows explosively makes the modeling and extraction of semantic information from a web page an increasingly challenging job. Although semantic information plays a significant role in the fields of ontology construction, web mining and other applications, currently most semantic interpretation methods require intensive human decisions while some others are restricted to particular domains. Therefore, they are not capable of dealing with today’s vast and frequent application needs.This thesis presents a knowledge model to depict the logical view of a web page. With the help of a small amount of manually labeled training samples and applying web mining and extraction techniques, a web page is automatically turned from a stream of HTML tags and characters into a sequence of semantic blocks. The locations and functionalities of these blocks are the major semantic information that we are interested in our work.Based on repeated structures, which is a long-studied type of data with many unique features, we propose a3-step process to extract structural semantic information from a web page. In the first step, we design a compound classifier with both decision tree and SVM algorithms to identify repeated structures in the web page. In the second step, meaningful repeated structures are defined as logical blocks to segment the page. In the last step, a semantic label is assigned to each segment of the page to represent its functionality and then informative contents are extracted accordingly.Comparing to the other existing methods, the proposed model and extraction method are easy to implement. Our method is insensitive to the transfer of fields, topics and web page layouts. It does not need much manual efforts and is expected to achieve a precise block extraction result for every web page. In this thesis, we go through the proposed extraction process and explain each step in details. In the experiment section, our method is compared with two state-of-the-art systems to prove the significant value of our research.

Related Dissertations

  1. Soft Sensor of Naphtha Dry Point on Support Vector Machines Regression,TE622.1
  2. The Research on English-Chinese Name Entity Translation,TP391.2
  3. Research of Orange Quality Classification Technology Based on Computer Vision,TP391.41
  4. Based on Data Distribution Characteristics of Text Classification,TP391.1
  5. Research of License Plate Recognition Based on Rough Sets and Fuzzy SVM,TP391.41
  6. Prediction of Binding Affinity of Human Transporter Associated with Antigen Processing,R392.1
  7. Research on Several Machine Learning Methods and Their Applications in Video-Based Fingerprint Verification,TP391.41
  8. Research on Flat Features and Structural Information for Protein-Protein Interaction Extraction,TP181
  9. Research on Mapping Mechnism of Learning Expression,TP181
  10. The Research on Cllective Multi-Label Classification,TP391.1
  11. Web-based Mining Technology and Its Application in Digital Library,G250.76
  12. The Research on Key Technologies for Web Information Personalization Collection and Management,TP393.09
  13. A Research on Financial Crisis Prediction of Listed Companies Based on Particle Swarm Optimization and Support Vector Machine,F275
  14. Semi-supervised Learning and Active Learning of Sentiment Classification Coupled with Domain Knowledge,TP181
  15. The Fatigue Estimation and Real-Time Monitoring Based on Eeg,TN911.6
  16. Supervision topic model research and application,TP391.1
  17. Distortion effects on image quality evaluation and classification,TP391.41
  18. Ontology-based medicine named entity recognition technology research,TP391.1
  19. Application of Data Mining in Email Anti-Spam System,TP393.098
  20. Or figure based on license plate detection and recognition,TP391.41
  21. Based on self-learning social relation extraction research,TP391.1

CLC: > Industrial Technology > Automation technology,computer technology > Computing technology,computer technology > Computer applications > Computer network > General issues > The application of computer network > Web browser
© 2012 www.DissertationTopic.Net  Mobile