Dissertation > Excellent graduate degree dissertation topics show

Study on Key Technology of GHMM-Based Web Text Information Extraction and System Design

Author: WangJing
Tutor: LiuZhiJing
School: Xi'an University of Electronic Science and Technology
Course: Applied Computer Technology
Keywords: Data Mining Information Extraction Generalized hidden Markov model Named Entity Recognition
CLC: TP391.1
Type: Master's thesis
Year: 2008
Downloads: 128
Quote: 1
Read: Download Dissertation

Abstract


With the rapid development of the Internet, the Web has become the world's largest sources of information. How to obtain useful information on the Web is all face a common problem, the Web information extraction is proposed to solve this problem. Currently, most of the information extraction only remain in plain text information extraction, also did not consider the particularity of the text of the page. In addition, information extraction also rarely involved in the understanding of the semantics. Currently, information extraction model is a hidden Markov model, it is easy to establish, adaptable, growing attention by researchers to extract high precision, but the model is only applicable to ordinary text, containing more information web page is also not appropriate. Page, the Web text information typically contains more output attributes such as: entry, layout and formatting attributes. Taking into account the traditional hidden Markov model state transition process only a single entry attributes as observations output characteristics, multiple attributes (including entry, layout and formatting attributes) as hidden Markov model observation output characteristics, thereby introducing generalized hidden Markov model. For plain text, the traditional HMM is a single statement of the basic unit of information extraction, the assumptions of state transition sequence (from left to right, then top to bottom) for pages containing multimedia dimensional space does not appropriate. Page analysis, we found that the visual layout of the page structure made up of different blocks, these blocks there is a certain logic relationships. In this paper, based on the visual page segmentation algorithm (VIPS) block of pages to get a more applicable in the page layout structure based state transfer sequence. Any time observation output vector probability not only depends on the current state of the system, and relies on the system the previous moment in which the state, so the paper proposes a generalized hidden Markov model based on second-order Markov chain improvements. In addition, the semantic page analysis, this paper uses a role-based label named entity recognition method, the basic idea is: page text, combined with the rules of the roles table, using the improved generalized hidden Markov model role labeling, the role of the sequence based on the character string recognition, and ultimately of the named entity identification, thereby achieving both from the structure of the Web site and semantic information extraction. This article by Web Information Mining the Shanghai recruitment website Jobs amount SDI and information extraction, development WebIE based on GHMM Web text extraction system. This paper first introduces the Web text information extraction technology concept, and then through the analysis of the Web page, according to the characteristics of the Web page and entity identification technology combined role labeling application from the two aspects of Web page structure and semantics, improved GHMM model Web information extraction. Finally, through shows that the modified generalized hidden Markov for web information extraction very good results, also proposed system deficiencies and future research directions.

Related Dissertations

  1. Research on Domain Entity Attribute and Event Extraction Technology,TP391.1
  2. Research on Temporal Information Recognition and Normalization,TP391.1
  3. Bing- thick academic thought and clinical experience and empirical studies apply to turtle soups treatment of chronic kidney disease,R249.2
  4. Study on Growth Monitoring Technique Based on Pixel Un-Mixing Method and HJ Remote Sensing Images in Paddy Rice,S511
  5. Land Desertification in Qinghai Lake Landscape Pattern Change,X171
  6. Active faults based radar image information extraction method applied research and demonstration,P542.3
  7. Based on high-resolution remote sensing data mining houses information extraction,TP751
  8. Web Usage Mining and the Research of Personalized Recommendation,TP311.13
  9. Engineering News reported information extraction and applied research,G212
  10. Home Academic Information Extraction System,TP393.092
  11. GPU-based image search Chinese Research on key technologies of the retrieval,TP391.1
  12. Reptiles theme for Education News Design and Implementation,TP391.3
  13. One kind of empirical data on the workload of a software bug fixes Prediction Model,TP311.53
  14. The key component vertical search engine technology research,TP391.3
  15. Based on data mining movement behavior prediction,TP311.13
  16. Study and Implementation of On-Line Traffic Routes Recommender System,TP391.3
  17. Research and Application of Similarity-based Mining of Financial Data Analysis System,TP311.13
  18. Product Review Mining Based on Opinion Words,TP311.13
  19. Colleges and universities based on data mining technology design and development of information collection and analysis system,TP311.13
  20. Based on Data Mining Technology Insurance Quality Management System,TP311.13
  21. The Use of Data Mining in Shanghai Professional Testing Authority to Optimization of Management Projects Recruiting Exam,TP311.13

CLC: > Industrial Technology > Automation technology,computer technology > Computing technology,computer technology > Computer applications > Information processing (information processing) > Text Processing
© 2012 www.DissertationTopic.Net  Mobile