Dissertation > Excellent graduate degree dissertation topics show

Knowledge Mining Based on Statistical Snowball Models

Author: LiuXiaoJiang
Tutor: YuNengHai;LiMingJing
School: University of Science and Technology of China
Course: Signal and Information Processing
Keywords: knowledge mining named entity search self-supervised learning relationship extraction named entity recognition named entity summarization
CLC: TP391.3
Type: PhD thesis
Year: 2011
Downloads: 95
Quote: 0
Read: Download Dissertation

Abstract


With the rapid development of Internet technologies, the World Wide Web has been growing rapidly as a huge knowledge repository, containing various kinds of valuable information about real-world named entities. These named entities contain organizations, locations and persons, covering from celebrities to the everyday individuals. Named entity search engines automatically mine the named entities from Web pages, and summarize knowledge for them based on the their Web appearances, which could be directly returned to users. Compared with the general search engines which can only return the unstructured Web pages, this type of search engines provides faster and more direct user experience, and has become a great research and development area in both industry and research area.In order to build a fast and accurate named entity search engine, deep knowledge mining on named entities from the Web is required. There are three key knowledge mining problems in building named entity search engines: named entity recognition, named entity summarization and named entity relationship mining. Focusing on these three key problems, this dissertation proposes a statistical unsupervised learning framework named StatSnowball, which has overcome the disadvantage of state-of-the-art unsupervised learning models. The main contents and contributions of this dissertation are as follows:1. Discuss the state-of-the-art Web-scale knowledge mining systems. Mainly focus on supervised methods based on the natural language features and the state-of-the-art self-supervised methods based on the extraction patterns. These methods have been widely used in different tasks of knowledge mining. The emphasis of our analysis is the basic idea behind these two types of methods, and typical models.2. Propose an unsupervised learning model: StatSnowball (Statistical Snowball) for the relationship extraction. Our model adopts the bootstrapping framework and uses the general statistical model Markov logic networks as the underlying extraction model. By using the statistical pattern evaluation and selection methods, StatSnowball can incorporate all kinds of patterns. By adopting MLN, StatSnowball accomplishes various levels of joint inference in relationship extraction. Experiments on both small but fully labeled data and large scale Web data have shown the effectiveness of our methods.3. Propose a uniform named entity recognition and relation extraction model based on iterative framework: EntSum. Our model extends conditional random field model used by named entity recognition, which enables relationship features to be added to the model. Joint model adopts the iterative framework to build bidirectional connection between two tasks, in which both results can be used in the other’s decision making process. Experiments on the real Web data have shown the increase to the performance on both two tasks.4. Propose an entity summarization model: BioSnowball, which can be considered as an extension to the basic StatSnowball model. By using the Fact-Bio duality, BioSnowball adopts the bootstrapping framework, and starts from only a small set of samples to jointly complete two different types of summarization. Our model can jointly complete the fact extraction and biography ranking for Web entities. Experiments on the real Web data and the user study have shown the effectiveness of our model on both problems. The success of BioSnowball has also shown the generality of the basic StatSnowball model.5. Build two public available named entity search engines named Renlifang and EntityCube, which the author has participated in as the main researcher and developer. These two search engines automatically mine knowledge from billions of Chinese and English Web pages respectively and build an entry page for every extracted entity. StatSnowball has been already applied to the system, and other methods in this dissertation have also been verified under the data of these two real systems.At the end of this dissertation, we conclude paper and prospect the further studies in the future.

Related Dissertations

  1. Research on Relationship Extraction Based on Semantic Pattern Matching in Web Environment,TP391.1
  2. The Research for Named Entity Recognition and Relation Extraction in Text,TP391.1
  3. Ontology-based medicine named entity recognition technology research,TP391.1
  4. CRF -based named joint extraction of entities and relationships,TP391.4
  5. Click data and search results based on fragments excavated named entities,TP391.3
  6. Human motion sequence data semantic analysis method,TP391.1
  7. Extension classification based knowledge mining complex product configuration design performance,TB472
  8. Research on Concept, Function and Application of Mechanical Structure Symmetry-Breaking,TH122
  9. Chinese named entity recognition and disambiguation of,TP391.1
  10. Study on Chinese Name Entity Recognition and Some Related Issues,TP391.41
  11. The Research of Conditional Random Fields Based Chinese Named Entity Recognition,TP391.4
  12. Chinese Named Entity Recognition Based on Conditional Random Fields,TP391.43
  13. The Study of POI Abbreviations Dictionary in the Filed of Location Search,TP391.3
  14. Research on Product Named Entity Recognition and Normalization,TP391.1
  15. English two-way time numbers and quantifiers identification and translation technology,TP391.2
  16. Study on CRF-based Chinese Named Entity Recognition,TP391.43
  17. Business Information Extraction Based on Internet,TP399-C2
  18. Study on the Kazakh Named Entity Recognition Method Based on N-gram Model,TP391.43
  19. Japanese Morphological Analysis and Its Application for Clir,TP391.1
  20. Research on Knowledge Map Construction in Automobile R&D Management,F426.471
  21. Research on Named Entity Processing of Statistical Machine Translaton,TP391.2

CLC: > Industrial Technology > Automation technology,computer technology > Computing technology,computer technology > Computer applications > Information processing (information processing) > Retrieval machine
© 2012 www.DissertationTopic.Net  Mobile