Dissertation > Excellent graduate degree dissertation topics show

Creating Chinese-English Comparable Corpora

Author: WangShanShan
Tutor: HuangDeGen
School: Dalian University of Technology
Course: Applied Computer Technology
Keywords: Comparable Corpora Cross Language Information Retrieval KeywordExtraction Document Alignment
CLC: TP391.3
Type: Master's thesis
Year: 2013
Downloads: 39
Quote: 0
Read: Download Dissertation


Comparable Corpora are valuable resources for many NLP applications, and extensive research has been done on information mining based on comparable corpora in recent years. While there are not enough large-scale available public comparable corpora at present, this paper presents a bi-directional CLIR-based method for creating comparable corpora from two independent news collections in different languages. The original Chinese document collections and English documents collections are crawled from XinHuaNet respectively and formatted in a consistent manner. For each document from the two collections, the best query key-words are extracted to represent the essential content of the document, and then the keywords are translated into the language of the other collection. The translated queries are run against the collection in the same language to pick up the candidate documents in the other language and candidates are aligned based on their publication dates and the similarity scores. Results show that our approach significantly outperforms previous approaches to the construction of Chinese-English comparable corpora. Our contributions are as follows:(1) For getting the high quality topic words of the Chinese documents, we propose a method base on TFIDF which combines word and phrases. In the processing stage, the segmentation results are modified aiming at improving the segmentation results. We combine the habits of the Chinese document expression on the word extraction and combine the characteristics of news papers on the phrases extraction. Finally, the results of the two stages are combined after removing the repeated strings effectively.(2) In the stage of document alignment, considering the number of target documents correspond to one source document is uncertain, this paper selects the top N target documents as the candidate set, and sets the similarity threshold to filter the candidate documents for improving the performance of the system.(3) This paper presents a bi-directional CLIR-based method for creating comparable corpora from two independent news collections in different languages. In other words, the two kinds of language will be the source language and another be the target respectively. The Experimental results show that this method with is better than that with only one direction.(4) We design the experiments evaluating the performance of our system.

Related Dissertations

  1. Study on Web-based Translation Technology for Out-of-Vocabulary,TP391.2
  2. The Research of Link Structure in Tibetan Web Base on Social Network Analysis,TP393.09
  3. Research of the Model of Enterprise Competitive Intelligence Collection System Based on Cross-Language Information Retrieval,TP391.3
  4. Research on Construction and Application of English-Chinese Comparable Corpora,TP391.1
  5. Semantic Document Retrieval for English to Chinese Cross-Language Question Answering System,TP391.1
  6. Mining Chinese-English Named Entity Pairs from Comparable Corpora,TP391.1
  7. Personalized recommendation based image browsing and retrieval of relevant methods,TP391.41
  8. The Design and Implementation of Cross-Language Navigational Search Engine,TP391.3
  9. Leveling Out in Chinese Translated Fiction,I046
  10. Japanese Morphological Analysis and Its Application for Clir,TP391.1
  11. Chinese-English Cross-Language Question Answer Information Retrieval Technology,TP391.3
  12. Chinese word semantic similarity measure its cross-language information retrieval,TP391.1
  13. The Construction of Large-scale Chinese-English Comparable Corpora,TP391.1
  14. Research on Techniques of Query Translation for Cross-language Information Retrieval,TP391.3
  15. Building Mongolian and Chinese Bilingual Semantic Dictionary Oriented Cross Language Information Retrieval,TP391.1
  16. Research on Cross Language Information Retrieval Based on Interlingua Semantic,TP391.3
  17. The Research on Cross Language Text Categorization Based on Interlingua Semantic,TP391.1
  18. Research and Implementation of Mining Bilingual Named Entities from Large-Scale Web Pages,TP391.4
  19. Research and Implementation of English Audio Cross-Language Information Retrieval for the Mobile Learning,TP391.3
  20. The Application of Cross-Language Information Retrieval Based on Latent Semantic Analysis,TP391.3

CLC: > Industrial Technology > Automation technology,computer technology > Computing technology,computer technology > Computer applications > Information processing (information processing) > Retrieval machine
© 2012 www.DissertationTopic.Net  Mobile