Dissertation > Excellent graduate degree dissertation topics show

Research on Key Techniques of Multiple Documents Automatic Summarization

Author: XuYongDong
Tutor: WangXiaoLong
School: Harbin Institute of Technology
Course: Applied Computer Technology
Keywords: Multiple Documents Automatic Summarization Temporal Information Process Text Units Similarity Hierarchical Topic Identify Multi-Document Rhetorical Structure
CLC: TP391.1
Type: PhD thesis
Year: 2007
Downloads: 506
Quote: 7
Read: Download Dissertation

Abstract


Multi-document automatic summarization extracts important or user-interesting information according to texts related to same topic or interesting of users, and automatically generates fixed length summarization. It is a application technique that is related to multiple research domains including Linguistics, Computational Linguistics, Artificial Intelligence and Information System etc. So the research of multi-document automatic summarization can effectively contribute to the progress of these domains. In addition, a feasible multiple documents automatic summarization system has the important practice value for promote speed and precision of web information processing.Thus, this paper researches common multi-document automatic summarization based on discourse structure. We first research the discourse relatives of each pairs of text units, including similar relative of cross-document units, text temporal information extraction and temporal relative identification of events, text rhetorical structure identification and hierarchical topic extraction. In addition, a multi-document represent structure based on rhetorical structure MRS is proposed. By representing interrelationship between text units at different levels of granularity and the happen and change of various events at time dimension, this structure can achieve information parallel fusion of multi-document while reserve original information of set of related documents. Finally, a series of algorithms including summary sentences extraction based on MRS, summary ordering, and summarization generation are proposed. This paper is composed of four parts:At first, this paper research Chinese temporal information extraction and temporal semantic calculation and in addition, research temporal reason and temporal relative identification of events. Text temporal information is very importance in node anchor, key events identification, events ordering and summary content reform. According to Chinese text temporal information expression trait, this paper decomposes temporal phrase which bear time information into some“little”elements which have single signification and can be easily extracted, and then, combinate these elements to temporal expression by integrate rules. In this course, calculate final temporal semantic value and temporal relative of events.Second, the text units similarity calculation method is researched in this paper. There exists semantic similar relative between units from cross-document which is important cue of finding important summary sentences. Because the text units semantic similarity cannot be calculated by full document similarity strategy, this paper propose a units similarity calculation method based on multiple features fusion which dig useful features as far as possible and automatically fuse these features by machine learn method so as to avoid information absence problem caused by the method of traditional single text expression by words or conception. We use logistic regression model to automatically fit the relations between the features and text units similarity. Such model has better fitness characteristic and can easily add new features or erase existing features and has more strong expansibility.Third, because that topic automatic identification is key technique of summarization, this paper propose the notion of hierarchical topic through the analysis of text set topics distributing and topic bound, and use hierarchical tree to replace traditional monolayer topic structure. We think that such processing can more effectively reflect true content of text set. Concretely, we use hierarchical clustering algorithm to build hierarchical topic tree and use density curve inflexion identification method to automatically get clustering threshold.Fourth, building a reasonable formalization representative structure of text set is foundation of next research. Dratomir R. Radev proposed two basic data structure: cube and graph when he described cross-document structure theory (CST). The cube structure considers influence of temporal information in topic identification of text set. The graph structure divides relationship of text units into multiple fine-grained rhetorical relationships. Inspired by this idea, this paper propose a multiple document rhetorical structure (MRS), and design a series of algorithms including summary sentences extraction based on MRS, summary ordering, and summarization generation. MRS comprise node which represent text units and link which represent the relation between these units. The links contain rhetorical relations which determine the importance of unit in text and similar relations which show the similarity between unit and all correlative nodes from other documents. The temporal information of unit shows occurrence and change of event described by nodes. So comprehensively combining these factors can assure the importance of node in whole set. Finally, this paper proposed a multi-document automatic summarization evaluating system which a single standard summary sentence in text set is extended to a standard summary set and the rationality of summary precision and redundancy result are improved. Our experiment result shows that the multi-document automatic summarization system based on MRS can generate good quality abstract.

Related Dissertations

  1. Research on Basic Algorithms of Digital Image Processing and Implementation with FPGA,TP391.41
  2. Research on Facial Feature Extraction and Matching Algorithms for Image Retrieval,TP391.41
  3. Research of High Speed Image Pre-processing System Based on FPGA,TP391.41
  4. Research on Algorithms of 2D Face Template Protection,TP391.41
  5. Research of Visualization Technology in the Virtual Test of Missile,TP391.9
  6. The Research and Implemention of Image Retrieval Based on User Interested Feature,TP391.41
  7. Research of Image Mosaic Technology,TP391.41
  8. Research and Implementation of Exact String Matchiing Algorithms,TP391.41
  9. Research of Question Answering System Based on the Analysis of Lexical and Semantic Meanings,TP391.1
  10. Research on the Classification Based on the Reconstruction of Solder Joint,TP391.41
  11. Tongue Feature Extraction and Research of Fusion Classification,TP391.41
  12. Research on Structure Transition Technology for SMT,TP391.2
  13. The Fatigue State Recognition of the Driver Based on Eye Detection,TP391.41
  14. Syntactic Features Based Pronoun Resolution,TP391.1
  15. Research on Infrared Image Simulation for Aerial Objects and Background,TP391.41
  16. Design and Simulation of UHF RFID System Based on the Protocol of EPC C1G2,TP391.44
  17. Research on Intelligent Learning-Based Multi-Sensor Target Recognition and Tracking System,TP391.41
  18. Research on Image Compression and Implementation Using TMS320C6713 Based on SPIHT Algorithm,TP391.41
  19. Research on Joint Target Detection for Dual-Sensor Image and System Implementation,TP391.41
  20. Research of Images Enhancing Algorithms on Fog or Backlighting Conditions and Implementation with Hardwares,TP391.41
  21. Research of Multiple Emails Automatic Summarization,TP391.1

CLC: > Industrial Technology > Automation technology,computer technology > Computing technology,computer technology > Computer applications > Information processing (information processing) > Text Processing
© 2012 www.DissertationTopic.Net  Mobile