Dissertation > Excellent graduate degree dissertation topics show

Chinese BBS Information Extraction and Classification

Author: HanJie
Tutor: LiaoWenJian
School: WRI
Course: Communication and Information System
Keywords: Information Extraction Information Classification Bulletin Board System Floor Split Anchor Information Anchor Induction Algorithm Semantic Label Discover
CLC: TP393.094
Type: Master's thesis
Year: 2009
Downloads: 28
Quote: 0
Read: Download Dissertation

Abstract


With the disordered, massive and dynamically changeful network information resource, the information extraction and classification in the cyberspace can assist user to find the very information rapidly, and gain structured data which is directly used by other application system, favoring the application of network information. As to different information sources, this dissertation mainly works in the method of BBS information extraction and classification, and describes the BBS information with structured form.Constructing the DOM tree by parsing the BBS page, find the BBS floor units’rule based on the elements’position rule on the DOM tree, and propose three kinds of concepts of anchor information, including the structured anchor, the individual anchor and the JavaScript anchor. With the remarkable characteristics of the anchor information, the anchor induction algorithm is introduced. The algorithm can effectively obtain the anchor information from the BBS pages by using the position, quantity and relation of the anchor information in the DOM tree to extract the position and then deduce the floor units’rule reversely. After establishing the steady mapping relationship between the anchor information and the floor units, the position of the floor units are located on the DOM tree by the path of the anchor information, and then split from the DOM tree accurately. Experimental analysis shows it can solve 87.39 percent of BBS pages rightly.When extracting information from the BBS pages by the floor units, the floor units from the same BBS site have the same DOM sub-tree structure,so the needed information’s position in the DOM sub-tree is changeless.Compare two floor units’DOM sub-tree, and extract the different content with the same position, we can get the collection of each floor units’information items.In the collection of information items classification procedure, the information items are sorted by the position in the DOM sub-tree.With the underlying semantic feature of the information item, mapped into its own category’s semantic label, it retrieves 70 percent of the structured mode information of the BBS back-end database table. The method greatly reduces manual labor intensity.By the BBS information extraction and classification, the structured table data is conductive to the design and management of BBS site.

Related Dissertations

  1. Research on Domain Entity Attribute and Event Extraction Technology,TP391.1
  2. Research on Temporal Information Recognition and Normalization,TP391.1
  3. Study on Growth Monitoring Technique Based on Pixel Un-Mixing Method and HJ Remote Sensing Images in Paddy Rice,S511
  4. Land Desertification in Qinghai Lake Landscape Pattern Change,X171
  5. Active faults based radar image information extraction method applied research and demonstration,P542.3
  6. Based on high-resolution remote sensing data mining houses information extraction,TP751
  7. Web Page Attribute Extraction Method Research,TP391.1
  8. The Research for Named Entity Recognition and Relation Extraction in Text,TP391.1
  9. Engineering News reported information extraction and applied research,G212
  10. Topic search engine key technology research,TP391.3
  11. Hull section robotic welding path planning and offline programming,TP242
  12. Based on semi- structured text transporter protein substrate information extraction system,Q811.4
  13. Dynamic learning framework based on structured automatic web data extraction method,TP393.092
  14. Web-oriented Chinese automatic summarization research generated,TP391.1
  15. Printers based on natural language HCI Research and implementation,TP11
  16. Multi-language support program comprehension understanding and information extraction technology research,TP311.52
  17. Template independent web information extraction,TP393.092
  18. Internet-facing access to diverse information technology research,TP393.09
  19. Content-based Indexing of Spam Filter Research and Implementation,TP393.098
  20. Study on Extraction of Coniferous Forest Information in Southern China,TP79
  21. Study on Information Extraction and the Dynamic Monitoring of Grassland Coverage in Three River Source Area,S812

CLC: > Industrial Technology > Automation technology,computer technology > Computing technology,computer technology > Computer applications > Computer network > General issues > The application of computer network > Remote login (Telnet)
© 2012 www.DissertationTopic.Net  Mobile