Forum data extraction without explicit rules

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

1 Scopus citations

Abstract

Web forum data contributed by millions of users are the mixture of well-formed user information and free-format user-generated content. Though easy to read for users, forum data are difficult to be analyzed by computer systems because of various surrounding HTML tags. It is challenging to extract forum data from a large number of Web sites automatically since these sites may have different styles. In this paper, we propose an approach to extract user information and user-generated content from multiple forum sites by using both structural and textual characteristics of forums. A structural induction process and a term combination computation process are introduced to assure extraction accuracy and automation. Extensive experiments on real-life data sets show the effectiveness of our proposed method.

Original languageEnglish
Title of host publicationProceedings - 2nd International Conference on Cloud and Green Computing and 2nd International Conference on Social Computing and Its Applications, CGC/SCA 2012
Pages460-465
Number of pages6
DOIs
StatePublished - 2012
Event2nd International Conference on Cloud and Green Computing, CGC 2012, Held Jointly with the 2nd International Conference on Social Computing and Its Applications, SCA 2012 - Xiangtan, Hunan, China
Duration: 1 Nov 20123 Nov 2012

Publication series

NameProceedings - 2nd International Conference on Cloud and Green Computing and 2nd International Conference on Social Computing and Its Applications, CGC/SCA 2012

Conference

Conference2nd International Conference on Cloud and Green Computing, CGC 2012, Held Jointly with the 2nd International Conference on Social Computing and Its Applications, SCA 2012
Country/TerritoryChina
CityXiangtan, Hunan
Period1/11/123/11/12

Keywords

  • forum data extraction
  • user-generated content

Fingerprint

Dive into the research topics of 'Forum data extraction without explicit rules'. Together they form a unique fingerprint.

Cite this