A refined TF-IDF algorithm based on channel distribution information for web news feature extraction

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

12 Scopus citations

Abstract

TF-IDF algorithm is widely used in text feature extraction, in which IDF value demonstrates the importance of a term. While applying to the procession of web news, the traditional IDF doesn't work well, especially in a collection divided according to channels. In order to solve this problem, a refined IDF schema is proposed, named Channel Distribution Information (CDI) IDF, which is based on the information among the IDF values of each channel collections. According to the statistical features, the Top terms and the meaningless terms could be identified. Experiments on a manual labeled test set indicated that, related to the traditional TF-IDF, the CDI TF-IDF increases the Recall, Precise and F0.5 measure by 2.71%, 3.07% and 3.00%.

Original languageEnglish
Title of host publication2nd International Workshop on Education Technology and Computer Science, ETCS 2010
Pages15-19
Number of pages5
DOIs
StatePublished - 2010
Event2nd International Workshop on Education Technology and Computer Science, ETCS 2010 - Wuhan, Hubei, China
Duration: 6 Mar 20107 Mar 2010

Publication series

Name2nd International Workshop on Education Technology and Computer Science, ETCS 2010
Volume2

Conference

Conference2nd International Workshop on Education Technology and Computer Science, ETCS 2010
Country/TerritoryChina
CityWuhan, Hubei
Period6/03/107/03/10

Keywords

  • Channel distribution information
  • Feature extraction
  • TF-IDF

Fingerprint

Dive into the research topics of 'A refined TF-IDF algorithm based on channel distribution information for web news feature extraction'. Together they form a unique fingerprint.

Cite this