A comparative study on term weighting schemes for text categorization

  • Man Lan*
  • , Sam Yuan Sung
  • , Hwee Boon Low
  • , Chew Lim Tan
  • *Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

70 Scopus citations

Abstract

The term weighting scheme, which is used to convert documents into vectors in the term spaces, is a vital step in automatic text categorization. The previous studies showed that term weighting schemes dominate the performance rather than the kernel functions of S Ms for the text categorization task. In this paper, we conducted experiments to compare various term weighting schemes with S M on two widely-used benchmark data sets. We also presented a new term weighting scheme t f . r f for text categorization. The cross-scheme comparison was performed by using McNcmar's Tests. The controlled experimental results showed that the newly proposed t f . r f scheme is significantly better than other term weighting schemes. Compared with schemes related with t f factor alone, the idf factor does not improve or even decrease the term's discriminating power for text categorization. The binary and t f .chi representations significantly underperform the other term weighting schemes.

Original languageEnglish
Title of host publicationProceedings of the International Joint Conference on Neural Networks, IJCNN 2005
Pages546-551
Number of pages6
DOIs
StatePublished - 2005
Externally publishedYes
EventInternational Joint Conference on Neural Networks, IJCNN 2005 - Montreal, QC, Canada
Duration: 31 Jul 20054 Aug 2005

Publication series

NameProceedings of the International Joint Conference on Neural Networks
Volume1

Conference

ConferenceInternational Joint Conference on Neural Networks, IJCNN 2005
Country/TerritoryCanada
CityMontreal, QC
Period31/07/054/08/05

Fingerprint

Dive into the research topics of 'A comparative study on term weighting schemes for text categorization'. Together they form a unique fingerprint.

Cite this