Categorical term frequency probability based feature selection for document categorization

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

3 Scopus citations

Abstract

Document categorization technology heavily relies on the categorical distribution of features. Those terms which occur unevenly in various categories have strong distinguishable information as to categorization. At first, we give the definition of CTFP (Categorical Term Frequency Probability), which will be used to accurately reflect the categorical characteristics of terms on each category. Then, the CTFP-VM (Variance-Mean based on CTFP) feature selection criterion is introduced to reveal the category distribution difference. After computing and ranking the variance mean based on CTFP distribution for each term, feature sets are obtained for document categorization. We perform the document categorization experiments on SVM classifiers with the well-known Reuters-21578 and 20 news-18828 corpuses as unbalanced and balanced corpus respectively. Experiments compare the novel methods with other conventional feature selection algorithms and the proposed method achieves the best feature set for document categorization The experimental results also demonstrate that the proposed variance mean feature selection method base on CTFP not only has better Fl-metric for document categorization but excellent corpus adaptability.

Original languageEnglish
Title of host publication2013 International Conference on Soft Computing and Pattern Recognition, SoCPaR 2013
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages66-71
Number of pages6
ISBN (Electronic)9781479934003
DOIs
StatePublished - 2013
Event2013 International Conference on Soft Computing and Pattern Recognition, SoCPaR 2013 - Hanoi, Viet Nam
Duration: 15 Dec 201318 Dec 2013

Publication series

Name2013 International Conference on Soft Computing and Pattern Recognition, SoCPaR 2013

Conference

Conference2013 International Conference on Soft Computing and Pattern Recognition, SoCPaR 2013
Country/TerritoryViet Nam
CityHanoi
Period15/12/1318/12/13

Keywords

  • categorical distribution
  • document categorization
  • feature selection
  • term frequency
  • variance mean

Fingerprint

Dive into the research topics of 'Categorical term frequency probability based feature selection for document categorization'. Together they form a unique fingerprint.

Cite this