Improved categorical distribution difference feature selection for Chinese document categorization

Research output: Contribution to conferencePaperpeer-review

3 Scopus citations

Abstract

Feature selection is an important process to choose a subset of features relevant to a particular application in document classification. Firstly, based on the categorical document frequency probability (CDFP), CDFP-VM criterion was designed for feature selection. Secondly, a maximum conditional distribution factor was proposed to improve the CDFP-VM criterion further. The method has advantages in the case of choosing smaller number of features, especially for classes with small number of training documents. It keeps the best features in favor of neither high nor low DF frequency terms, thus improves the final performance of the document categorization system. We perform the experiments with the standard Fudan Chinese corpus and selected Sogou corpus as balanced and unbalanced corpus respectively. The experiment results demonstrate the effectiveness of the proposed feature selection method in Chinese document categorization.

Original languageEnglish
DOIs
StatePublished - 2014
Event8th International Conference on Ubiquitous Information Management and Communication, ICUIMC 2014 - Siem Reap, Cambodia
Duration: 9 Jan 201411 Jan 2014

Conference

Conference8th International Conference on Ubiquitous Information Management and Communication, ICUIMC 2014
Country/TerritoryCambodia
CitySiem Reap
Period9/01/1411/01/14

Keywords

  • Categorical distribution difference
  • Category tendency
  • Document categorization
  • Feature selection
  • Variance mean

Fingerprint

Dive into the research topics of 'Improved categorical distribution difference feature selection for Chinese document categorization'. Together they form a unique fingerprint.

Cite this