Improved relative term frequency probability feature selection for document categorization

Qiang Li, Liang He, Xin Lin

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

Feature selection is an important process to choose a subset of features relevant to a particular application in document classification. Those terms which occur unevenly in various categories have strong distinguishable information as to categorization. Firstly, based on the categorical document frequency probability (CTFP), a CTFP_VM feature selection algorithm was designed for feature selection. Secondly, a maximum term frequency conditional distribution factor was proposed to improve the CTFP_VM criterion further. We perform the document categorization experiments on SVM classifiers with the well-known Reuters-21578 and 20news-18828 corpuses as unbalanced and balanced corpus respectively. Experiments compare the novel methods with other conventional feature selection algorithms and the proposed method achieves the excellent feature set for document categorization.

Original languageEnglish
Title of host publicationAchievements in Engineering Sciences
PublisherTrans Tech Publications
Pages1102-1109
Number of pages8
ISBN (Print)9783038350842
DOIs
StatePublished - 2014
Event3rd International Conference on Manufacturing Engineering and Process, ICMEP 2014 - Seoul, Korea, Republic of
Duration: 10 Apr 201411 Apr 2014

Publication series

NameApplied Mechanics and Materials
Volume548-549
ISSN (Print)1660-9336
ISSN (Electronic)1662-7482

Conference

Conference3rd International Conference on Manufacturing Engineering and Process, ICMEP 2014
Country/TerritoryKorea, Republic of
CitySeoul
Period10/04/1411/04/14

Keywords

  • Categorical distribution
  • Category tendency
  • Distribution probability
  • Term frequency
  • Variance mean

Fingerprint

Dive into the research topics of 'Improved relative term frequency probability feature selection for document categorization'. Together they form a unique fingerprint.

Cite this