Abstract
Feature selection is an important process to choose a subset of features relevant to a particular application in document classification. Firstly, based on the categorical document frequency probability (CDFP), CDFP-VM criterion was designed for feature selection. Secondly, a maximum conditional distribution factor was proposed to improve the CDFP-VM criterion further. The method has advantages in the case of choosing smaller number of features, especially for classes with small number of training documents. It keeps the best features in favor of neither high nor low DF frequency terms, thus improves the final performance of the document categorization system. We perform the experiments with the standard Fudan Chinese corpus and selected Sogou corpus as balanced and unbalanced corpus respectively. The experiment results demonstrate the effectiveness of the proposed feature selection method in Chinese document categorization.
| Original language | English |
|---|---|
| DOIs | |
| State | Published - 2014 |
| Event | 8th International Conference on Ubiquitous Information Management and Communication, ICUIMC 2014 - Siem Reap, Cambodia Duration: 9 Jan 2014 → 11 Jan 2014 |
Conference
| Conference | 8th International Conference on Ubiquitous Information Management and Communication, ICUIMC 2014 |
|---|---|
| Country/Territory | Cambodia |
| City | Siem Reap |
| Period | 9/01/14 → 11/01/14 |
Keywords
- Categorical distribution difference
- Category tendency
- Document categorization
- Feature selection
- Variance mean