Fast text classification: A training-corpus pruning based approach

Shuigeng Zhou, Tok Wang Ling, Jihong Guan, Jiangtao Hu, Aoying Zhou

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

14 Scopus citations

Abstract

With the rapid growth of on-line information available, text classification is becoming more and more important. kNN is a widely used text classification method of high performance. However, this method is inefficient because it requires a large amount of computation for evaluating the similarity between a test document and each training document. In this paper, we propose a fast kNN text classification approach based on pruning the training corpus. By using this approach, the size of training corpus can be condensed sharply so that time-consuming on kNN searching can be cut off significantly, and consequently classification efficiency can be improved substantially while classification performance is preserved comparable to that of without pruning. Effective, algorithm for text corpus pruning is designed. Experiments over the Reuters corpus are carried out, which validate the practicability of the proposed approach. Our approach is especially suitable for on-line text classification applications.

Original languageEnglish
Title of host publicationProceedings - 8th International Conference on Database Systems for Advanced Applications, DASFAA 2003
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages127-136
Number of pages10
ISBN (Electronic)0769518958, 9780769518954
DOIs
StatePublished - 2003
Externally publishedYes
Event8th International Conference on Database Systems for Advanced Applications, DASFAA 2003 - Kyoto, Japan
Duration: 26 Mar 200328 Mar 2003

Publication series

NameProceedings - 8th International Conference on Database Systems for Advanced Applications, DASFAA 2003

Conference

Conference8th International Conference on Database Systems for Advanced Applications, DASFAA 2003
Country/TerritoryJapan
CityKyoto
Period26/03/0328/03/03

Keywords

  • Algorithm design and analysis
  • Computer science
  • Content based retrieval
  • Drives
  • High performance computing
  • Information retrieval
  • Machine learning
  • Supervised learning
  • Testing
  • Text categorization

Fingerprint

Dive into the research topics of 'Fast text classification: A training-corpus pruning based approach'. Together they form a unique fingerprint.

Cite this