K-means clustering via principal component analysis

Chris Ding, Xiaofeng He

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

1037 Scopus citations

Abstract

Principal component analysis (PCA) is a widely used statistical technique for unsupervised dimension reduction. K-meaas clustering is a commonly used data clustering for performing unsupervised learning tasks. Here we prove that principal components are the continuous solutions to the discrete cluster membership indicators for K-means clustering. New lower bounds for K-means objective function are derived, which is the total variance minus the eigenvalues of the data covariance matrix. These results indicate that unsupervised dimension reduction is closely related to unsupervised learning. Several implications are discussed. On dimension reduction, the result provides new insights to the observed effectiveness of PCA-based data reductions, beyond the conventional noise-reduction explanation that PCA, via singular value decomposition, provides the best low-dimensional linear approximation of the data. On learning, the result suggests effective techniques for K-means data clustering. DNA gene expression and Internet newsgroups are analyzed to illustrate our results. Experiments indicate that the new bounds are within 0.5-1.5% of the optimal values.

Original languageEnglish
Title of host publicationProceedings, Twenty-First International Conference on Machine Learning, ICML 2004
EditorsR. Greiner, D. Schuurmans
Pages225-232
Number of pages8
StatePublished - 2004
Externally publishedYes
EventProceedings, Twenty-First International Conference on Machine Learning, ICML 2004 - Banff, Alta, Canada
Duration: 4 Jul 20048 Jul 2004

Publication series

NameProceedings, Twenty-First International Conference on Machine Learning, ICML 2004

Conference

ConferenceProceedings, Twenty-First International Conference on Machine Learning, ICML 2004
Country/TerritoryCanada
CityBanff, Alta
Period4/07/048/07/04

Fingerprint

Dive into the research topics of 'K-means clustering via principal component analysis'. Together they form a unique fingerprint.

Cite this