Web document clustering using hyperlink structures

Xiaofeng He, Hongyuan Zha, Chris H.Q. Ding, Horst D. Simon

Research output: Contribution to journalArticlepeer-review

80 Scopus citations

Abstract

With the exponential growth of information on the World Wide Web, there is great demand for developing efficient methods for effectively organizing the large amount of retrieved information. Document clustering plays an important role in information retrieval and taxonomy management for the Web. In this paper we examine three clustering methods: K-means, multi-level METIS, and the recently developed normalized-cut method using a new approach of combining textual information, hyperlink structure and co-citation relations into a single similarity metric. We found the normalized-cut method with the new similarity metric is particularly effective, as demonstrated on three datasets of web query results. We also explore some theoretical connections between the normalized-cut method and the K-means method.

Original languageEnglish
Pages (from-to)19-45
Number of pages27
JournalComputational Statistics and Data Analysis
Volume41
Issue number1
DOIs
StatePublished - 28 Nov 2002
Externally publishedYes

Keywords

  • Cheeger constant
  • Clustering method
  • Eigenvalue decomposition
  • Graph partitioning
  • K-means method
  • Link structure
  • Normalized cut method
  • Similarity metric
  • World Wide Web

Fingerprint

Dive into the research topics of 'Web document clustering using hyperlink structures'. Together they form a unique fingerprint.

Cite this