Abstract
With the exponential growth of information on the World Wide Web, there is great demand for developing efficient methods for effectively organizing the large amount of retrieved information. Document clustering plays an important role in information retrieval and taxonomy management for the Web. In this paper we examine three clustering methods: K-means, multi-level METIS, and the recently developed normalized-cut method using a new approach of combining textual information, hyperlink structure and co-citation relations into a single similarity metric. We found the normalized-cut method with the new similarity metric is particularly effective, as demonstrated on three datasets of web query results. We also explore some theoretical connections between the normalized-cut method and the K-means method.
| Original language | English |
|---|---|
| Pages (from-to) | 19-45 |
| Number of pages | 27 |
| Journal | Computational Statistics and Data Analysis |
| Volume | 41 |
| Issue number | 1 |
| DOIs | |
| State | Published - 28 Nov 2002 |
| Externally published | Yes |
Keywords
- Cheeger constant
- Clustering method
- Eigenvalue decomposition
- Graph partitioning
- K-means method
- Link structure
- Normalized cut method
- Similarity metric
- World Wide Web