Efficient computation of k-medians over data streams under memory constraints

  • Zhi Hong Chong*
  • , Jeffrey Xu Yu
  • , Zhen Jie Zhang
  • , Xue Min Lin
  • , Wei Wang
  • , Ao Ying Zhou
  • *Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

3 Scopus citations

Abstract

In this paper, we study the problem of efficiently computing k-medians over high-dimensional and high speed data streams. The focus of this paper is on the issue of minimizing CPU time to handle high speed data streams on top of the requirements of high accuracy and small memory. Our work is motivated by the following observation: the existing algorithms have similar approximation behaviors in practice, even though they make noticeably different worst case theoretical guarantees. The underlying reason is that in order to achieve high approximation level with the smallest possible memory, they need rather complex techniques to maintain a sketch, along time dimension, by using some existing off-line clustering algorithms. Those clustering algorithms cannot guarantee the optimal clustering result over data segments in a data stream but accumulate errors over segments, which makes most algorithms behave the same in terms of approximation level, in practice. We propose a new grid-based approach which divides the entire data set into cells (not along time dimension). We can achieve high approximation level based on a novel concept called (1 - ∈)-dominant. We further extend the method to the data stream context, by leveraging a density-based heuristic and frequent item mining techniques over data streams. We only need to apply an existing clustering once to computing k-medians, on demand, which reduces CPU time significantly. We conducted extensive experimental studies, and show that our approaches outperform other well-known approaches.

Original languageEnglish
Pages (from-to)284-296
Number of pages13
JournalJournal of Computer Science and Technology
Volume21
Issue number2
DOIs
StatePublished - Mar 2006
Externally publishedYes

Keywords

  • Cluster
  • Data mining
  • Data streams
  • k-medians

Fingerprint

Dive into the research topics of 'Efficient computation of k-medians over data streams under memory constraints'. Together they form a unique fingerprint.

Cite this