TY - GEN
T1 - Minimum coresets for maxima representation of multidimensional data
AU - Wang, Yanhao
AU - Mathioudakis, Michael
AU - Li, Yuchen
AU - Tan, Kian Lee
N1 - Publisher Copyright:
© 2021 ACM.
PY - 2021/6/20
Y1 - 2021/6/20
N2 - Coresets are succinct summaries of large datasets such that, for a given problem, the solution obtained from a coreset is provably competitive with the solution obtained from the full dataset. As such, coreset-based data summarization techniques have been successfully applied to various problems, e.g., geometric optimization, clustering, and approximate query processing, for scaling them up to massive data. In this paper, we study coresets for the maxima representation of multidimensional data: Given a set P of points in mathbbRd , where d is a small constant, and an error parameter (0,1) , a subset Q ⊆ P is an -coreset for the maxima representation of P iff the maximum of Q is an -approximation of the maximum of P for any vector u in Rd , where the maximum is taken over the inner products between the set of points (P or Q) and u. We define a novel minimum -coreset problem that asks for an -coreset of the smallest size for the maxima representation of a point set. For the two-dimensional case, we develop an optimal polynomial-time algorithm for the minimum -coreset problem by transforming it into the shortest-cycle problem in a directed graph. Then, we prove that this problem is NP-hard in three or higher dimensions and present polynomial-time approximation algorithms in an arbitrary fixed dimension. Finally, we provide extensive experimental results on both real and synthetic datasets to demonstrate the superior performance of our proposed algorithms.
AB - Coresets are succinct summaries of large datasets such that, for a given problem, the solution obtained from a coreset is provably competitive with the solution obtained from the full dataset. As such, coreset-based data summarization techniques have been successfully applied to various problems, e.g., geometric optimization, clustering, and approximate query processing, for scaling them up to massive data. In this paper, we study coresets for the maxima representation of multidimensional data: Given a set P of points in mathbbRd , where d is a small constant, and an error parameter (0,1) , a subset Q ⊆ P is an -coreset for the maxima representation of P iff the maximum of Q is an -approximation of the maximum of P for any vector u in Rd , where the maximum is taken over the inner products between the set of points (P or Q) and u. We define a novel minimum -coreset problem that asks for an -coreset of the smallest size for the maxima representation of a point set. For the two-dimensional case, we develop an optimal polynomial-time algorithm for the minimum -coreset problem by transforming it into the shortest-cycle problem in a directed graph. Then, we prove that this problem is NP-hard in three or higher dimensions and present polynomial-time approximation algorithms in an arbitrary fixed dimension. Finally, we provide extensive experimental results on both real and synthetic datasets to demonstrate the superior performance of our proposed algorithms.
KW - Convex hull
KW - Coreset
KW - Epsilon-kernel
KW - Maxima representation
KW - Regret minimizing set
UR - https://www.scopus.com/pages/publications/85109214380
U2 - 10.1145/3452021.3458322
DO - 10.1145/3452021.3458322
M3 - 会议稿件
AN - SCOPUS:85109214380
T3 - Proceedings of the ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems
SP - 138
EP - 152
BT - PODS 2021 - Proceedings of the 40th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems
PB - Association for Computing Machinery
T2 - 40th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, PODS 2021
Y2 - 20 June 2021 through 25 June 2021
ER -