TY - JOUR
T1 - Distributed error estimation of functional dependency
AU - Jin, Cheqing
AU - Lall, Ashwin
AU - Xu, Jun
AU - Zhang, Zhigang
AU - Zhou, Aoying
N1 - Publisher Copyright:
© 2016 Elsevier Inc. All rights reserved.
PY - 2016/6/1
Y1 - 2016/6/1
N2 - Measuring or estimating the number of errors in (i.e., violations to) a functional dependency (FD) offers valuable information about data semantics and quality. Most existing work focuses on FD error estimation in a centralized environment, where data are stored only in one site and the goal is to optimize the time and space complexities of the estimation algorithms. The distributed FD error estimation problem, in which the data can reside in multiple physically distributed sites, has never been studied in depth and is the subject of this work. In this work, we study a version of the distributed FD error estimation problem where a coordinator site communicates with multiple remote sites for arriving at such estimations, and the goal is to minimize this communication cost. We study two types of queries - that are dual to each other in semantics - for such estimations: one tries to maximize the accuracies of FD error estimations under fixed communication costs, and the other to minimize the communication costs needed to meet certain accuracy requirements. In our framework, each remote site maintains a concise synopsis data structure obtained by scanning its local data once, and the coordinator site receives and processes all such data structures to arrive at an estimate of the FD error. Our solution extends from the case of two remote sites to that of multiple remote sites. We demonstrate the efficacy of our proposed techniques via rigorous analysis and extensive experiments.
AB - Measuring or estimating the number of errors in (i.e., violations to) a functional dependency (FD) offers valuable information about data semantics and quality. Most existing work focuses on FD error estimation in a centralized environment, where data are stored only in one site and the goal is to optimize the time and space complexities of the estimation algorithms. The distributed FD error estimation problem, in which the data can reside in multiple physically distributed sites, has never been studied in depth and is the subject of this work. In this work, we study a version of the distributed FD error estimation problem where a coordinator site communicates with multiple remote sites for arriving at such estimations, and the goal is to minimize this communication cost. We study two types of queries - that are dual to each other in semantics - for such estimations: one tries to maximize the accuracies of FD error estimations under fixed communication costs, and the other to minimize the communication costs needed to meet certain accuracy requirements. In our framework, each remote site maintains a concise synopsis data structure obtained by scanning its local data once, and the coordinator site receives and processes all such data structures to arrive at an estimate of the FD error. Our solution extends from the case of two remote sites to that of multiple remote sites. We demonstrate the efficacy of our proposed techniques via rigorous analysis and extensive experiments.
KW - Distributed processing
KW - Error estimation
KW - Functional dependency
UR - https://www.scopus.com/pages/publications/84959386270
U2 - 10.1016/j.ins.2016.01.051
DO - 10.1016/j.ins.2016.01.051
M3 - 文章
AN - SCOPUS:84959386270
SN - 0020-0255
VL - 345
SP - 156
EP - 176
JO - Information Sciences
JF - Information Sciences
ER -