Skip to main navigation Skip to search Skip to main content

Distributed error estimation of functional dependency

Research output: Contribution to journalArticlepeer-review

Abstract

Measuring or estimating the number of errors in (i.e., violations to) a functional dependency (FD) offers valuable information about data semantics and quality. Most existing work focuses on FD error estimation in a centralized environment, where data are stored only in one site and the goal is to optimize the time and space complexities of the estimation algorithms. The distributed FD error estimation problem, in which the data can reside in multiple physically distributed sites, has never been studied in depth and is the subject of this work. In this work, we study a version of the distributed FD error estimation problem where a coordinator site communicates with multiple remote sites for arriving at such estimations, and the goal is to minimize this communication cost. We study two types of queries - that are dual to each other in semantics - for such estimations: one tries to maximize the accuracies of FD error estimations under fixed communication costs, and the other to minimize the communication costs needed to meet certain accuracy requirements. In our framework, each remote site maintains a concise synopsis data structure obtained by scanning its local data once, and the coordinator site receives and processes all such data structures to arrive at an estimate of the FD error. Our solution extends from the case of two remote sites to that of multiple remote sites. We demonstrate the efficacy of our proposed techniques via rigorous analysis and extensive experiments.

Original languageEnglish
Pages (from-to)156-176
Number of pages21
JournalInformation Sciences
Volume345
DOIs
StatePublished - 1 Jun 2016

Keywords

  • Distributed processing
  • Error estimation
  • Functional dependency

Fingerprint

Dive into the research topics of 'Distributed error estimation of functional dependency'. Together they form a unique fingerprint.

Cite this