Integrating workload balancing and fault tolerance in distributed stream processing system

  • Junhua Fang*
  • , Pingfu Chao
  • , Rong Zhang
  • , Xiaofang Zhou
  • *Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

13 Scopus citations

Abstract

Distributed Stream Processing Engine (DSPE) is designed for processing continuous streams so as to achieve the real-time performance with low latency guaranteed. To satisfy such requirement, the availability and efficiency are the main concern of the DSPE system, which can be achieved by a proper design of the fault tolerance module and the workload balancing module, respectively. However, the inherent characteristics of data streams, including persistence, dynamic and unpredictability, pose great challenges in satisfying both properties. As far as we know, most of the state-of-the-art DSPE systems take either fault tolerance or workload balancing as its single optimization goal, which in turn receives a higher resource overhead or longer recovery time. In this paper, we combine the fault tolerance and workload balancing mechanisms in the DSPE to reduce the overall resource consumption while keeping the system interactive, high-throughput, scalable and highly available. Based on our data-level replication strategy, our method can handle the dynamic data skewness and node failure scenario: during the distribution fluctuation of the incoming stream, we rebalance the workload by selectively inactivate the data in high-load nodes and activate their replicas on low-load nodes to minimize the migration overhead within the stateful operator; when a fault occurs in the process, the system activates the replicas of the data affected to ensure the correctness while keeping the workload balanced. Extensive experiments on various join workloads on both benchmark data and real data show our superior performance compared with baseline systems.

Original languageEnglish
Pages (from-to)2471-2496
Number of pages26
JournalWorld Wide Web
Volume22
Issue number6
DOIs
StatePublished - 1 Nov 2019

Keywords

  • Distributed systems
  • High availability computing
  • Load Balancing
  • Real-time data processing

Fingerprint

Dive into the research topics of 'Integrating workload balancing and fault tolerance in distributed stream processing system'. Together they form a unique fingerprint.

Cite this