Efficient fault-tolerance for iterative graph processing on distributed dataflow systems

Chen Xu, Markus Holzemer, Manohar Kaul, Volker Markl

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

23 Scopus citations

Abstract

Real-world graph processing applications often require combining the graph data with tabular data. Moreover, graph processing usually is part of a larger analytics workflow consiting of data preparation, analysis and model building, and model application. General-purpose distributed dataflow frameworks execute all steps of such workflows holistically. This holistic view enables these systems to reason about and automatically optimize the processing. Most big graph processing algorithms are iterative and incur a long runtime, as they require multiple passes over the data until convergence. Thus, fault tolerance and quick recovery from any intermittent failure at any step of the workflow are crucial for effective and efficient analysis. In this work, we propose a novel fault-tolerance mechanism for iterative graph processing on distributed data-flow systems with the objective to reduce the checkpointing cost and failure recovery time. Rather than writing checkpoints that block downstream operators, our mechanism writes checkpoints in an unblocking manner, without breaking pipelined tasks. In contrast to the typical unblocking checkpointing approaches (i.e., managing checkpoints independently for immutable datasets), we inject the checkpoints of mutable datasets into the iterative dataflow itself. Hence, our mechanism is iteration-aware by design. This simplifies the system architecture and facilitates coordinating the checkpoint creation during iterative graph processing. We achieve speedier recovery, i.e., confined recovery, by using the local log files on each node to avoid a complete re-computation from scratch. Our theoretical studies as well as our experimental analysis on Flink give further insight into our fault-tolerance strategies and show that they are more efficient than blocking checkpointing and complete recovery for iterative graph processing on dataflow systems.

Original languageEnglish
Title of host publication2016 IEEE 32nd International Conference on Data Engineering, ICDE 2016
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages613-624
Number of pages12
ISBN (Electronic)9781509020195
DOIs
StatePublished - 22 Jun 2016
Externally publishedYes
Event32nd IEEE International Conference on Data Engineering, ICDE 2016 - Helsinki, Finland
Duration: 16 May 201620 May 2016

Publication series

Name2016 IEEE 32nd International Conference on Data Engineering, ICDE 2016

Conference

Conference32nd IEEE International Conference on Data Engineering, ICDE 2016
Country/TerritoryFinland
CityHelsinki
Period16/05/1620/05/16

Fingerprint

Dive into the research topics of 'Efficient fault-tolerance for iterative graph processing on distributed dataflow systems'. Together they form a unique fingerprint.

Cite this