TY - GEN
T1 - Efficient fault-tolerance for iterative graph processing on distributed dataflow systems
AU - Xu, Chen
AU - Holzemer, Markus
AU - Kaul, Manohar
AU - Markl, Volker
N1 - Publisher Copyright:
© 2016 IEEE.
PY - 2016/6/22
Y1 - 2016/6/22
N2 - Real-world graph processing applications often require combining the graph data with tabular data. Moreover, graph processing usually is part of a larger analytics workflow consiting of data preparation, analysis and model building, and model application. General-purpose distributed dataflow frameworks execute all steps of such workflows holistically. This holistic view enables these systems to reason about and automatically optimize the processing. Most big graph processing algorithms are iterative and incur a long runtime, as they require multiple passes over the data until convergence. Thus, fault tolerance and quick recovery from any intermittent failure at any step of the workflow are crucial for effective and efficient analysis. In this work, we propose a novel fault-tolerance mechanism for iterative graph processing on distributed data-flow systems with the objective to reduce the checkpointing cost and failure recovery time. Rather than writing checkpoints that block downstream operators, our mechanism writes checkpoints in an unblocking manner, without breaking pipelined tasks. In contrast to the typical unblocking checkpointing approaches (i.e., managing checkpoints independently for immutable datasets), we inject the checkpoints of mutable datasets into the iterative dataflow itself. Hence, our mechanism is iteration-aware by design. This simplifies the system architecture and facilitates coordinating the checkpoint creation during iterative graph processing. We achieve speedier recovery, i.e., confined recovery, by using the local log files on each node to avoid a complete re-computation from scratch. Our theoretical studies as well as our experimental analysis on Flink give further insight into our fault-tolerance strategies and show that they are more efficient than blocking checkpointing and complete recovery for iterative graph processing on dataflow systems.
AB - Real-world graph processing applications often require combining the graph data with tabular data. Moreover, graph processing usually is part of a larger analytics workflow consiting of data preparation, analysis and model building, and model application. General-purpose distributed dataflow frameworks execute all steps of such workflows holistically. This holistic view enables these systems to reason about and automatically optimize the processing. Most big graph processing algorithms are iterative and incur a long runtime, as they require multiple passes over the data until convergence. Thus, fault tolerance and quick recovery from any intermittent failure at any step of the workflow are crucial for effective and efficient analysis. In this work, we propose a novel fault-tolerance mechanism for iterative graph processing on distributed data-flow systems with the objective to reduce the checkpointing cost and failure recovery time. Rather than writing checkpoints that block downstream operators, our mechanism writes checkpoints in an unblocking manner, without breaking pipelined tasks. In contrast to the typical unblocking checkpointing approaches (i.e., managing checkpoints independently for immutable datasets), we inject the checkpoints of mutable datasets into the iterative dataflow itself. Hence, our mechanism is iteration-aware by design. This simplifies the system architecture and facilitates coordinating the checkpoint creation during iterative graph processing. We achieve speedier recovery, i.e., confined recovery, by using the local log files on each node to avoid a complete re-computation from scratch. Our theoretical studies as well as our experimental analysis on Flink give further insight into our fault-tolerance strategies and show that they are more efficient than blocking checkpointing and complete recovery for iterative graph processing on dataflow systems.
UR - https://www.scopus.com/pages/publications/84980383629
U2 - 10.1109/ICDE.2016.7498275
DO - 10.1109/ICDE.2016.7498275
M3 - 会议稿件
AN - SCOPUS:84980383629
T3 - 2016 IEEE 32nd International Conference on Data Engineering, ICDE 2016
SP - 613
EP - 624
BT - 2016 IEEE 32nd International Conference on Data Engineering, ICDE 2016
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 32nd IEEE International Conference on Data Engineering, ICDE 2016
Y2 - 16 May 2016 through 20 May 2016
ER -