ACF2: Accelerating Checkpoint-Free Failure Recovery for Distributed Graph Processing

  • Chen Xu*
  • , Yi Yang
  • , Qingfeng Pan
  • , Hongfu Zhou
  • *Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

1 Scopus citations

Abstract

Iterative computation in distributed graph processing systems typically incurs a long runtime. Hence, it is crucial for graph processing to tolerate and quick recover from intermittent failures. Existing solutions can be categorized into checkpoint-based and checkpoint-free solution. The former writes checkpoints periodically during execution, which leads to significant overhead. Differently, the latter requires no checkpoint. Once failure happens, it reloads input data and resets the value of lost vertices directly. However, reloading input data involves repartitioning, which incurs additional overhead. Moreover, we observe that checkpoint-free solution cannot effectively handle failures for graph algorithms with topological mutations. To address these issues, we propose ACF2 with a partition-aware backup strategy and an incremental protocol. In particular, the partition-aware backup strategy backs up the sub-graphs of all nodes after initial partitioning. Once failure happens, the partition-aware backup strategy recovers the lost sub-graphs from the backups, and then resumes computation like checkpoint-free solution. To effectively handle failures involving topological mutations, the incremental protocol logs topological mutations during normal execution which would be exploited for recovery. We implement ACF2 based on Apache Giraph and our experiments show that ACF2 significantly outperforms existing solutions.

Original languageEnglish
Title of host publicationWeb and Big Data - 6th International Joint Conference, APWeb-WAIM 2022, Proceedings
EditorsBohan Li, Chuanqi Tao, Lin Yue, Xuming Han, Diego Calvanese, Toshiyuki Amagasa
PublisherSpringer Science and Business Media Deutschland GmbH
Pages45-59
Number of pages15
ISBN (Print)9783031251573
DOIs
StatePublished - 2023
Event6th International Joint Conference on Asia-Pacific Web (APWeb) and Web-Age Information Management (WAIM), APWeb-WAIM 2022 - Nanjing, China
Duration: 25 Nov 202227 Nov 2022

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume13421 LNCS
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Conference

Conference6th International Joint Conference on Asia-Pacific Web (APWeb) and Web-Age Information Management (WAIM), APWeb-WAIM 2022
Country/TerritoryChina
CityNanjing
Period25/11/2227/11/22

Keywords

  • Checkpoint-free
  • Failure recovery
  • Graph processing

Fingerprint

Dive into the research topics of 'ACF2: Accelerating Checkpoint-Free Failure Recovery for Distributed Graph Processing'. Together they form a unique fingerprint.

Cite this