A generic approach to scheduling and checkpointing workflows

  • Li Han
  • , Valentin Le Fèvre
  • , Louis Claude Canon
  • , Yves Robert
  • , Frédéric Vivien

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

4 Scopus citations

Abstract

This work deals with scheduling and checkpointing strategies to execute scientific workflows on failure-prone large-scale platforms. To the best of our knowledge, this work is the first to target fail-stop errors for arbitrary workflows. Most previous work addresses soft errors, which corrupt the task being executed by a processor but do not cause the entire memory of that processor to be lost, contrarily to fail-stop errors. We revisit classical mapping heuristics such as HEFT and MinMin and complement them with several checkpointing strategies. The objective is to derive an efficient trade-off between checkpointing every task (CkptAll), which is an overkill when failures are rare events, and checkpointing no task (CkptNone), which induces dramatic re-execution overhead even when only a few failures strike during execution. Contrarily to previous work, our approach applies to arbitrary workflows, not just special classes of dependence graphs such as M-SPGs (Minimal Series-Parallel Graphs). Extensive experiments report significant gain over both CkptAll and CkptNone, for a wide variety of workflows.

Original languageEnglish
Title of host publicationProceedings of the 47th International Conference on Parallel Processing, ICPP 2018
PublisherAssociation for Computing Machinery
ISBN (Print)9781450365109
DOIs
StatePublished - 13 Aug 2018
Event47th International Conference on Parallel Processing, ICPP 2018 - Eugene, United States
Duration: 14 Aug 201816 Aug 2018

Publication series

NameACM International Conference Proceeding Series

Conference

Conference47th International Conference on Parallel Processing, ICPP 2018
Country/TerritoryUnited States
CityEugene
Period14/08/1816/08/18

Fingerprint

Dive into the research topics of 'A generic approach to scheduling and checkpointing workflows'. Together they form a unique fingerprint.

Cite this