TY - GEN
T1 - A generic approach to scheduling and checkpointing workflows
AU - Han, Li
AU - Le Fèvre, Valentin
AU - Canon, Louis Claude
AU - Robert, Yves
AU - Vivien, Frédéric
N1 - Publisher Copyright:
© 2018 Copyright held by the owner/author(s). Publication rights licensed to ACM.
PY - 2018/8/13
Y1 - 2018/8/13
N2 - This work deals with scheduling and checkpointing strategies to execute scientific workflows on failure-prone large-scale platforms. To the best of our knowledge, this work is the first to target fail-stop errors for arbitrary workflows. Most previous work addresses soft errors, which corrupt the task being executed by a processor but do not cause the entire memory of that processor to be lost, contrarily to fail-stop errors. We revisit classical mapping heuristics such as HEFT and MinMin and complement them with several checkpointing strategies. The objective is to derive an efficient trade-off between checkpointing every task (CkptAll), which is an overkill when failures are rare events, and checkpointing no task (CkptNone), which induces dramatic re-execution overhead even when only a few failures strike during execution. Contrarily to previous work, our approach applies to arbitrary workflows, not just special classes of dependence graphs such as M-SPGs (Minimal Series-Parallel Graphs). Extensive experiments report significant gain over both CkptAll and CkptNone, for a wide variety of workflows.
AB - This work deals with scheduling and checkpointing strategies to execute scientific workflows on failure-prone large-scale platforms. To the best of our knowledge, this work is the first to target fail-stop errors for arbitrary workflows. Most previous work addresses soft errors, which corrupt the task being executed by a processor but do not cause the entire memory of that processor to be lost, contrarily to fail-stop errors. We revisit classical mapping heuristics such as HEFT and MinMin and complement them with several checkpointing strategies. The objective is to derive an efficient trade-off between checkpointing every task (CkptAll), which is an overkill when failures are rare events, and checkpointing no task (CkptNone), which induces dramatic re-execution overhead even when only a few failures strike during execution. Contrarily to previous work, our approach applies to arbitrary workflows, not just special classes of dependence graphs such as M-SPGs (Minimal Series-Parallel Graphs). Extensive experiments report significant gain over both CkptAll and CkptNone, for a wide variety of workflows.
UR - https://www.scopus.com/pages/publications/85054875944
U2 - 10.1145/3225058.3225145
DO - 10.1145/3225058.3225145
M3 - 会议稿件
AN - SCOPUS:85054875944
SN - 9781450365109
T3 - ACM International Conference Proceeding Series
BT - Proceedings of the 47th International Conference on Parallel Processing, ICPP 2018
PB - Association for Computing Machinery
T2 - 47th International Conference on Parallel Processing, ICPP 2018
Y2 - 14 August 2018 through 16 August 2018
ER -