跳到主要导航 跳到搜索 跳到主要内容

A generic approach to scheduling and checkpointing workflows

  • Li Han
  • , Valentin Le Fèvre
  • , Louis Claude Canon
  • , Yves Robert
  • , Frédéric Vivien
  • École normale supérieure de Lyon
  • Université de Bourgogne
  • University of Tennessee

科研成果: 书/报告/会议事项章节会议稿件同行评审

摘要

This work deals with scheduling and checkpointing strategies to execute scientific workflows on failure-prone large-scale platforms. To the best of our knowledge, this work is the first to target fail-stop errors for arbitrary workflows. Most previous work addresses soft errors, which corrupt the task being executed by a processor but do not cause the entire memory of that processor to be lost, contrarily to fail-stop errors. We revisit classical mapping heuristics such as HEFT and MinMin and complement them with several checkpointing strategies. The objective is to derive an efficient trade-off between checkpointing every task (CkptAll), which is an overkill when failures are rare events, and checkpointing no task (CkptNone), which induces dramatic re-execution overhead even when only a few failures strike during execution. Contrarily to previous work, our approach applies to arbitrary workflows, not just special classes of dependence graphs such as M-SPGs (Minimal Series-Parallel Graphs). Extensive experiments report significant gain over both CkptAll and CkptNone, for a wide variety of workflows.

源语言英语
主期刊名Proceedings of the 47th International Conference on Parallel Processing, ICPP 2018
出版商Association for Computing Machinery
ISBN(印刷版)9781450365109
DOI
出版状态已出版 - 13 8月 2018
活动47th International Conference on Parallel Processing, ICPP 2018 - Eugene, 美国
期限: 14 8月 201816 8月 2018

出版系列

姓名ACM International Conference Proceeding Series

会议

会议47th International Conference on Parallel Processing, ICPP 2018
国家/地区美国
Eugene
时期14/08/1816/08/18

指纹

探究 'A generic approach to scheduling and checkpointing workflows' 的科研主题。它们共同构成独一无二的指纹。

引用此