Skip to main navigation Skip to search Skip to main content

A reliability analysis for successful execution of parallel DAG tasks

  • Ke Kun Hu
  • , Guo Sun Zeng*
  • , Wen Juan Liu
  • , Wei Wang
  • *Corresponding author for this work
  • Tongji University

Research output: Contribution to journalArticlepeer-review

Abstract

Large scale parallel computing system is becoming more and more failure-prone due to the increasing number of computational nodes. This results in serious reliability problems in parallel computing. To ensure successfully running of parallel tasks such as Meta tasks and DAG tasks, it is necessary to perform reliability analysis before scheduling parallel tasks. For Meta tasks, some key factors are discussed that affect and impede successful execution of a single task. Then, the reliability formula of Meta tasks is presented. For DAG tasks, hardware failures, software failures, network link failures and subtask execution order are all taken into account. We shall calculate not only the reliability of subtasks, but also the reliability of network communication. Then two reliability algorithms of DAG tasks are designed. Finally, some experiments are conducted. Experimental results show that our reliability analysis methods are more effective and comprehensive.

Original languageEnglish
Pages (from-to)81-99
Number of pages19
JournalJournal of Information Science and Engineering
Volume33
Issue number1
DOIs
StatePublished - Jan 2017
Externally publishedYes

Keywords

  • DAG tasks
  • Meta tasks
  • Parallel computing
  • Reliability
  • Successful execution

Fingerprint

Dive into the research topics of 'A reliability analysis for successful execution of parallel DAG tasks'. Together they form a unique fingerprint.

Cite this