Abstract
Large scale parallel computing system is becoming more and more failure-prone due to the increasing number of computational nodes. This results in serious reliability problems in parallel computing. To ensure successfully running of parallel tasks such as Meta tasks and DAG tasks, it is necessary to perform reliability analysis before scheduling parallel tasks. For Meta tasks, some key factors are discussed that affect and impede successful execution of a single task. Then, the reliability formula of Meta tasks is presented. For DAG tasks, hardware failures, software failures, network link failures and subtask execution order are all taken into account. We shall calculate not only the reliability of subtasks, but also the reliability of network communication. Then two reliability algorithms of DAG tasks are designed. Finally, some experiments are conducted. Experimental results show that our reliability analysis methods are more effective and comprehensive.
| Original language | English |
|---|---|
| Pages (from-to) | 81-99 |
| Number of pages | 19 |
| Journal | Journal of Information Science and Engineering |
| Volume | 33 |
| Issue number | 1 |
| DOIs | |
| State | Published - Jan 2017 |
| Externally published | Yes |
Keywords
- DAG tasks
- Meta tasks
- Parallel computing
- Reliability
- Successful execution
Fingerprint
Dive into the research topics of 'A reliability analysis for successful execution of parallel DAG tasks'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver