摘要
Resource management of deep learning training (DLT) jobs is critical for cluster resource efficiency and client QoS assurance. Most existing scheduling frameworks require clients to specify job resource configuration, which can lead to over-provision or under-provision issues. Additionally, the performance of some static scheduling frameworks degrades in highly dynamic clusters. In this paper, we propose a QoS-aware joint resource optimization framework called Qore-DL for distributed DLT jobs. We divide the lifecycle of a DLT job into submission, queuing and running stages. Qore-DL automatically configures reasonable resources for submitted jobs and greedily assigns scheduled jobs to hosts. For running jobs, Qore-DL employs a heuristic scheme to adjust their resources. Qore-DL jointly considers the optimization of QoS satisfaction and resource efficiency at the three stages of DLT jobs. We implemented the prototype of Qore-DL in TensorFlow based on Kubernetes and conducted extensive experiments in CPU and GPU clusters to evaluate its performance. The experiment results show that, compared with its counterparts, Qore-DL can improve the job completion rate by up to 42.4% and the cluster resource efficiency by up to 21.8%.
| 源语言 | 英语 |
|---|---|
| 文章编号 | 102640 |
| 期刊 | Journal of Systems Architecture |
| 卷 | 130 |
| DOI | |
| 出版状态 | 已出版 - 9月 2022 |
联合国可持续发展目标
此成果有助于实现下列可持续发展目标:
-
可持续发展目标 8 体面工作和经济增长
-
可持续发展目标 12 负责任消费和生产
指纹
探究 'Qore-DL: A QoS-aware joint optimization framework for distributed deep learning training' 的科研主题。它们共同构成独一无二的指纹。引用此
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver