跳到主要导航 跳到搜索 跳到主要内容

Qore-DL: A QoS-aware joint optimization framework for distributed deep learning training

  • Qin Hua
  • , Shiyou Qian*
  • , Dingyu Yang
  • , Jianmei Guo
  • , Jian Cao
  • , Guangtao Xue
  • , Minglu Li
  • *此作品的通讯作者

科研成果: 期刊稿件文章同行评审

摘要

Resource management of deep learning training (DLT) jobs is critical for cluster resource efficiency and client QoS assurance. Most existing scheduling frameworks require clients to specify job resource configuration, which can lead to over-provision or under-provision issues. Additionally, the performance of some static scheduling frameworks degrades in highly dynamic clusters. In this paper, we propose a QoS-aware joint resource optimization framework called Qore-DL for distributed DLT jobs. We divide the lifecycle of a DLT job into submission, queuing and running stages. Qore-DL automatically configures reasonable resources for submitted jobs and greedily assigns scheduled jobs to hosts. For running jobs, Qore-DL employs a heuristic scheme to adjust their resources. Qore-DL jointly considers the optimization of QoS satisfaction and resource efficiency at the three stages of DLT jobs. We implemented the prototype of Qore-DL in TensorFlow based on Kubernetes and conducted extensive experiments in CPU and GPU clusters to evaluate its performance. The experiment results show that, compared with its counterparts, Qore-DL can improve the job completion rate by up to 42.4% and the cluster resource efficiency by up to 21.8%.

源语言英语
文章编号102640
期刊Journal of Systems Architecture
130
DOI
出版状态已出版 - 9月 2022

联合国可持续发展目标

此成果有助于实现下列可持续发展目标:

  1. 可持续发展目标 8 - 体面工作和经济增长
    可持续发展目标 8 体面工作和经济增长
  2. 可持续发展目标 12 - 负责任消费和生产
    可持续发展目标 12 负责任消费和生产

指纹

探究 'Qore-DL: A QoS-aware joint optimization framework for distributed deep learning training' 的科研主题。它们共同构成独一无二的指纹。

引用此