Qore-DL: A QoS-aware joint optimization framework for distributed deep learning training

  • Qin Hua
  • , Shiyou Qian*
  • , Dingyu Yang
  • , Jianmei Guo
  • , Jian Cao
  • , Guangtao Xue
  • , Minglu Li
  • *Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

2 Scopus citations

Abstract

Resource management of deep learning training (DLT) jobs is critical for cluster resource efficiency and client QoS assurance. Most existing scheduling frameworks require clients to specify job resource configuration, which can lead to over-provision or under-provision issues. Additionally, the performance of some static scheduling frameworks degrades in highly dynamic clusters. In this paper, we propose a QoS-aware joint resource optimization framework called Qore-DL for distributed DLT jobs. We divide the lifecycle of a DLT job into submission, queuing and running stages. Qore-DL automatically configures reasonable resources for submitted jobs and greedily assigns scheduled jobs to hosts. For running jobs, Qore-DL employs a heuristic scheme to adjust their resources. Qore-DL jointly considers the optimization of QoS satisfaction and resource efficiency at the three stages of DLT jobs. We implemented the prototype of Qore-DL in TensorFlow based on Kubernetes and conducted extensive experiments in CPU and GPU clusters to evaluate its performance. The experiment results show that, compared with its counterparts, Qore-DL can improve the job completion rate by up to 42.4% and the cluster resource efficiency by up to 21.8%.

Original languageEnglish
Article number102640
JournalJournal of Systems Architecture
Volume130
DOIs
StatePublished - Sep 2022

Keywords

  • Cluster
  • Distributed deep learning
  • QoS
  • Resource schedule

Fingerprint

Dive into the research topics of 'Qore-DL: A QoS-aware joint optimization framework for distributed deep learning training'. Together they form a unique fingerprint.

Cite this