Hadar: Heterogeneity-Aware Optimization-Based Online Scheduling for Deep Learning Cluster

  • Abeda Sultana
  • , Fei Xu
  • , Xu Yuan
  • , Li Chen*
  • , Nian Feng Tzeng
  • *Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

5 Scopus citations

Abstract

With the wide adoption of deep neural network (DNN) models for various applications, enterprises, and cloud providers have built deep learning clusters and increasingly deployed specialized accelerators, such as GPUs and TPUs, for DNN training jobs. To arbitrate cluster resources among multi-user jobs, existing schedulers fall short, either lacking fine-grained heterogeneity awareness or hardly generalizable to various scheduling policies. To fill this gap, we propose a novel design of a task-level heterogeneity-aware scheduler, Hadar, based on an online optimization framework that can express other scheduling algorithms. Hadar leverages the performance traits of DNN jobs on a heterogeneous cluster, characterizes the task-level performance heterogeneity in the optimization problem, and makes scheduling decisions across both spatial and temporal dimensions. The primal-dual framework is employed, with our design of a dual subroutine, to solve the optimization problem and guide the scheduling design. Extensive trace-driven simulations with representative DNN models have been conducted to demonstrate that Hadar improves the average job completion time (JCT) by 3× over an Apache YARN-based resource manager used in production. Moreover, Hadar outperforms Gavel [1], the state-of-the-art heterogeneity-aware scheduler, by 2.5× for the average JCT, shortens the queuing delay by 13%, and improves FTF (Finish-Time-Fairness) by 1.5%.

Original languageEnglish
Title of host publicationProceedings - 2024 IEEE International Parallel and Distributed Processing Symposium, IPDPS 2024
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages681-691
Number of pages11
ISBN (Electronic)9798350337662
DOIs
StatePublished - 2024
Event38th IEEE International Parallel and Distributed Processing Symposium, IPDPS 2024 - San Francisco, United States
Duration: 27 May 202431 May 2024

Publication series

NameProceedings - 2024 IEEE International Parallel and Distributed Processing Symposium, IPDPS 2024

Conference

Conference38th IEEE International Parallel and Distributed Processing Symposium, IPDPS 2024
Country/TerritoryUnited States
CitySan Francisco
Period27/05/2431/05/24

Keywords

  • distributed deep learning
  • optimization
  • scheduling

Fingerprint

Dive into the research topics of 'Hadar: Heterogeneity-Aware Optimization-Based Online Scheduling for Deep Learning Cluster'. Together they form a unique fingerprint.

Cite this