TY - GEN
T1 - Hadar
T2 - 38th IEEE International Parallel and Distributed Processing Symposium, IPDPS 2024
AU - Sultana, Abeda
AU - Xu, Fei
AU - Yuan, Xu
AU - Chen, Li
AU - Tzeng, Nian Feng
N1 - Publisher Copyright:
© 2024 IEEE.
PY - 2024
Y1 - 2024
N2 - With the wide adoption of deep neural network (DNN) models for various applications, enterprises, and cloud providers have built deep learning clusters and increasingly deployed specialized accelerators, such as GPUs and TPUs, for DNN training jobs. To arbitrate cluster resources among multi-user jobs, existing schedulers fall short, either lacking fine-grained heterogeneity awareness or hardly generalizable to various scheduling policies. To fill this gap, we propose a novel design of a task-level heterogeneity-aware scheduler, Hadar, based on an online optimization framework that can express other scheduling algorithms. Hadar leverages the performance traits of DNN jobs on a heterogeneous cluster, characterizes the task-level performance heterogeneity in the optimization problem, and makes scheduling decisions across both spatial and temporal dimensions. The primal-dual framework is employed, with our design of a dual subroutine, to solve the optimization problem and guide the scheduling design. Extensive trace-driven simulations with representative DNN models have been conducted to demonstrate that Hadar improves the average job completion time (JCT) by 3× over an Apache YARN-based resource manager used in production. Moreover, Hadar outperforms Gavel [1], the state-of-the-art heterogeneity-aware scheduler, by 2.5× for the average JCT, shortens the queuing delay by 13%, and improves FTF (Finish-Time-Fairness) by 1.5%.
AB - With the wide adoption of deep neural network (DNN) models for various applications, enterprises, and cloud providers have built deep learning clusters and increasingly deployed specialized accelerators, such as GPUs and TPUs, for DNN training jobs. To arbitrate cluster resources among multi-user jobs, existing schedulers fall short, either lacking fine-grained heterogeneity awareness or hardly generalizable to various scheduling policies. To fill this gap, we propose a novel design of a task-level heterogeneity-aware scheduler, Hadar, based on an online optimization framework that can express other scheduling algorithms. Hadar leverages the performance traits of DNN jobs on a heterogeneous cluster, characterizes the task-level performance heterogeneity in the optimization problem, and makes scheduling decisions across both spatial and temporal dimensions. The primal-dual framework is employed, with our design of a dual subroutine, to solve the optimization problem and guide the scheduling design. Extensive trace-driven simulations with representative DNN models have been conducted to demonstrate that Hadar improves the average job completion time (JCT) by 3× over an Apache YARN-based resource manager used in production. Moreover, Hadar outperforms Gavel [1], the state-of-the-art heterogeneity-aware scheduler, by 2.5× for the average JCT, shortens the queuing delay by 13%, and improves FTF (Finish-Time-Fairness) by 1.5%.
KW - distributed deep learning
KW - optimization
KW - scheduling
UR - https://www.scopus.com/pages/publications/85198903480
U2 - 10.1109/IPDPS57955.2024.00066
DO - 10.1109/IPDPS57955.2024.00066
M3 - 会议稿件
AN - SCOPUS:85198903480
T3 - Proceedings - 2024 IEEE International Parallel and Distributed Processing Symposium, IPDPS 2024
SP - 681
EP - 691
BT - Proceedings - 2024 IEEE International Parallel and Distributed Processing Symposium, IPDPS 2024
PB - Institute of Electrical and Electronics Engineers Inc.
Y2 - 27 May 2024 through 31 May 2024
ER -