跳到主要导航 跳到搜索 跳到主要内容

E-LAS: Design and Analysis of Completion-Time Agnostic Scheduling for Distributed Deep Learning Cluster

  • Abeda Sultana
  • , Li Chen
  • , Fei Xu
  • , Xu Yuan
  • University of Louisiana at Lafayette

科研成果: 书/报告/会议事项章节会议稿件同行评审

摘要

With the prosperity of deep learning, enterprises, and large platform providers, such as Microsoft, Amazon, and Google, have built and provided GPU clusters to facilitate distributed deep learning training. As deep learning training workloads are heterogeneous, with a diverse range of characteristics and resource requirements, it becomes increasingly crucial to design an efficient and optimal scheduler for distributed deep learning jobs in the GPU cluster. This paper aims to propose a simple and yet effective scheduler, called E-LAS, with the objective of reducing the averaged training completion time of deep learning jobs. Without relying on the estimation or prior knowledge of the job running time, E-LAS leverages the real-time epoch progress rate, unique for distributed deep learning training jobs, as well as the attained services from temporal and spatial domains, to guide the scheduling decisions. The theoretical analysis for E-LAS is conducted to offer a deeper understanding on the components of scheduling criteria. Furthermore, we present a placement algorithm to achieve better resource utilization without involving much implementation overhead, complementary to the scheduling algorithm. Extensive simulations have been conducted, demonstrating that E-LAS improves the averaged job completion time (JCT) by 10 × over an Apache YARN-based resource manager used in production. Moreover, E-LAS outperforms Tiresias, the state-of-the-art scheduling algorithm customized for deep learning jobs, by almost 1.5 × for the average JCT as well as queuing time.

源语言英语
主期刊名Proceedings of the 49th International Conference on Parallel Processing, ICPP 2020
出版商Association for Computing Machinery
ISBN(电子版)9781450388160
DOI
出版状态已出版 - 17 8月 2020
活动49th International Conference on Parallel Processing, ICPP 2020 - Virtual, Online, 加拿大
期限: 17 8月 202020 8月 2020

出版系列

姓名ACM International Conference Proceeding Series

会议

会议49th International Conference on Parallel Processing, ICPP 2020
国家/地区加拿大
Virtual, Online
时期17/08/2020/08/20

指纹

探究 'E-LAS: Design and Analysis of Completion-Time Agnostic Scheduling for Distributed Deep Learning Cluster' 的科研主题。它们共同构成独一无二的指纹。

引用此