E-LAS: Design and Analysis of Completion-Time Agnostic Scheduling for Distributed Deep Learning Cluster

Abeda Sultana, Li Chen, Fei Xu, Xu Yuan

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

9 Scopus citations

Abstract

With the prosperity of deep learning, enterprises, and large platform providers, such as Microsoft, Amazon, and Google, have built and provided GPU clusters to facilitate distributed deep learning training. As deep learning training workloads are heterogeneous, with a diverse range of characteristics and resource requirements, it becomes increasingly crucial to design an efficient and optimal scheduler for distributed deep learning jobs in the GPU cluster. This paper aims to propose a simple and yet effective scheduler, called E-LAS, with the objective of reducing the averaged training completion time of deep learning jobs. Without relying on the estimation or prior knowledge of the job running time, E-LAS leverages the real-time epoch progress rate, unique for distributed deep learning training jobs, as well as the attained services from temporal and spatial domains, to guide the scheduling decisions. The theoretical analysis for E-LAS is conducted to offer a deeper understanding on the components of scheduling criteria. Furthermore, we present a placement algorithm to achieve better resource utilization without involving much implementation overhead, complementary to the scheduling algorithm. Extensive simulations have been conducted, demonstrating that E-LAS improves the averaged job completion time (JCT) by 10 × over an Apache YARN-based resource manager used in production. Moreover, E-LAS outperforms Tiresias, the state-of-the-art scheduling algorithm customized for deep learning jobs, by almost 1.5 × for the average JCT as well as queuing time.

Original languageEnglish
Title of host publicationProceedings of the 49th International Conference on Parallel Processing, ICPP 2020
PublisherAssociation for Computing Machinery
ISBN (Electronic)9781450388160
DOIs
StatePublished - 17 Aug 2020
Event49th International Conference on Parallel Processing, ICPP 2020 - Virtual, Online, Canada
Duration: 17 Aug 202020 Aug 2020

Publication series

NameACM International Conference Proceeding Series

Conference

Conference49th International Conference on Parallel Processing, ICPP 2020
Country/TerritoryCanada
CityVirtual, Online
Period17/08/2020/08/20

Keywords

  • Distributed Deep learning
  • Scheduling

Fingerprint

Dive into the research topics of 'E-LAS: Design and Analysis of Completion-Time Agnostic Scheduling for Distributed Deep Learning Cluster'. Together they form a unique fingerprint.

Cite this