spotDNN: Provisioning Spot Instances for Predictable Distributed DNN Training in the Cloud

Ruitao Shang, Fei Xu*, Zhuoyan Bai, Li Chen, Zhi Zhou, Fangming Liu

*Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

5 Scopus citations

Abstract

Distributed Deep Neural Network (DDNN) training on cloud spot instances is increasingly compelling as it can significantly save the user budget. To handle unexpected instance revocations, provisioning a heterogeneous cluster using the asynchronous parallel mechanism becomes the dominant method for DDNN training with spot instances. However, blindly provisioning a cluster of spot instances can easily result in unpre-dictable DDNN training performance, mainly because bottlenecks occur on the parameter server network bandwidth and PCIe bandwidth resources, as well as the inadequate cluster heterogeneity. To address the challenges above, we propose spotDNN, a heterogeneity-aware spot instance provisioning framework that provides predictable performance for DDNN training in the cloud. By explicitly considering the contention for bottle-neck resources, we first build an analytical performance model of DDNN training in heterogeneous clusters. It leverages the weighted average batch size and convergence coefficient to quantify the DDNN training loss in heterogeneous clusters. Through a lightweight workload profiling, we further design a cost-efficient instance provisioning strategy which incorporates the bounds calculation and sliding window techniques to effectively guarantee the training performance service level objectives (SLOs). We have implemented a prototype of spotDNN and conducted extensive experiments on Amazon EC2. Experiment results show that spotDNN can deliver predictable DDNN training performance while reducing the monetary cost by up to 68.1% compared to the existing solutions, yet with acceptable runtime overhead.

Original languageEnglish
Title of host publication2023 IEEE/ACM 31st International Symposium on Quality of Service, IWQoS 2023
PublisherInstitute of Electrical and Electronics Engineers Inc.
ISBN (Electronic)9798350399738
DOIs
StatePublished - 2023
Event31st IEEE/ACM International Symposium on Quality of Service, IWQoS 2023 - Orlando, United States
Duration: 19 Jun 202321 Jun 2023

Publication series

NameIEEE International Workshop on Quality of Service, IWQoS
Volume2023-June
ISSN (Print)1548-615X

Conference

Conference31st IEEE/ACM International Symposium on Quality of Service, IWQoS 2023
Country/TerritoryUnited States
CityOrlando
Period19/06/2321/06/23

Keywords

  • distributed DNN training
  • heterogeneous clusters
  • predictable performance
  • spot instance provisioning

Fingerprint

Dive into the research topics of 'spotDNN: Provisioning Spot Instances for Predictable Distributed DNN Training in the Cloud'. Together they form a unique fingerprint.

Cite this