Cynthia: Cost-efficient cloud resource provisioning for predictable distributed deep neural network training

  • Haoyue Zheng
  • , Fei Xu*
  • , Li Chen
  • , Zhi Zhou
  • , Fangming Liu
  • *Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

35 Scopus citations

Abstract

It becomes an increasingly popular trend for deep neural networks with large-scale datasets to be trained in a distributed manner in the cloud. However, widely known as resource-intensive and time-consuming, distributed deep neural network (DDNN) training suffers from unpredictable performance in the cloud, due to the intricate factors of resource bottleneck, heterogeneity and the imbalance of computation and communication which eventually cause severe resource under-utilization. In this paper, we propose Cynthia, a cost-efficient cloud resource provisioning framework to provide predictable DDNN training performance and reduce the training budget. To explicitly explore the resource bottleneck and heterogeneity, Cynthia predicts the DDNN training time by leveraging a lightweight analytical performance model based on the resource consumption of workers and parameter servers. With an accurate performance prediction, Cynthia is able to optimally provision the cost-efficient cloud instances to jointly guarantee the training performance and minimize the training budget. We implement Cynthia on top of Kubernetes by launching a 56-docker cluster to train four representative DNN models. Extensive prototype experiments on Amazon EC2 demonstrate that Cynthia can provide predictable training performance while reducing the monetary cost for DDNN workloads by up to 50.6%, in comparison to state-of-the-art resource provisioning strategies, yet with acceptable runtime overhead.

Original languageEnglish
Title of host publicationProceedings of the 48th International Conference on Parallel Processing, ICPP 2019
PublisherAssociation for Computing Machinery
ISBN (Electronic)9781450362955
DOIs
StatePublished - 5 Aug 2019
Event48th International Conference on Parallel Processing, ICPP 2019 - Kyoto, Japan
Duration: 5 Aug 20198 Aug 2019

Publication series

NameACM International Conference Proceeding Series

Conference

Conference48th International Conference on Parallel Processing, ICPP 2019
Country/TerritoryJapan
CityKyoto
Period5/08/198/08/19

Keywords

  • Cloud resource provisioning
  • Deep neural network training
  • Predictable performance
  • Resource bottleneck
  • Resource heterogeneity

Fingerprint

Dive into the research topics of 'Cynthia: Cost-efficient cloud resource provisioning for predictable distributed deep neural network training'. Together they form a unique fingerprint.

Cite this