Prophet: Speeding up Distributed DNN Training with Predictable Communication Scheduling

  • Zhenwei Zhang
  • , Qiang Qi
  • , Ruitao Shang
  • , Li Chen
  • , Fei Xu*
  • *Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

8 Scopus citations

Abstract

Optimizing performance for Distributed Deep Neural Network (DDNN) training has recently become increasingly compelling, as the DNN model gets complex and the training dataset grows large. While existing works on communication scheduling mostly focus on overlapping the computation and communication to improve DDNN training performance, the GPU and network resources are still under-utilized in DDNN training clusters. To tackle this issue, in this paper, we design and implement a predictable communication scheduling strategy named Prophet to schedule the gradient transfer in an adequate order, with the aim of maximizing the GPU and network resource utilization. Leveraging our observed stepwise pattern of gradient transfer start time, Prophet first uses the monitored network bandwidth and the profiled time interval among gradients to predict the appropriate number of gradients that can be grouped into blocks. Then, these gradient blocks can be transferred one by one to guarantee high utilization of GPU and network resources while ensuring the priority of gradient transfer (i.e., low-priority gradients cannot preempt high-priority gradients in the network transfer). Prophet can make the forward propagation start as early as possible so as to greedily reduce the waiting (idle) time of GPU resources during the DDNN training process. Prototype experiments with representative DNN models trained on Amazon EC2 demonstrate that Prophet can improve the DDNN training performance by up to 40% compared with the state-of-the-art priority-based communication scheduling strategies, yet with negligible runtime performance overhead.

Original languageEnglish
Title of host publication50th International Conference on Parallel Processing, ICPP 2021 - Main Conference Proceedings
PublisherAssociation for Computing Machinery
ISBN (Electronic)9781450390682
DOIs
StatePublished - 9 Aug 2021
Event50th International Conference on Parallel Processing, ICPP 2021 - Virtual, Online, United States
Duration: 9 Aug 202112 Aug 2021

Publication series

NameACM International Conference Proceeding Series

Conference

Conference50th International Conference on Parallel Processing, ICPP 2021
Country/TerritoryUnited States
CityVirtual, Online
Period9/08/2112/08/21

Keywords

  • communication scheduling
  • distributed DNN training
  • gradient transfer
  • resource utilization

Fingerprint

Dive into the research topics of 'Prophet: Speeding up Distributed DNN Training with Predictable Communication Scheduling'. Together they form a unique fingerprint.

Cite this