跳到主要导航 跳到搜索 跳到主要内容

Prophet: Speeding up Distributed DNN Training with Predictable Communication Scheduling

  • Zhenwei Zhang
  • , Qiang Qi
  • , Ruitao Shang
  • , Li Chen
  • , Fei Xu*
  • *此作品的通讯作者

科研成果: 书/报告/会议事项章节会议稿件同行评审

摘要

Optimizing performance for Distributed Deep Neural Network (DDNN) training has recently become increasingly compelling, as the DNN model gets complex and the training dataset grows large. While existing works on communication scheduling mostly focus on overlapping the computation and communication to improve DDNN training performance, the GPU and network resources are still under-utilized in DDNN training clusters. To tackle this issue, in this paper, we design and implement a predictable communication scheduling strategy named Prophet to schedule the gradient transfer in an adequate order, with the aim of maximizing the GPU and network resource utilization. Leveraging our observed stepwise pattern of gradient transfer start time, Prophet first uses the monitored network bandwidth and the profiled time interval among gradients to predict the appropriate number of gradients that can be grouped into blocks. Then, these gradient blocks can be transferred one by one to guarantee high utilization of GPU and network resources while ensuring the priority of gradient transfer (i.e., low-priority gradients cannot preempt high-priority gradients in the network transfer). Prophet can make the forward propagation start as early as possible so as to greedily reduce the waiting (idle) time of GPU resources during the DDNN training process. Prototype experiments with representative DNN models trained on Amazon EC2 demonstrate that Prophet can improve the DDNN training performance by up to 40% compared with the state-of-the-art priority-based communication scheduling strategies, yet with negligible runtime performance overhead.

源语言英语
主期刊名50th International Conference on Parallel Processing, ICPP 2021 - Main Conference Proceedings
出版商Association for Computing Machinery
ISBN(电子版)9781450390682
DOI
出版状态已出版 - 9 8月 2021
活动50th International Conference on Parallel Processing, ICPP 2021 - Virtual, Online, 美国
期限: 9 8月 202112 8月 2021

出版系列

姓名ACM International Conference Proceeding Series

会议

会议50th International Conference on Parallel Processing, ICPP 2021
国家/地区美国
Virtual, Online
时期9/08/2112/08/21

指纹

探究 'Prophet: Speeding up Distributed DNN Training with Predictable Communication Scheduling' 的科研主题。它们共同构成独一无二的指纹。

引用此