Rationing bandwidth resources for mitigating network resource contention in distributed DNN training clusters

Qiang Qi, Fei Xu, Li Chen, Zhi Zhou

Research output: Contribution to journalArticlepeer-review

1 Scopus citations

Abstract

Distributed deep neural network (DDNN) training becomes increasingly compelling as the DNN model gets complex and the dataset grows large. Through an in-depth analysis of the latest Microsoft GPU cluster trace, we show that the co-located Parameter Server (PS) configuration is not uncommon in production DDNN training clusters, which inevitably causes intense network resource contention among the co-located PS and worker tasks. Our motivation experiments on Amazon EC2 further show that such network resource contention brings severe performance variation to DDNN training jobs. While existing works largely mitigate the inter-job network resource contention, the intra-job (i.e., task-level) network resource contention among the co-located PS and worker tasks has received comparably little attention. To tackle such performance issues, in this paper, we design and implement Nebula, a Network bandwidth resource allocation strategy for DDNN training tasks, in order to mitigate the network resource contention and alleviate the performance variation of DDNN training jobs. Nebula monitors the weights of co-located PS and workers and rations the network bandwidth resources for the two tasks by comparing the corresponding task weights. We implement a prototype of Nebula and conduct extensive prototype experiments with representative DNN models trained on Amazon EC2. Our experiment results demonstrate that Nebula can reduce the iteration time of a DDNN training job by up to 25% and improve the cluster resource utilization by up to 30% in comparison to MXNet, yet with practically acceptable runtime overhead.

Original languageEnglish
Pages (from-to)171-185
Number of pages15
JournalCCF Transactions on High Performance Computing
Volume3
Issue number2
DOIs
StatePublished - Jun 2021

Keywords

  • Bandwidth allocation
  • Distributed DNN training
  • Network resource contention

Fingerprint

Dive into the research topics of 'Rationing bandwidth resources for mitigating network resource contention in distributed DNN training clusters'. Together they form a unique fingerprint.

Cite this