TY - JOUR
T1 - Rationing bandwidth resources for mitigating network resource contention in distributed DNN training clusters
AU - Qi, Qiang
AU - Xu, Fei
AU - Chen, Li
AU - Zhou, Zhi
N1 - Publisher Copyright:
© 2021, China Computer Federation (CCF).
PY - 2021/6
Y1 - 2021/6
N2 - Distributed deep neural network (DDNN) training becomes increasingly compelling as the DNN model gets complex and the dataset grows large. Through an in-depth analysis of the latest Microsoft GPU cluster trace, we show that the co-located Parameter Server (PS) configuration is not uncommon in production DDNN training clusters, which inevitably causes intense network resource contention among the co-located PS and worker tasks. Our motivation experiments on Amazon EC2 further show that such network resource contention brings severe performance variation to DDNN training jobs. While existing works largely mitigate the inter-job network resource contention, the intra-job (i.e., task-level) network resource contention among the co-located PS and worker tasks has received comparably little attention. To tackle such performance issues, in this paper, we design and implement Nebula, a Network bandwidth resource allocation strategy for DDNN training tasks, in order to mitigate the network resource contention and alleviate the performance variation of DDNN training jobs. Nebula monitors the weights of co-located PS and workers and rations the network bandwidth resources for the two tasks by comparing the corresponding task weights. We implement a prototype of Nebula and conduct extensive prototype experiments with representative DNN models trained on Amazon EC2. Our experiment results demonstrate that Nebula can reduce the iteration time of a DDNN training job by up to 25% and improve the cluster resource utilization by up to 30% in comparison to MXNet, yet with practically acceptable runtime overhead.
AB - Distributed deep neural network (DDNN) training becomes increasingly compelling as the DNN model gets complex and the dataset grows large. Through an in-depth analysis of the latest Microsoft GPU cluster trace, we show that the co-located Parameter Server (PS) configuration is not uncommon in production DDNN training clusters, which inevitably causes intense network resource contention among the co-located PS and worker tasks. Our motivation experiments on Amazon EC2 further show that such network resource contention brings severe performance variation to DDNN training jobs. While existing works largely mitigate the inter-job network resource contention, the intra-job (i.e., task-level) network resource contention among the co-located PS and worker tasks has received comparably little attention. To tackle such performance issues, in this paper, we design and implement Nebula, a Network bandwidth resource allocation strategy for DDNN training tasks, in order to mitigate the network resource contention and alleviate the performance variation of DDNN training jobs. Nebula monitors the weights of co-located PS and workers and rations the network bandwidth resources for the two tasks by comparing the corresponding task weights. We implement a prototype of Nebula and conduct extensive prototype experiments with representative DNN models trained on Amazon EC2. Our experiment results demonstrate that Nebula can reduce the iteration time of a DDNN training job by up to 25% and improve the cluster resource utilization by up to 30% in comparison to MXNet, yet with practically acceptable runtime overhead.
KW - Bandwidth allocation
KW - Distributed DNN training
KW - Network resource contention
UR - https://www.scopus.com/pages/publications/85129834684
U2 - 10.1007/s42514-021-00064-x
DO - 10.1007/s42514-021-00064-x
M3 - 文章
AN - SCOPUS:85129834684
SN - 2524-4922
VL - 3
SP - 171
EP - 185
JO - CCF Transactions on High Performance Computing
JF - CCF Transactions on High Performance Computing
IS - 2
ER -