跳到主要导航 跳到搜索 跳到主要内容

Accelerating Synchronous Distributed Data Parallel Training with Small Batch Sizes

  • Yushu Sun
  • , Nifei Bi
  • , Chen Xu*
  • , Yuean Niu
  • , Hongfu Zhou
  • *此作品的通讯作者

科研成果: 书/报告/会议事项章节会议稿件同行评审

摘要

Synchronous distributed data parallel (SDDP) training is widely employed in distributed deep learning systems to train DNN models on large datasets. The performance of SDDP training essentially depends on the communication overhead and the statistical efficiency. However, existing approaches only optimize either the communication overhead or the statistical efficiency to accelerate SDDP training. In this paper, we adopt the advantages of those approaches and design a new approach, namely SkipSMA, that benefits from both low communication overhead and high statistical efficiency. In particular, we exploit the skipping strategy with an adaptive interval to decrease the communication frequency, which guarantees low communication overhead. Moreover, we employ the correction technique to mitigate the divergence while keeping small batch sizes, which ensures high statistical efficiency. To demonstrate the performance of SkipSMA, we integrate it into TensorFlow. Our experiments show that SkipSMA outperforms the state-of-the-art solutions for SDDP training, e.g., 6.88x speedup over SSGD.

源语言英语
主期刊名Database Systems for Advanced Applications - 29th International Conference, DASFAA 2024, Proceedings
编辑Makoto Onizuka, Chuan Xiao, Jae-Gil Lee, Yongxin Tong, Yoshiharu Ishikawa, Kejing Lu, Sihem Amer-Yahia, H.V. Jagadish
出版商Springer Science and Business Media Deutschland GmbH
503-513
页数11
ISBN(印刷版)9789819755684
DOI
出版状态已出版 - 2024
活动29th International Conference on Database Systems for Advanced Applications, DASFAA 2024 - Gifu, 日本
期限: 2 7月 20245 7月 2024

出版系列

姓名Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
14854 LNCS
ISSN(印刷版)0302-9743
ISSN(电子版)1611-3349

会议

会议29th International Conference on Database Systems for Advanced Applications, DASFAA 2024
国家/地区日本
Gifu
时期2/07/245/07/24

指纹

探究 'Accelerating Synchronous Distributed Data Parallel Training with Small Batch Sizes' 的科研主题。它们共同构成独一无二的指纹。

引用此