Accelerating Synchronous Distributed Data Parallel Training with Small Batch Sizes

Yushu Sun, Nifei Bi, Chen Xu, Yuean Niu, Hongfu Zhou

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

Synchronous distributed data parallel (SDDP) training is widely employed in distributed deep learning systems to train DNN models on large datasets. The performance of SDDP training essentially depends on the communication overhead and the statistical efficiency. However, existing approaches only optimize either the communication overhead or the statistical efficiency to accelerate SDDP training. In this paper, we adopt the advantages of those approaches and design a new approach, namely SkipSMA, that benefits from both low communication overhead and high statistical efficiency. In particular, we exploit the skipping strategy with an adaptive interval to decrease the communication frequency, which guarantees low communication overhead. Moreover, we employ the correction technique to mitigate the divergence while keeping small batch sizes, which ensures high statistical efficiency. To demonstrate the performance of SkipSMA, we integrate it into TensorFlow. Our experiments show that SkipSMA outperforms the state-of-the-art solutions for SDDP training, e.g., 6.88x speedup over SSGD.

Original languageEnglish
Title of host publicationDatabase Systems for Advanced Applications - 29th International Conference, DASFAA 2024, Proceedings
EditorsMakoto Onizuka, Chuan Xiao, Jae-Gil Lee, Yongxin Tong, Yoshiharu Ishikawa, Kejing Lu, Sihem Amer-Yahia, H.V. Jagadish
PublisherSpringer Science and Business Media Deutschland GmbH
Pages503-513
Number of pages11
ISBN (Print)9789819755684
DOIs
StatePublished - 2024
Event29th International Conference on Database Systems for Advanced Applications, DASFAA 2024 - Gifu, Japan
Duration: 2 Jul 20245 Jul 2024

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume14854 LNCS
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Conference

Conference29th International Conference on Database Systems for Advanced Applications, DASFAA 2024
Country/TerritoryJapan
CityGifu
Period2/07/245/07/24

Keywords

  • Data Parallel
  • Deep Learning Systems
  • Distributed Training
  • Synchronous Training

Fingerprint

Dive into the research topics of 'Accelerating Synchronous Distributed Data Parallel Training with Small Batch Sizes'. Together they form a unique fingerprint.

Cite this