Elastic Averaging for Efficient Pipelined DNN Training

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

8 Scopus citations

Abstract

Nowadays, the size of DNN models has grown rapidly. To train a large model, pipeline parallelism-based frameworks partition the model across GPUs and slice each batch of data into multiple micro-batches. However, pipeline parallelism suffers from a bubble issue and low peak utilization of GPUs. Recent work tries to address the two issues, but fails to exploit the benefit of vanilla pipeline parallelism, i.e., overlapping communication with computation. In this work, we employ an elastic averaging-based framework which explores elastic averaging to add multiple parallel pipelines. To help the framework exploit the advantage of pipeline parallelism while reducing the memory footprints, we propose a schedule, advance forward propagation. Moreover, since the numbers of parallel pipelines and micro-batches are essential to the framework performance, we propose a profiling-based tuning method to automatically determine the settings. We integrate those techniques into a prototype system, namely AvgPipe, based on PyTorch. Our experiments show that Avg-Pipe achieves a 1.7x speedups over state-of-the-art solutions of pipeline parallelism on average.

Original languageEnglish
Title of host publicationPPoPP 2023 - Proceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming
PublisherAssociation for Computing Machinery
Pages380-391
Number of pages12
ISBN (Electronic)9798400700156
DOIs
StatePublished - 11 Feb 2023
Event28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming, PPoPP 2023 - Montreal, Canada
Duration: 25 Feb 20231 Mar 2023

Publication series

NameProceedings of the ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPOPP
ISSN (Print)1542-0205

Conference

Conference28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming, PPoPP 2023
Country/TerritoryCanada
CityMontreal
Period25/02/231/03/23

Keywords

  • deep learning system
  • elastic averaging
  • pipeline parallelism

Fingerprint

Dive into the research topics of 'Elastic Averaging for Efficient Pipelined DNN Training'. Together they form a unique fingerprint.

Cite this