跳到主要导航 跳到搜索 跳到主要内容

Espresso: Cost-Efficient Large Model Training by Exploiting GPU Heterogeneity in the Cloud

  • Qiannan Zhou
  • , Fei Xu*
  • , Lingxuan Weng
  • , Ruixing Li
  • , Xudong Wu
  • , Li Chen
  • , Zhi Zhou
  • , Fangming Liu
  • *此作品的通讯作者
  • East China Normal University
  • University of Louisiana at Lafayette
  • Sun Yat-Sen University
  • Peng Cheng Laboratory

科研成果: 书/报告/会议事项章节会议稿件同行评审

摘要

As Transformer-based models deepen and datasets expand, training large models demands numerous accelerators, particularly GPUs, bringing high cloud expenses. However, conventional homogeneous resource provisioning is inefficient due to limited cloud resources and low GPU utilization. This challenge necessitates heterogeneous GPU provisioning for training in clouds. Current research on large model training often focuses on load balancing of stages, neglecting the varying computing and memory demands across stages. Additionally, the allocation of heterogeneous G PU s for training has surprisingly received little attention. This paper introduces Espresso, a cost-efficient GPU provisioning framework that unifies the heterogeneous GPU allocation (GPU allocator) and adequate stage placement (stage placer) for large model training in the cloud. Specifically, the GPU allocator proposes a cost tree-based provisioning strategy to prioritize searching allocation plans with lower costs and reduce unnecessary branches by multi-dimensional pruning methods. The resource-aware stage placer further devises a compute-memory ratio to optimize communication and computation efficiency during training. We have open-sourced a prototype of Espresso and conducted prototype experiments on four representative large models in public clouds. Extensive experiment results demonstrate that Espresso guarantees the performance for large model training while saving costs by up to 49.8 % compared to state-of-the-art solutions, yet with acceptable runtime overhead.

源语言英语
主期刊名INFOCOM 2025 - IEEE Conference on Computer Communications
出版商Institute of Electrical and Electronics Engineers Inc.
ISBN(电子版)9798331543051
DOI
出版状态已出版 - 2025
活动2025 IEEE Conference on Computer Communications, INFOCOM 2025 - London, 英国
期限: 19 5月 202522 5月 2025

出版系列

姓名Proceedings - IEEE INFOCOM
ISSN(印刷版)0743-166X

会议

会议2025 IEEE Conference on Computer Communications, INFOCOM 2025
国家/地区英国
London
时期19/05/2522/05/25

指纹

探究 'Espresso: Cost-Efficient Large Model Training by Exploiting GPU Heterogeneity in the Cloud' 的科研主题。它们共同构成独一无二的指纹。

引用此