TY - JOUR
T1 - λdNN
T2 - Achieving Predictable Distributed DNN Training with Serverless Architectures
AU - Xu, Fei
AU - Qin, Yiling
AU - Chen, Li
AU - Zhou, Zhi
AU - Liu, Fangming
N1 - Publisher Copyright:
© 1968-2012 IEEE.
PY - 2022/2/1
Y1 - 2022/2/1
N2 - Serverless computing is becoming a promising paradigm for Distributed Deep Neural Network (DDNN) training in the cloud, as it allows users to decompose complex model training into a number of functions without managing virtual machines or servers. Though provided with a simpler resource interface (i.e., function number and memory size), inadequate function resource provisioning (either under-provisioning or over-provisioning) easily leads to unpredictable DDNN training performance in serverless platforms. Our empirical studies on AWS Lambda indicate that, such unpredictable performance of serverless DDNN training is mainly caused by the resource bottleneck of Parameter Servers (PS) and small local batch size. In this article, we design and implement lambdaλDNN, a cost-efficient function resource provisioning framework to provide predictable performance for serverless DDNN training workloads, while saving the budget of provisioned functions. Leveraging the PS network bandwidth and function CPU utilization, we build a lightweight analytical DDNN training performance model to enable our design of lambdaλDNN resource provisioning strategy, so as to guarantee DDNN training performance with serverless functions. Extensive prototype experiments on AWS Lambda and complementary trace-driven simulations demonstrate that, lambdaλDNN can deliver predictable DDNN training performance and save the monetary cost of function resources by up to 66.7 percent, compared with the state-of-the-art resource provisioning strategies, yet with an acceptable runtime overhead.
AB - Serverless computing is becoming a promising paradigm for Distributed Deep Neural Network (DDNN) training in the cloud, as it allows users to decompose complex model training into a number of functions without managing virtual machines or servers. Though provided with a simpler resource interface (i.e., function number and memory size), inadequate function resource provisioning (either under-provisioning or over-provisioning) easily leads to unpredictable DDNN training performance in serverless platforms. Our empirical studies on AWS Lambda indicate that, such unpredictable performance of serverless DDNN training is mainly caused by the resource bottleneck of Parameter Servers (PS) and small local batch size. In this article, we design and implement lambdaλDNN, a cost-efficient function resource provisioning framework to provide predictable performance for serverless DDNN training workloads, while saving the budget of provisioned functions. Leveraging the PS network bandwidth and function CPU utilization, we build a lightweight analytical DDNN training performance model to enable our design of lambdaλDNN resource provisioning strategy, so as to guarantee DDNN training performance with serverless functions. Extensive prototype experiments on AWS Lambda and complementary trace-driven simulations demonstrate that, lambdaλDNN can deliver predictable DDNN training performance and save the monetary cost of function resources by up to 66.7 percent, compared with the state-of-the-art resource provisioning strategies, yet with an acceptable runtime overhead.
KW - Distributed DNN training
KW - function resource provisioning
KW - predictable performance
KW - serverless computing
UR - https://www.scopus.com/pages/publications/85100476464
U2 - 10.1109/TC.2021.3054656
DO - 10.1109/TC.2021.3054656
M3 - 文章
AN - SCOPUS:85100476464
SN - 0018-9340
VL - 71
SP - 450
EP - 463
JO - IEEE Transactions on Computers
JF - IEEE Transactions on Computers
IS - 2
ER -