TY - GEN
T1 - HarmonyBatch
T2 - 32nd IEEE/ACM International Symposium on Quality of Service, IWQoS 2024
AU - Chen, Jiabin
AU - Xu, Fei
AU - Gu, Yikun
AU - Chen, Li
AU - Liu, Fangming
AU - Zhou, Zhi
N1 - Publisher Copyright:
© 2024 IEEE.
PY - 2024
Y1 - 2024
N2 - Deep Neural Network (DNN) inference on serverless functions is gaining prominence due to its potential for substantial budget savings. Existing works on serverless DNN inference solely optimize batching requests from one application with a single Service Level Objective (SLO) on CPU functions. However, production serverless DNN inference traces indicate that the request arrival rate of applications is surprisingly low, which inevitably causes a long batching time and SLO violations. Hence, there is an urgent need for batching multiple DNN inference requests with diverse SLOs (i.e., multi-SLO DNN inference) in serverless platforms. Moreover, the potential performance and cost benefits of deploying heterogeneous (i.e., CPU and GPU) functions for DNN inference have received scant attention.In this paper, we present HarmonyBatch, a cost-efficient resource provisioning framework designed to achieve predictable performance for multi-SLO DNN inference with heterogeneous serverless functions. Specifically, we construct an analytical performance and cost model of DNN inference on both CPU and GPU functions, by explicitly considering the GPU time-slicing scheduling mechanism and request arrival rate distribution. Based on such a model, we devise a two-stage merging strategy in HarmonyBatch to judiciously batch the multi-SLO DNN inference requests into application groups. It aims to minimize the budget of function provisioning for each application group while guaranteeing diverse performance SLOs of inference applications. We have implemented a prototype of HarmonyBatch on Alibaba Cloud Function Compute. Extensive prototype experiments with representative DNN inference workloads demonstrate that HarmonyBatch can provide predictable performance to serverless DNN inference workloads while reducing the monetary cost by up to 82.9% compared to the state-of-the-art methods.
AB - Deep Neural Network (DNN) inference on serverless functions is gaining prominence due to its potential for substantial budget savings. Existing works on serverless DNN inference solely optimize batching requests from one application with a single Service Level Objective (SLO) on CPU functions. However, production serverless DNN inference traces indicate that the request arrival rate of applications is surprisingly low, which inevitably causes a long batching time and SLO violations. Hence, there is an urgent need for batching multiple DNN inference requests with diverse SLOs (i.e., multi-SLO DNN inference) in serverless platforms. Moreover, the potential performance and cost benefits of deploying heterogeneous (i.e., CPU and GPU) functions for DNN inference have received scant attention.In this paper, we present HarmonyBatch, a cost-efficient resource provisioning framework designed to achieve predictable performance for multi-SLO DNN inference with heterogeneous serverless functions. Specifically, we construct an analytical performance and cost model of DNN inference on both CPU and GPU functions, by explicitly considering the GPU time-slicing scheduling mechanism and request arrival rate distribution. Based on such a model, we devise a two-stage merging strategy in HarmonyBatch to judiciously batch the multi-SLO DNN inference requests into application groups. It aims to minimize the budget of function provisioning for each application group while guaranteeing diverse performance SLOs of inference applications. We have implemented a prototype of HarmonyBatch on Alibaba Cloud Function Compute. Extensive prototype experiments with representative DNN inference workloads demonstrate that HarmonyBatch can provide predictable performance to serverless DNN inference workloads while reducing the monetary cost by up to 82.9% compared to the state-of-the-art methods.
KW - DNN inference
KW - SLO guarantee
KW - resource provisioning
KW - serverless computing
UR - https://www.scopus.com/pages/publications/85206354554
U2 - 10.1109/IWQoS61813.2024.10682915
DO - 10.1109/IWQoS61813.2024.10682915
M3 - 会议稿件
AN - SCOPUS:85206354554
T3 - IEEE International Workshop on Quality of Service, IWQoS
BT - 2024 IEEE/ACM 32nd International Symposium on Quality of Service, IWQoS 2024
PB - Institute of Electrical and Electronics Engineers Inc.
Y2 - 19 June 2024 through 21 June 2024
ER -