TY - JOUR
T1 - JointPS
T2 - Joint Parameter Server Placement and Flow Scheduling for Machine Learning Clusters
AU - Zhao, Yangming
AU - Yang, Cheng
AU - Zhao, Gongming
AU - Hou, Yunfei
AU - Wang, Ting
AU - Qiao, Chunming
N1 - Publisher Copyright:
© 1968-2012 IEEE.
PY - 2023/12/1
Y1 - 2023/12/1
N2 - To distill more information from training data, more parameters are introduced into machine learning models. As a result, communication becomes the bottleneck of Distributed Machine Learning (DML) systems. To alleviate the communication resource contention among DML jobs, which prolongs the time to train machine learning models, in machine learning clusters, JointPS is proposed in this paper. JointPS first minimizes the completion time of a single training epoch for each DML job via jointly optimizing the parameter server placement and flow scheduling, and predicts the number of remaining training epochs for each DML job by leveraging a dynamic model fitting method. Then, JointPS can estimate the remaining time to complete each DML job. According to such estimation, JointPS schedules DML jobs following the Minimum Remaining Time First (MRTF) principle to minimize the average job completion time. To the best of our knowledge, JointPS should be the first work that minimizes the average completion time of network-intensive DML training jobs by jointly optimizing the parameter server placement and flow scheduling without modifying the DML models and training procedures. Through both testbed experiments and extensive simulations, we demonstrate that JointPS can reduce the average completion time of DML jobs by up to 88% compared with state-of-the-art technology.
AB - To distill more information from training data, more parameters are introduced into machine learning models. As a result, communication becomes the bottleneck of Distributed Machine Learning (DML) systems. To alleviate the communication resource contention among DML jobs, which prolongs the time to train machine learning models, in machine learning clusters, JointPS is proposed in this paper. JointPS first minimizes the completion time of a single training epoch for each DML job via jointly optimizing the parameter server placement and flow scheduling, and predicts the number of remaining training epochs for each DML job by leveraging a dynamic model fitting method. Then, JointPS can estimate the remaining time to complete each DML job. According to such estimation, JointPS schedules DML jobs following the Minimum Remaining Time First (MRTF) principle to minimize the average job completion time. To the best of our knowledge, JointPS should be the first work that minimizes the average completion time of network-intensive DML training jobs by jointly optimizing the parameter server placement and flow scheduling without modifying the DML models and training procedures. Through both testbed experiments and extensive simulations, we demonstrate that JointPS can reduce the average completion time of DML jobs by up to 88% compared with state-of-the-art technology.
KW - Distributed machine learning
KW - average job completion time
KW - flow scheduling
KW - parameter server placement
UR - https://www.scopus.com/pages/publications/85169707201
U2 - 10.1109/TC.2023.3305753
DO - 10.1109/TC.2023.3305753
M3 - 文章
AN - SCOPUS:85169707201
SN - 0018-9340
VL - 72
SP - 3503
EP - 3518
JO - IEEE Transactions on Computers
JF - IEEE Transactions on Computers
IS - 12
ER -