TY - GEN
T1 - Stage delay scheduling
T2 - 48th International Conference on Parallel Processing, ICPP 2019
AU - Shao, Wujie
AU - Xu, Fei
AU - Chen, Li
AU - Zheng, Haoyue
AU - Liu, Fangming
N1 - Publisher Copyright:
© 2019 ACM.
PY - 2019/8/5
Y1 - 2019/8/5
N2 - To increase the resource utilization of datacenters, big data analytics jobs are commonly running stages in parallel which are organized into and scheduled according to the Directed Acyclic Graph (DAG). Through an in-depth analysis of the latest Alibaba cluster trace and our motivation experiments on Amazon EC2, however, we show that the CPU and network resources are still under-utilized due to the unwise stage scheduling, thereby prolonging the completion time of a DAG-style job (e.g., Spark). While existing works on reducing the job completion time focus on either task scheduling or job scheduling, stage scheduling has received comparably little attention. In this paper, we design and implement DelayStage, a simple yet effective stage delay scheduling strategy to interleave the cluster resources across the parallel stages, so as to increase the cluster resource utilization and speed up the job performance. With the aim of minimizing the makespan of parallel stages, DelayStage judiciously arranges the execution of stages in a pipelined manner to maximize the performance benefits of resource interleaving. Extensive prototype experiments on 30 Amazon EC2 instances and complementary trace-driven simulations show that DelayStage can improve the cluster resource utilization by up to 81.8% and reduce the job completion time by up to 41.3%, in comparison to the stock Spark and the state-of-the-art stage scheduling strategies, yet with acceptable runtime overhead.
AB - To increase the resource utilization of datacenters, big data analytics jobs are commonly running stages in parallel which are organized into and scheduled according to the Directed Acyclic Graph (DAG). Through an in-depth analysis of the latest Alibaba cluster trace and our motivation experiments on Amazon EC2, however, we show that the CPU and network resources are still under-utilized due to the unwise stage scheduling, thereby prolonging the completion time of a DAG-style job (e.g., Spark). While existing works on reducing the job completion time focus on either task scheduling or job scheduling, stage scheduling has received comparably little attention. In this paper, we design and implement DelayStage, a simple yet effective stage delay scheduling strategy to interleave the cluster resources across the parallel stages, so as to increase the cluster resource utilization and speed up the job performance. With the aim of minimizing the makespan of parallel stages, DelayStage judiciously arranges the execution of stages in a pipelined manner to maximize the performance benefits of resource interleaving. Extensive prototype experiments on 30 Amazon EC2 instances and complementary trace-driven simulations show that DelayStage can improve the cluster resource utilization by up to 81.8% and reduce the job completion time by up to 41.3%, in comparison to the stock Spark and the state-of-the-art stage scheduling strategies, yet with acceptable runtime overhead.
KW - Big data analytics
KW - Job completion time
KW - Parallel stages
KW - Resource interleaving
KW - Stage delay scheduling
UR - https://www.scopus.com/pages/publications/85071093464
U2 - 10.1145/3337821.3337872
DO - 10.1145/3337821.3337872
M3 - 会议稿件
AN - SCOPUS:85071093464
T3 - ACM International Conference Proceeding Series
BT - Proceedings of the 48th International Conference on Parallel Processing, ICPP 2019
PB - Association for Computing Machinery
Y2 - 5 August 2019 through 8 August 2019
ER -