TY - GEN
T1 - Continuously bulk loading over range partitioned tables for large scale historical data
AU - He, Xiaolong
AU - Cai, Peng
AU - Zhou, Xuan
AU - Zhou, Aoying
N1 - Publisher Copyright:
© 2021 IEEE.
PY - 2021/4
Y1 - 2021/4
N2 - To support efficiently and continuously loading large scale historical data into a distributed data management system (DDMS), it needs to balance the bulk workload across machines. The fundamental problem is to estimate the time used to merge currently loaded data (defined as incremental data) into previously loaded data (defined as baseline data) for each partition, referred to as partition merge. In this work, we present a learning-based framework, referred to as LeaBalancer, to balance the merge loads across cluster nodes. In the situation where the system is scheduled to have regular bulk loading tasks, LeaBalancer can learn to predict the partition merge time from the merge logs generated by previous bulk loadings. Nevertheless, it is still difficult to balance the bulk workload only using a single plan phase because of inaccurate merge time prediction or other in-progress heavy workloads during the bulk loading. To resolve this problem, we design a multi-round balancing strategy, and at the beginning of each round LeaBalancer carefully chooses partitions for migration according to the remaining merge loads in each node. Experimental results show that LeaBalancer can adaptively perform load balance under various settings.
AB - To support efficiently and continuously loading large scale historical data into a distributed data management system (DDMS), it needs to balance the bulk workload across machines. The fundamental problem is to estimate the time used to merge currently loaded data (defined as incremental data) into previously loaded data (defined as baseline data) for each partition, referred to as partition merge. In this work, we present a learning-based framework, referred to as LeaBalancer, to balance the merge loads across cluster nodes. In the situation where the system is scheduled to have regular bulk loading tasks, LeaBalancer can learn to predict the partition merge time from the merge logs generated by previous bulk loadings. Nevertheless, it is still difficult to balance the bulk workload only using a single plan phase because of inaccurate merge time prediction or other in-progress heavy workloads during the bulk loading. To resolve this problem, we design a multi-round balancing strategy, and at the beginning of each round LeaBalancer carefully chooses partitions for migration according to the remaining merge loads in each node. Experimental results show that LeaBalancer can adaptively perform load balance under various settings.
KW - Bulk Loading
KW - Load Balancing
KW - Machine Learning
KW - Range Partitioning
UR - https://www.scopus.com/pages/publications/85112865793
U2 - 10.1109/ICDE51399.2021.00088
DO - 10.1109/ICDE51399.2021.00088
M3 - 会议稿件
AN - SCOPUS:85112865793
T3 - Proceedings - International Conference on Data Engineering
SP - 960
EP - 971
BT - Proceedings - 2021 IEEE 37th International Conference on Data Engineering, ICDE 2021
PB - IEEE Computer Society
T2 - 37th IEEE International Conference on Data Engineering, ICDE 2021
Y2 - 19 April 2021 through 22 April 2021
ER -