Continuously bulk loading over range partitioned tables for large scale historical data

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

3 Scopus citations

Abstract

To support efficiently and continuously loading large scale historical data into a distributed data management system (DDMS), it needs to balance the bulk workload across machines. The fundamental problem is to estimate the time used to merge currently loaded data (defined as incremental data) into previously loaded data (defined as baseline data) for each partition, referred to as partition merge. In this work, we present a learning-based framework, referred to as LeaBalancer, to balance the merge loads across cluster nodes. In the situation where the system is scheduled to have regular bulk loading tasks, LeaBalancer can learn to predict the partition merge time from the merge logs generated by previous bulk loadings. Nevertheless, it is still difficult to balance the bulk workload only using a single plan phase because of inaccurate merge time prediction or other in-progress heavy workloads during the bulk loading. To resolve this problem, we design a multi-round balancing strategy, and at the beginning of each round LeaBalancer carefully chooses partitions for migration according to the remaining merge loads in each node. Experimental results show that LeaBalancer can adaptively perform load balance under various settings.

Original languageEnglish
Title of host publicationProceedings - 2021 IEEE 37th International Conference on Data Engineering, ICDE 2021
PublisherIEEE Computer Society
Pages960-971
Number of pages12
ISBN (Electronic)9781728191843
DOIs
StatePublished - Apr 2021
Event37th IEEE International Conference on Data Engineering, ICDE 2021 - Virtual, Online, Chania, Greece
Duration: 19 Apr 202122 Apr 2021

Publication series

NameProceedings - International Conference on Data Engineering
Volume2021-April
ISSN (Print)1084-4627
ISSN (Electronic)2375-0286

Conference

Conference37th IEEE International Conference on Data Engineering, ICDE 2021
Country/TerritoryGreece
CityVirtual, Online, Chania
Period19/04/2122/04/21

Keywords

  • Bulk Loading
  • Load Balancing
  • Machine Learning
  • Range Partitioning

Fingerprint

Dive into the research topics of 'Continuously bulk loading over range partitioned tables for large scale historical data'. Together they form a unique fingerprint.

Cite this