DS2: Handling data skew using data stealings over high-speed networks

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

1 Scopus citations

Abstract

Distributed in-memory computing systems have dramatic performance improvement over traditional disk-based systems, which makes them widely used in large-scale data processing applications. Unfortunately, uneven and unpredictable data distributions caused by data skew have a significant impact on the performance. In Spark, when data skew happens, some tasks will process much more data than other tasks and become the performance bottleneck. The traditional approaches to handling data skew are based on sampling and repartitioning, which incur additional overhead. In this paper, we divide data skew in distributed data processing systems into intra-node and inter-node skew. Based on data stealing, we proposed DS2 to handle both intra-node and inter-node data skew. It aims to improve the performance under data skew, without involving additional overhead. DS2 first balances the skewed data distribution in the local and then handles the inter-node skew by RDMA during execution. It achieves up to 2.96× speedup on the aggregation operator and 2.81× speedup on the join operator.

Original languageEnglish
Title of host publicationProceedings - 2021 IEEE 37th International Conference on Data Engineering, ICDE 2021
PublisherIEEE Computer Society
Pages1865-1870
Number of pages6
ISBN (Electronic)9781728191843
DOIs
StatePublished - Apr 2021
Event37th IEEE International Conference on Data Engineering, ICDE 2021 - Virtual, Online, Chania, Greece
Duration: 19 Apr 202122 Apr 2021

Publication series

NameProceedings - International Conference on Data Engineering
Volume2021-April
ISSN (Print)1084-4627
ISSN (Electronic)2375-0286

Conference

Conference37th IEEE International Conference on Data Engineering, ICDE 2021
Country/TerritoryGreece
CityVirtual, Online, Chania
Period19/04/2122/04/21

Keywords

  • Data Skew
  • Data Stealing
  • OLAP
  • RDMA

Fingerprint

Dive into the research topics of 'DS2: Handling data skew using data stealings over high-speed networks'. Together they form a unique fingerprint.

Cite this