A distributed join algorithm on separated data storage

Qiu Shi Fan, Min Qi Zhou*, Ao Ying Zhou

*Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

2 Scopus citations

Abstract

In this big data era, the efficiency of join operator is needed to be optimized imperatively, especially for database systems with separated baseline and incremental data. In this database system architecture, the baseline data is stored in the disk as usual, while the incremental data is stored in the memory to achieve both higher transactional processing efficiency and scalability. HBase, BigTable, OceanBase are typical database systems deployed with such separated baseline and incremental data architecture, but they provided join operator with very low efficiency only. The main reasons are as follows: they have to merge the baseline data and incremental data at first; and the network overhead is very heavy because of the complex data model they used. This paper proposes an algorithm for efficient join operator based on separated baseline data and incremental data. It partitions the join attributes into specified ranges first and merges each range on different nodes in parallel. The key point of this algorithm is that it partitions, sorts the baseline data and incremental data separately to achieve even higher parallelism before merge join and avoids the cost of merge of the baseline and incremental data tuples which will not be appeared in the result set. We implement this algorithm based on OceanBase, an open sourced distributed database system. The experimental results confirm that our algorithm can improve the join performance of OceanBase database by a large margin.

Original languageEnglish
Pages (from-to)2102-2113
Number of pages12
JournalJisuanji Xuebao/Chinese Journal of Computers
Volume39
Issue number10
DOIs
StatePublished - 1 Oct 2016

Keywords

  • Distributed join
  • Incremental data
  • Parallel process
  • Sort merge join

Fingerprint

Dive into the research topics of 'A distributed join algorithm on separated data storage'. Together they form a unique fingerprint.

Cite this