TY - GEN
T1 - Scalable parallel join for huge tables
AU - Weng, Nianlong
AU - Zhou, Minqi
AU - Shan, Ming Chien
AU - Zhou, Aoying
PY - 2013
Y1 - 2013
N2 - The parallel join processing which combines tuples from two or more relational tables together in a parallel manner is becoming more and more important and imperative to be solved, since tables may be huge, especially in this big data era. A few algorithms have already been proposed based on the prevailing mapreduce paradigm, while most of them impose both high communication costs and synchronization costs. In this paper, we propose a novel algorithm for scalable parallel join processing for the column-wise stored data analyzing. To cater for the prevailing deployed Hadoop system, we adopt the Hadoop Distributed File System (HDFS) as the file system across over a large set of machines. Tables are projected (i.e., vertical partition), segmented (i.e., horizontal partition), clustered and placed in a column-wise format over the distributed file system based on Gray Code. By effectively fetching the dedicated tuples from other tables on demand based on an optimized bloom filter strategy, each segment (i.e., partition) is capable in accomplishing the join processing individually with dramatically reduced communication cost, and consequently achieves the desired scalable parallelism. Tuples are transmitted in a demand driven manner across the network, rather than the hash-based movement in the mapreduce paradigm. Our extensive performance studies confirm the effectiveness and efficiency of our methods.
AB - The parallel join processing which combines tuples from two or more relational tables together in a parallel manner is becoming more and more important and imperative to be solved, since tables may be huge, especially in this big data era. A few algorithms have already been proposed based on the prevailing mapreduce paradigm, while most of them impose both high communication costs and synchronization costs. In this paper, we propose a novel algorithm for scalable parallel join processing for the column-wise stored data analyzing. To cater for the prevailing deployed Hadoop system, we adopt the Hadoop Distributed File System (HDFS) as the file system across over a large set of machines. Tables are projected (i.e., vertical partition), segmented (i.e., horizontal partition), clustered and placed in a column-wise format over the distributed file system based on Gray Code. By effectively fetching the dedicated tuples from other tables on demand based on an optimized bloom filter strategy, each segment (i.e., partition) is capable in accomplishing the join processing individually with dramatically reduced communication cost, and consequently achieves the desired scalable parallelism. Tuples are transmitted in a demand driven manner across the network, rather than the hash-based movement in the mapreduce paradigm. Our extensive performance studies confirm the effectiveness and efficiency of our methods.
UR - https://www.scopus.com/pages/publications/84886077373
U2 - 10.1109/BigData.Congress.2013.29
DO - 10.1109/BigData.Congress.2013.29
M3 - 会议稿件
AN - SCOPUS:84886077373
SN - 9780768550060
T3 - Proceedings - 2013 IEEE International Congress on Big Data, BigData 2013
SP - 157
EP - 164
BT - Proceedings - 2013 IEEE International Congress on Big Data, BigData 2013
T2 - 2013 IEEE International Congress on Big Data, BigData 2013
Y2 - 27 June 2013 through 2 July 2013
ER -