TY - GEN
T1 - Join optimization in the mapreduce environment for column-wise data store
AU - Zhou, Minqi
AU - Zhang, Rong
AU - Zeng, Dadan
AU - Qian, Weining
AU - Zhou, Aoying
PY - 2010
Y1 - 2010
N2 - The chain join processing which combines records from two or more tables sequentially has been well studied in the centralized databases. However, it has seldom been discussed in the cloud computing era, and remains imperative to be solved, especially where structured (or relational) data are stored in a column (attribute) wise fashion in distributed file systems (e.g., Google File System) over hundreds of or even thousands of commodities PCs. In this paper, we propose a novel method for chain join processing, which is one of the common primitives in the cloud era for column-wise stored data analysis. By effectively selecting the dedicated records (tuples) for the chain join based on the information exploited within bipartite join graph, communication cost for record transmission could be reduced dramatically. A bushy tree structure is deployed to regulate the chain join sequence, which further reduces the number of intermediate results generated and transmitted, and explores higher parallelism in join processing, while results in more efficient join processing. Our extensive performance study confirms the effectiveness and efficiency of our methods.
AB - The chain join processing which combines records from two or more tables sequentially has been well studied in the centralized databases. However, it has seldom been discussed in the cloud computing era, and remains imperative to be solved, especially where structured (or relational) data are stored in a column (attribute) wise fashion in distributed file systems (e.g., Google File System) over hundreds of or even thousands of commodities PCs. In this paper, we propose a novel method for chain join processing, which is one of the common primitives in the cloud era for column-wise stored data analysis. By effectively selecting the dedicated records (tuples) for the chain join based on the information exploited within bipartite join graph, communication cost for record transmission could be reduced dramatically. A bushy tree structure is deployed to regulate the chain join sequence, which further reduces the number of intermediate results generated and transmitted, and explores higher parallelism in join processing, while results in more efficient join processing. Our extensive performance study confirms the effectiveness and efficiency of our methods.
UR - https://www.scopus.com/pages/publications/84863149565
U2 - 10.1109/SKG.2010.18
DO - 10.1109/SKG.2010.18
M3 - 会议稿件
AN - SCOPUS:84863149565
SN - 9780769541891
T3 - Proceedings - 6th International Conference on Semantics, Knowledge and Grid, SKG 2010
SP - 97
EP - 104
BT - Proceedings - 6th International Conference on Semantics, Knowledge and Grid, SKG 2010
T2 - 6th International Conference on Semantics, Knowledge and Grid, SKG 2010
Y2 - 1 November 2010 through 3 November 2010
ER -