TY - JOUR
T1 - Efficient query processing framework for big data warehouse
T2 - an almost join-free approach
AU - Wang, Huiju
AU - Qin, Xiongpai
AU - Zhou, Xuan
AU - Li, Furong
AU - Qin, Zuoyan
AU - Zhu, Qing
AU - Wang, Shan
N1 - Publisher Copyright:
© 2014, Higher Education Press and Springer-Verlag Berlin Heidelberg.
PY - 2015/4
Y1 - 2015/4
N2 - The rapidly increasing scale of data warehouses is challenging today’s data analytical technologies. A conventional data analytical platform processes data warehouse queries using a star schema — it normalizes the data into a fact table and a number of dimension tables, and during query processing it selectively joins the tables according to users’ demands. This model is space economical. However, it faces two problems when applied to big data. First, join is an expensive operation, which prohibits a parallel database or a MapReduce-based system from achieving efficiency and scalability simultaneously. Second, join operations have to be executed repeatedly, while numerous join results can actually be reused by different queries. In this paper, we propose a new query processing framework for data warehouses. It pushes the join operations partially to the pre-processing phase and partially to the post-processing phase, so that data warehouse queries can be transformed into massive parallelized filter-aggregation operations on the fact table. In contrast to the conventional query processing models, our approach is efficient, scalable and stable despite of the large number of tables involved in the join. It is especially suitable for a large-scale parallel data warehouse. Our empirical evaluation on Hadoop shows that our framework exhibits linear scalability and outperforms some existing approaches by an order of magnitude.
AB - The rapidly increasing scale of data warehouses is challenging today’s data analytical technologies. A conventional data analytical platform processes data warehouse queries using a star schema — it normalizes the data into a fact table and a number of dimension tables, and during query processing it selectively joins the tables according to users’ demands. This model is space economical. However, it faces two problems when applied to big data. First, join is an expensive operation, which prohibits a parallel database or a MapReduce-based system from achieving efficiency and scalability simultaneously. Second, join operations have to be executed repeatedly, while numerous join results can actually be reused by different queries. In this paper, we propose a new query processing framework for data warehouses. It pushes the join operations partially to the pre-processing phase and partially to the post-processing phase, so that data warehouse queries can be transformed into massive parallelized filter-aggregation operations on the fact table. In contrast to the conventional query processing models, our approach is efficient, scalable and stable despite of the large number of tables involved in the join. It is especially suitable for a large-scale parallel data warehouse. Our empirical evaluation on Hadoop shows that our framework exhibits linear scalability and outperforms some existing approaches by an order of magnitude.
KW - TAMP
KW - data warehouse
KW - join-free
KW - large scale
KW - multi-version schema
UR - https://www.scopus.com/pages/publications/84925438267
U2 - 10.1007/s11704-014-4025-6
DO - 10.1007/s11704-014-4025-6
M3 - 文章
AN - SCOPUS:84925438267
SN - 2095-2228
VL - 9
SP - 224
EP - 236
JO - Frontiers of Computer Science
JF - Frontiers of Computer Science
IS - 2
ER -