TY - JOUR
T1 - D-Cubicle
T2 - boosting data transfer dynamically for large-scale analytical queries in single-GPU systems
AU - Wang, Jialun
AU - Pang, Wenhao
AU - Weng, Chuliang
AU - Zhou, Aoying
N1 - Publisher Copyright:
© 2023, Higher Education Press.
PY - 2023/8
Y1 - 2023/8
N2 - In analytical queries, a number of important operators like JOIN and GROUP BY are suitable for parallelization, and GPU is an ideal accelerator considering its power of parallel computing. However, when data size increases to hundreds of gigabytes, one GPU card becomes insufficient due to the small capacity of global memory and the slow data transfer between host and device. A straightforward solution is to equip more GPUs linked with high-bandwidth connectors, but the cost will be highly increased. We utilize unified memory (UM) produced by NVIDIA CUDA (Compute Unified Device Architecture) to make it possible to accelerate large-scale queries on just one GPU, but we notice that the transfer performance between host and UM, which happens before kernel execution, is often significantly slower than the theoretical bandwidth. An important reason is that, in single-GPU environment, data processing systems usually invoke only one or a static number of threads for data copy, leading to an inefficient transfer which slows down the overall performance heavily. In this paper, we present D-Cubicle, a runtime module to accelerate data transfer between host-managed memory and unified memory. D-Cubicle boosts the actual transfer speed dynamically through a self-adaptive approach. In our experiments, taking data transfer into account, D-Cubicle processes 200 GB of data on a single GPU with 32 GB of global memory, achieving 1.43x averagely and 2.09x maximally the performance of the baseline system.
AB - In analytical queries, a number of important operators like JOIN and GROUP BY are suitable for parallelization, and GPU is an ideal accelerator considering its power of parallel computing. However, when data size increases to hundreds of gigabytes, one GPU card becomes insufficient due to the small capacity of global memory and the slow data transfer between host and device. A straightforward solution is to equip more GPUs linked with high-bandwidth connectors, but the cost will be highly increased. We utilize unified memory (UM) produced by NVIDIA CUDA (Compute Unified Device Architecture) to make it possible to accelerate large-scale queries on just one GPU, but we notice that the transfer performance between host and UM, which happens before kernel execution, is often significantly slower than the theoretical bandwidth. An important reason is that, in single-GPU environment, data processing systems usually invoke only one or a static number of threads for data copy, leading to an inefficient transfer which slows down the overall performance heavily. In this paper, we present D-Cubicle, a runtime module to accelerate data transfer between host-managed memory and unified memory. D-Cubicle boosts the actual transfer speed dynamically through a self-adaptive approach. In our experiments, taking data transfer into account, D-Cubicle processes 200 GB of data on a single GPU with 32 GB of global memory, achieving 1.43x averagely and 2.09x maximally the performance of the baseline system.
KW - GPU
KW - data analytics
KW - unified memory
UR - https://www.scopus.com/pages/publications/85143653869
U2 - 10.1007/s11704-022-2160-z
DO - 10.1007/s11704-022-2160-z
M3 - 文章
AN - SCOPUS:85143653869
SN - 2095-2228
VL - 17
JO - Frontiers of Computer Science
JF - Frontiers of Computer Science
IS - 4
M1 - 174610
ER -