Abstract
Big data processing is being widely used in academia and industry to handle DNN-based inference workloads for fields such as video analyses. In such cases, multiple parallel inference tasks in the big data processing system repeatedly load the same, read-only DNN model so the system does not fully utilize the GPU resources which creates a bottleneck that limits the inference performance. This paper presents a model sharing technique for single GPU cards that enables sharing of the same model among various DNN inference tasks. An allocator is used to make the model sharing technique work for each GPU in the distributed environment. This method was implemented in Spark on a GPU platform in a distributed data processing system that supports large-scale inference workloads. Tests show that for video analyses on the YOLO-v3 model, the model sharing reduces the GPU memory overhead and improves system throughput by up to 136% compared to a system without the model sharing technique.
| Translated title of the contribution | Model sharing for GPU-accelerated DNN inference in big data processing systems |
|---|---|
| Original language | Chinese (Traditional) |
| Pages (from-to) | 1435-1441 |
| Number of pages | 7 |
| Journal | Qinghua Daxue Xuebao/Journal of Tsinghua University |
| Volume | 62 |
| Issue number | 9 |
| DOIs | |
| State | Published - 15 Sep 2022 |