Efficient CUDA stream management for multi-DNN real-time inference on embedded GPUs

  • Weiguang Pang
  • , Xiantong Luo
  • , Kailun Chen
  • , Dong Ji*
  • , Lei Qiao
  • , Wang Yi
  • *Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

18 Scopus citations

Abstract

Deep Neural Networks (DNNs) are widely used in Cyber–Physical Systems (CPS) that often involve multiple DNN tasks with varying real-time requirements. These tasks need to be deployed on a single embedded hardware platform with limited resources, such as an embedded GPU. Efficiently sharing the same embedded GPU among multiple real-time DNN tasks is a complex challenge. While existing DNN frameworks (e.g., PyTorch and TensorFlow) focus on maximizing average performance and high throughput on GPU, they lack scheduling management mechanisms considering multiple DNNs with different timing requirements. In this paper, we address this challenge by thoroughly examining and summarizing the scheduling rules for multiple kernels with different priorities in CUDA streams. Based on these rules, we design a framework that supports multi-DNN real-time inference and propose a method for allocating CUDA streams to DNN kernels to meet schedulability requirements while maximizing GPU resource utilization. Our proposed approach is implemented on an NVIDIA Jetson AGX Xavier embedded GPU system and validated using several popular DNNs. The results show that our approach achieves shorter response times compared with several state-of-the-art methods.

Original languageEnglish
Article number102888
JournalJournal of Systems Architecture
Volume139
DOIs
StatePublished - Jun 2023
Externally publishedYes

Keywords

  • CUDA stream priority
  • DNN
  • GPU
  • Real-time scheduling

Fingerprint

Dive into the research topics of 'Efficient CUDA stream management for multi-DNN real-time inference on embedded GPUs'. Together they form a unique fingerprint.

Cite this