TY - GEN
T1 - A Multimodal Trustworthy Joint Perception Prediction Model for Autonomous Driving
AU - Liu, Yixiao
AU - Zhang, Lei
AU - Xu, Qian
AU - Sun, Yan
N1 - Publisher Copyright:
© 2024 IEEE.
PY - 2024
Y1 - 2024
N2 - The emergence of vision-focused joint perception and prediction (PnP) marks a novel trend in the field of autonomous driving research. It predicts the future states of traffic participants within the surrounding environment from perceptual data. However, perception from a single vehicle is insufficient to obtain more accurate environmental information; therefore, the fusion of perception data from different sources and modalities becomes increasingly crucial for processing and predicting environmental data. To this end, this paper proposes a novel multimodal trustworthiness fusion and prediction model. First, we introduce a Bird's-Eye View (BEV) encoder that is synchronized with poses and based on multimodal data. This encoder is capable of projecting raw image inputs from any modality camera, captured at any pose and time, into a shared, synchronized BEV space, thereby enhancing spatiotemporal synchronization. Second, we present a trustworthy Spatial-Temporal Pyramid Transform (TSTPT), which is designed to comprehensively extract multiscale features from BEV and forecast future BEV states, leveraging spatial priors. A comprehensive series of experiments conducted on the KITTI and nuScenes datasets demonstrate that the proposed model is overall feasible more reliable and safe compared to existing vision-based prediction methods.
AB - The emergence of vision-focused joint perception and prediction (PnP) marks a novel trend in the field of autonomous driving research. It predicts the future states of traffic participants within the surrounding environment from perceptual data. However, perception from a single vehicle is insufficient to obtain more accurate environmental information; therefore, the fusion of perception data from different sources and modalities becomes increasingly crucial for processing and predicting environmental data. To this end, this paper proposes a novel multimodal trustworthiness fusion and prediction model. First, we introduce a Bird's-Eye View (BEV) encoder that is synchronized with poses and based on multimodal data. This encoder is capable of projecting raw image inputs from any modality camera, captured at any pose and time, into a shared, synchronized BEV space, thereby enhancing spatiotemporal synchronization. Second, we present a trustworthy Spatial-Temporal Pyramid Transform (TSTPT), which is designed to comprehensively extract multiscale features from BEV and forecast future BEV states, leveraging spatial priors. A comprehensive series of experiments conducted on the KITTI and nuScenes datasets demonstrate that the proposed model is overall feasible more reliable and safe compared to existing vision-based prediction methods.
KW - connected and automated vehicles
KW - machine-learning methods
KW - multimodal fusion
KW - trust management
UR - https://www.scopus.com/pages/publications/105009088148
U2 - 10.1109/ACAIT63902.2024.11021780
DO - 10.1109/ACAIT63902.2024.11021780
M3 - 会议稿件
AN - SCOPUS:105009088148
T3 - Proceedings of 2024 8th Asian Conference on Artificial Intelligence Technology, ACAIT 2024
SP - 133
EP - 138
BT - Proceedings of 2024 8th Asian Conference on Artificial Intelligence Technology, ACAIT 2024
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 8th Asian Conference on Artificial Intelligence Technology, ACAIT 2024
Y2 - 8 November 2024 through 10 November 2024
ER -