TY - GEN
T1 - Open-Scene Understanding-oriented 3D Scene Graph Generation
AU - Hao, Yuansu
AU - Yu, Fei
AU - Wang, Yanhao
AU - Li, Yuehua
AU - Deng, Quan
AU - Yu, Yuan
AU - Huang, Chen
AU - Che, Nan
N1 - Publisher Copyright:
© 2025 IEEE.
PY - 2025
Y1 - 2025
N2 - Understanding complex 3D environments is essential for many computer vision and robotic applications, especially in highly dynamic open-scene scenarios. The 3D scene graph plays an important role in the comprehension of 3D environments. However, most existing methods for 3D scene graph generation depend on pre-specified object and relationship classes (i.e., closed vocabulary) and labeled data for training, which restricts their effectiveness in the open-scene setting. To address this issue, we propose a novel Open-Scene Understanding-oriented 3D Scene Graph (OSU-3DSG) framework that can operate without labeled training data. The OSU-3DSG framework effectively extracts visual features from RGB-D image sequences and fuses them with camera pose estimates to create accurate 3D object maps. Then, by leveraging a pre-trained Vision Language Model (VLM), it generates relational triplets and constructs 3D scene graphs in a zero-shot manner. In particular, it excels at adaptively recognizing and interpreting object relationships, making it suitable for open-world applications. Finally, we perform extensive experiments on two open-world 3D datasets, namely 3DSSG and Replica, to evaluate the effectiveness and adaptability of the OSU-3DSG framework, demonstrating its potential to pave the way for the advancement of open-scene understanding. Our code and data are published at https://github.com/YuansuHao/OSU-3DSG.
AB - Understanding complex 3D environments is essential for many computer vision and robotic applications, especially in highly dynamic open-scene scenarios. The 3D scene graph plays an important role in the comprehension of 3D environments. However, most existing methods for 3D scene graph generation depend on pre-specified object and relationship classes (i.e., closed vocabulary) and labeled data for training, which restricts their effectiveness in the open-scene setting. To address this issue, we propose a novel Open-Scene Understanding-oriented 3D Scene Graph (OSU-3DSG) framework that can operate without labeled training data. The OSU-3DSG framework effectively extracts visual features from RGB-D image sequences and fuses them with camera pose estimates to create accurate 3D object maps. Then, by leveraging a pre-trained Vision Language Model (VLM), it generates relational triplets and constructs 3D scene graphs in a zero-shot manner. In particular, it excels at adaptively recognizing and interpreting object relationships, making it suitable for open-world applications. Finally, we perform extensive experiments on two open-world 3D datasets, namely 3DSSG and Replica, to evaluate the effectiveness and adaptability of the OSU-3DSG framework, demonstrating its potential to pave the way for the advancement of open-scene understanding. Our code and data are published at https://github.com/YuansuHao/OSU-3DSG.
KW - 3D scene graph generation
KW - open-scene understanding
KW - vision language model
KW - zero-shot learning
UR - https://www.scopus.com/pages/publications/105022632131
U2 - 10.1109/ICME59968.2025.11209525
DO - 10.1109/ICME59968.2025.11209525
M3 - 会议稿件
AN - SCOPUS:105022632131
T3 - Proceedings - IEEE International Conference on Multimedia and Expo
BT - 2025 IEEE International Conference on Multimedia and Expo
PB - IEEE Computer Society
T2 - 2025 IEEE International Conference on Multimedia and Expo, ICME 2025
Y2 - 30 June 2025 through 4 July 2025
ER -