TY - GEN
T1 - 3D Scene Graph Generation with Cross-Modal Alignment and Adversarial Learning
AU - Hu, Yujun
AU - Zhou, Xiaoyu
AU - Wang, Changbo
AU - Meng, Weiliang
AU - He, Gaoqi
N1 - Publisher Copyright:
© 2025 ACM.
PY - 2025/6/30
Y1 - 2025/6/30
N2 - 3D Scene Graph Generation (3DSGG) aims to model spatial and semantic relationships among objects for comprehensive scene understanding and reasoning. However, existing methods encounter two core challenges: (i) semantic-geometric misalignment across heterogeneous modalities-textual descriptions often overlook curvature cues in point clouds-and (ii) long-tail distribution bias in relation prediction, conflating distinct predicates due to sparse samples. To address these issues, we propose a novel 3DSGG framework integrating textual, visual, and point-cloud data through three dedicated modules: (1) Cross-Modal Consistency Enhancement (CMCE), which aligns RGB-D and point-cloud embeddings via cosine similarity and non-linear mappings; (2) Relation Enhancement Generation (REGM), which rebalances tail relations using dynamic weighting and relation embeddings; and (3) Generation Quality Optimization (GQOM), which refines graph precision and robustness with a quality discriminator and structural-consistency loss. Extensive quantitative experiments and systematic empirical ablations demonstrate the proposed framework's superiority and robustness.
AB - 3D Scene Graph Generation (3DSGG) aims to model spatial and semantic relationships among objects for comprehensive scene understanding and reasoning. However, existing methods encounter two core challenges: (i) semantic-geometric misalignment across heterogeneous modalities-textual descriptions often overlook curvature cues in point clouds-and (ii) long-tail distribution bias in relation prediction, conflating distinct predicates due to sparse samples. To address these issues, we propose a novel 3DSGG framework integrating textual, visual, and point-cloud data through three dedicated modules: (1) Cross-Modal Consistency Enhancement (CMCE), which aligns RGB-D and point-cloud embeddings via cosine similarity and non-linear mappings; (2) Relation Enhancement Generation (REGM), which rebalances tail relations using dynamic weighting and relation embeddings; and (3) Generation Quality Optimization (GQOM), which refines graph precision and robustness with a quality discriminator and structural-consistency loss. Extensive quantitative experiments and systematic empirical ablations demonstrate the proposed framework's superiority and robustness.
KW - 3d point clouds
KW - cross-modal alignment
KW - scene graph generation
UR - https://www.scopus.com/pages/publications/105011592686
U2 - 10.1145/3731715.3733257
DO - 10.1145/3731715.3733257
M3 - 会议稿件
AN - SCOPUS:105011592686
T3 - ICMR 2025 - Proceedings of the 2025 International Conference on Multimedia Retrieval
SP - 487
EP - 496
BT - ICMR 2025 - Proceedings of the 2025 International Conference on Multimedia Retrieval
PB - Association for Computing Machinery, Inc
T2 - 2025 International Conference on Multimedia Retrieval, ICMR 2025
Y2 - 30 June 2025 through 3 July 2025
ER -