3D Scene Graph Generation with Cross-Modal Alignment and Adversarial Learning

Yujun Hu, Xiaoyu Zhou, Changbo Wang, Weiliang Meng, Gaoqi He

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

3D Scene Graph Generation (3DSGG) aims to model spatial and semantic relationships among objects for comprehensive scene understanding and reasoning. However, existing methods encounter two core challenges: (i) semantic-geometric misalignment across heterogeneous modalities-textual descriptions often overlook curvature cues in point clouds-and (ii) long-tail distribution bias in relation prediction, conflating distinct predicates due to sparse samples. To address these issues, we propose a novel 3DSGG framework integrating textual, visual, and point-cloud data through three dedicated modules: (1) Cross-Modal Consistency Enhancement (CMCE), which aligns RGB-D and point-cloud embeddings via cosine similarity and non-linear mappings; (2) Relation Enhancement Generation (REGM), which rebalances tail relations using dynamic weighting and relation embeddings; and (3) Generation Quality Optimization (GQOM), which refines graph precision and robustness with a quality discriminator and structural-consistency loss. Extensive quantitative experiments and systematic empirical ablations demonstrate the proposed framework's superiority and robustness.

Original languageEnglish
Title of host publicationICMR 2025 - Proceedings of the 2025 International Conference on Multimedia Retrieval
PublisherAssociation for Computing Machinery, Inc
Pages487-496
Number of pages10
ISBN (Electronic)9798400718779
DOIs
StatePublished - 30 Jun 2025
Event2025 International Conference on Multimedia Retrieval, ICMR 2025 - Chicago, United States
Duration: 30 Jun 20253 Jul 2025

Publication series

NameICMR 2025 - Proceedings of the 2025 International Conference on Multimedia Retrieval

Conference

Conference2025 International Conference on Multimedia Retrieval, ICMR 2025
Country/TerritoryUnited States
CityChicago
Period30/06/253/07/25

Keywords

  • 3d point clouds
  • cross-modal alignment
  • scene graph generation

Fingerprint

Dive into the research topics of '3D Scene Graph Generation with Cross-Modal Alignment and Adversarial Learning'. Together they form a unique fingerprint.

Cite this