TY - GEN
T1 - Scene Graph Generation using Depth-based Multimodal Network
AU - Chen, Lianggangxu
AU - Lu, Jiale
AU - Wang, Changbo
AU - He, Gaoqi
N1 - Publisher Copyright:
© 2023 IEEE.
PY - 2023
Y1 - 2023
N2 - Scene graph generation (SGG) provides an efficient way for scene understanding. However, it has been plagued by the inaccurate classification of relative spatial relationship and incorrect feature information aggregation from distant objects. In this paper, we innovatively introduce the depth information of objects into SGG and propose a multimodal edge-featured graph attention network (MEGA-Net). MEGA-Net primarily comprises three modules. First, the edge-aware message passing (EMP) module extracts multimodal features and fuses them as edge features in the graph network via a quadrilinear model. Multimodal features consist of depth features, visual features, spatial features, and linguistic features. The depth feature in EMP provides the relative spatial relationship among objects which prevents the tail spatial predicates from being recognized as the head predicates. Second, we propose a depth-based self-supervised graph attention (DSGAT) module to predict the correlation probability between object pairs. By encoding the depth ranking of different object pairs in 2D images, DSGAT learns more accurate directional attention to avoid unrelated neighbors. Third, we introduce a predicate aware loss (PA-Loss) to alleviate the feature redundancy problem caused by extra depth information. This is achieved by introducing semantic frequency information that reflects the priority between different types of relationships. Systematic experiments show that our method achieves state-of-the-art performance on two popular datasets, VG and VRD.
AB - Scene graph generation (SGG) provides an efficient way for scene understanding. However, it has been plagued by the inaccurate classification of relative spatial relationship and incorrect feature information aggregation from distant objects. In this paper, we innovatively introduce the depth information of objects into SGG and propose a multimodal edge-featured graph attention network (MEGA-Net). MEGA-Net primarily comprises three modules. First, the edge-aware message passing (EMP) module extracts multimodal features and fuses them as edge features in the graph network via a quadrilinear model. Multimodal features consist of depth features, visual features, spatial features, and linguistic features. The depth feature in EMP provides the relative spatial relationship among objects which prevents the tail spatial predicates from being recognized as the head predicates. Second, we propose a depth-based self-supervised graph attention (DSGAT) module to predict the correlation probability between object pairs. By encoding the depth ranking of different object pairs in 2D images, DSGAT learns more accurate directional attention to avoid unrelated neighbors. Third, we introduce a predicate aware loss (PA-Loss) to alleviate the feature redundancy problem caused by extra depth information. This is achieved by introducing semantic frequency information that reflects the priority between different types of relationships. Systematic experiments show that our method achieves state-of-the-art performance on two popular datasets, VG and VRD.
KW - Depth Information
KW - Scene Graph Generation
KW - Self-Supervised Graph Attention Network
UR - https://www.scopus.com/pages/publications/85171173924
U2 - 10.1109/ICME55011.2023.00199
DO - 10.1109/ICME55011.2023.00199
M3 - 会议稿件
AN - SCOPUS:85171173924
T3 - Proceedings - IEEE International Conference on Multimedia and Expo
SP - 1139
EP - 1144
BT - Proceedings - 2023 IEEE International Conference on Multimedia and Expo, ICME 2023
PB - IEEE Computer Society
T2 - 2023 IEEE International Conference on Multimedia and Expo, ICME 2023
Y2 - 10 July 2023 through 14 July 2023
ER -