Semantic-Orthogonal Multi-modal Attention Network for RGB-D Salient Object Detection

Jiawei Xu, Qiangqiang Zhou, Jiacong Yu, Chen Liao, Dandan Zhu

Research output: Contribution to journalArticlepeer-review

Abstract

In recent years, RGB-D saliency object detection has significantly advanced computer vision. However, existing methods still face challenges in feature extraction, cross-modal fusion, and multi-scale processing, limiting their performance in complex scenarios. To tackle these challenges, we propose SOMANet (Semantic Orthogonal Multi-Modal Attention Network), a novel and efficient RGB-D saliency object detection model that incorporates three key innovations: First, inspired by the “local focus-global reasoning” dual-path mechanism of the human visual system, we introduce a novel method for semantic token sparsification—Dual-Stage Sparse Semantic Enhancement (DSSE), based on the Swin Transformer architecture. DSSE filters out redundant semantic information, improving generalization and enabling focus on crucial semantics. This method enhances feature extraction efficiency by reducing FLOPs by over 33%, without sacrificing accuracy compared to the original Swin Transformer backbone. Second, we propose the Orthogonal Multi-Modal Mutual Attention Fusion (O-MMAF) module, which integrates mutual attention with orthogonal channel attention. This module effectively leverages the complementary relationship between RGB and Depth features, improving accuracy and robustness in cross-modal fusion. Finally, inspired by the visual processing mechanisms of primates, we design the Multi-Scale Self-Calibrating Spatial Recursive Attention (MSRA) module. By extracting multi-scale information and performing deep optimization, MSRA simulates the brain’s approach to information processing, generating high-precision saliency predictions in a coarse-to-fine manner. Experimental results show that SOMANet achieves outstanding performance across four evaluation metrics on nine publicly available RGB-D datasets, surpassing 12 state-of-the-art models, demonstrating its effectiveness in this field. Our code is published at https://github.com/jiaweiXu1029/SOMANet.

Original languageEnglish
Pages (from-to)6917-6929
Number of pages13
JournalVisual Computer
Volume41
Issue number9
DOIs
StatePublished - Jul 2025

Keywords

  • Cross-modal fusion
  • Multi-scale features
  • Orthogonal multi-modal mutual attention
  • Saliency object detection
  • Self-calibrating spatial recursive attention
  • Semantic token sparsification

Fingerprint

Dive into the research topics of 'Semantic-Orthogonal Multi-modal Attention Network for RGB-D Salient Object Detection'. Together they form a unique fingerprint.

Cite this