SegCFT: Context-aware Fourier Transform for efficient semantic segmentation

  • Yinqi Zhang
  • , Lingfu Jiang
  • , Fuhai Chen
  • , Jiao Xie*
  • , Baochang Zhang
  • , Gaoqi He
  • , Shaohui Lin
  • *Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

5 Scopus citations

Abstract

Semantic segmentation has been one of the most critical tasks in computer vision. Recent works mainly focus on improving segmentation performance by designing high-capacity transformer architectures. They try to solve the high data consumption and computing costs required for model training and deployment in the cloud, but the high computation overhead still makes it difficult to be directly applied to limited resource devices. In this paper, we propose a novel fast Fourier Transform (FFT) based Context-aware Feature Mixer under the transformer-like architecture for precise and efficient semantic segmentation, called SegCFT. Different from the self-attention-based transformer, SegCFT uses a Hierarchical Fourier Transform (HFT) to reduce computational cost via non-parametric calculation and promote segmentation performance by fusing the channel-wise and pixel-wise contexts. To integrate the features from the frequency domain of DFT into the spatial domain of the transformer-like architecture, an Adaptive Modulation Unit (AMU) is designed to modulate the frequency-domain features and ensure consistency between the frequency domain and the spatial domain. Experimental evaluation on two semantic segmentation benchmarks, ADE20k and Cityscapes, shows that SegCFT achieves competitive segmentation performance, while the training and inference costs are superior to the previous methods.

Original languageEnglish
Article number127946
JournalNeurocomputing
Volume596
DOIs
StatePublished - 1 Sep 2024

Keywords

  • Efficient network architecture
  • Fast Fourier Transform
  • Sementic segmentation
  • Transformer

Fingerprint

Dive into the research topics of 'SegCFT: Context-aware Fourier Transform for efficient semantic segmentation'. Together they form a unique fingerprint.

Cite this