LSRFormer: Efficient Transformer Supply Convolutional Neural Networks With Global Information for Aerial Image Segmentation

Renhe Zhang, Qian Zhang, Guixu Zhang

Research output: Contribution to journalArticlepeer-review

28 Scopus citations

Abstract

Both local context information and global context information are essential for the semantic segmentation of aerial images. Convolutional neural networks (CNNs) can capture local context information well but cannot model the global dependencies. Vision transformers (ViTs) are good at extracting global information but cannot retain spatial details well. In order to leverage the advantages of these two paradigms, we integrate them into one model in this study. However, global token interaction of ViT brings high computational cost, which makes it difficult to apply to large-sized aerial images. To handle this problem, we propose a novel efficient ViT block named long-short-range transformer (LSRFormer). Instead of mainstream ViTs designed as backbones, LSRFormer is a pretraining-free and plug-and-play module to be appended after CNN stages to supplement the global information. It is composed of long-range self-attention (LR-SA), short-range self-attention (SR-SA), and multiscale-convolutional feed-forward network (MSC-FFN). LR-SA establishes long-range dependencies at the junction of the windows and SR-SA diffuses the long-range information from window boundary to internal. MSC-FFN can capture multiscale information inside the ViT block. We append the LSRFormer block after each CNN stage of a pure convolutional network to build a model named ConvLSR-Net. Compared with existing models which combine CNN and ViTs, our model can learn both local and global representations at all stages of the model. In particular, ConvLSR-Net achieves state-of-the-art (SOTA) results on four challenging aerial image segmentation benchmarks, including iSAID, LoveDA, ISPRS Potsdam, and Vaihingen. The code has been released at https://github.com/stdcoutzrh/ConvLSR-Net.

Original languageEnglish
Article number5610713
Pages (from-to)1-13
Number of pages13
JournalIEEE Transactions on Geoscience and Remote Sensing
Volume62
DOIs
StatePublished - 2024

Keywords

  • Global dependencies
  • remote sensing
  • semantic segmentation
  • vision transformer (ViT)

Fingerprint

Dive into the research topics of 'LSRFormer: Efficient Transformer Supply Convolutional Neural Networks With Global Information for Aerial Image Segmentation'. Together they form a unique fingerprint.

Cite this