TY - JOUR
T1 - LSRFormer
T2 - Efficient Transformer Supply Convolutional Neural Networks With Global Information for Aerial Image Segmentation
AU - Zhang, Renhe
AU - Zhang, Qian
AU - Zhang, Guixu
N1 - Publisher Copyright:
© 1980-2012 IEEE.
PY - 2024
Y1 - 2024
N2 - Both local context information and global context information are essential for the semantic segmentation of aerial images. Convolutional neural networks (CNNs) can capture local context information well but cannot model the global dependencies. Vision transformers (ViTs) are good at extracting global information but cannot retain spatial details well. In order to leverage the advantages of these two paradigms, we integrate them into one model in this study. However, global token interaction of ViT brings high computational cost, which makes it difficult to apply to large-sized aerial images. To handle this problem, we propose a novel efficient ViT block named long-short-range transformer (LSRFormer). Instead of mainstream ViTs designed as backbones, LSRFormer is a pretraining-free and plug-and-play module to be appended after CNN stages to supplement the global information. It is composed of long-range self-attention (LR-SA), short-range self-attention (SR-SA), and multiscale-convolutional feed-forward network (MSC-FFN). LR-SA establishes long-range dependencies at the junction of the windows and SR-SA diffuses the long-range information from window boundary to internal. MSC-FFN can capture multiscale information inside the ViT block. We append the LSRFormer block after each CNN stage of a pure convolutional network to build a model named ConvLSR-Net. Compared with existing models which combine CNN and ViTs, our model can learn both local and global representations at all stages of the model. In particular, ConvLSR-Net achieves state-of-the-art (SOTA) results on four challenging aerial image segmentation benchmarks, including iSAID, LoveDA, ISPRS Potsdam, and Vaihingen. The code has been released at https://github.com/stdcoutzrh/ConvLSR-Net.
AB - Both local context information and global context information are essential for the semantic segmentation of aerial images. Convolutional neural networks (CNNs) can capture local context information well but cannot model the global dependencies. Vision transformers (ViTs) are good at extracting global information but cannot retain spatial details well. In order to leverage the advantages of these two paradigms, we integrate them into one model in this study. However, global token interaction of ViT brings high computational cost, which makes it difficult to apply to large-sized aerial images. To handle this problem, we propose a novel efficient ViT block named long-short-range transformer (LSRFormer). Instead of mainstream ViTs designed as backbones, LSRFormer is a pretraining-free and plug-and-play module to be appended after CNN stages to supplement the global information. It is composed of long-range self-attention (LR-SA), short-range self-attention (SR-SA), and multiscale-convolutional feed-forward network (MSC-FFN). LR-SA establishes long-range dependencies at the junction of the windows and SR-SA diffuses the long-range information from window boundary to internal. MSC-FFN can capture multiscale information inside the ViT block. We append the LSRFormer block after each CNN stage of a pure convolutional network to build a model named ConvLSR-Net. Compared with existing models which combine CNN and ViTs, our model can learn both local and global representations at all stages of the model. In particular, ConvLSR-Net achieves state-of-the-art (SOTA) results on four challenging aerial image segmentation benchmarks, including iSAID, LoveDA, ISPRS Potsdam, and Vaihingen. The code has been released at https://github.com/stdcoutzrh/ConvLSR-Net.
KW - Global dependencies
KW - remote sensing
KW - semantic segmentation
KW - vision transformer (ViT)
UR - https://www.scopus.com/pages/publications/85185386097
U2 - 10.1109/TGRS.2024.3366709
DO - 10.1109/TGRS.2024.3366709
M3 - 文章
AN - SCOPUS:85185386097
SN - 0196-2892
VL - 62
SP - 1
EP - 13
JO - IEEE Transactions on Geoscience and Remote Sensing
JF - IEEE Transactions on Geoscience and Remote Sensing
M1 - 5610713
ER -