TY - GEN
T1 - Cross-Stage Class-Specific Attention for Image Semantic Segmentation
AU - Shi, Zhengyi
AU - Sun, Li
AU - Li, Qingli
N1 - Publisher Copyright:
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2022.
PY - 2022
Y1 - 2022
N2 - Recent backbones built on transformers capture the context within a significantly larger area than CNN, and greatly improve the performance on semantic segmentation. However, the fact, that the decoder utilizes features from different stages in the shallow layers, indicates that local context is still important. Instead of simply incorporating features from different stages, we propose a cross-stage class-specific attention mainly for transformer-based backbones. Specifically, given a coarse prediction, we first employ the final stage features to aggregate a class center within the whole image. Then high-resolution features from the earlier stage are used as queries to absorb the semantics from class centers. To eliminate the irrelevant classes within a local area, we build the context for each query position according to the classification score from coarse prediction, and remove the redundant classes. So only relevant classes provide keys and values in attention and participate the value routing. We validate the proposed scheme on different datasets including ADE20K, Pascal Context and COCO-Stuff, showing that the proposed model improves the performance compared with other works.
AB - Recent backbones built on transformers capture the context within a significantly larger area than CNN, and greatly improve the performance on semantic segmentation. However, the fact, that the decoder utilizes features from different stages in the shallow layers, indicates that local context is still important. Instead of simply incorporating features from different stages, we propose a cross-stage class-specific attention mainly for transformer-based backbones. Specifically, given a coarse prediction, we first employ the final stage features to aggregate a class center within the whole image. Then high-resolution features from the earlier stage are used as queries to absorb the semantics from class centers. To eliminate the irrelevant classes within a local area, we build the context for each query position according to the classification score from coarse prediction, and remove the redundant classes. So only relevant classes provide keys and values in attention and participate the value routing. We validate the proposed scheme on different datasets including ADE20K, Pascal Context and COCO-Stuff, showing that the proposed model improves the performance compared with other works.
KW - Attention algorithm
KW - Semantic segmentation
KW - Vision transformer
UR - https://www.scopus.com/pages/publications/85142772433
U2 - 10.1007/978-3-031-18916-6_45
DO - 10.1007/978-3-031-18916-6_45
M3 - 会议稿件
AN - SCOPUS:85142772433
SN - 9783031189159
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 558
EP - 573
BT - Pattern Recognition and Computer Vision - 5th Chinese Conference, PRCV 2022, Proceedings
A2 - Yu, Shiqi
A2 - Zhang, Jianguo
A2 - Zhang, Zhaoxiang
A2 - Tan, Tieniu
A2 - Yuen, Pong C.
A2 - Guo, Yike
A2 - Han, Junwei
A2 - Lai, Jianhuang
PB - Springer Science and Business Media Deutschland GmbH
T2 - 5th Chinese Conference on Pattern Recognition and Computer Vision, PRCV 2022
Y2 - 4 November 2022 through 7 November 2022
ER -