融合空间十字注意力与通道注意力的语义分割网络

doi:10.11996/JG.j.2095-302X.2023030531

摘要/Abstract

摘要：

针对现有语义分割方法无法有效构建上下文语义关联关系以及所提取的语义特征表征能力不足的问题，提出了一种新的空间十字注意力与通道注意力相融合的语义分割网络。首先，采用空间十字注意力模块(SCCAM)聚合目标像素在水平和垂直方向上的上下文信息，进而高效地建立像素之间的非局部语义依赖关系。其次，在通道注意力模块(CAM)中引入多头注意力机制，在多个通道子空间上挖掘语义更显著的通道特征。在此基础上，通过融合空间与通道两个维度上的注意力特征，进一步增强特征的语义表征能力，提升语义分割精度。在Cityscapes数据集、PASCAL VOC2012数据集以及CamVid数据集上的实验结果表明，与其他先进语义分割方法相比，该网络模型具有更高的分割精度。

关键词: 语义分割, 神经网络, 注意力机制, 空间注意力, 通道注意力

Abstract:

In light of the shortcomings of current semantic segmentation methods, which suffer from ineffective construction of contextual semantic associations and insufficient representation of extracted semantic features, a novel semantic segmentation network that combines spatial criss-cross attention and channel attention was proposed. Firstly, the spatial criss-cross attention module (SCCAM) was adopted to aggregate context information of each target pixel in the horizontal and vertical directions, thus enabling efficient construction of non-local semantic dependencies between pixels. Secondly, the multi-head attention mechanism was introduced in the channel attention module (CAM) to mine channel features with more significant semantics on multiple channel subspaces. Finally, the semantic representation capability was strengthened by merging attention features on both spatial and channel dimensions, thereby improving the precision of semantic segmentation. The experimental results on several datasets, including Cityscapes, PASCAL VOC2012, and CamVid demonstrated that the proposed network model outperformed other state-of-the-art semantic segmentation methods in terms of segmentation accuracy.

Key words: semantic segmentation, neural networks, attention mechanism, space attention, channel attention

中图分类号:

TP391

吴文欢, 张淏坤. 融合空间十字注意力与通道注意力的语义分割网络[J]. 图学学报, 2023, 44(3): 531-539.

WU Wen-huan, ZHANG Hao-kun. Semantic segmentation with fusion of spatial criss-cross and channel multi-head attention[J]. Journal of Graphics, 2023, 44(3): 531-539.

图/表 10

参考文献 22

[1]	FENG D, HAASE-SCHÜTZ C, ROSENBAUM L, et al. Deep multi-modal object detection and semantic segmentation for autonomous driving: datasets, methods, and challenges[J]. IEEE Transactions on Intelligent Transportation Systems, 2021, 22(3): 1341-1360. DOI URL
[2]	CHEN X, WILLIAMS B M, VALLABHANENI S R, et al. Learning active contour models for medical image segmentation[C]// 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2019: 11632-11640.
[3]	ZHENG Z, ZHONG Y F, WANG J J, et al. Foreground-aware relation network for geospatial object segmentation in high spatial resolution remote sensing imagery[C]// 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2020: 4096-4105.
[4]	LONG J, SHELHAMER E, DARRELL T. Fully convolutional networks for semantic segmentation[C]// 2015 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2015: 3431-3440.
[5]	CHEN L C, PAPANDREOU G, KOKKINOS I, et al. DeepLab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 40(4): 834-848. DOI URL
[6]	ZHAO H S, SHI J P, QI X J, et al. Pyramid scene parsing network[C]// 2017 IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2017: 2881-2890.
[7]	RONNEBERGER O, FISCHER P, BROX T. U-Net: convolutional Networks for Biomedical Image Segmentation[C]// International Conference on Medical Image Computing and Computer-Assisted Intervention. Cham: Springer, 2015: 234-241.
[8]	BADRINARAYANAN V, KENDALL A, CIPOLLA R. SegNet: a deep convolutional encoder-decoder architecture for image segmentation[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 39(12): 2481-2495. DOI PMID
[9]	WANG X L, GIRSHICK R, GUPTA A, et al. Non-local neural networks[C]// 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2018: 7794-7803.
[10]	YIN M H, YAO Z L, CAO Y, et al. Disentangled non-local neural networks[EB/OL]. [2022-09-08]. https://arxiv.org/pdf/2006.06668.pdf.
[11]	HUANG Z L, WANG X G, HUANG L C, et al. CCNet: criss-cross attention for semantic segmentation[C]// 2019 IEEE/CVF International Conference on Computer Vision. New York: IEEE Press, 2019: 603-612.
[12]	ZHANG H, DANA K, SHI J P, et al. Context encoding for semantic segmentation[C]// 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2018: 7151-7160.
[13]	FU J, LIU J, TIAN H J, et al. Dual attention network for scene segmentation[C]// 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2019: 3146-3154.
[14]	DOSOVITSKIY A, BEYER L, KOLESNIKOV A, et al. An image is worth 16×16 words: transformers for image recognition at scale[EB/OL]. [2022-08-22]. https://arxiv.org/abs/2010.11929.
[15]	STRUDEL R, GARCIA R, LAPTEV I, et al. Segmenter: transformer for semantic segmentation[C]// 2021 IEEE/CVF International Conference on Computer Vision. New York: IEEE Press, 2021: 7262-7272.
[16]	ZHENG S X, LU J C, ZHAO H S, et al. Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers[C]// 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2021: 6881-6890.
[17]	LIU Z, LIN Y T, CAO Y, et al. Swin transformer: hierarchical vision transformer using shifted windows[C]// 2021 IEEE/CVF International Conference on Computer Vision. New York: IEEE Press, 2021: 10012-10022.
[18]	HE K M, ZHANG X Y, REN S Q, et al. Deep residual learning for image recognition[C]// 2016 IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2016: 770-778.
[19]	HE T, ZHANG Z, ZHANG H, et al. Bag of tricks for image classification with convolutional neural networks[C]// 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2019: 558-567.
[20]	GUO M H, XU T X, LIU J J, et al. Attention mechanisms in computer vision: a survey[J]. Computational Visual Media, 2022, 8(3): 331-368. DOI
[21]	CONTRIBUTORS M. OpenMMLab semantic segmentation toolbox and benchmark[EB/OL]. [2022-08-15]. https://github.com/open-mmlab/mmsegmentation.
[22]	YUAN Y H, HUANG L, GUO J Y, et al. OCNet: object context for semantic segmentation[J]. International Journal of Computer Vision, 2021, 129(8): 2375-2398. DOI

Method	Backbone	SCCAM	CAM	mIoU (%)
Baseline	Resnet50	-	-	67.5
Ours	Resnet50	-	√	80.0
Ours	Resnet50	√	-	78.8
Ours	Resnet50	√	√	81.6

Method	Backbone	SCCAM	CAM	mIoU (%)
Baseline	Resnet50	-	-	67.5
Ours	Resnet50	-	√	80.0
Ours	Resnet50	√	-	78.8
Ours	Resnet50	√	√	81.6

Method	Backbone	FPS	mIoU (%)
Baseline	Resnet50	0.95	67.5
EncNet^[12]	Resnet50	1.04	74.2
NLNet ^[9]	Resnet50	0.82	77.0
SETR-MLA^[16]	VIT-L	0.23	77.3
DNLNet^[10]	Resnet50	0.81	78.6
OCNet^[22]	Resnet50	1.08	79.3
DANet^[13]	Resnet50	0.84	80.0
Ours	Resnet50	0.95	81.6

Method	Backbone	FPS	mIoU (%)
Baseline	Resnet50	0.95	67.5
EncNet^[12]	Resnet50	1.04	74.2
NLNet ^[9]	Resnet50	0.82	77.0
SETR-MLA^[16]	VIT-L	0.23	77.3
DNLNet^[10]	Resnet50	0.81	78.6
OCNet^[22]	Resnet50	1.08	79.3
DANet^[13]	Resnet50	0.84	80.0
Ours	Resnet50	0.95	81.6

Method	mIoU	Road	Sidewalk	Building	Wall	Fence	Pole	Traffic Light	Traffic Sign	Vegetation
Baseline	67.5	97.8	83.0	92.0	34.4	58.7	66.0	73.6	80.2	92.1
EncNet^[12]	74.2	97.8	83.2	92.3	45.4	58.9	64.5	71.1	78.1	91.9
NLNet^[9]	77.0	98.0	84.7	93.0	58.1	61.1	65.8	73.3	79.4	92.4
SETR-MLA^[16]	77.3	98.2	85.3	92.2	63.7	64.4	53.0	63.3	73.4	91.8
DNLNet^[10]	78.6	98.2	85.4	93.2	61.0	62.5	66.3	72.8	79.9	92.6
OCNet^[22]	79.3	98.2	85.6	93.0	61.4	62.6	66.0	73.4	80.2	92.7
DANet^[13]	80.0	98.3	85.8	93.1	62.0	63.5	66.7	73.3	80.7	92.8
Ours	81.6	98.3	86.1	93.4	60.6	65.6	69.7	75.0	82.2	93.1
Method	Terrain	Sky	Person	Rider	Car	Truck	Bus	Train	Motorcycle	Bicycle
Baseline	59.1	94.6	82.4	62.4	91.6	17.3	33.4	35.0	51.3	77.9
EncNet^[12]	61.3	94.4	80.8	62.3	94.6	64.3	84.9	60.2	47.8	76.6
NLNet^[9]	62.7	94.7	82.7	62.1	95.4	68.0	85.7	76.9	51.7	78.2
SETR-MLA^[16]	65.8	94.2	78.4	58.6	94.4	82.0	89.5	81.4	65.2	73.6
DNLNet^[10]	64.7	95.1	83.3	65.4	95.6	73.9	85.2	69.7	69.5	79.0
OCNet^[22]	65.3	95.1	83.2	64.6	95.5	80.7	87.0	76.1	67.4	78.8
DANet^[13]	64.9	95.0	83.3	64.8	95.7	83.4	88.2	82.1	67.1	78.6
Ours	65.7	95.2	84.5	67.0	95.7	86.4	91.7	86.9	72.2	80.2