Semantic segmentation with fusion of spatial criss-cross and channel multi-head attention

doi:10.11996/JG.j.2095-302X.2023030531

Abstract

Abstract:

In light of the shortcomings of current semantic segmentation methods, which suffer from ineffective construction of contextual semantic associations and insufficient representation of extracted semantic features, a novel semantic segmentation network that combines spatial criss-cross attention and channel attention was proposed. Firstly, the spatial criss-cross attention module (SCCAM) was adopted to aggregate context information of each target pixel in the horizontal and vertical directions, thus enabling efficient construction of non-local semantic dependencies between pixels. Secondly, the multi-head attention mechanism was introduced in the channel attention module (CAM) to mine channel features with more significant semantics on multiple channel subspaces. Finally, the semantic representation capability was strengthened by merging attention features on both spatial and channel dimensions, thereby improving the precision of semantic segmentation. The experimental results on several datasets, including Cityscapes, PASCAL VOC2012, and CamVid demonstrated that the proposed network model outperformed other state-of-the-art semantic segmentation methods in terms of segmentation accuracy.

Key words: semantic segmentation, neural networks, attention mechanism, space attention, channel attention

CLC Number:

TP391

WU Wen-huan, ZHANG Hao-kun. Semantic segmentation with fusion of spatial criss-cross and channel multi-head attention[J]. Journal of Graphics, 2023, 44(3): 531-539.

Figures/Tables 10

References 22

[1]	FENG D, HAASE-SCHÜTZ C, ROSENBAUM L, et al. Deep multi-modal object detection and semantic segmentation for autonomous driving: datasets, methods, and challenges[J]. IEEE Transactions on Intelligent Transportation Systems, 2021, 22(3): 1341-1360. DOI URL
[2]	CHEN X, WILLIAMS B M, VALLABHANENI S R, et al. Learning active contour models for medical image segmentation[C]// 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2019: 11632-11640.
[3]	ZHENG Z, ZHONG Y F, WANG J J, et al. Foreground-aware relation network for geospatial object segmentation in high spatial resolution remote sensing imagery[C]// 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2020: 4096-4105.
[4]	LONG J, SHELHAMER E, DARRELL T. Fully convolutional networks for semantic segmentation[C]// 2015 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2015: 3431-3440.
[5]	CHEN L C, PAPANDREOU G, KOKKINOS I, et al. DeepLab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 40(4): 834-848. DOI URL
[6]	ZHAO H S, SHI J P, QI X J, et al. Pyramid scene parsing network[C]// 2017 IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2017: 2881-2890.
[7]	RONNEBERGER O, FISCHER P, BROX T. U-Net: convolutional Networks for Biomedical Image Segmentation[C]// International Conference on Medical Image Computing and Computer-Assisted Intervention. Cham: Springer, 2015: 234-241.
[8]	BADRINARAYANAN V, KENDALL A, CIPOLLA R. SegNet: a deep convolutional encoder-decoder architecture for image segmentation[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 39(12): 2481-2495. DOI PMID
[9]	WANG X L, GIRSHICK R, GUPTA A, et al. Non-local neural networks[C]// 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2018: 7794-7803.
[10]	YIN M H, YAO Z L, CAO Y, et al. Disentangled non-local neural networks[EB/OL]. [2022-09-08]. https://arxiv.org/pdf/2006.06668.pdf.
[11]	HUANG Z L, WANG X G, HUANG L C, et al. CCNet: criss-cross attention for semantic segmentation[C]// 2019 IEEE/CVF International Conference on Computer Vision. New York: IEEE Press, 2019: 603-612.
[12]	ZHANG H, DANA K, SHI J P, et al. Context encoding for semantic segmentation[C]// 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2018: 7151-7160.
[13]	FU J, LIU J, TIAN H J, et al. Dual attention network for scene segmentation[C]// 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2019: 3146-3154.
[14]	DOSOVITSKIY A, BEYER L, KOLESNIKOV A, et al. An image is worth 16×16 words: transformers for image recognition at scale[EB/OL]. [2022-08-22]. https://arxiv.org/abs/2010.11929.
[15]	STRUDEL R, GARCIA R, LAPTEV I, et al. Segmenter: transformer for semantic segmentation[C]// 2021 IEEE/CVF International Conference on Computer Vision. New York: IEEE Press, 2021: 7262-7272.
[16]	ZHENG S X, LU J C, ZHAO H S, et al. Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers[C]// 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2021: 6881-6890.
[17]	LIU Z, LIN Y T, CAO Y, et al. Swin transformer: hierarchical vision transformer using shifted windows[C]// 2021 IEEE/CVF International Conference on Computer Vision. New York: IEEE Press, 2021: 10012-10022.
[18]	HE K M, ZHANG X Y, REN S Q, et al. Deep residual learning for image recognition[C]// 2016 IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2016: 770-778.
[19]	HE T, ZHANG Z, ZHANG H, et al. Bag of tricks for image classification with convolutional neural networks[C]// 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2019: 558-567.
[20]	GUO M H, XU T X, LIU J J, et al. Attention mechanisms in computer vision: a survey[J]. Computational Visual Media, 2022, 8(3): 331-368. DOI
[21]	CONTRIBUTORS M. OpenMMLab semantic segmentation toolbox and benchmark[EB/OL]. [2022-08-15]. https://github.com/open-mmlab/mmsegmentation.
[22]	YUAN Y H, HUANG L, GUO J Y, et al. OCNet: object context for semantic segmentation[J]. International Journal of Computer Vision, 2021, 129(8): 2375-2398. DOI

Method	Backbone	SCCAM	CAM	mIoU (%)
Baseline	Resnet50	-	-	67.5
Ours	Resnet50	-	√	80.0
Ours	Resnet50	√	-	78.8
Ours	Resnet50	√	√	81.6

Method	Backbone	SCCAM	CAM	mIoU (%)
Baseline	Resnet50	-	-	67.5
Ours	Resnet50	-	√	80.0
Ours	Resnet50	√	-	78.8
Ours	Resnet50	√	√	81.6

Method	Backbone	FPS	mIoU (%)
Baseline	Resnet50	0.95	67.5
EncNet^[12]	Resnet50	1.04	74.2
NLNet ^[9]	Resnet50	0.82	77.0
SETR-MLA^[16]	VIT-L	0.23	77.3
DNLNet^[10]	Resnet50	0.81	78.6
OCNet^[22]	Resnet50	1.08	79.3
DANet^[13]	Resnet50	0.84	80.0
Ours	Resnet50	0.95	81.6

Method	Backbone	FPS	mIoU (%)
Baseline	Resnet50	0.95	67.5
EncNet^[12]	Resnet50	1.04	74.2
NLNet ^[9]	Resnet50	0.82	77.0
SETR-MLA^[16]	VIT-L	0.23	77.3
DNLNet^[10]	Resnet50	0.81	78.6
OCNet^[22]	Resnet50	1.08	79.3
DANet^[13]	Resnet50	0.84	80.0
Ours	Resnet50	0.95	81.6

Method	mIoU	Road	Sidewalk	Building	Wall	Fence	Pole	Traffic Light	Traffic Sign	Vegetation
Baseline	67.5	97.8	83.0	92.0	34.4	58.7	66.0	73.6	80.2	92.1
EncNet^[12]	74.2	97.8	83.2	92.3	45.4	58.9	64.5	71.1	78.1	91.9
NLNet^[9]	77.0	98.0	84.7	93.0	58.1	61.1	65.8	73.3	79.4	92.4
SETR-MLA^[16]	77.3	98.2	85.3	92.2	63.7	64.4	53.0	63.3	73.4	91.8
DNLNet^[10]	78.6	98.2	85.4	93.2	61.0	62.5	66.3	72.8	79.9	92.6
OCNet^[22]	79.3	98.2	85.6	93.0	61.4	62.6	66.0	73.4	80.2	92.7
DANet^[13]	80.0	98.3	85.8	93.1	62.0	63.5	66.7	73.3	80.7	92.8
Ours	81.6	98.3	86.1	93.4	60.6	65.6	69.7	75.0	82.2	93.1
Method	Terrain	Sky	Person	Rider	Car	Truck	Bus	Train	Motorcycle	Bicycle
Baseline	59.1	94.6	82.4	62.4	91.6	17.3	33.4	35.0	51.3	77.9
EncNet^[12]	61.3	94.4	80.8	62.3	94.6	64.3	84.9	60.2	47.8	76.6
NLNet^[9]	62.7	94.7	82.7	62.1	95.4	68.0	85.7	76.9	51.7	78.2
SETR-MLA^[16]	65.8	94.2	78.4	58.6	94.4	82.0	89.5	81.4	65.2	73.6
DNLNet^[10]	64.7	95.1	83.3	65.4	95.6	73.9	85.2	69.7	69.5	79.0
OCNet^[22]	65.3	95.1	83.2	64.6	95.5	80.7	87.0	76.1	67.4	78.8
DANet^[13]	64.9	95.0	83.3	64.8	95.7	83.4	88.2	82.1	67.1	78.6
Ours	65.7	95.2	84.5	67.0	95.7	86.4	91.7	86.9	72.2	80.2