Journal of Graphics ›› 2023, Vol. 44 ›› Issue (3): 531-539.DOI: 10.11996/JG.j.2095-302X.2023030531
Previous Articles Next Articles
Received:2022-10-05
															
							
															
							
																	Accepted:2023-02-22
															
							
																	Online:2023-06-30
															
							
																	Published:2023-06-30
															
						About author:WU Wen-huan (1985-), associate professor, Ph.D. His main research interests cover computer vision and image processing, etc. E-mail:wuwenhuan5@163.com
Supported by:CLC Number:
WU Wen-huan, ZHANG Hao-kun. Semantic segmentation with fusion of spatial criss-cross and channel multi-head attention[J]. Journal of Graphics, 2023, 44(3): 531-539.
Add to citation manager EndNote|Ris|BibTeX
URL: http://www.txxb.com.cn/EN/10.11996/JG.j.2095-302X.2023030531
| Method | Backbone | SCCAM | CAM | mIoU (%) | 
|---|---|---|---|---|
| Baseline | Resnet50 | - | - | 67.5 | 
| Ours | Resnet50 | - | √ | 80.0 | 
| Ours | Resnet50 | √ | - | 78.8 | 
| Ours | Resnet50 | √ | √ | 81.6 | 
Table 1 Ablation study on the Cityscapes validation set
| Method | Backbone | SCCAM | CAM | mIoU (%) | 
|---|---|---|---|---|
| Baseline | Resnet50 | - | - | 67.5 | 
| Ours | Resnet50 | - | √ | 80.0 | 
| Ours | Resnet50 | √ | - | 78.8 | 
| Ours | Resnet50 | √ | √ | 81.6 | 
																													Fig. 4 Visualized comparison of segmentation results of SCCAM and CAM ((a), (e) Original image; (b), (f) Ground truth; (c), (g) Results when remove the SCCAM and CAM respectively; (d), (h) Results when utilize the SCCAM and CAM the same time)
| Method | Backbone | FPS | mIoU (%) | 
|---|---|---|---|
| Baseline | Resnet50 | 0.95 | 67.5 | 
| EncNet[ |  Resnet50 | 1.04 | 74.2 | 
| NLNet [ |  Resnet50 | 0.82 | 77.0 | 
| SETR-MLA[ |  VIT-L | 0.23 | 77.3 | 
| DNLNet[ |  Resnet50 | 0.81 | 78.6 | 
| OCNet[ |  Resnet50 | 1.08 | 79.3 | 
| DANet[ |  Resnet50 | 0.84 | 80.0 | 
| Ours | Resnet50 | 0.95 | 81.6 | 
Table 2 Results of different methods with the same experimental setup on the Cityscapes validation set
| Method | Backbone | FPS | mIoU (%) | 
|---|---|---|---|
| Baseline | Resnet50 | 0.95 | 67.5 | 
| EncNet[ |  Resnet50 | 1.04 | 74.2 | 
| NLNet [ |  Resnet50 | 0.82 | 77.0 | 
| SETR-MLA[ |  VIT-L | 0.23 | 77.3 | 
| DNLNet[ |  Resnet50 | 0.81 | 78.6 | 
| OCNet[ |  Resnet50 | 1.08 | 79.3 | 
| DANet[ |  Resnet50 | 0.84 | 80.0 | 
| Ours | Resnet50 | 0.95 | 81.6 | 
																													Fig. 5 Results of image segmentation ((a) Original image; (b) Ground truth; (c) Baseline; (d) EncNet[12]; (e) NLNet[9]; (f) SETR-MLA[16]; (g) DNLNet[10]; (h) OCNet[22]; (i) DANet[13]; (j) Ours)
| Method | mIoU | Road | Sidewalk | Building | Wall | Fence | Pole | Traffic Light | Traffic Sign | Vegetation | 
|---|---|---|---|---|---|---|---|---|---|---|
| Baseline | 67.5 | 97.8 | 83.0 | 92.0 | 34.4 | 58.7 | 66.0 | 73.6 | 80.2 | 92.1 | 
| EncNet[ |  74.2 | 97.8 | 83.2 | 92.3 | 45.4 | 58.9 | 64.5 | 71.1 | 78.1 | 91.9 | 
| NLNet[ |  77.0 | 98.0 | 84.7 | 93.0 | 58.1 | 61.1 | 65.8 | 73.3 | 79.4 | 92.4 | 
| SETR-MLA[ |  77.3 | 98.2 | 85.3 | 92.2 | 63.7 | 64.4 | 53.0 | 63.3 | 73.4 | 91.8 | 
| DNLNet[ |  78.6 | 98.2 | 85.4 | 93.2 | 61.0 | 62.5 | 66.3 | 72.8 | 79.9 | 92.6 | 
| OCNet[ |  79.3 | 98.2 | 85.6 | 93.0 | 61.4 | 62.6 | 66.0 | 73.4 | 80.2 | 92.7 | 
| DANet[ |  80.0 | 98.3 | 85.8 | 93.1 | 62.0 | 63.5 | 66.7 | 73.3 | 80.7 | 92.8 | 
| Ours | 81.6 | 98.3 | 86.1 | 93.4 | 60.6 | 65.6 | 69.7 | 75.0 | 82.2 | 93.1 | 
| Method | Terrain | Sky | Person | Rider | Car | Truck | Bus | Train | Motorcycle | Bicycle | 
| Baseline | 59.1 | 94.6 | 82.4 | 62.4 | 91.6 | 17.3 | 33.4 | 35.0 | 51.3 | 77.9 | 
| EncNet[ |  61.3 | 94.4 | 80.8 | 62.3 | 94.6 | 64.3 | 84.9 | 60.2 | 47.8 | 76.6 | 
| NLNet[ |  62.7 | 94.7 | 82.7 | 62.1 | 95.4 | 68.0 | 85.7 | 76.9 | 51.7 | 78.2 | 
| SETR-MLA[ |  65.8 | 94.2 | 78.4 | 58.6 | 94.4 | 82.0 | 89.5 | 81.4 | 65.2 | 73.6 | 
| DNLNet[ |  64.7 | 95.1 | 83.3 | 65.4 | 95.6 | 73.9 | 85.2 | 69.7 | 69.5 | 79.0 | 
| OCNet[ |  65.3 | 95.1 | 83.2 | 64.6 | 95.5 | 80.7 | 87.0 | 76.1 | 67.4 | 78.8 | 
| DANet[ |  64.9 | 95.0 | 83.3 | 64.8 | 95.7 | 83.4 | 88.2 | 82.1 | 67.1 | 78.6 | 
| Ours | 65.7 | 95.2 | 84.5 | 67.0 | 95.7 | 86.4 | 91.7 | 86.9 | 72.2 | 80.2 | 
Table 3 Results of different methods on the Cityscapes validation set for each category (%)
| Method | mIoU | Road | Sidewalk | Building | Wall | Fence | Pole | Traffic Light | Traffic Sign | Vegetation | 
|---|---|---|---|---|---|---|---|---|---|---|
| Baseline | 67.5 | 97.8 | 83.0 | 92.0 | 34.4 | 58.7 | 66.0 | 73.6 | 80.2 | 92.1 | 
| EncNet[ |  74.2 | 97.8 | 83.2 | 92.3 | 45.4 | 58.9 | 64.5 | 71.1 | 78.1 | 91.9 | 
| NLNet[ |  77.0 | 98.0 | 84.7 | 93.0 | 58.1 | 61.1 | 65.8 | 73.3 | 79.4 | 92.4 | 
| SETR-MLA[ |  77.3 | 98.2 | 85.3 | 92.2 | 63.7 | 64.4 | 53.0 | 63.3 | 73.4 | 91.8 | 
| DNLNet[ |  78.6 | 98.2 | 85.4 | 93.2 | 61.0 | 62.5 | 66.3 | 72.8 | 79.9 | 92.6 | 
| OCNet[ |  79.3 | 98.2 | 85.6 | 93.0 | 61.4 | 62.6 | 66.0 | 73.4 | 80.2 | 92.7 | 
| DANet[ |  80.0 | 98.3 | 85.8 | 93.1 | 62.0 | 63.5 | 66.7 | 73.3 | 80.7 | 92.8 | 
| Ours | 81.6 | 98.3 | 86.1 | 93.4 | 60.6 | 65.6 | 69.7 | 75.0 | 82.2 | 93.1 | 
| Method | Terrain | Sky | Person | Rider | Car | Truck | Bus | Train | Motorcycle | Bicycle | 
| Baseline | 59.1 | 94.6 | 82.4 | 62.4 | 91.6 | 17.3 | 33.4 | 35.0 | 51.3 | 77.9 | 
| EncNet[ |  61.3 | 94.4 | 80.8 | 62.3 | 94.6 | 64.3 | 84.9 | 60.2 | 47.8 | 76.6 | 
| NLNet[ |  62.7 | 94.7 | 82.7 | 62.1 | 95.4 | 68.0 | 85.7 | 76.9 | 51.7 | 78.2 | 
| SETR-MLA[ |  65.8 | 94.2 | 78.4 | 58.6 | 94.4 | 82.0 | 89.5 | 81.4 | 65.2 | 73.6 | 
| DNLNet[ |  64.7 | 95.1 | 83.3 | 65.4 | 95.6 | 73.9 | 85.2 | 69.7 | 69.5 | 79.0 | 
| OCNet[ |  65.3 | 95.1 | 83.2 | 64.6 | 95.5 | 80.7 | 87.0 | 76.1 | 67.4 | 78.8 | 
| DANet[ |  64.9 | 95.0 | 83.3 | 64.8 | 95.7 | 83.4 | 88.2 | 82.1 | 67.1 | 78.6 | 
| Ours | 65.7 | 95.2 | 84.5 | 67.0 | 95.7 | 86.4 | 91.7 | 86.9 | 72.2 | 80.2 | 
| Method | Backbone | mIoU (%) | 
|---|---|---|
| Baseline | Resnet50 | 42.4 | 
| EncNet[ |  Resnet50 | 42.7 | 
| DANet[ |  Resnet50 | 42.8 | 
| OCNet[ |  Resnet50 | 42.9 | 
| DNLNet[ |  Resnet50 | 43.0 | 
| NLNet[ |  Resnet50 | 43.1 | 
| Ours | Resnet50 | 43.8 | 
Table 4 Cross validation
| Method | Backbone | mIoU (%) | 
|---|---|---|
| Baseline | Resnet50 | 42.4 | 
| EncNet[ |  Resnet50 | 42.7 | 
| DANet[ |  Resnet50 | 42.8 | 
| OCNet[ |  Resnet50 | 42.9 | 
| DNLNet[ |  Resnet50 | 43.0 | 
| NLNet[ |  Resnet50 | 43.1 | 
| Ours | Resnet50 | 43.8 | 
| Method | Backbone | FPS | mIoU ( %) | 
|---|---|---|---|
| Baseline | Resnet50 | 13.89 | 52.8 | 
| EncNet[ |  Resnet50 | 14.29 | 72.7 | 
| OCNet[ |  Resnet50 | 15.10 | 73.3 | 
| DNLNet [ |  Resnet50 | 12.40 | 73.7 | 
| NLNet[ |  Resnet50 | 12.52 | 74.0 | 
| DANet[ |  Resnet50 | 12.76 | 74.3 | 
| SETR-MLA[ |  VIT-L | 3.56 | 79.7 | 
| Ours | Resnet50 | 13.43 | 78.2 | 
Table 5 Generalization performance test
| Method | Backbone | FPS | mIoU ( %) | 
|---|---|---|---|
| Baseline | Resnet50 | 13.89 | 52.8 | 
| EncNet[ |  Resnet50 | 14.29 | 72.7 | 
| OCNet[ |  Resnet50 | 15.10 | 73.3 | 
| DNLNet [ |  Resnet50 | 12.40 | 73.7 | 
| NLNet[ |  Resnet50 | 12.52 | 74.0 | 
| DANet[ |  Resnet50 | 12.76 | 74.3 | 
| SETR-MLA[ |  VIT-L | 3.56 | 79.7 | 
| Ours | Resnet50 | 13.43 | 78.2 | 
| [1] |  
											 FENG D, HAASE-SCHÜTZ C, ROSENBAUM L, et al. Deep multi-modal object detection and semantic segmentation for autonomous driving: datasets, methods, and challenges[J]. IEEE Transactions on Intelligent Transportation Systems, 2021, 22(3): 1341-1360. 
																							 DOI URL  | 
										
| [2] | CHEN X, WILLIAMS B M, VALLABHANENI S R, et al. Learning active contour models for medical image segmentation[C]// 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2019: 11632-11640. | 
| [3] | ZHENG Z, ZHONG Y F, WANG J J, et al. Foreground-aware relation network for geospatial object segmentation in high spatial resolution remote sensing imagery[C]// 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2020: 4096-4105. | 
| [4] | LONG J, SHELHAMER E, DARRELL T. Fully convolutional networks for semantic segmentation[C]// 2015 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2015: 3431-3440. | 
| [5] |  
											 CHEN L C, PAPANDREOU G, KOKKINOS I, et al. DeepLab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 40(4): 834-848. 
																							 DOI URL  | 
										
| [6] | ZHAO H S, SHI J P, QI X J, et al. Pyramid scene parsing network[C]// 2017 IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2017: 2881-2890. | 
| [7] | RONNEBERGER O, FISCHER P, BROX T. U-Net: convolutional Networks for Biomedical Image Segmentation[C]// International Conference on Medical Image Computing and Computer-Assisted Intervention. Cham: Springer, 2015: 234-241. | 
| [8] |  
											 BADRINARAYANAN V, KENDALL A, CIPOLLA R. SegNet: a deep convolutional encoder-decoder architecture for image segmentation[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 39(12): 2481-2495. 
																							 DOI PMID  | 
										
| [9] | WANG X L, GIRSHICK R, GUPTA A, et al. Non-local neural networks[C]// 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2018: 7794-7803. | 
| [10] | YIN M H, YAO Z L, CAO Y, et al. Disentangled non-local neural networks[EB/OL]. [2022-09-08]. https://arxiv.org/pdf/2006.06668.pdf. | 
| [11] | HUANG Z L, WANG X G, HUANG L C, et al. CCNet: criss-cross attention for semantic segmentation[C]// 2019 IEEE/CVF International Conference on Computer Vision. New York: IEEE Press, 2019: 603-612. | 
| [12] | ZHANG H, DANA K, SHI J P, et al. Context encoding for semantic segmentation[C]// 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2018: 7151-7160. | 
| [13] | FU J, LIU J, TIAN H J, et al. Dual attention network for scene segmentation[C]// 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2019: 3146-3154. | 
| [14] | DOSOVITSKIY A, BEYER L, KOLESNIKOV A, et al. An image is worth 16×16 words: transformers for image recognition at scale[EB/OL]. [2022-08-22]. https://arxiv.org/abs/2010.11929. | 
| [15] | STRUDEL R, GARCIA R, LAPTEV I, et al. Segmenter: transformer for semantic segmentation[C]// 2021 IEEE/CVF International Conference on Computer Vision. New York: IEEE Press, 2021: 7262-7272. | 
| [16] | ZHENG S X, LU J C, ZHAO H S, et al. Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers[C]// 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2021: 6881-6890. | 
| [17] | LIU Z, LIN Y T, CAO Y, et al. Swin transformer: hierarchical vision transformer using shifted windows[C]// 2021 IEEE/CVF International Conference on Computer Vision. New York: IEEE Press, 2021: 10012-10022. | 
| [18] | HE K M, ZHANG X Y, REN S Q, et al. Deep residual learning for image recognition[C]// 2016 IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2016: 770-778. | 
| [19] | HE T, ZHANG Z, ZHANG H, et al. Bag of tricks for image classification with convolutional neural networks[C]// 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2019: 558-567. | 
| [20] |  
											 GUO M H, XU T X, LIU J J, et al. Attention mechanisms in computer vision: a survey[J]. Computational Visual Media, 2022, 8(3): 331-368. 
																							 DOI  | 
										
| [21] | CONTRIBUTORS M. OpenMMLab semantic segmentation toolbox and benchmark[EB/OL]. [2022-08-15]. https://github.com/open-mmlab/mmsegmentation. | 
| [22] |  
											 YUAN Y H, HUANG L, GUO J Y, et al. OCNet: object context for semantic segmentation[J]. International Journal of Computer Vision, 2021, 129(8): 2375-2398. 
																							 DOI  | 
										
| [1] | 
														LI Li-xia , WANG Xin, WANG Jun , ZHANG You-yuan. 
														
															 
	Small object detection algorithm in UAV image based on
feature fusion and attention mechanism
 
														[J]. Journal of Graphics, 2023, 44(4): 658-666.
													 | 
												
| [2] | 
														LI Xin , PU Yuan-yuan, ZHAO Zheng-peng , XU Dan , QIAN Wen-hua. 
														
															 
	Content semantics and style features match consistent
artistic style transfer
 
														[J]. Journal of Graphics, 2023, 44(4): 699-709.
													 | 
												
| [3] | 
														YU Wei-qun, LIU Jia-tao, ZHANG Ya-ping. 
														
															 
	Monocular depth estimation based on Laplacian
pyramid with attention fusion
 
														[J]. Journal of Graphics, 2023, 44(4): 728-738.
													 | 
												
| [4] | HU Xin, ZHOU Yun-qiang, XIAO Jian, YANG Jie. Surface defect detection of threaded steel based on improved YOLOv5 [J]. Journal of Graphics, 2023, 44(3): 427-437. | 
| [5] | HAO Peng-fei, LIU Li-qun, GU Ren-yuan. YOLO-RD-Apple orchard heterogenous image obscured fruit detection model [J]. Journal of Graphics, 2023, 44(3): 456-464. | 
| [6] | LI Yu, YAN Tian-tian, ZHOU Dong-sheng, WEI Xiao-peng. Natural scene text detection based on attention mechanism and deep multi-scale feature fusion [J]. Journal of Graphics, 2023, 44(3): 473-481. | 
| [7] | XIAO Tian-xing, WU Jing-jing. Segmentation of laser coding characters based on residual and feature-grouped attention [J]. Journal of Graphics, 2023, 44(3): 482-491. | 
| [8] | XIE Guo-bo, HE Di-xuan, HE Yu-qin, LIN Zhi-yi. P-CenterNet for chimney detection in optical remote-sensing images [J]. Journal of Graphics, 2023, 44(2): 233-249. | 
| [9] | XIONG Ju-ju , XU Yang, FAN Run-ze , SUN Shao-cong. Flowers recognition based on lightweight visual transformer [J]. Journal of Graphics, 2023, 44(2): 271-279. | 
| [10] | CHENG Lang , JING Chao. X-ray image rotating object detection based on improved YOLOv7 [J]. Journal of Graphics, 2023, 44(2): 324-334. | 
| [11] | CAO Yi-qin , WU Ming-lin , XU Lu. Steel surface defect detection based on improved YOLOv5 algorithm [J]. Journal of Graphics, 2023, 44(2): 335-345. | 
| [12] | ZHANG Wei-kang, SUN Hao, CHEN Xin-kai, LI Xu-bing, YAO Li-gang, DONG Hui . Research on weed detection in vegetable seedling fields based on the improved YOLOv5 intelligent weeding robot [J]. Journal of Graphics, 2023, 44(2): 346-356. | 
| [13] | LI Xiao-bo , LI Yang-gui, GUO Ning , FAN Zhen. Mask detection algorithm based on YOLOv5 integrating attention mechanism [J]. Journal of Graphics, 2023, 44(1): 16-25. | 
| [14] | SHAO Wen-bin, LIU Yu-jie, SUN Xiao-rui, LI Zong-min . Cross modality person re-identification based on residual enhanced attention [J]. Journal of Graphics, 2023, 44(1): 33-40. | 
| [15] | 
														SHAN Fang-mei , WANG Meng-wen , LI Min. 
														
															 
	Multi-scale convolutional neural network incorporating attention
mechanism for intestinal polyp segmentation
 
														[J]. Journal of Graphics, 2023, 44(1): 50-58.
													 | 
												
| Viewed | ||||||
| 
										Full text | 
									
										 | 
								|||||
| 
										Abstract | 
									
										 | 
								|||||
