Feature fusion and inter-layer transmission: an improved object detection method based on Anchor DETR

doi:10.11996/JG.j.2095-302X.2024050968

Abstract

Abstract:

Object detection is a crucial task in the field of computer vision, aiming to accurately identify and locate objects of interest in images or videos. An improved object detection algorithm was proposed by incorporating feature fusion, optimizing the inter-layer transmission method of the encoder, and designing a random jump retention method. These improvements addressed the limitations of general Transformer models in object detection tasks. Specifically, to counteract the issue of insufficient object information perception due to the computational constraints limiting Transformer vision models to a single layer of features, a convolutional attention mechanism was utilized to achieve effective multi-scale feature fusion, thereby enhancing the capability of object recognition and localization. By optimizing the transfer mode between encoder layers, each encoder layer effectively transmitted and learned more information, reducing information loss between layers. Additionally, to address the problem where predictions in the intermediate stages of the decoder outperformed those in the final stage, a random jump retention method was designed to improve the model’s prediction accuracy and stability. Experimental results demonstrated that the improved method significantly enhanced performance in object detection tasks. On the COCO2017 dataset, the model’s AP reached 42.3%, and the AP for small targets improved by 2.2%; on the PASCAL VOC2007 dataset, the model’s AP improved by 1.4%, and the AP for small targets improved by 2.4%.

Key words: object detection, feature fusion, Transformer, attention mechanism, image processing

CLC Number:

TP391

ZHANG Dongping, WEI Yangyue, HE Shuji, XU Yunchao, HU Haimiao, HUANG Wenjun. Feature fusion and inter-layer transmission: an improved object detection method based on Anchor DETR[J]. Journal of Graphics, 2024, 45(5): 968-978.

Figures/Tables 12

References 38

[1]	CHOUBISA M, KUMAR V, KUMAR M, et al. Object tracking in intelligent video surveillance system based on artificial system[C]// 2023 International Conference on Computational Intelligence, Communication Technology and Networking. New York: IEEE Press, 2023: 160-166.
[2]	KAPOOR P. A video surveillance detection of moving object using deep learning[C]// 2023 3rd International Conference on Smart Generation Computing, Communication and Networking. New York: IEEE Press, 2023: 1-6.
[3]	BAJGOTI A, GUPTA R, BALAJI P, et al. SwinAnomaly: real-time video anomaly detection using video Swin transformer and SORT[J]. IEEE Access, 2023, 11: 111093-111105.
[4]	XIAO B P, GUO J H, HE Z F. Real-time object detection algorithm of autonomous vehicles based on improved YOLOv5s[C]// 2021 5th CAA International Conference on Vehicular Control and Intelligence. New York: IEEE Press, 2021: 1-6.
[5]	SARDA A, DIXIT S, BHAN A. Object detection for autonomous driving using YOLO [you only look once] algorithm[C]// 2021 3rd International Conference on Intelligent Communication Technologies and Virtual Mobile Networks (ICICV). New York: IEEE Press, 2021: 1370-1374.
[6]	LI Z, GE Y F, WANG X H, et al. Industrial anomaly detection via teacher student network[C]// 2023 International Conference on Advanced Mechatronic Systems. New York: IEEE Press, 2023: 1-5.
[7]	翟永杰, 赵晓瑜, 王璐瑶, 等. IDD-YOLOv7: 一种用于输电线路绝缘子多缺陷的轻量化检测方法[J]. 图学学报, 2024, 45(1): 90-101. DOI
	ZHAI Y J, ZHAO X Y, WANG L Y, et al. IDD-YOLOv7: a lightweight method for multiple defect detection of insulators in transmission lines[J]. Journal of Graphics, 2024, 45(1): 90-101 (in Chinese). DOI
[8]	张相胜, 杨骁. 基于改进YOLOv7-tiny的橡胶密封圈缺陷检测方法[J]. 图学学报, 2024, 45(3): 446-453. DOI
	ZHANG X S, YANG X. Defect detection method of rubber seal ring based on improved YOLOv7-tiny[J]. Journal of Graphics, 2024, 45(3): 446-453 (in Chinese). DOI
[9]	GIRSHICK R, DONAHUE J, DARRELL T, et al. Rich feature hierarchies for accurate object detection and semantic segmentation[C]// 2014 IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2014: 580-587.
[10]	HE K M, ZHANG X Y, REN S Q, et al. Spatial pyramid pooling in deep convolutional networks for visual recognition[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2015, 37(9): 1904-1916. DOI PMID
[11]	GIRSHICK R. Fast R-CNN[C]// 2015 IEEE International Conference on Computer Vision. New York: IEEE Press, 2015: 1440-1448.
[12]	REN S Q, HE K M, GIRSHICK R, et al. Faster R-CNN: towards real-time object detection with region proposal networks[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 39(6): 1137-1149. DOI PMID
[13]	REDMON J, DIVVALA S, GIRSHICK R, et al. You only look once: unified, real-time object detection[C]// 2016 IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2016: 779-788.
[14]	LIU W, ANGUELOV D, ERHAN D, et al. SSD: single shot MultiBox detector[C]// The 14th European Conference on Computer Vision. Cham: Springer, 2016: 21-37.
[15]	VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]// The 31st International Conference on Neural Information Processing Systems. New York, United States Curran Associates Inc., 2017: 6000-6010.
[16]	黄友文, 林志钦, 章劲, 等. 结合坐标Transformer的轻量级人体姿态估计算法[J]. 图学学报, 2024, 45(3): 516-527. DOI
	HUANG Y W, LIN Z Q, ZHANG J, et al. Lightweight human pose estimation algorithm combined with coordinate Transformer[J]. Journal of Graphics, 2024, 45(3): 516-527 (in Chinese). DOI
[17]	DEVLIN J, CHANG M W, LEE K, et al. BERT: pre-training of deep bidirectional transformers for language understanding[C]// 2019 Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies. Minneapolis: Association for Computational Linguistics, 2018: 4171-4186.
[18]	RADFORD A, NARASIMHAN K, SALIMANS T, et al. Improving language understanding by generative pre-training[EB/OL]. (2018-06-11) [2024-02-12]. https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf.
[19]	RADFORD A, WU J, CHILD R, et al. Language models are unsupervised multitask learners[J]. OpenAI Blog, 2019, 1(8): 9.
[20]	BROWN T B, MANN B, RYDER N, et al. Language models are few-shot learners[C]// The 34th International Conference on Neural Information Processing Systems. New York: United States Curran Associates Inc., 2020: 159.
[21]	DOSOVITSKIY A, BEYER L, KOLESNIKOV A, et al. An image is worth 16x16 words: transformers for image recognition at scale[EB/OL]. (2021-06-03) [2024-02-12]. https://arxiv.org/pdf/2010.11929.
[22]	CARION N, MASSA F, SYNNAEVE G, et al. End-to-end object detection with transformers[C]// The 16th European Conference on Computer Vision. Cham: Springer, 2020: 213-229.
[23]	WANG Y M, ZHANG X Y, YANG T, et al. Anchor DETR: query design for transformer-based detection[C]// The 36th AAAI Conference on Artificial Intelligence. Palo Alto: AAAI, 2022: 2567-2575.
[24]	CHEN F Y, ZHANG H, HU K, et al. Enhanced training of query-based object detection via selective query recollection[C]// 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2023: 23756-23765.
[25]	周丽娟, 毛嘉宁. 视觉Transformer识别任务研究综述[J]. 中国图象图形学报, 2023, 28(10): 2969-3003.
	ZHOU L J, MAO J N. Vision transformer-based recognition tasks: a critical review[J]. Journal of Image and Graphics, 2023, 28(10): 2969-3003 (in Chinese).
[26]	许正森, 雷相达, 管海燕. 多尺度局部特征增强Transformer道路裂缝检测模型[J]. 中国图象图形学报, 2023, 28(4): 1019-1028.
	XU Z S, LEI X D, GUAN H Y. Multi-scale local feature enhanced transformer network for pavement crack detection[J]. Journal of Image and Graphics, 2023, 28(4): 1019-1028 (in Chinese).
[27]	ZHU X Z, SU W J, LU L W, et al. Deformable DETR: deformable transformers for end-to-end object detection[EB/OL]. (2021-03-18) [2024-02-12]. https://arxiv.org/pdf/2010.04159.
[28]	ROH B, SHIN J, SHIN W, et al. Sparse DETR: efficient end-to-end object detection with learnable sparsity[EB/OL]. (2022-03-04) [2024-02-12]. https://arxiv.org/pdf/2111.14330.
[29]	LIU Z, LIN Y T, CAO Y, et al. Swin transformer: hierarchical vision transformer using shifted windows[C]// 2021 IEEE/CVF International Conference on Computer Vision. New York: IEEE Press, 2021: 9992-10002.
[30]	孙旭辉, 官铮, 王学. 红外与可见光图像分组融合的视觉Transformer[J]. 中国图象图形学报, 2023, 28(1): 166-178.
	SUN X H, GUAN Z, WANG X. Vision transformer for fusing infrared and visible images in groups[J]. Journal of Image and Graphics, 2023, 28(1): 166-178 (in Chinese).
[31]	樊圣澜, 柏正尧, 陆倩杰, 等. 基于Transformer网络的COVID-19肺部CT图像分割[J]. 中国图象图形学报, 2023, 28(10): 3203-3213.
	FAN S L, BAI Z Y, LU Q J, et al. A transformer network based CT image segmentation for COVID-19-derived lung disease[J]. Journal of Image and Graphics, 2023, 28(10): 3203-3213 (in Chinese).
[32]	WOO S, PARK J, LEE J Y, et al. CBAM: convolutional block attention module[C]// The 15th European Conference on Computer Vision. Cham: Springer, 2018: 3-19.
[33]	LIN T Y, MAIRE M, BELONGIE S, et al. Microsoft COCO: common objects in context[C]// The 13th European Conference on Computer Vision. Cham: Springer, 2014: 740-755.
[34]	LOSHCHILOV I, HUTTER F. Decoupled weight decay regularization[EB/OL]. (2019-01-04) [2024-02-12]. https://arxiv.org/pdf/1711.05101.
[35]	LIN T Y, GOYAL P, GIRSHICK R, et al. Focal loss for dense object detection[C]// 2017 IEEE International Conference on Computer Vision. New York: IEEE Press, 2017: 2999-3007.
[36]	MENG D P, CHEN X K, FAN Z J, et al. Conditional DETR for fast training convergence[C]// 2021 IEEE/CVF International Conference on Computer Vision. New York: IEEE Press, 2021: 3631-3640.
[37]	LIU S L, LI F, ZHANG H, et al. DAB-DETR: dynamic anchor boxes are better queries for DETR[EB/OL]. (2022-03-30) [2024-02-12]. https://arxiv.org/pdf/2201.12329.
[38]	HE K M, ZHANG X Y, REN S Q, et al. Deep residual learning for image recognition[C]// 2016 IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2015: 770-778.

模型	Epoch	AP/%	AP₅₀/%	AP₇₅/%	AP_S/%	AP_M/%	AP_L/%	参数量/M	GFLOPs
RetinaNet	36	38.7	58.0	41.5	23.3	42.3	50.3	38	205
Faster RCNN	36	40.2	61.0	43.8	24.2	43.5	52.0	42	180
DETR	500	42.0	62.4	44.2	20.5	45.8	61.1	41	86
Conditional DETR	50	40.9	61.8	43.3	20.8	44.6	59.2	43	90
DAB DETR	50	42.2	63.1	44.7	21.5	45.7	60.3	43	94
Anchor DETR	50	42.1	63.1	44.9	22.3	46.2	60.0	37	164
改进算法	50	42.3	63.0	45.4	24.5	46.5	58.6	39	169

模型	Epoch	AP/%	AP₅₀/%	AP₇₅/%	AP_S/%	AP_M/%	AP_L/%	参数量/M	GFLOPs
RetinaNet	36	38.7	58.0	41.5	23.3	42.3	50.3	38	205
Faster RCNN	36	40.2	61.0	43.8	24.2	43.5	52.0	42	180
DETR	500	42.0	62.4	44.2	20.5	45.8	61.1	41	86
Conditional DETR	50	40.9	61.8	43.3	20.8	44.6	59.2	43	90
DAB DETR	50	42.2	63.1	44.7	21.5	45.7	60.3	43	94
Anchor DETR	50	42.1	63.1	44.9	22.3	46.2	60.0	37	164
改进算法	50	42.3	63.0	45.4	24.5	46.5	58.6	39	169

模型	Epoch	AP/%	AP₅₀/%	AP₇₅/%	AP_S/%	AP_M/%	AP_L/%	参数量/M	GFLOPs
Conditional DETR-R50	75	35.3	62.3	34.8	7.3	21.2	45.4	43	87
DAB DETR-R50	75	35.9	64.5	35.3	8.6	24.2	45.5	43	89
Sparse DETR-R50	75	38.3	64.8	39.4	10.7	27.6	47.1	40	171
Anchor DETR-R50	75	39.4	67.5	39.1	8.9	23.8	50.4	37	172
改进算法-R50	75	40.8	69.6	42.2	11.3	28.8	51.7	39	177
Conditional DETR-R101	75	36.5	63.5	35.3	8.0	25.8	46.0	62	154
DAB DETR-R101	75	39.3	66.9	41.0	10.5	27.8	49.1	62	155
Sparse DETR-R101	75	42.1	68.2	43.9	11.5	30.3	52.2	59	238
Anchor DETR-R101	75	42.7	70.9	44.1	9.9	30.3	53.8	56	238
改进算法-R101	75	43.1	71.1	45.3	12.8	30.7	55.6	58	243

模型	Epoch	AP/%	AP₅₀/%	AP₇₅/%	AP_S/%	AP_M/%	AP_L/%	参数量/M	GFLOPs
Conditional DETR-R50	75	35.3	62.3	34.8	7.3	21.2	45.4	43	87
DAB DETR-R50	75	35.9	64.5	35.3	8.6	24.2	45.5	43	89
Sparse DETR-R50	75	38.3	64.8	39.4	10.7	27.6	47.1	40	171
Anchor DETR-R50	75	39.4	67.5	39.1	8.9	23.8	50.4	37	172
改进算法-R50	75	40.8	69.6	42.2	11.3	28.8	51.7	39	177
Conditional DETR-R101	75	36.5	63.5	35.3	8.0	25.8	46.0	62	154
DAB DETR-R101	75	39.3	66.9	41.0	10.5	27.8	49.1	62	155
Sparse DETR-R101	75	42.1	68.2	43.9	11.5	30.3	52.2	59	238
Anchor DETR-R101	75	42.7	70.9	44.1	9.9	30.3	53.8	56	238
改进算法-R101	75	43.1	71.1	45.3	12.8	30.7	55.6	58	243

特征融合	编码器层间传递优化	随机跳跃保持	AP/%	AP₅₀/%	AP₇₅/%	AP_S/%	AP_M/%	AP_L/%
-	-	-	39.4	67.5	39.1	8.9	23.8	50.4
√	-	-	39.9	68.4	40.7	10.6	26.5	50.7
-	√	-	40.1	69.1	41.1	11.1	26.2	50.7
-	-	√	40.3	69.2	41.8	10.2	27.0	51.1
√	√	-	40.4	69.0	40.8	11.2	27.4	50.6
-	√	√	40.4	68.7	41.6	11.0	27.6	51.1
√	-	√	40.5	69.4	42.0	10.8	27.6	51.4
√	√	√	40.8	69.6	42.2	11.3	28.8	51.7