特征融合与层间传递：一种基于Anchor DETR改进的目标检测方法

doi:10.11996/JG.j.2095-302X.2024050968

图学学报 ›› 2024, Vol. 45 ›› Issue (5): 968-978.DOI: 10.11996/JG.j.2095-302X.2024050968

• 图像处理与计算机视觉 • 上一篇下一篇

特征融合与层间传递：一种基于Anchor DETR改进的目标检测方法

章东平¹(), 魏杨悦¹, 何数技¹, 徐云超¹, 胡海苗², 黄文君³

1.中国计量大学信息工程学院，浙江杭州 310018
2.北京航空航天大学杭州创新研究院，浙江杭州 310051
3.浙江中控技术股份有限公司，浙江杭州 310053

收稿日期:2024-07-02 修回日期:2024-07-12 出版日期:2024-10-31 发布日期:2024-10-31
第一作者:章东平(1970-)，男，教授，博士。主要研究方向为图像处理和计算机视觉。E-mail：06a0303103@cjlu.edu.cn
基金资助:
浙江省重点研发计划项目(2024C01028);浙江省重点研发计划项目(2024C01108);浙江省重点研发计划项目(2022C01082);浙江省重点研发计划项目(2023C01032)

Feature fusion and inter-layer transmission: an improved object detection method based on Anchor DETR

ZHANG Dongping¹(), WEI Yangyue¹, HE Shuji¹, XU Yunchao¹, HU Haimiao², HUANG Wenjun³

1. College of Information Engineering, China Jiliang University, Hangzhou Zhejiang 310018, China
2. Hangzhou Innovation Institute, Beihang University, Hangzhou Zhejiang 310051, China
3. Supcon Technology Co.,ltd, Hangzhou Zhejiang 310053, China

Received:2024-07-02 Revised:2024-07-12 Published:2024-10-31 Online:2024-10-31
First author：ZHANG Dongping (1970-), professor, Ph.D. His main research interests cover image processing and computer vision. E-mail：06a0303103@cjlu.edu.cn
Supported by:
Key Research and Development Program of Zhejiang Province(2024C01028);Key Research and Development Program of Zhejiang Province(2024C01108);Key Research and Development Program of Zhejiang Province(2022C01082);Key Research and Development Program of Zhejiang Province(2023C01032)

摘要/Abstract

摘要：

目标检测是计算机视觉领域中的一项重要任务，旨在从图像或视频中准确识别和定位感兴趣的目标物体。本文提出了一种改进的目标检测算法，通过增加特征融合、优化编码器层间传递方式和设计随机跳跃保持方法，解决一般Transformer模型在目标检测任务中存在的局限性。针对Transformer视觉模型由于计算量限制只应用一层特征，导致目标对象信息感知不足的问题，利用卷积注意力机制实现了多尺度特征的有效融合，提高了对目标的识别和定位能力。通过优化编码器的层间传递方式，使得每层编码器有效地传递和学习更多的信息，减少层间信息的丢失。还针对解码器中间阶段预测优于最终阶段的问题，设计了随机跳跃保持方法，提高了模型的预测准确性和稳定性。实验结果表明，改进方法在目标检测任务中取得了显著的性能提升，在COCO2017数据集上，模型的平均精度AP达到了42.3%，小目标的平均精度提高了2.2%；在PASCAL VOC2007数据集上，模型的平均精度AP提高了1.4%，小目标的平均精度提高了2.4%。

关键词: 目标检测, 特征融合, Transformer, 注意力机制, 图像处理

Abstract:

Object detection is a crucial task in the field of computer vision, aiming to accurately identify and locate objects of interest in images or videos. An improved object detection algorithm was proposed by incorporating feature fusion, optimizing the inter-layer transmission method of the encoder, and designing a random jump retention method. These improvements addressed the limitations of general Transformer models in object detection tasks. Specifically, to counteract the issue of insufficient object information perception due to the computational constraints limiting Transformer vision models to a single layer of features, a convolutional attention mechanism was utilized to achieve effective multi-scale feature fusion, thereby enhancing the capability of object recognition and localization. By optimizing the transfer mode between encoder layers, each encoder layer effectively transmitted and learned more information, reducing information loss between layers. Additionally, to address the problem where predictions in the intermediate stages of the decoder outperformed those in the final stage, a random jump retention method was designed to improve the model’s prediction accuracy and stability. Experimental results demonstrated that the improved method significantly enhanced performance in object detection tasks. On the COCO2017 dataset, the model’s AP reached 42.3%, and the AP for small targets improved by 2.2%; on the PASCAL VOC2007 dataset, the model’s AP improved by 1.4%, and the AP for small targets improved by 2.4%.

Key words: object detection, feature fusion, Transformer, attention mechanism, image processing

中图分类号:

TP391

章东平, 魏杨悦, 何数技, 徐云超, 胡海苗, 黄文君. 特征融合与层间传递：一种基于Anchor DETR改进的目标检测方法[J]. 图学学报, 2024, 45(5): 968-978.

ZHANG Dongping, WEI Yangyue, HE Shuji, XU Yunchao, HU Haimiao, HUANG Wenjun. Feature fusion and inter-layer transmission: an improved object detection method based on Anchor DETR[J]. Journal of Graphics, 2024, 45(5): 968-978.

图/表 12

参考文献 38

[1]	CHOUBISA M, KUMAR V, KUMAR M, et al. Object tracking in intelligent video surveillance system based on artificial system[C]// 2023 International Conference on Computational Intelligence, Communication Technology and Networking. New York: IEEE Press, 2023: 160-166.
[2]	KAPOOR P. A video surveillance detection of moving object using deep learning[C]// 2023 3rd International Conference on Smart Generation Computing, Communication and Networking. New York: IEEE Press, 2023: 1-6.
[3]	BAJGOTI A, GUPTA R, BALAJI P, et al. SwinAnomaly: real-time video anomaly detection using video Swin transformer and SORT[J]. IEEE Access, 2023, 11: 111093-111105.
[4]	XIAO B P, GUO J H, HE Z F. Real-time object detection algorithm of autonomous vehicles based on improved YOLOv5s[C]// 2021 5th CAA International Conference on Vehicular Control and Intelligence. New York: IEEE Press, 2021: 1-6.
[5]	SARDA A, DIXIT S, BHAN A. Object detection for autonomous driving using YOLO [you only look once] algorithm[C]// 2021 3rd International Conference on Intelligent Communication Technologies and Virtual Mobile Networks (ICICV). New York: IEEE Press, 2021: 1370-1374.
[6]	LI Z, GE Y F, WANG X H, et al. Industrial anomaly detection via teacher student network[C]// 2023 International Conference on Advanced Mechatronic Systems. New York: IEEE Press, 2023: 1-5.
[7]	翟永杰, 赵晓瑜, 王璐瑶, 等. IDD-YOLOv7: 一种用于输电线路绝缘子多缺陷的轻量化检测方法[J]. 图学学报, 2024, 45(1): 90-101. DOI
	ZHAI Y J, ZHAO X Y, WANG L Y, et al. IDD-YOLOv7: a lightweight method for multiple defect detection of insulators in transmission lines[J]. Journal of Graphics, 2024, 45(1): 90-101 (in Chinese). DOI
[8]	张相胜, 杨骁. 基于改进YOLOv7-tiny的橡胶密封圈缺陷检测方法[J]. 图学学报, 2024, 45(3): 446-453. DOI
	ZHANG X S, YANG X. Defect detection method of rubber seal ring based on improved YOLOv7-tiny[J]. Journal of Graphics, 2024, 45(3): 446-453 (in Chinese). DOI
[9]	GIRSHICK R, DONAHUE J, DARRELL T, et al. Rich feature hierarchies for accurate object detection and semantic segmentation[C]// 2014 IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2014: 580-587.
[10]	HE K M, ZHANG X Y, REN S Q, et al. Spatial pyramid pooling in deep convolutional networks for visual recognition[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2015, 37(9): 1904-1916. DOI PMID
[11]	GIRSHICK R. Fast R-CNN[C]// 2015 IEEE International Conference on Computer Vision. New York: IEEE Press, 2015: 1440-1448.
[12]	REN S Q, HE K M, GIRSHICK R, et al. Faster R-CNN: towards real-time object detection with region proposal networks[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 39(6): 1137-1149. DOI PMID
[13]	REDMON J, DIVVALA S, GIRSHICK R, et al. You only look once: unified, real-time object detection[C]// 2016 IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2016: 779-788.
[14]	LIU W, ANGUELOV D, ERHAN D, et al. SSD: single shot MultiBox detector[C]// The 14th European Conference on Computer Vision. Cham: Springer, 2016: 21-37.
[15]	VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]// The 31st International Conference on Neural Information Processing Systems. New York, United States Curran Associates Inc., 2017: 6000-6010.
[16]	黄友文, 林志钦, 章劲, 等. 结合坐标Transformer的轻量级人体姿态估计算法[J]. 图学学报, 2024, 45(3): 516-527. DOI
	HUANG Y W, LIN Z Q, ZHANG J, et al. Lightweight human pose estimation algorithm combined with coordinate Transformer[J]. Journal of Graphics, 2024, 45(3): 516-527 (in Chinese). DOI
[17]	DEVLIN J, CHANG M W, LEE K, et al. BERT: pre-training of deep bidirectional transformers for language understanding[C]// 2019 Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies. Minneapolis: Association for Computational Linguistics, 2018: 4171-4186.
[18]	RADFORD A, NARASIMHAN K, SALIMANS T, et al. Improving language understanding by generative pre-training[EB/OL]. (2018-06-11) [2024-02-12]. https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf.
[19]	RADFORD A, WU J, CHILD R, et al. Language models are unsupervised multitask learners[J]. OpenAI Blog, 2019, 1(8): 9.
[20]	BROWN T B, MANN B, RYDER N, et al. Language models are few-shot learners[C]// The 34th International Conference on Neural Information Processing Systems. New York: United States Curran Associates Inc., 2020: 159.
[21]	DOSOVITSKIY A, BEYER L, KOLESNIKOV A, et al. An image is worth 16x16 words: transformers for image recognition at scale[EB/OL]. (2021-06-03) [2024-02-12]. https://arxiv.org/pdf/2010.11929.
[22]	CARION N, MASSA F, SYNNAEVE G, et al. End-to-end object detection with transformers[C]// The 16th European Conference on Computer Vision. Cham: Springer, 2020: 213-229.
[23]	WANG Y M, ZHANG X Y, YANG T, et al. Anchor DETR: query design for transformer-based detection[C]// The 36th AAAI Conference on Artificial Intelligence. Palo Alto: AAAI, 2022: 2567-2575.
[24]	CHEN F Y, ZHANG H, HU K, et al. Enhanced training of query-based object detection via selective query recollection[C]// 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2023: 23756-23765.
[25]	周丽娟, 毛嘉宁. 视觉Transformer识别任务研究综述[J]. 中国图象图形学报, 2023, 28(10): 2969-3003.
	ZHOU L J, MAO J N. Vision transformer-based recognition tasks: a critical review[J]. Journal of Image and Graphics, 2023, 28(10): 2969-3003 (in Chinese).
[26]	许正森, 雷相达, 管海燕. 多尺度局部特征增强Transformer道路裂缝检测模型[J]. 中国图象图形学报, 2023, 28(4): 1019-1028.
	XU Z S, LEI X D, GUAN H Y. Multi-scale local feature enhanced transformer network for pavement crack detection[J]. Journal of Image and Graphics, 2023, 28(4): 1019-1028 (in Chinese).
[27]	ZHU X Z, SU W J, LU L W, et al. Deformable DETR: deformable transformers for end-to-end object detection[EB/OL]. (2021-03-18) [2024-02-12]. https://arxiv.org/pdf/2010.04159.
[28]	ROH B, SHIN J, SHIN W, et al. Sparse DETR: efficient end-to-end object detection with learnable sparsity[EB/OL]. (2022-03-04) [2024-02-12]. https://arxiv.org/pdf/2111.14330.
[29]	LIU Z, LIN Y T, CAO Y, et al. Swin transformer: hierarchical vision transformer using shifted windows[C]// 2021 IEEE/CVF International Conference on Computer Vision. New York: IEEE Press, 2021: 9992-10002.
[30]	孙旭辉, 官铮, 王学. 红外与可见光图像分组融合的视觉Transformer[J]. 中国图象图形学报, 2023, 28(1): 166-178.
	SUN X H, GUAN Z, WANG X. Vision transformer for fusing infrared and visible images in groups[J]. Journal of Image and Graphics, 2023, 28(1): 166-178 (in Chinese).
[31]	樊圣澜, 柏正尧, 陆倩杰, 等. 基于Transformer网络的COVID-19肺部CT图像分割[J]. 中国图象图形学报, 2023, 28(10): 3203-3213.
	FAN S L, BAI Z Y, LU Q J, et al. A transformer network based CT image segmentation for COVID-19-derived lung disease[J]. Journal of Image and Graphics, 2023, 28(10): 3203-3213 (in Chinese).
[32]	WOO S, PARK J, LEE J Y, et al. CBAM: convolutional block attention module[C]// The 15th European Conference on Computer Vision. Cham: Springer, 2018: 3-19.
[33]	LIN T Y, MAIRE M, BELONGIE S, et al. Microsoft COCO: common objects in context[C]// The 13th European Conference on Computer Vision. Cham: Springer, 2014: 740-755.
[34]	LOSHCHILOV I, HUTTER F. Decoupled weight decay regularization[EB/OL]. (2019-01-04) [2024-02-12]. https://arxiv.org/pdf/1711.05101.
[35]	LIN T Y, GOYAL P, GIRSHICK R, et al. Focal loss for dense object detection[C]// 2017 IEEE International Conference on Computer Vision. New York: IEEE Press, 2017: 2999-3007.
[36]	MENG D P, CHEN X K, FAN Z J, et al. Conditional DETR for fast training convergence[C]// 2021 IEEE/CVF International Conference on Computer Vision. New York: IEEE Press, 2021: 3631-3640.
[37]	LIU S L, LI F, ZHANG H, et al. DAB-DETR: dynamic anchor boxes are better queries for DETR[EB/OL]. (2022-03-30) [2024-02-12]. https://arxiv.org/pdf/2201.12329.
[38]	HE K M, ZHANG X Y, REN S Q, et al. Deep residual learning for image recognition[C]// 2016 IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2015: 770-778.

模型	Epoch	AP/%	AP₅₀/%	AP₇₅/%	AP_S/%	AP_M/%	AP_L/%	参数量/M	GFLOPs
RetinaNet	36	38.7	58.0	41.5	23.3	42.3	50.3	38	205
Faster RCNN	36	40.2	61.0	43.8	24.2	43.5	52.0	42	180
DETR	500	42.0	62.4	44.2	20.5	45.8	61.1	41	86
Conditional DETR	50	40.9	61.8	43.3	20.8	44.6	59.2	43	90
DAB DETR	50	42.2	63.1	44.7	21.5	45.7	60.3	43	94
Anchor DETR	50	42.1	63.1	44.9	22.3	46.2	60.0	37	164
改进算法	50	42.3	63.0	45.4	24.5	46.5	58.6	39	169

模型	Epoch	AP/%	AP₅₀/%	AP₇₅/%	AP_S/%	AP_M/%	AP_L/%	参数量/M	GFLOPs
RetinaNet	36	38.7	58.0	41.5	23.3	42.3	50.3	38	205
Faster RCNN	36	40.2	61.0	43.8	24.2	43.5	52.0	42	180
DETR	500	42.0	62.4	44.2	20.5	45.8	61.1	41	86
Conditional DETR	50	40.9	61.8	43.3	20.8	44.6	59.2	43	90
DAB DETR	50	42.2	63.1	44.7	21.5	45.7	60.3	43	94
Anchor DETR	50	42.1	63.1	44.9	22.3	46.2	60.0	37	164
改进算法	50	42.3	63.0	45.4	24.5	46.5	58.6	39	169

模型	Epoch	AP/%	AP₅₀/%	AP₇₅/%	AP_S/%	AP_M/%	AP_L/%	参数量/M	GFLOPs
Conditional DETR-R50	75	35.3	62.3	34.8	7.3	21.2	45.4	43	87
DAB DETR-R50	75	35.9	64.5	35.3	8.6	24.2	45.5	43	89
Sparse DETR-R50	75	38.3	64.8	39.4	10.7	27.6	47.1	40	171
Anchor DETR-R50	75	39.4	67.5	39.1	8.9	23.8	50.4	37	172
改进算法-R50	75	40.8	69.6	42.2	11.3	28.8	51.7	39	177
Conditional DETR-R101	75	36.5	63.5	35.3	8.0	25.8	46.0	62	154
DAB DETR-R101	75	39.3	66.9	41.0	10.5	27.8	49.1	62	155
Sparse DETR-R101	75	42.1	68.2	43.9	11.5	30.3	52.2	59	238
Anchor DETR-R101	75	42.7	70.9	44.1	9.9	30.3	53.8	56	238
改进算法-R101	75	43.1	71.1	45.3	12.8	30.7	55.6	58	243

模型	Epoch	AP/%	AP₅₀/%	AP₇₅/%	AP_S/%	AP_M/%	AP_L/%	参数量/M	GFLOPs
Conditional DETR-R50	75	35.3	62.3	34.8	7.3	21.2	45.4	43	87
DAB DETR-R50	75	35.9	64.5	35.3	8.6	24.2	45.5	43	89
Sparse DETR-R50	75	38.3	64.8	39.4	10.7	27.6	47.1	40	171
Anchor DETR-R50	75	39.4	67.5	39.1	8.9	23.8	50.4	37	172
改进算法-R50	75	40.8	69.6	42.2	11.3	28.8	51.7	39	177
Conditional DETR-R101	75	36.5	63.5	35.3	8.0	25.8	46.0	62	154
DAB DETR-R101	75	39.3	66.9	41.0	10.5	27.8	49.1	62	155
Sparse DETR-R101	75	42.1	68.2	43.9	11.5	30.3	52.2	59	238
Anchor DETR-R101	75	42.7	70.9	44.1	9.9	30.3	53.8	56	238
改进算法-R101	75	43.1	71.1	45.3	12.8	30.7	55.6	58	243

特征融合	编码器层间传递优化	随机跳跃保持	AP/%	AP₅₀/%	AP₇₅/%	AP_S/%	AP_M/%	AP_L/%
-	-	-	39.4	67.5	39.1	8.9	23.8	50.4
√	-	-	39.9	68.4	40.7	10.6	26.5	50.7
-	√	-	40.1	69.1	41.1	11.1	26.2	50.7
-	-	√	40.3	69.2	41.8	10.2	27.0	51.1
√	√	-	40.4	69.0	40.8	11.2	27.4	50.6
-	√	√	40.4	68.7	41.6	11.0	27.6	51.1
√	-	√	40.5	69.4	42.0	10.8	27.6	51.4
√	√	√	40.8	69.6	42.2	11.3	28.8	51.7

特征融合与层间传递：一种基于Anchor DETR改进的目标检测方法

Feature fusion and inter-layer transmission: an improved object detection method based on Anchor DETR

RichHTML

PDF (PC)

可视化

摘要/Abstract

引用本文

使用本文

图/表 12

参考文献 38

相关文章 15

编辑推荐

Metrics

本文评价

层间传递方式	AP	AP₅₀	AP₇₅	AP_S	AP_M	AP_L
方法1	40.0	68.8	40.5	9.0	26.2	51.6
方法2	39.8	69.0	40.2	10.7	26.5	51.4
方法3	40.1	69.1	41.1	11.1	26.2	50.7

采样数	AP	AP₅₀	AP₇₅	AP_S	AP_M	AP_L
50	39.3	68.0	39.4	10.3	25.2	49.9
100	40.3	69.2	41.8	10.2	27.0	51.1
200	39.2	67.9	39.0	10.0	24.8	50.0
300	39.0	68.1	38.8	10.5	24.3	49.8

[1]	李琼 , 考月英 , 张莹 , 徐沛 . 面向无人机航拍图像的目标检测研究综述[J]. 图学学报, 2024, 45(6): 1145-1164.
[2]	刘灿锋, 孙浩, 东辉. 结合 Transformer 与 Kolmogorov Arnold 网络的分子扩增时序预测研究[J]. 图学学报, 2024, 45(6): 1256-1265.
[3]	李珍峰, 符世琛, 徐乐, 孟博, 张昕, 秦建军. 基于MBI-YOLOv8的煤矸石目标检测算法研究[J]. 图学学报, 2024, 45(6): 1301-1312.
[4]	闫建红, 冉同霄. 基于YOLOv8的轻量化无人机图像目标检测算法[J]. 图学学报, 2024, 45(6): 1328-1337.
[5]	胡凤阔, 叶兰, 谭显峰, 张钦展, 胡志新, 方清, 王磊, 满孝锋. 一种基于改进YOLOv8的轻量化路面病害检测算法[J]. 图学学报, 2024, 45(5): 892-900.
[6]	刘义艳, 郝婷楠, 贺晨, 常英杰. 基于DBBR-YOLO的光伏电池表面缺陷检测[J]. 图学学报, 2024, 45(5): 913-921.
[7]	吴沛宸, 袁立宁, 胡皓, 刘钊, 郭放. 基于注意力特征融合的视频异常行为检测[J]. 图学学报, 2024, 45(5): 922-929.
[8]	刘丽, 张起凡, 白宇昂, 黄凯烨. 结合Swin Transformer的多尺度遥感图像变化检测研究[J]. 图学学报, 2024, 45(5): 941-956.
[9]	姜晓恒, 段金忠, 卢洋, 崔丽莎, 徐明亮. 融合先验知识推理的表面缺陷检测[J]. 图学学报, 2024, 45(5): 957-967.
[10]	谢国波, 林松泽, 林志毅, 吴陈锋, 梁立辉. 基于改进YOLOv7-tiny的道路病害检测算法[J]. 图学学报, 2024, 45(5): 987-997.
[11]	熊超, 王云艳, 罗雨浩. 特征对齐与上下文引导的多视图三维重建[J]. 图学学报, 2024, 45(5): 1008-1016.
[12]	彭文, 林金炜. 基于空间信息关注和纹理增强的短小染色体分类方法[J]. 图学学报, 2024, 45(5): 1017-1029.
[13]	李建华, 韩宇, 石开铭, 张可嘉, 郭红领, 方东平, 曹佳明. 施工现场小目标工人检测方法[J]. 图学学报, 2024, 45(5): 1040-1049.
[14]	孙己龙, 刘勇, 周黎伟, 路鑫, 侯小龙, 王亚琼, 王志丰. 基于DCNv2和Transformer Decoder的隧道衬砌裂缝高效检测模型研究[J]. 图学学报, 2024, 45(5): 1050-1061.
[15]	刘宗明, 洪唯, 龙睿, 祝越, 张小宇. 基于自注意机制的乳源瑶绣自动生成与应用研究[J]. 图学学报, 2024, 45(5): 1096-1105.