Small object detection algorithm in UAV image based on feature fusion and attention mechanism

doi:10.11996/JG.j.2095-302X.2023040658

Abstract

Abstract:

The task of detecting small objects in UAV aerial images is a formidable challenge due to their diminutive size and insufficient amount of feature information. To surmount this predicament, a multi-head attention mechanism was incorporated into the YOLOv5 backbone network in order to seamlessly integrate global feature information. As the network depth increased, the model tended to accentuate high-level semantic information at the expense of underlying detailed texture features vital for the detection of small objects. To address this issue, a shallow feature enhancement module was devised to acquire underlying feature information and augment small object feature information. Furthermore, a multi-level feature fusion module was developed to amalgamate feature information from different layers, thus enabling the network to dynamically adjust the weights of each output detection layer. Experimental results on the publicly available VisDrone2021 dataset demonstrated that the mean average precision of the proposed algorithm, attained a level of 45.7%, representing a 3.1% enhancement over the baseline YOLOv5 algorithm. Additionally, the proposed algorithm achieved a detection speed of 41 frames per second for high-resolution images, satisfying the requirement for real-time performance and exhibiting a noteworthy improvement in detection accuracy over other prevalent methods.

Key words: feature fusion, attention mechanism, UAV aerial imagery, small object detection, YOLOv5

CLC Number:

TP391

LI Li-xia, WANG Xin, WANG Jun, ZHANG You-yuan. Small object detection algorithm in UAV image based on feature fusion and attention mechanism[J]. Journal of Graphics, 2023, 44(4): 658-666.

Figures/Tables 11

References 20

[1]	江波, 屈若锟, 李彦冬. 基于深度学习的无人机航拍目标检测研究综述[J]. 航空学报, 2021, 42(4): 524519. 1-524519. 15.
	JIANG B, QU R K, LI Y D, et al. Object detection in UAV imagery based on deep learning: review[J]. Acta Aeronautica et Astronautica Sinica, 2021, 42(4): 524519. 1-524519. 15. (in Chinese).
[2]	周立旺, 潘天翔, 杨泽曦, 等. 多阶段优化的小目标聚焦检测[J]. 图学学报, 2020, 41(1): 93-99.
	ZHOU L W, PAN T X, YANG Z X, et al. FocusNet: coarse-to-fine small object detection network[J]. Journal of Graphics, 2020, 41(1): 93-99 (in Chinese).
[3]	REDMON J, FARHADI A. YOLOv3: an incremental improvement[EB/OL]. [2022-05-26]. https://arxiv.org/abs/1804.02767.
[4]	LIU W, ANGUELOV D, ERHAN D, et al. SSD: single shot MultiBoxsDetector[C]// The 14th European Conference on Computer Vision. Cham: Springer International Publishin, 2016: 21-37.
[5]	GIRSHICK R. Fast R-CNN[C]// 2015 IEEE International Conference on Computer Vision. New York: IEEE Press, 2015: 1440-1448.
[6]	REN S Q, HE K M, GIRSHICK R, et al. Faster R-CNN: towards real-time object detection with region proposal networks[C]// International Conference on Neural Information Processing Systems. Cambridge: MIT Press, 2015: 91-99.
[7]	CAO J, CHOLAKKAL H, ANWER R M, et al. D2Det: towards high quality object detection and instance segmentation[C]// 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2020: 11485-11494.
[8]	ZHAN W, SUN C F, WANG M C, et al. An improved Yolov5 real-time detection method for small objects captured by UAV[J]. Soft Computing, 2022, 26(1): 361-373. DOI
[9]	LIM J S, ASTRID M, YOON H J, et al. Small object detection using context and attention[C]// 2021 International Conference on Artificial Intelligence in Information and Communication. New York: IEEE Press, 2021: 181-186.
[10]	SONG Z Y, ZHANG Y, LIU Y, et al. MSFYOLO: feature fusion-based detection for small objects[J]. IEEE Latin America Transactions, 2022, 20(5): 823-830. DOI URL
[11]	LIU Y J, YANG F B, HU P. Small-object detection in UAV-captured images via multi-branch parallel feature pyramid networks[J]. IEEE Access, 2020, 8: 145740-145750. DOI URL
[12]	胡俊, 顾晶晶, 王秋红. 基于遥感图像的多模态小目标检测[J]. 图学学报, 2022, 43(2): 197-204.
	HU J, GU J J, WANG Q H. Multimodal small target detection based on remote sensing image[J]. Journal of Graphics, 2022, 43(2): 197-204 (in Chinese).
[13]	LIN T Y, DOLLÁR P, GIRSHICK R, et al. Feature pyramid networks for object detection[C]// 2017 IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2017: 2117-2125.
[14]	LI H C, XIONG P F, AN J, et al. Pyramid attention network for semantic segmentation[EB/OL]. [2022-05-26]. https://arxiv.org/abs/1805.10180.
[15]	VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all You need[C]// The 31st International Conference on Neural Information Processing Systems. New York: ACM, 2017: 6000-6010.
[16]	PAN X R, GE C J, LU R, et al. On the integration of self-attention and convolution[C]// 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2022: 815-825.
[17]	SRINIVAS A, LIN T Y, PARMAR N, et al. Bottleneck transformers for visual recognition[C]// 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2021: 16514-16524.
[18]	CAO Y R, HE Z J, WANG L J, et al. VisDrone-DET2021: the vision meets drone object detection challenge results[C]// 2021 IEEE/CVF International Conference on Computer Vision Workshops. New York: IEEE Press, 2021: 2847-2854.
[19]	LI C L, YANG T, ZHU S J, et al. Density map guided object detection in aerial images[C]// 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops. New York: IEEE Press, 2020: 737-746.
[20]	BOCHKOVSKIY A, WANG C Y, LIAO H Y M. YOLOv4: optimal speed and accuracy of object detection[EB/OL]. [2022-05-26]. https://arxiv.org/abs/2004.10934.

Model	Params (M)	Depth	Width	GFLOPs	mAP (%)	FPS¹⁵³⁶ (帧/秒)
YOLOv5n	1.777	0.33	0.25	4.3	32.4	81
YOLOv5s	7.037	0.33	0.50	15.8	42.6	56
YOLOv5m	20.889	0.67	0.75	48.0	46.1	32
YOLOv5l	46.157	1.00	1.00	107.8	48.2	19

Model	Params (M)	Depth	Width	GFLOPs	mAP (%)	FPS¹⁵³⁶ (帧/秒)
YOLOv5n	1.777	0.33	0.25	4.3	32.4	81
YOLOv5s	7.037	0.33	0.50	15.8	42.6	56
YOLOv5m	20.889	0.67	0.75	48.0	46.1	32
YOLOv5l	46.157	1.00	1.00	107.8	48.2	19

Model	BT-MHSA	SP	MF	P (%)	R (%)	mAP (%)	Params (M)	FPS¹⁵³⁶ (帧/秒)
YOLOv5s	-	-	-	53.2	43.5	42.6	7.037	56
M1	√	-	-	54.0	44.2	43.5	6.719	58
M2	-	√	-	52.7	45.3	43.7	5.388	60
M3	-	-	√	53.1	45.0	43.5	8.174	47
M4	-	√	√	54.9	44.6	43.9	9.159	44
M5	√	√	-	52.1	45.3	43.6	7.061	54
M6	√	-	√	53.3	46.0	44.4	9.747	43
M7	√	√	√	55.6	46.5	45.7	9.832	41

Model	BT-MHSA	SP	MF	P (%)	R (%)	mAP (%)	Params (M)	FPS¹⁵³⁶ (帧/秒)
YOLOv5s	-	-	-	53.2	43.5	42.6	7.037	56
M1	√	-	-	54.0	44.2	43.5	6.719	58
M2	-	√	-	52.7	45.3	43.7	5.388	60
M3	-	-	√	53.1	45.0	43.5	8.174	47
M4	-	√	√	54.9	44.6	43.9	9.159	44
M5	√	√	-	52.1	45.3	43.6	7.061	54
M6	√	-	√	53.3	46.0	44.4	9.747	43
M7	√	√	√	55.6	46.5	45.7	9.832	41

算法	输入尺寸	目标类别										mAP(%)
算法	输入尺寸	Awn-tr	Bicycle	Bus	Car	Motor	Pedestrian	People	Tricycle	Truck	Van	mAP(%)
Faster R-CNN	640×640	8.73	5.86	43.79	44.16	16.83	12.55	8.10	8.53	30.42	20.45	19.94
YOLOv3	640×640	7.71	6.80	39.36	68.87	21.53	22.54	12.50	8.41	26.41	24.31	23.84
CenterNet	640×640	14.28	7.51	42.66	61.96	18.86	22.94	11.67	13.08	24.74	19.38	23.71
DMNet^[19]	640×640	14.11	8.89	49.23	58.90	29.38	27.67	18.93	20.32	29.30	30.27	28.70
YOLOv4^[20]	640×640	12.39	8.68	48.86	69.21	22.71	26.67	14.48	12.67	29.94	27.19	27.28
SSD	640×640	11.15	7.38	49.82	63.17	19.09	18.71	9.01	11.74	33.10	29.96	25.31
YOLOX	640×640	15.43	9.03	51.80	72.16	29.33	25.44	17.07	16.47	39.21	35.16	31.11
本文算法	640×640	18.20	11.90	57.60	74.80	28.50	32.50	18.80	17.60	39.00	35.60	33.45