YOLOv8 with bi-level routing attention for road scene object detection

doi:10.11996/JG.j.2095-302X.2023061104

Abstract

Abstract:

With the continuous increase of motor vehicles, the road traffic environment has become increasingly complex, particularly due to changes in light conditions and complex backgrounds that can interfere with the accuracy and precision of target detection algorithms. Meanwhile, the diverse shapes of targets in road scenes can pose challenges to the detection task. In response to these challenges, a method named YOLOv8n_T was proposed. Building on the YOLOv8 skeleton network, it incorporated a D_C2f block utilizing deformable convolution to enhance feature learning for targets under complex backgrounds, making it more adaptable to the diverse and complex scenarios of road targets. Furthermore, the model incorporated a dual routing attention module to query adaptively and remove irrelevant regions, retaining only the most relevant regions. For small targets such as pedestrians and traffic lights on the road, a small target detection layer was added. Experimental results demonstrated that the proposed YOLOv8n_T could significantly enhance the precision of target detection in road scenarios, with an average precision increase of 6.8 percentage points compared to the original YOLOv8n and 11.2 percentage points compared to YOLOv5n on the BDD100K dataset.

Key words: deformable convolution, road scene, object detection, YOLO, attention mechanism

CLC Number:

TP391

WEI Chen-hao, YANG Rui, LIU Zhen-bing, LAN Ru-shi, SUN Xi-yan, LUO Xiao-nan. YOLOv8 with bi-level routing attention for road scene object detection[J]. Journal of Graphics, 2023, 44(6): 1104-1111.

Figures/Tables 8

References 19

[1]	KRIZHEVSKY A, SUTSKEVER I, HINTON G E. ImageNet classification with deep convolutional neural networks[J]. Communications of the ACM, 2017, 60(6): 84-90. DOI URL
[2]	SIMONYAN K, ZISSERMAN A. Very deep convolutional networks for large-scale image recognition[EB/OL]. [2023-01-10]. https://arxiv.org/abs/1409.1556.pdf.
[3]	SZEGEDY C, LIU W, JIA Y Q, et al. Going deeper with convolutions[C]// 2015 IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2015: 1-9.
[4]	HE K M, ZHANG X Y, REN S Q, et al. Deep residual learning for image recognition[C]// 2016 IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2016: 770-778.
[5]	LIU W, ANGUELOV D, ERHAN D, et al. SSD: single shot MultiBox detector[C]// European Conference on Computer Vision. Cham: Springer, 2016: 21-37.
[6]	REDMON J, DIVVALA S, GIRSHICK R, et al. You only look once: unified, real-time object detection[C]// 2016 IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2016: 779-788.
[7]	HUANG G, LIU Z, LAURENS V D M, et al. Densely connected convolutional networks[C]// 2017 IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2017: 2261-2269.
[8]	REDMON J, FARHADI A. YOLOv3: an incremental improvement[EB/OL]. [2023-03-10]. https://arxiv.org/abs/1804.02767.pdf.
[9]	BOCHKOVSKIY A, WANG C Y, LIAO H Y M. YOLOv4: optimal speed and accuracy of object detection[EB/OL]. [2023-03-10]. https://arxiv.org/abs/2004.10934.pdf.
[10]	GLENN R J. YOLOv5[EB/OL]. [2023-03-10]. https://github.com/ultralytics/yolov5.
[11]	REN S Q, HE K M, GIRSHICK R, et al. Faster R-CNN: towards real-time object detection with region proposal networks[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 39(6): 1137-1149. DOI PMID
[12]	HU J, SHEN L, SUN G. Squeeze-and-excitation networks[C]// 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2018: 7132-7141.
[13]	JADERBERG M, SIMONYAN K, ZISSERMAN A, et al. Spatial transformer networks[EB/OL]. [2023-03-10]. https://arxiv.org/abs/1506.02025.pdf.
[14]	WOO S, PARK J, LEE J Y, et al. CBAM: convolutional block attention module[C]// European Conference on Computer Vision. Cham: Springer, 2018: 3-19.
[15]	DAI J F, QI H Z, XIONG Y W, et al. Deformable convolutional networks[C]// 2017 IEEE International Conference on Computer Vision. New York: IEEE Press, 2017: 764-773.
[16]	ZHU L, WANG X J, KE Z H, et al. BiFormer: vision transformer with Bi-level routing attention[C]// 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2023: 10323-10333.
[17]	YU F, CHEN H F, WANG X, et al. BDD100K: a diverse driving dataset for heterogeneous multitask learning[C]// 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2020: 2636-2645.
[18]	KLEIN I. NEXET-the largest and most diverse road dataset in the world[EB/OL]. [2023-03-10]. https://www.kaggle.com/datasets/solesensei/nexet-original.
[19]	TIAN Z, SHEN C H, CHEN H, et al. FCOS: fully convolutional one-stage object detection[C]// 2019 IEEE/CVF International Conference on Computer Vision. New York: IEEE Press, 2020: 9626-9635.

方法		BDD100K				NEXET
方法	mAP50	mAP50~95	APs	APm	APl	mAP50	mAP50~95	APs	APm	APl
YOLOv5n	0.51	0.284	0.095	0.279	0.384	0.628	0.435	0.106	0.365	0.498
SSD	0.528	0.297	0.098	0.283	0.379	0.632	0.462	0.11	0.382	0.505
YOLOv7-Tiny	0.547	0.316	0.112	0.298	0.406	0.643	0.453	0.125	0.397	0.516
YOLOv8n	0.554	0.347	0.115	0.294	0.415	0.651	0.474	0.133	0.421	0.528
FCOS	0.564	0.33	0.127	0.279	0.397	0.647	0.461	0.162	0.416	0.524
Faster-RCNN	0.587	0.362	0.106	0.316	0.42	0.66	0.482	0.118	0.408	0.531
YOLOv8n_T(Ours)	0.622	0.401	0.193	0.371	0.458	0.684	0.496	0.251	0.432	0.560

方法		BDD100K				NEXET
方法	mAP50	mAP50~95	APs	APm	APl	mAP50	mAP50~95	APs	APm	APl
YOLOv5n	0.51	0.284	0.095	0.279	0.384	0.628	0.435	0.106	0.365	0.498
SSD	0.528	0.297	0.098	0.283	0.379	0.632	0.462	0.11	0.382	0.505
YOLOv7-Tiny	0.547	0.316	0.112	0.298	0.406	0.643	0.453	0.125	0.397	0.516
YOLOv8n	0.554	0.347	0.115	0.294	0.415	0.651	0.474	0.133	0.421	0.528
FCOS	0.564	0.33	0.127	0.279	0.397	0.647	0.461	0.162	0.416	0.524
Faster-RCNN	0.587	0.362	0.106	0.316	0.42	0.66	0.482	0.118	0.408	0.531
YOLOv8n_T(Ours)	0.622	0.401	0.193	0.371	0.458	0.684	0.496	0.251	0.432	0.560

方法	D_C2f	BRA	nano	mAP50	mAP50~95
YOLOv8n	×	×	×	0.554	0.347
YOLOv8n_D	√	×	×	0.571	0.356
YOLOv8n_BRA	×	√	×	0.557	0.35
YOLOv8n_nano	×	×	√	0.608	0.393
YOLOv8n_T (Ours)	√	√	√	0.622	0.401

方法	D_C2f	BRA	nano	mAP50	mAP50~95
YOLOv8n	×	×	×	0.554	0.347
YOLOv8n_D	√	×	×	0.571	0.356
YOLOv8n_BRA	×	√	×	0.557	0.35
YOLOv8n_nano	×	×	√	0.608	0.393
YOLOv8n_T (Ours)	√	√	√	0.622	0.401