Lightweight human pose estimation algorithm combined with coordinate Transformer

doi:10.11996/JG.j.2095-302X.2024030516

Abstract

Abstract:

Addressing issues such as large model size, high computational costs, and limited compatibility with edge devices in most existing bottom-up human pose estimation algorithms, this study proposed a lightweight multi-person pose estimation network model named YOLOv5s6-Pose-CT based on YOLOv5s6-Pose. In order to reduce feature redundancy across both spatial and channel dimensions, the network model introduced spatial and channel reconstruction convolution in the neck network. Simultaneously, a coordinate Transformer was incorporated into the backbone network to enhance long-distance dependence while maintaining efficient local feature extraction ability. Furthermore, unbiased feature position alignment was employed to resolve feature dislocation during multi-scale fusion. Finally, this study redefined the regression loss of bounding boxes using the MPDIoU (minimum point distance-based IoU) loss function. Experimental results on the COCO 2017 dataset demonstrated that compared with EfficientHRNet-H1 (a mainstream lightweight network), our optimized network model reduced parameters by 16.2% and computation by 66.1%, respectively, while maintaining comparable accuracy levels. Moreover, compared with the baseline approach, our proposed model achieved parameter and computation reductions of 11.2% and 5.8%, respectively, along with improvements of 2.5% in average detection accuracy and 2.6% in recall rate.

Key words: human pose estimation, lightweight, coordinate Transformer, unbiased feature position alignment, loss function

CLC Number:

TP391

HUANG Youwen, LIN Zhiqin, ZHANG Jin, CHEN Junkuan. Lightweight human pose estimation algorithm combined with coordinate Transformer[J]. Journal of Graphics, 2024, 45(3): 516-527.

Figures/Tables 14

References 39

[1]	冯杰, 郑建立. 基于卷积与Transformer的人体姿态估计方法对比研究[J]. 软件工程, 2023, 26(3): 18-24.
	FENG J, ZHENG J L. A comparative study of human pose estimation based on convolution and Transformer[J]. Software Engineering, 2023, 26(3): 18-24 (in Chinese).
[2]	蔡兴泉, 霍宇晴, 李发建, 等. 面向太极拳学习的人体姿态估计及相似度计算[J]. 图学学报, 2022, 43(4): 695-706.
	CAI X Q, HUO Y Q, LI F J, et al. Human pose estimation and similarity calculation for Tai Chi learning[J]. Journal of Graphics, 2022, 43(4): 695-706 (in Chinese). DOI
[3]	蔡敏敏, 黄继风, 林晓, 等. 基于人体姿态估计与聚类的特定运动帧获取方法[J]. 图学学报, 2022, 43(1): 44-52.
	CAI M M, HUANG J F, LIN X, et al. Acquisition method of specific motion frame based on human attitude estimation and clustering[J]. Journal of Graphics, 2022, 43(1): 44-52 (in Chinese).
[4]	范溢华, 王永振, 燕雪峰, 等. 人脸识别任务驱动的低光照图像增强算法[J]. 图学学报, 2022, 43(6): 1170-1181.
	FAN Y H, WANG Y Z, YAN X F, et al. Face recognition-driven low-light image enhancement[J]. Journal of Graphics, 2022, 43(6): 1170-1181 (in Chinese).
[5]	赵心驰, 胡岸明, 何为. 基于卷积神经网络和XGBoost的摔倒检测[J]. 激光与光电子学进展, 2020, 57(16): 161024.
	ZHAO X C, HU A M, HE W. Fall detection based on convolutional neural network and XGBoost[J]. Laser & Optoelectronics Progress, 2020, 57(16): 161024 (in Chinese).
[6]	卢健, 杨腾飞, 赵博, 等. 基于深度学习的人体姿态估计方法综述[J]. 激光与光电子学进展, 2021, 58(24): 69-88.
	LU J, YANG T F, ZHAO B, et al. Review of deep learning- based human pose estimation[J]. Laser & Optoelectronics Progress, 2021, 58(24): 69-88 (in Chinese).
[7]	TOSHEV A, SZEGEDY C. DeepPose: human pose estimation via deep neural networks[C]// 2014 IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2014: 1653-1660.
[8]	WEI S H, RAMAKRISHNA V, KANADE T, et al. Convolutional pose machines[C]// 2016 IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2016: 4724-4732.
[9]	NEWELL A, YANG K Y, DENG J. Stacked hourglass networks for human pose estimation[EB/OL]. (2016-10-11) [2023-08-07]. https://link.springer.com/content/pdf/10.1007/978-3-319-46484-8_29.pdf.
[10]	任好盼, 王文明, 危德健, 等. 基于高分辨率网络的人体姿态估计方法[J]. 图学学报, 2021, 42(3): 432-438.
	REN H P, WANG W M, WEI D J, et al. Human pose estimation based on high-resolution net[J]. Journal of Graphics, 2021, 42(3): 432-438 (in Chinese). DOI
[11]	FANG H S, XIE S Q, TAI Y W, et al. RMPE: regional multi-person pose estimation[C]// 2017 IEEE International Conference on Computer Vision. New York: IEEE Press, 2017: 2353-2362.
[12]	CHEN Y L, WANG Z C, PENG Y X, et al. Cascaded pyramid network for multi-person pose estimation[C]// 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2018: 7103-7112.
[13]	SUN K, XIAO B, LIU D, et al. Deep high-resolution representation learning for human pose estimation[C]// 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2019: 5686-5696.
[14]	曾文献, 马月, 李伟光. 轻量化二维人体骨骼关键点检测算法综述[J]. 科学技术与工程, 2022, 22(16): 6377-6392.
	ZENG W X, MA Y, LI W G. A survey of lightweight two-dimensional human skeleton key point detection algorithms[J]. Science Technology and Engineering, 2022, 22(16): 6377-6392 (in Chinese).
[15]	CAO Z, SIMON T, WEI S H, et al. Realtime multi-person 2D pose estimation using part affinity fields[EB/OL]. (2016-11-24) [2023-08-07]. http://arxiv.org/abs/1611.08050.
[16]	OSOKIN D. Real-time 2D multi-person pose estimation on CPU: lightweight OpenPose[EB/OL]. (2018-11-29) [2023-07- 07]. https://arxiv.longhoe.net/abs/1811.12004.
[17]	CHENG B W, XIAO B, WANG J D, et al. Higher HRNet: scale-aware representation learning for bottom-up human pose estimation[C]// 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2020: 5385-5394.
[18]	GENG Z G, SUN K, XIAO B, et al. Bottom-up human pose estimation via disentangled keypoint regression[EB/OL]. (2021-04-06) [2023-07-07]. http://arxiv.org/abs/2104.02300.
[19]	MAJI D, NAGORI S, MATHEW M, et al. YOLO-pose: enhancing YOLO for multi person pose estimation using object keypoint similarity loss[EB/OL]. (2022-04-14) [2023-08-12]. http://arxiv.org/abs/2204.06806.
[20]	LI J N, WANG Y W, ZHANG S L. PolarPose: single-stage multi-person pose estimation in polar coordinates[J]. IEEE Transactions on Image Processing: a Publication of the IEEE Signal Processing Society, 2023, 32: 1108-1119.
[21]	LI J F, WEN Y, HE L H. SCConv: spatial and channel reconstruction convolution for feature redundancy[C]// 2023 IEEE/CVF International Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2023: 6153-6162.
[22]	WANG C, ZHOU Y H, ZHANG F, et al. Unbiased feature position alignment for human pose estimation[J]. Neurocomputing, 2023, 537(C): 152-163.
[23]	MA S L, XU Y. MPDIoU: a loss for efficient and accurate bounding box regression[EB/OL]. (2023-07-14) [2023-08-17]. http://arxiv.org/abs/2307.07662.
[24]	HOWARD A G, ZHU M L, CHEN B, et al. MobileNets: efficient convolutional neural networks for mobile vision applications[EB/OL]. (2017-04-17) [2023-08-17]. http://arxiv.org/abs/1704.04861.
[25]	NEFF C, SHETH A, FURGURSON S, et al. EfficientHRNet: efficient and scalable high-resolution networks for real-time multi-person 2D human pose estimation[J]. Journal of Real-Time Image Processing, 2021, 18(4): 1037-1049.
[26]	TAN M X, PANG R M, LE Q V. EfficientDet: scalable and efficient object detection[C]// 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2020: 10778-10787.
[27]	WANG C Y, MARK LIAO H Y, WU Y H, et al. CSPNet: a new backbone that can enhance learning capability of CNN[C]// 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops. New York: IEEE Press, 2020: 1571-1580.
[28]	PIAO Y R, JIANG Y Y, ZHANG M, et al. PANet: patch-aware network for light field salient object detection[J]. IEEE Transactions on Cybernetics, 2023, 53(1): 379-391.
[29]	WANG X L, GIRSHICK R, GUPTA A, et al. Non-local neural networks[C]// 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2018: 7794-7803.
[30]	HU J, SHEN L, SUN G. Squeeze-and-excitation networks[C]// 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New Yorky: IEEE Press, 2018: 7132-7141.
[31]	WOO S, PARK J, LEE J Y, et al. CBAM: convolutional block attention module[C]// Computer Vision - ECCV 2018. Cham: Springer International Publishing, 2018: 3-19.
[32]	CAO Y, XU J R, WEI F Y, et al. GCNet: non-local networks meet squeeze-excitation networks and beyond[C]// 2019 IEEE/CVF International Conference on Computer Vision Workshop. New York: IEEE Press, 2019: 1971-1980.
[33]	LIU H J, LIU F Q, FAN X Y, et al. Polarized self-attention: towards high-quality pixel-wise mapping[J]. Neurocomputing, 2022, 506: 158-167.
[34]	ZHU L, WANG X J, KE Z H, et al. BiFormer: vision transformer with Bi-level routing attention[C]// 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2023: 10323-10333.
[35]	LIU Z, LIN Y T, CAO Y, et al. Swin Transformer: hierarchical vision Transformer using shifted Windows[C]// 2021 IEEE/CVF International Conference on Computer Vision. New York: IEEE Press, 2021: 9992-10002.
[36]	TOLSTIKHIN I, HOULSBY N, KOLESNIKOV A, et al. MLP-mixer: an all-MLP architecture for vision[EB/OL]. (2021-05-04) [2023-07-23]. http://arxiv.org/abs/2105.01601.
[37]	HOU Q B, ZHOU D Q, FENG J S. Coordinate attention for efficient mobile network design[C]// 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New Yorke: IEEE Press, 2021: 13708-13717.
[38]	ZHENG Z H, WANG P, REN D W, et al. Enhancing geometric factors in model learning and inference for object detection and instance segmentation[J]. IEEE Transactions on Cybernetics, 2022, 52(8): 8574-8586.
[39]	LIN T Y, MAIRE M, BELONGIE S, et al. Microsoft coco: common objects in context[J]. Lecture Notes in Computer Science, 2014, 8693(1): 740-755.

方法	输入规模	参数量/MB	计算量/G	AP/%	AP⁵⁰/%	AP⁷⁵/%	AP^L/%	AR/%
Lightweight OpenPose	368×368	4.1	18.0	42.8	-	-	-	-
EfficientHRNet-H₁	480×480	16.0	28.4	59.2	82.6	64.0	67.2	64.7
EfficientHRNet-H₂	448×448	10.3	15.4	52.9	80.5	59.1	61.9	59.3
EfficientHRNet-H₃	416×416	6.9	8.4	44.8	76.7	48.3	52.3	52.4
EfficientHRNet-H₄	384×384	3.7	4.2	35.7	69.6	33.7	44.3	42.9
YOLOv5s6-Pose-ti-lite	640×640	12.6	8.6	54.9	82.2	59.9	66.6	61.8
Baseline	640×640	15.1	10.2	56.7	83.7	61.3	71.1	63.7
Ours	640×640	13.4	9.6	59.2	85.3	63.3	73.2	66.3

方法	输入规模	参数量/MB	计算量/G	AP/%	AP⁵⁰/%	AP⁷⁵/%	AP^L/%	AR/%
Lightweight OpenPose	368×368	4.1	18.0	42.8	-	-	-	-
EfficientHRNet-H₁	480×480	16.0	28.4	59.2	82.6	64.0	67.2	64.7
EfficientHRNet-H₂	448×448	10.3	15.4	52.9	80.5	59.1	61.9	59.3
EfficientHRNet-H₃	416×416	6.9	8.4	44.8	76.7	48.3	52.3	52.4
EfficientHRNet-H₄	384×384	3.7	4.2	35.7	69.6	33.7	44.3	42.9
YOLOv5s6-Pose-ti-lite	640×640	12.6	8.6	54.9	82.2	59.9	66.6	61.8
Baseline	640×640	15.1	10.2	56.7	83.7	61.3	71.1	63.7
Ours	640×640	13.4	9.6	59.2	85.3	63.3	73.2	66.3

方法	参数量/MB	计算量/G	AP/%
YOLOv5s6-Pose	15.1	10.2	56.7
YOLOv5s6-Pose+SCConv	12.3	8.3	53.6
Backbone+SCConv	14.0	9.4	55.3
Neck+SCConv	13.3	9.0	56.4

方法	参数量/MB	计算量/G	AP/%
YOLOv5s6-Pose	15.1	10.2	56.7
YOLOv5s6-Pose+SCConv	12.3	8.3	53.6
Backbone+SCConv	14.0	9.4	55.3
Neck+SCConv	13.3	9.0	56.4

方法	参数量/MB	计算量/G	AP/%
YOLOv5s6-Pose	15.1	10.2	56.7
YOLOv5s6-Pose+CA	15.1	10.2	56.9
YOLOv5s6-Pose+Swin Transformer	15.6	12.5	57.2
YOLOv5s6-Pose+CT	15.2	10.8	58.6