结合坐标Transformer的轻量级人体姿态估计算法

doi:10.11996/JG.j.2095-302X.2024030516

摘要/Abstract

摘要：

针对现有的大多数自底向上人体姿态估计算法存在模型规模大、计算成本高及对边缘设备不友好等问题，提出了一种基于YOLOv5s6-Pose的轻量级多人姿态估计网络模型YOLOv5s6-Pose-CT。该模型在颈部网络中引入空间和通道重建卷积，以减少空间和通道维度上的特征冗余。同时，提出了一种坐标Transformer嵌入于主干网络中，使模型专注于长距离依赖和拥有高效的局部特征提取能力。其次，通过使用无偏特征位置对齐来解决多尺度融合过程中出现的特征错位问题。最后，使用损失函数MPDIoU对边界框的回归损失重新定义。在COCO 2017数据集上的实验结果表明，本文优化的网络模型与主流的轻量级网络EfficientHRNet-H1模型相比，在保持相同精度的同时，参数量和计算量分别减少16.2%和66.1%。相比于基准模型YOLOv5s6-Pose，参数量减少11.2%，计算量降低5.8%，平均检测精度和平均召回率分别提升2.5%和2.6%。

关键词: 人体姿态估计, 轻量级, 坐标Transformer, 无偏特征位置对齐, 损失函数

Abstract:

Addressing issues such as large model size, high computational costs, and limited compatibility with edge devices in most existing bottom-up human pose estimation algorithms, this study proposed a lightweight multi-person pose estimation network model named YOLOv5s6-Pose-CT based on YOLOv5s6-Pose. In order to reduce feature redundancy across both spatial and channel dimensions, the network model introduced spatial and channel reconstruction convolution in the neck network. Simultaneously, a coordinate Transformer was incorporated into the backbone network to enhance long-distance dependence while maintaining efficient local feature extraction ability. Furthermore, unbiased feature position alignment was employed to resolve feature dislocation during multi-scale fusion. Finally, this study redefined the regression loss of bounding boxes using the MPDIoU (minimum point distance-based IoU) loss function. Experimental results on the COCO 2017 dataset demonstrated that compared with EfficientHRNet-H1 (a mainstream lightweight network), our optimized network model reduced parameters by 16.2% and computation by 66.1%, respectively, while maintaining comparable accuracy levels. Moreover, compared with the baseline approach, our proposed model achieved parameter and computation reductions of 11.2% and 5.8%, respectively, along with improvements of 2.5% in average detection accuracy and 2.6% in recall rate.

Key words: human pose estimation, lightweight, coordinate Transformer, unbiased feature position alignment, loss function

中图分类号:

TP391

黄友文, 林志钦, 章劲, 陈俊宽. 结合坐标Transformer的轻量级人体姿态估计算法[J]. 图学学报, 2024, 45(3): 516-527.

HUANG Youwen, LIN Zhiqin, ZHANG Jin, CHEN Junkuan. Lightweight human pose estimation algorithm combined with coordinate Transformer[J]. Journal of Graphics, 2024, 45(3): 516-527.

图/表 14

图1 YOLOv5s6-Pose-CT网络结构

Fig. 1 YOLOv5s6-Pose-CT network structure

图2 空间和通道重建卷积模块

Fig. 2 Spatial and channel reconstruction convolution module

图3 Transformer架构图((a) Swin Transformer；(b)坐标注意力模块；(c)坐标Transformer)

Fig. 3 Transformer structure chart ((a) Swin Transformer; (b) Coordinate attention module; (c) Coordinate Transformer)

图4 可视化对比((a) Swin Transformer错检；(b)坐标注意力漏检；(c)，(d)坐标Transformer矫正)

Fig. 4 Visual comparison ((a) Swin Transformer error detection; (b) Coordinate attention misdetection; (c), (d) Coordinate Transformer correction)

图5 传统角对齐插值方法导致的错位误差

Fig. 5 The misalignment error caused by the traditional corner alignment interpolation methods

图6 使用UFPA的角对齐插值策略示例

Fig. 6 An example of corner alignment interpolation strategy using UFPA

图7 可视化结果对比((a)传统角对齐插值方法；(b) UFPA角对齐插值方法)

Fig. 7 Comparison of visualization results ((a) Traditional corner alignment interpolation methods; (b) UFPA corner alignment interpolation methods)

图8 MPDIoU损失函数的参数示例

Fig. 8 An example of the parameters of the MPDIoU loss function

表1 在COCO2017人体关键点数据集中轻量级自底向上方法对比

Table 1 Comparison of Lightweight Bottom Up Methods on the COCO2017 dataset

方法	输入规模	参数量/MB	计算量/G	AP/%	AP⁵⁰/%	AP⁷⁵/%	AP^L/%	AR/%
Lightweight OpenPose	368×368	4.1	18.0	42.8	-	-	-	-
EfficientHRNet-H₁	480×480	16.0	28.4	59.2	82.6	64.0	67.2	64.7
EfficientHRNet-H₂	448×448	10.3	15.4	52.9	80.5	59.1	61.9	59.3
EfficientHRNet-H₃	416×416	6.9	8.4	44.8	76.7	48.3	52.3	52.4
EfficientHRNet-H₄	384×384	3.7	4.2	35.7	69.6	33.7	44.3	42.9
YOLOv5s6-Pose-ti-lite	640×640	12.6	8.6	54.9	82.2	59.9	66.6	61.8
Baseline	640×640	15.1	10.2	56.7	83.7	61.3	71.1	63.7
Ours	640×640	13.4	9.6	59.2	85.3	63.3	73.2	66.3

图9 轻量级自底向上多人姿态估计方法的可视化结果对比

Fig. 9 Comparison of visual results of lightweight bottom-up multi-person pose estimation methods ((a) EfficientHRNet-H3; (b) EfficientHRNet-H2; (c) EfficientHRNet-H1; (d) YOLOv5s6-Pose; (e) Ours)

表2 在COCO2017人体关键点数据集中SCConv模块作用于不同阶段的实验对比

Table 2 Experimental comparison of SCConv module in different stages on the COCO2017 dataset

方法	参数量/MB	计算量/G	AP/%
YOLOv5s6-Pose	15.1	10.2	56.7
YOLOv5s6-Pose+SCConv	12.3	8.3	53.6
Backbone+SCConv	14.0	9.4	55.3
Neck+SCConv	13.3	9.0	56.4

表3 在COCO2017人体关键点数据集中模型使用不同注意力机制的实验对比

Table 3 Experimental comparison of using different attention in the model on the COCO2017 dataset

方法	参数量/MB	计算量/G	AP/%
YOLOv5s6-Pose	15.1	10.2	56.7
YOLOv5s6-Pose+CA	15.1	10.2	56.9
YOLOv5s6-Pose+Swin Transformer	15.6	12.5	57.2
YOLOv5s6-Pose+CT	15.2	10.8	58.6

表4 消融实验设计

Table 4 Ablation experimental design

模块	实验
模块	1	2	3	4	5	6	7	8	9	10	11	12
SCConv	-	√	-	-	-	-	√	√	√	√	√	√
CT	-	-	√	-	-	-	-	-	√	-	√	√
UFPA	-	-	-	√	-	√	-	√	-	√	√	√
MPDIoU	-	-	-	-	√	√	√	-	-	√	-	√

表5 消融实验结果对比

Table 5 Comparison of ablation experiment results

实验	参数量/MB	计算量/G	AP/%	AP⁵⁰/%	AR/%
1 (Baseline)	15.1	10.2	56.7	83.7	63.7
2 (SCConv)	13.3	9.0	56.4	83.1	63.5
3 (CT)	15.2	10.8	58.6	84.8	65.6
4 (UFPA)	15.1	10.2	57.4	83.9	64.5
5 (MPDIoU)	15.1	10.2	57.2	83.9	64.3
6 (UFPA+MPDIoU)	15.1	10.2	57.8	84.2	64.9
7 (SCConv+MPDIoU)	13.3	9.0	56.8	83.7	63.8
8 (SCConv+UFPA)	13.3	9.0	57.1	83.8	64.1
9 (SCConv+CT)	13.4	9.6	58.2	84.6	65.4
10 (SCConv+UFPA+MPDIoU)	13.3	9.0	57.4	84.0	64.4
11 (SCConv+CT+UFPA)	13.4	9.6	58.8	85.1	65.8
12 (SCConv+CT+UFPA+MPDIoU)	13.4	9.6	59.2	85.3	66.3

参考文献 39

[1]	冯杰, 郑建立. 基于卷积与Transformer的人体姿态估计方法对比研究[J]. 软件工程, 2023, 26(3): 18-24.
	FENG J, ZHENG J L. A comparative study of human pose estimation based on convolution and Transformer[J]. Software Engineering, 2023, 26(3): 18-24 (in Chinese).
[2]	蔡兴泉, 霍宇晴, 李发建, 等. 面向太极拳学习的人体姿态估计及相似度计算[J]. 图学学报, 2022, 43(4): 695-706.
	CAI X Q, HUO Y Q, LI F J, et al. Human pose estimation and similarity calculation for Tai Chi learning[J]. Journal of Graphics, 2022, 43(4): 695-706 (in Chinese). DOI
[3]	蔡敏敏, 黄继风, 林晓, 等. 基于人体姿态估计与聚类的特定运动帧获取方法[J]. 图学学报, 2022, 43(1): 44-52.
	CAI M M, HUANG J F, LIN X, et al. Acquisition method of specific motion frame based on human attitude estimation and clustering[J]. Journal of Graphics, 2022, 43(1): 44-52 (in Chinese).
[4]	范溢华, 王永振, 燕雪峰, 等. 人脸识别任务驱动的低光照图像增强算法[J]. 图学学报, 2022, 43(6): 1170-1181.
	FAN Y H, WANG Y Z, YAN X F, et al. Face recognition-driven low-light image enhancement[J]. Journal of Graphics, 2022, 43(6): 1170-1181 (in Chinese).
[5]	赵心驰, 胡岸明, 何为. 基于卷积神经网络和XGBoost的摔倒检测[J]. 激光与光电子学进展, 2020, 57(16): 161024.
	ZHAO X C, HU A M, HE W. Fall detection based on convolutional neural network and XGBoost[J]. Laser & Optoelectronics Progress, 2020, 57(16): 161024 (in Chinese).
[6]	卢健, 杨腾飞, 赵博, 等. 基于深度学习的人体姿态估计方法综述[J]. 激光与光电子学进展, 2021, 58(24): 69-88.
	LU J, YANG T F, ZHAO B, et al. Review of deep learning- based human pose estimation[J]. Laser & Optoelectronics Progress, 2021, 58(24): 69-88 (in Chinese).
[7]	TOSHEV A, SZEGEDY C. DeepPose: human pose estimation via deep neural networks[C]// 2014 IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2014: 1653-1660.
[8]	WEI S H, RAMAKRISHNA V, KANADE T, et al. Convolutional pose machines[C]// 2016 IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2016: 4724-4732.
[9]	NEWELL A, YANG K Y, DENG J. Stacked hourglass networks for human pose estimation[EB/OL]. (2016-10-11) [2023-08-07]. https://link.springer.com/content/pdf/10.1007/978-3-319-46484-8_29.pdf.
[10]	任好盼, 王文明, 危德健, 等. 基于高分辨率网络的人体姿态估计方法[J]. 图学学报, 2021, 42(3): 432-438.
	REN H P, WANG W M, WEI D J, et al. Human pose estimation based on high-resolution net[J]. Journal of Graphics, 2021, 42(3): 432-438 (in Chinese). DOI
[11]	FANG H S, XIE S Q, TAI Y W, et al. RMPE: regional multi-person pose estimation[C]// 2017 IEEE International Conference on Computer Vision. New York: IEEE Press, 2017: 2353-2362.
[12]	CHEN Y L, WANG Z C, PENG Y X, et al. Cascaded pyramid network for multi-person pose estimation[C]// 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2018: 7103-7112.
[13]	SUN K, XIAO B, LIU D, et al. Deep high-resolution representation learning for human pose estimation[C]// 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2019: 5686-5696.
[14]	曾文献, 马月, 李伟光. 轻量化二维人体骨骼关键点检测算法综述[J]. 科学技术与工程, 2022, 22(16): 6377-6392.
	ZENG W X, MA Y, LI W G. A survey of lightweight two-dimensional human skeleton key point detection algorithms[J]. Science Technology and Engineering, 2022, 22(16): 6377-6392 (in Chinese).
[15]	CAO Z, SIMON T, WEI S H, et al. Realtime multi-person 2D pose estimation using part affinity fields[EB/OL]. (2016-11-24) [2023-08-07]. http://arxiv.org/abs/1611.08050.
[16]	OSOKIN D. Real-time 2D multi-person pose estimation on CPU: lightweight OpenPose[EB/OL]. (2018-11-29) [2023-07- 07]. https://arxiv.longhoe.net/abs/1811.12004.
[17]	CHENG B W, XIAO B, WANG J D, et al. Higher HRNet: scale-aware representation learning for bottom-up human pose estimation[C]// 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2020: 5385-5394.
[18]	GENG Z G, SUN K, XIAO B, et al. Bottom-up human pose estimation via disentangled keypoint regression[EB/OL]. (2021-04-06) [2023-07-07]. http://arxiv.org/abs/2104.02300.
[19]	MAJI D, NAGORI S, MATHEW M, et al. YOLO-pose: enhancing YOLO for multi person pose estimation using object keypoint similarity loss[EB/OL]. (2022-04-14) [2023-08-12]. http://arxiv.org/abs/2204.06806.
[20]	LI J N, WANG Y W, ZHANG S L. PolarPose: single-stage multi-person pose estimation in polar coordinates[J]. IEEE Transactions on Image Processing: a Publication of the IEEE Signal Processing Society, 2023, 32: 1108-1119.
[21]	LI J F, WEN Y, HE L H. SCConv: spatial and channel reconstruction convolution for feature redundancy[C]// 2023 IEEE/CVF International Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2023: 6153-6162.
[22]	WANG C, ZHOU Y H, ZHANG F, et al. Unbiased feature position alignment for human pose estimation[J]. Neurocomputing, 2023, 537(C): 152-163.
[23]	MA S L, XU Y. MPDIoU: a loss for efficient and accurate bounding box regression[EB/OL]. (2023-07-14) [2023-08-17]. http://arxiv.org/abs/2307.07662.
[24]	HOWARD A G, ZHU M L, CHEN B, et al. MobileNets: efficient convolutional neural networks for mobile vision applications[EB/OL]. (2017-04-17) [2023-08-17]. http://arxiv.org/abs/1704.04861.
[25]	NEFF C, SHETH A, FURGURSON S, et al. EfficientHRNet: efficient and scalable high-resolution networks for real-time multi-person 2D human pose estimation[J]. Journal of Real-Time Image Processing, 2021, 18(4): 1037-1049.
[26]	TAN M X, PANG R M, LE Q V. EfficientDet: scalable and efficient object detection[C]// 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2020: 10778-10787.
[27]	WANG C Y, MARK LIAO H Y, WU Y H, et al. CSPNet: a new backbone that can enhance learning capability of CNN[C]// 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops. New York: IEEE Press, 2020: 1571-1580.
[28]	PIAO Y R, JIANG Y Y, ZHANG M, et al. PANet: patch-aware network for light field salient object detection[J]. IEEE Transactions on Cybernetics, 2023, 53(1): 379-391.
[29]	WANG X L, GIRSHICK R, GUPTA A, et al. Non-local neural networks[C]// 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2018: 7794-7803.
[30]	HU J, SHEN L, SUN G. Squeeze-and-excitation networks[C]// 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New Yorky: IEEE Press, 2018: 7132-7141.
[31]	WOO S, PARK J, LEE J Y, et al. CBAM: convolutional block attention module[C]// Computer Vision - ECCV 2018. Cham: Springer International Publishing, 2018: 3-19.
[32]	CAO Y, XU J R, WEI F Y, et al. GCNet: non-local networks meet squeeze-excitation networks and beyond[C]// 2019 IEEE/CVF International Conference on Computer Vision Workshop. New York: IEEE Press, 2019: 1971-1980.
[33]	LIU H J, LIU F Q, FAN X Y, et al. Polarized self-attention: towards high-quality pixel-wise mapping[J]. Neurocomputing, 2022, 506: 158-167.
[34]	ZHU L, WANG X J, KE Z H, et al. BiFormer: vision transformer with Bi-level routing attention[C]// 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2023: 10323-10333.
[35]	LIU Z, LIN Y T, CAO Y, et al. Swin Transformer: hierarchical vision Transformer using shifted Windows[C]// 2021 IEEE/CVF International Conference on Computer Vision. New York: IEEE Press, 2021: 9992-10002.
[36]	TOLSTIKHIN I, HOULSBY N, KOLESNIKOV A, et al. MLP-mixer: an all-MLP architecture for vision[EB/OL]. (2021-05-04) [2023-07-23]. http://arxiv.org/abs/2105.01601.
[37]	HOU Q B, ZHOU D Q, FENG J S. Coordinate attention for efficient mobile network design[C]// 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New Yorke: IEEE Press, 2021: 13708-13717.
[38]	ZHENG Z H, WANG P, REN D W, et al. Enhancing geometric factors in model learning and inference for object detection and instance segmentation[J]. IEEE Transactions on Cybernetics, 2022, 52(8): 8574-8586.
[39]	LIN T Y, MAIRE M, BELONGIE S, et al. Microsoft coco: common objects in context[J]. Lecture Notes in Computer Science, 2014, 8693(1): 740-755.