欢迎访问《图学学报》 分享到:

图学学报 ›› 2024, Vol. 45 ›› Issue (3): 516-527.DOI: 10.11996/JG.j.2095-302X.2024030516

• 计算机图形学与虚拟现实 • 上一篇    下一篇

结合坐标Transformer的轻量级人体姿态估计算法

黄友文(), 林志钦, 章劲, 陈俊宽   

  1. 江西理工大学信息工程学院,江西 赣州 341000
  • 收稿日期:2023-11-17 接受日期:2024-02-24 出版日期:2024-06-30 发布日期:2024-06-11
  • 第一作者:黄友文(1982-),男,副教授,博士。主要研究方向为计算机视觉、自然语言处理和机器学习。E-mail:ywhuang@jxust.edu.cn
  • 基金资助:
    江西省教育厅资助项目(GJJ180443)

Lightweight human pose estimation algorithm combined with coordinate Transformer

HUANG Youwen(), LIN Zhiqin, ZHANG Jin, CHEN Junkuan   

  1. School of Information Engineering, Jiangxi University of Science and Technology, Ganzhou Jiangxi 341000, China
  • Received:2023-11-17 Accepted:2024-02-24 Published:2024-06-30 Online:2024-06-11
  • First author:HUANG Youwen (1982-), associate professor, Ph.D. His main research interests cover computer vision, natural language processing, and machine learning. E-mail:ywhuang@jxust.edu.cn
  • Supported by:
    Jiangxi Provincial Department of Education(GJJ180443)

摘要:

针对现有的大多数自底向上人体姿态估计算法存在模型规模大、计算成本高及对边缘设备不友好等问题,提出了一种基于YOLOv5s6-Pose的轻量级多人姿态估计网络模型YOLOv5s6-Pose-CT。该模型在颈部网络中引入空间和通道重建卷积,以减少空间和通道维度上的特征冗余。同时,提出了一种坐标Transformer嵌入于主干网络中,使模型专注于长距离依赖和拥有高效的局部特征提取能力。其次,通过使用无偏特征位置对齐来解决多尺度融合过程中出现的特征错位问题。最后,使用损失函数MPDIoU对边界框的回归损失重新定义。在COCO 2017数据集上的实验结果表明,本文优化的网络模型与主流的轻量级网络EfficientHRNet-H1模型相比,在保持相同精度的同时,参数量和计算量分别减少16.2%和66.1%。相比于基准模型YOLOv5s6-Pose,参数量减少11.2%,计算量降低5.8%,平均检测精度和平均召回率分别提升2.5%和2.6%。

关键词: 人体姿态估计, 轻量级, 坐标Transformer, 无偏特征位置对齐, 损失函数

Abstract:

Addressing issues such as large model size, high computational costs, and limited compatibility with edge devices in most existing bottom-up human pose estimation algorithms, this study proposed a lightweight multi-person pose estimation network model named YOLOv5s6-Pose-CT based on YOLOv5s6-Pose. In order to reduce feature redundancy across both spatial and channel dimensions, the network model introduced spatial and channel reconstruction convolution in the neck network. Simultaneously, a coordinate Transformer was incorporated into the backbone network to enhance long-distance dependence while maintaining efficient local feature extraction ability. Furthermore, unbiased feature position alignment was employed to resolve feature dislocation during multi-scale fusion. Finally, this study redefined the regression loss of bounding boxes using the MPDIoU (minimum point distance-based IoU) loss function. Experimental results on the COCO 2017 dataset demonstrated that compared with EfficientHRNet-H1 (a mainstream lightweight network), our optimized network model reduced parameters by 16.2% and computation by 66.1%, respectively, while maintaining comparable accuracy levels. Moreover, compared with the baseline approach, our proposed model achieved parameter and computation reductions of 11.2% and 5.8%, respectively, along with improvements of 2.5% in average detection accuracy and 2.6% in recall rate.

Key words: human pose estimation, lightweight, coordinate Transformer, unbiased feature position alignment, loss function

中图分类号: