欢迎访问《图学学报》 分享到:

图学学报 ›› 2025, Vol. 46 ›› Issue (4): 746-755.DOI: 10.11996/JG.j.2095-302X.2025040746

• 图像处理与计算机视觉 • 上一篇    下一篇

三维人体姿态和形状估计的分层注意力时空特征融合算法

闫卓越1(), 刘骊1,2(), 付晓东1,2, 刘利军1,2, 彭玮1,2   

  1. 1.昆明理工大学信息工程与自动化学院,云南 昆明 650500
    2.昆明理工大学云南省计算机技术应用重点实验室,云南 昆明 650500
  • 收稿日期:2024-11-06 接受日期:2025-03-18 出版日期:2025-08-30 发布日期:2025-08-11
  • 通讯作者:刘骊(1979-),女,教授,博士。主要研究方向为计算机图形学与计算机视觉、图像处理等。E-mail:ieall@kust.edu.cn
  • 第一作者:闫卓越(1998-),女,硕士研究生。主要研究方向为计算机视觉。E-mail:yanzhuoyue@stu.kust.edu.cn
  • 基金资助:
    国家自然科学基金(62262036);国家自然科学基金(62362043);兴滇英才支持计划项目(KKXY202203008)

Hierarchical attention spatial-temporal feature fusion algorithm for 3D human pose and shape estimation

YAN Zhuoyue1(), LIU Li1,2(), FU Xiaodong1,2, LIU Lijun1,2, PENG Wei1,2   

  1. 1. Faculty of Information Engineering and Automation, Kunming University of Science and Technology, Kunming Yunnan 650500, China
    2. Yunnan Key Laboratory of Computer Technologies Application, Kunming University of Science and Technology, Kunming Yunnan 650500, China
  • Received:2024-11-06 Accepted:2025-03-18 Published:2025-08-30 Online:2025-08-11
  • First author:YAN Zhuoyue (1998-), master student. Her main research interest covers computer vision. E-mail:yanzhuoyue@stu.kust.edu.cn
  • Supported by:
    National Natural Science Foundation of China(62262036);National Natural Science Foundation of China(62362043);Xingdian Talent Support Project(KKXY202203008)

摘要:

基于单目视频的三维人体姿态和形状估计在虚拟试衣和影视特效制作等领域具有重要作用。针对基于单目视频的三维人体姿态和形状估计中的人体建模不充分、时空表征较单一、估计精准性受限的问题,提出三维人体姿态和形状估计的分层注意力时空特征融合算法。首先使用分层注意力对人体部位进行分层空间建模,得到可学习的人体姿态空间特征;然后将可学习的人体姿态空间特征与参数人体模板结合,共同指导人体运动时序特征进行时空建模,实现时空特征融合;最后提出三维人体姿态和形状联合优化方法,使用多层感知机回归更加精准且平滑的三维人体网格。在Human3.6M数据集上的实验结果表明,该方法在评估指标MPJPE和ACC-ERR上的数值分别为56.1 mm和3.4 mm/s2,较现有方法相比降低了0.5%和5.6%,能够提高三维人体姿态和形状估计的精度,生成精准且平滑的三维人体网格。此外,在3DPW数据集和互联网视频的测试结果表明,在面对快速运动等场景时,也具有一定的鲁棒性。

关键词: 三维人体姿态和形状估计, 分层注意力, 时空建模, 时空特征融合, 姿态和形状联合优化

Abstract:

Monocular-video-based 3D human pose and shape estimation plays an important role in the fields of virtual try-on and special effects production. To address the problem of insufficient human modeling, simple spatial-temporal feature representation, and limited estimation accuracy in 3D human pose and shape estimation from monocular videos, a hierarchical-attention spatial-temporal feature-fusion algorithm was proposed. Firstly, hierarchical attention was applied for model human body parts in hierarchical spatial modeling, yielding learnable human pose spatial features. Secondly, the learnable human pose spatial features were combined with a parametric human template to guide spatial-temporal modeling of human motion temporal feature, achieving spatial-temporal feature fusion. Finally, the method of 3D human pose and shape co-optimization was proposed, and more accurate and smooth 3D human mesh was returned by multilayer perceptron. Experimental results on Human3.6M dataset demonstrated that MPJPE and ACC-ERR were 56.1 mm and 3.4 mm/s2 respectively, reductions of 0.5% and 5.6% compared with the state-of-the-art method, improving the accuracy of 3D human pose and shape estimation, and generating accurate and smooth 3D human mesh. Furthermore, the testing results on 3DPW and Internet videos confirmed the robustness of the proposed method when facing the challenge of fast motion.

Key words: 3D human pose and shape estimation, hierarchical attention, spatial-temporal modeling, spatial-temporal feature fusion, pose and shape co-optimization

中图分类号: