三维人体姿态和形状估计的分层注意力时空特征融合算法

doi:10.11996/JG.j.2095-302X.2025040746

图学学报 ›› 2025, Vol. 46 ›› Issue (4): 746-755.DOI: 10.11996/JG.j.2095-302X.2025040746

• 图像处理与计算机视觉 • 上一篇下一篇

三维人体姿态和形状估计的分层注意力时空特征融合算法

闫卓越¹(), 刘骊¹^,²(), 付晓东¹^,², 刘利军¹^,², 彭玮¹^,²

1.昆明理工大学信息工程与自动化学院，云南昆明 650500
2.昆明理工大学云南省计算机技术应用重点实验室，云南昆明 650500

收稿日期:2024-11-06 接受日期:2025-03-18 出版日期:2025-08-30 发布日期:2025-08-11
通讯作者:刘骊(1979-)，女，教授，博士。主要研究方向为计算机图形学与计算机视觉、图像处理等。E-mail：ieall@kust.edu.cn
第一作者:闫卓越(1998-)，女，硕士研究生。主要研究方向为计算机视觉。E-mail：yanzhuoyue@stu.kust.edu.cn
基金资助:
国家自然科学基金(62262036);国家自然科学基金(62362043);兴滇英才支持计划项目(KKXY202203008)

Hierarchical attention spatial-temporal feature fusion algorithm for 3D human pose and shape estimation

YAN Zhuoyue¹(), LIU Li¹^,²(), FU Xiaodong¹^,², LIU Lijun¹^,², PENG Wei¹^,²

1. Faculty of Information Engineering and Automation, Kunming University of Science and Technology, Kunming Yunnan 650500, China
2. Yunnan Key Laboratory of Computer Technologies Application, Kunming University of Science and Technology, Kunming Yunnan 650500, China

Received:2024-11-06 Accepted:2025-03-18 Published:2025-08-30 Online:2025-08-11
First author：YAN Zhuoyue (1998-), master student. Her main research interest covers computer vision. E-mail：yanzhuoyue@stu.kust.edu.cn
Supported by:
National Natural Science Foundation of China(62262036);National Natural Science Foundation of China(62362043);Xingdian Talent Support Project(KKXY202203008)

摘要/Abstract

摘要：

基于单目视频的三维人体姿态和形状估计在虚拟试衣和影视特效制作等领域具有重要作用。针对基于单目视频的三维人体姿态和形状估计中的人体建模不充分、时空表征较单一、估计精准性受限的问题，提出三维人体姿态和形状估计的分层注意力时空特征融合算法。首先使用分层注意力对人体部位进行分层空间建模，得到可学习的人体姿态空间特征；然后将可学习的人体姿态空间特征与参数人体模板结合，共同指导人体运动时序特征进行时空建模，实现时空特征融合；最后提出三维人体姿态和形状联合优化方法，使用多层感知机回归更加精准且平滑的三维人体网格。在Human3.6M数据集上的实验结果表明，该方法在评估指标MPJPE和ACC-ERR上的数值分别为56.1 mm和3.4 mm/s²，较现有方法相比降低了0.5%和5.6%，能够提高三维人体姿态和形状估计的精度，生成精准且平滑的三维人体网格。此外，在3DPW数据集和互联网视频的测试结果表明，在面对快速运动等场景时，也具有一定的鲁棒性。

关键词: 三维人体姿态和形状估计, 分层注意力, 时空建模, 时空特征融合, 姿态和形状联合优化

Abstract:

Monocular-video-based 3D human pose and shape estimation plays an important role in the fields of virtual try-on and special effects production. To address the problem of insufficient human modeling, simple spatial-temporal feature representation, and limited estimation accuracy in 3D human pose and shape estimation from monocular videos, a hierarchical-attention spatial-temporal feature-fusion algorithm was proposed. Firstly, hierarchical attention was applied for model human body parts in hierarchical spatial modeling, yielding learnable human pose spatial features. Secondly, the learnable human pose spatial features were combined with a parametric human template to guide spatial-temporal modeling of human motion temporal feature, achieving spatial-temporal feature fusion. Finally, the method of 3D human pose and shape co-optimization was proposed, and more accurate and smooth 3D human mesh was returned by multilayer perceptron. Experimental results on Human3.6M dataset demonstrated that MPJPE and ACC-ERR were 56.1 mm and 3.4 mm/s² respectively, reductions of 0.5% and 5.6% compared with the state-of-the-art method, improving the accuracy of 3D human pose and shape estimation, and generating accurate and smooth 3D human mesh. Furthermore, the testing results on 3DPW and Internet videos confirmed the robustness of the proposed method when facing the challenge of fast motion.

Key words: 3D human pose and shape estimation, hierarchical attention, spatial-temporal modeling, spatial-temporal feature fusion, pose and shape co-optimization

中图分类号:

TP391.41

闫卓越, 刘骊, 付晓东, 刘利军, 彭玮. 三维人体姿态和形状估计的分层注意力时空特征融合算法[J]. 图学学报, 2025, 46(4): 746-755.

YAN Zhuoyue, LIU Li, FU Xiaodong, LIU Lijun, PENG Wei. Hierarchical attention spatial-temporal feature fusion algorithm for 3D human pose and shape estimation[J]. Journal of Graphics, 2025, 46(4): 746-755.

图/表 13

图1 三维人体姿态和形状估计的分层注意力时空特征融合算法流程图

Fig. 1 Flowchart of hierarchical attention spatial-temporal feature fusion algorithm for 3D human pose and shape estimation

图2 身体部位编码结构图

Fig. 2 Human body parts coding structural diagram

表1 实验环境版本

Table 1 Experimental environment version

名称	版本
Ubuntu	18.04
Cuda	11.3
Pytorch	2.3.1
Python	3.8

表2 在Human3.6M数据集上的定量结果

Table 2 Quantitative results on Human3.6M dataset

方法	MPJPE/mm	PA-MPJPE/mm	ACC-ERR/(mm/s²)
文献[10]	65.9	41.5	18.3
文献[12]	76.0	53.2	15.3
文献[13]	62.2	41.1	5.3
文献[19]	56.4	38.7	-
文献[20]	69.4	47.4	3.6
文献[6]	67.0	46.3	3.6
文献[14]	62.8	41.0	-
文献[8]	73.2	51.0	3.6
文献[5]	70.4	44.5	4.8
文献[9]	58.3	41.3	3.8
本文方法	56.1	39.5	3.4

表3 在3DPW数据集上的定量结果

Table 3 Quantitative results on 3DPW dataset

方法	MPJPE/ mm	PA-MPJPE/ mm	PVE/ mm	ACC-ERR/ (mm/s²)
文献[10]	91.9	57.6	99.1	25.4
文献[12]	86.9	54.7	-	11.6
文献[13]	86.5	52.7	102.9	7.1
文献[19]	79.1	45.7	92.6	17.6
文献[20]	84.3	52.1	99.7	7.4
文献[6]	80.7	50.6	96.3	6.6
文献[14]	85.5	50.2	99.1	-
文献[8]	83.4	51.7	98.9	7.2
文献[5]	80.6	48.0	95.3	8.2
文献[9]	75.0	45.5	90.2	7.1
本文方法	74.4	50.0	90.0	7.1

表4 在MPI-INF-3DHP数据集上的定量结果

Table 4 Quantitative results on MPI-INF-3DHP dataset

方法	MPJPE/mm	PA-MPJPE/mm	ACC-ERR/(mm/s²)
文献[10]	103.9	68.9	27.3
文献[12]	96.4	65.4	11.1
文献[13]	97.6	63.5	8.5
文献[19]	83.6	56.2	-
文献[20]	96.7	62.8	9.6
文献[6]	93.9	61.5	7.9
文献[8]	98.2	62.5	8.6
文献[5]	93.7	59.6	10.0
文献[9]	94.4	60.4	9.2
本文方法	89.6	61.8	8.0

表5 在Human3.6M数据集上的定性结果

Table 5 Qualitative results on Human3.6M dataset

动作	文献[20]	文献[5]	本文方法
Directions
Sitdown
Walk
Walkdog

图3 在3DPW数据集上的定性结果((a) 文献[20]；(b) 文献[5]；(c) 本文方法)

Fig. 3 Qualitative results on 3DPW dataset ((a) Reference [20]; (b) Reference [5]; (c) Ours)

图4 在互联网视频上的定性结果((a) 文献[20]；(b) 文献[5]；(c) 本文方法)

Fig. 4 Qualitative results on Internet videos ((a) Reference [20]; (b) Reference [5]; (c) Ours)

图5 在3DPW数据集上的错误实例对比((a) 输入；(b) 文献[5]；(c) 本文方法)

Fig. 5 Error instances comparison on 3DPW dataset ((a) Input; (b) Reference [5]; (c) Ours)

表6 在3DPW数据集上验证各模块的有效性

Table 6 Validity of each module on 3DPW dataset

模型	MPJPE/ mm	PA-MPJPE/ mm	ACC-ERR/ (mm/s²)
Base	78.4	52.8	8.9
Base+分层	75.6	52.0	8.9
Base+分层+融合	74.4	50.0	7.1

图6 消融实验各模块定性结果((a) Base；(b) Base+分层；(c) Base+分层+融合)

Fig. 6 Qualitative results of each module on ablation experiment ((a) Base; (b) Base+ Hierarchical; (c) Base+Layering+Fusion)

表7 参数验证实验

Table 7 Parameter verification experiment

T	MPJPE/mm	PA-MPJPE/mm	ACC-ERR/(mm/s²)
4	59.1	43.7	11.1
8	59.9	42.8	7.1
16	56.1	39.5	3.4

参考文献 31

[1]	TIAN Y T, ZHANG H W, LIU Y B, et al. Recovering 3D human mesh from monocular images: a survey[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023, 45(12): 15406-15425.
[2]	YE V, PAVLAKOS G, MALIK J, et al. Decoupling human and camera motion from videos in the wild[C]// 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2023: 21222-21232.
[3]	KANAZAWA A, BLACK M J, JACOBS D W, et al. End-to-end recovery of human shape and pose[C]// 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2018: 7122-7131.
[4]	KANAZAWA A, ZHANG J Y, FELSEN P, et al. Learning 3D human dynamics from video[C]// 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2019: 5614-5623.
[5]	YAO W, ZHANG H W, SUN Y L, et al. STAF:3D human mesh recovery from video with spatio-temporal alignment fusion[EB/OL]. [2024-05-05]. https://arxiv.org/abs/2401.01730.pdf.
[6]	SHEN X L, YANG Z X, WANG X H, et al. Global-to-local modeling for video-based 3D human pose and shape estimation[C]// 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2023: 8887-8896.
[7]	YANG S, HENG W, LIU G, et al. Capturing the motion of every joint:3D human pose and shape estimation with independent tokens[EB/OL]. [2024-05-05]. https://arxiv.org/pdf/2303.00298.pdf.
[8]	ZHANG B Y, MA K H, WU S P, et al. Two-stage co-segmentation network based on discriminative representation for recovering human mesh from videos[C]// 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2023: 5662-5670.
[9]	LEE M, LEE H, KIM B, et al. UNSPAT: Uncertainty-guided spatio-temporal transformer for 3D human pose and shape estimation on videos[C]// 2024 IEEE/CVF Winter Conference on Applications of Computer Vision. New York: IEEE Press, 2024: 3004-3013.
[10]	KOCABAS M, ATHANASIOU N, BLACK M J. Vibe: Video inference for human body pose and shape estimation[C]// 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2020: 5253-5263.
[11]	MAHMOOD N, GHORBANI N, TROJE N F, et al. AMASS: Archive of motion capture as surface shapes[C]// 2019 IEEE/CVF International Conference on Computer Vision. New York: IEEE Press, 2019: 5442-5451.
[12]	LUO Z Y, GOLESTANEH S A, KITANI K M. 3D human motion estimation via motion compression and refinement[C]// Computer Vision - ACCV 2020: 15th Asian Conference. New York: ACM, 2020: 324-340.
[13]	CHOI H, MOON G, CHANG J Y, et al. Beyond static features for temporally consistent 3D human pose and shape from a video[C]// 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2021: 1964-1973.
[14]	ZHU W T, MA X X, LIU Z Y, et al. Motionbert: a unified perspective on learning human motion representations[C]// 2023 IEEE/CVF International Conference on Computer Vision. New York: IEEE Press, 2023: 15085-15099.
[15]	VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]// The 31st International Conference on Neural Information Processing Systems. New York: ACM, 2017: 6000-6010.
[16]	王玉萍, 曾毅, 李胜辉, 等. 一种基于Transformer的三维人体姿态估计方法[J]. 图学学报, 2023, 44(1): 139-145. DOI
	WANG Y P, ZENG Y, LI S H, et al. A Transformer-based 3D human pose estimation method[J]. Journal of Graphics, 2023, 44(1): 139-145 (in Chinese).
[17]	YOU Y X, LIU H, WANG T, et al. Co-evolution of pose and mesh for 3D human body estimation from video[C]// 2023 IEEE/CVF International Conference on Computer Vision. New York: IEEE Press, 2023: 14963-14973.
[18]	吕衡, 杨鸿宇. 一种基于时空运动信息交互建模的三维人体姿态估计方法[J]. 图学学报, 2024, 45(1): 159-168. DOI
	LV H, YANG H Y. A 3D human pose estimation approach based on spatio-temporal motion interaction modeling[J]. Journal of Graphics, 2024, 45(1): 159-168 (in Chinese). DOI
[19]	WAN Z N, LI Z J, TIAN M Q, et al. Encoder-decoder with multi-level attention for 3D human shape and pose estimation[C]// 2021 IEEE/CVF International Conference on Computer Vision. New York: IEEE Press, 2021: 13033-13042.
[20]	WEI W L, LIN J C, LIU T L, et al. Capturing humans in motion: Temporal-attentive 3D human pose and shape estimation from monocular video[C]// 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2022: 13211-13220.
[21]	JIN K M, LIM B S, LEE G H, et al. Kinematic-aware hierarchical attention network for human pose estimation in videos[C]// 2023 IEEE/CVF Winter Conference on Applications of Computer Vision. New York: IEEE Press, 2023: 5725-5734.
[22]	CUI M M, ZHANG K B, SUN Z N. Graph and Skipped Transformer:Exploiting spatial and temporal modeling capacities for efficient 3D human pose estimation[EB/OL]. [2024-05-05]. https://arxiv.org/pdf/2407.02990.pdf.
[23]	TANG Z H, QIU Z F, HAO Y B, et al. 3D human pose estimation with spatio-temporal criss-cross attention[C]// 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2023: 4790-4799.
[24]	XU J L, GUO Y J, PENG Y X. FinePOSE: fine-grained prompt-driven 3D human pose estimation via diffusion models[C]// 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2024: 561-570.
[25]	JIAO J B, CHENG X N, CHEN W J, et al. Towards precise 3D human pose estimation with multi-perspective spatial-temporal relational transformer[EB/OL]. [2024-05-05]. https://arxiv.org/pdf/2401.16700.pdf.
[26]	CHEN Y L, WANG Z C, PENG Y X, et al. Cascaded pyramid network for multi-person pose estimation[C]// 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2018: 7103-7112.
[27]	XU Y F, ZHANG J, ZHANG Q M, et al. Vitpose: simple vision transformer baselines for human pose estimation[J]. Advances in Neural Information Processing Systems, 2022, 35: 38571-38584.
[28]	KOLOTOUROS N, PAVLAKOS G, BLACK M J, et al. Learning to reconstruct 3D human pose and shape via model-fitting in the loop[C]// 2019 IEEE/CVF International Conference on Computer Vision. New York: IEEE Press, 2019: 2252-2261.
[29]	IONESCU C, PAPAVA D, OLARU V, et al. Human3.6m: large scale datasets and predictive methods for 3D human sensing in natural environments[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2013, 36(7): 1325-1339.
[30]	MEHTA D, RHODIN H, CASAS D, et al. Monocular 3D human pose estimation in the wild using improved CNN supervision[C]// 2017 IEEE/CVF International Conference on 3D Vision. New York: IEEE Press, 2017: 506-516.
[31]	MARCARD T V, HENSCHEL R, BLACK M J, et al. Recovering accurate 3D human pose in the wild using imus and a moving camera[C]// Computer Vision - ECCV 2018: 15th European Conference. New York: ACM, 2018: 601-617.

三维人体姿态和形状估计的分层注意力时空特征融合算法

Hierarchical attention spatial-temporal feature fusion algorithm for 3D human pose and shape estimation

RichHTML

PDF (PC)

可视化

摘要/Abstract

引用本文

使用本文

图/表 13

参考文献 31

相关文章 1

编辑推荐

Metrics

本文评价