Hierarchical attention spatial-temporal feature fusion algorithm for 3D human pose and shape estimation

doi:10.11996/JG.j.2095-302X.2025040746

Abstract

Abstract:

Monocular-video-based 3D human pose and shape estimation plays an important role in the fields of virtual try-on and special effects production. To address the problem of insufficient human modeling, simple spatial-temporal feature representation, and limited estimation accuracy in 3D human pose and shape estimation from monocular videos, a hierarchical-attention spatial-temporal feature-fusion algorithm was proposed. Firstly, hierarchical attention was applied for model human body parts in hierarchical spatial modeling, yielding learnable human pose spatial features. Secondly, the learnable human pose spatial features were combined with a parametric human template to guide spatial-temporal modeling of human motion temporal feature, achieving spatial-temporal feature fusion. Finally, the method of 3D human pose and shape co-optimization was proposed, and more accurate and smooth 3D human mesh was returned by multilayer perceptron. Experimental results on Human3.6M dataset demonstrated that MPJPE and ACC-ERR were 56.1 mm and 3.4 mm/s² respectively, reductions of 0.5% and 5.6% compared with the state-of-the-art method, improving the accuracy of 3D human pose and shape estimation, and generating accurate and smooth 3D human mesh. Furthermore, the testing results on 3DPW and Internet videos confirmed the robustness of the proposed method when facing the challenge of fast motion.

Key words: 3D human pose and shape estimation, hierarchical attention, spatial-temporal modeling, spatial-temporal feature fusion, pose and shape co-optimization

CLC Number:

TP391.41

YAN Zhuoyue, LIU Li, FU Xiaodong, LIU Lijun, PENG Wei. Hierarchical attention spatial-temporal feature fusion algorithm for 3D human pose and shape estimation[J]. Journal of Graphics, 2025, 46(4): 746-755.

Figures/Tables 13

References 31

[1]	TIAN Y T, ZHANG H W, LIU Y B, et al. Recovering 3D human mesh from monocular images: a survey[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023, 45(12): 15406-15425.
[2]	YE V, PAVLAKOS G, MALIK J, et al. Decoupling human and camera motion from videos in the wild[C]// 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2023: 21222-21232.
[3]	KANAZAWA A, BLACK M J, JACOBS D W, et al. End-to-end recovery of human shape and pose[C]// 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2018: 7122-7131.
[4]	KANAZAWA A, ZHANG J Y, FELSEN P, et al. Learning 3D human dynamics from video[C]// 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2019: 5614-5623.
[5]	YAO W, ZHANG H W, SUN Y L, et al. STAF:3D human mesh recovery from video with spatio-temporal alignment fusion[EB/OL]. [2024-05-05]. https://arxiv.org/abs/2401.01730.pdf.
[6]	SHEN X L, YANG Z X, WANG X H, et al. Global-to-local modeling for video-based 3D human pose and shape estimation[C]// 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2023: 8887-8896.
[7]	YANG S, HENG W, LIU G, et al. Capturing the motion of every joint:3D human pose and shape estimation with independent tokens[EB/OL]. [2024-05-05]. https://arxiv.org/pdf/2303.00298.pdf.
[8]	ZHANG B Y, MA K H, WU S P, et al. Two-stage co-segmentation network based on discriminative representation for recovering human mesh from videos[C]// 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2023: 5662-5670.
[9]	LEE M, LEE H, KIM B, et al. UNSPAT: Uncertainty-guided spatio-temporal transformer for 3D human pose and shape estimation on videos[C]// 2024 IEEE/CVF Winter Conference on Applications of Computer Vision. New York: IEEE Press, 2024: 3004-3013.
[10]	KOCABAS M, ATHANASIOU N, BLACK M J. Vibe: Video inference for human body pose and shape estimation[C]// 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2020: 5253-5263.
[11]	MAHMOOD N, GHORBANI N, TROJE N F, et al. AMASS: Archive of motion capture as surface shapes[C]// 2019 IEEE/CVF International Conference on Computer Vision. New York: IEEE Press, 2019: 5442-5451.
[12]	LUO Z Y, GOLESTANEH S A, KITANI K M. 3D human motion estimation via motion compression and refinement[C]// Computer Vision - ACCV 2020: 15th Asian Conference. New York: ACM, 2020: 324-340.
[13]	CHOI H, MOON G, CHANG J Y, et al. Beyond static features for temporally consistent 3D human pose and shape from a video[C]// 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2021: 1964-1973.
[14]	ZHU W T, MA X X, LIU Z Y, et al. Motionbert: a unified perspective on learning human motion representations[C]// 2023 IEEE/CVF International Conference on Computer Vision. New York: IEEE Press, 2023: 15085-15099.
[15]	VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]// The 31st International Conference on Neural Information Processing Systems. New York: ACM, 2017: 6000-6010.
[16]	王玉萍, 曾毅, 李胜辉, 等. 一种基于Transformer的三维人体姿态估计方法[J]. 图学学报, 2023, 44(1): 139-145. DOI
	WANG Y P, ZENG Y, LI S H, et al. A Transformer-based 3D human pose estimation method[J]. Journal of Graphics, 2023, 44(1): 139-145 (in Chinese).
[17]	YOU Y X, LIU H, WANG T, et al. Co-evolution of pose and mesh for 3D human body estimation from video[C]// 2023 IEEE/CVF International Conference on Computer Vision. New York: IEEE Press, 2023: 14963-14973.
[18]	吕衡, 杨鸿宇. 一种基于时空运动信息交互建模的三维人体姿态估计方法[J]. 图学学报, 2024, 45(1): 159-168. DOI
	LV H, YANG H Y. A 3D human pose estimation approach based on spatio-temporal motion interaction modeling[J]. Journal of Graphics, 2024, 45(1): 159-168 (in Chinese). DOI
[19]	WAN Z N, LI Z J, TIAN M Q, et al. Encoder-decoder with multi-level attention for 3D human shape and pose estimation[C]// 2021 IEEE/CVF International Conference on Computer Vision. New York: IEEE Press, 2021: 13033-13042.
[20]	WEI W L, LIN J C, LIU T L, et al. Capturing humans in motion: Temporal-attentive 3D human pose and shape estimation from monocular video[C]// 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2022: 13211-13220.
[21]	JIN K M, LIM B S, LEE G H, et al. Kinematic-aware hierarchical attention network for human pose estimation in videos[C]// 2023 IEEE/CVF Winter Conference on Applications of Computer Vision. New York: IEEE Press, 2023: 5725-5734.
[22]	CUI M M, ZHANG K B, SUN Z N. Graph and Skipped Transformer:Exploiting spatial and temporal modeling capacities for efficient 3D human pose estimation[EB/OL]. [2024-05-05]. https://arxiv.org/pdf/2407.02990.pdf.
[23]	TANG Z H, QIU Z F, HAO Y B, et al. 3D human pose estimation with spatio-temporal criss-cross attention[C]// 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2023: 4790-4799.
[24]	XU J L, GUO Y J, PENG Y X. FinePOSE: fine-grained prompt-driven 3D human pose estimation via diffusion models[C]// 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2024: 561-570.
[25]	JIAO J B, CHENG X N, CHEN W J, et al. Towards precise 3D human pose estimation with multi-perspective spatial-temporal relational transformer[EB/OL]. [2024-05-05]. https://arxiv.org/pdf/2401.16700.pdf.
[26]	CHEN Y L, WANG Z C, PENG Y X, et al. Cascaded pyramid network for multi-person pose estimation[C]// 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2018: 7103-7112.
[27]	XU Y F, ZHANG J, ZHANG Q M, et al. Vitpose: simple vision transformer baselines for human pose estimation[J]. Advances in Neural Information Processing Systems, 2022, 35: 38571-38584.
[28]	KOLOTOUROS N, PAVLAKOS G, BLACK M J, et al. Learning to reconstruct 3D human pose and shape via model-fitting in the loop[C]// 2019 IEEE/CVF International Conference on Computer Vision. New York: IEEE Press, 2019: 2252-2261.
[29]	IONESCU C, PAPAVA D, OLARU V, et al. Human3.6m: large scale datasets and predictive methods for 3D human sensing in natural environments[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2013, 36(7): 1325-1339.
[30]	MEHTA D, RHODIN H, CASAS D, et al. Monocular 3D human pose estimation in the wild using improved CNN supervision[C]// 2017 IEEE/CVF International Conference on 3D Vision. New York: IEEE Press, 2017: 506-516.
[31]	MARCARD T V, HENSCHEL R, BLACK M J, et al. Recovering accurate 3D human pose in the wild using imus and a moving camera[C]// Computer Vision - ECCV 2018: 15th European Conference. New York: ACM, 2018: 601-617.

名称	版本
Ubuntu	18.04
Cuda	11.3
Pytorch	2.3.1
Python	3.8

名称	版本
Ubuntu	18.04
Cuda	11.3
Pytorch	2.3.1
Python	3.8

方法	MPJPE/mm	PA-MPJPE/mm	ACC-ERR/(mm/s²)
文献[10]	65.9	41.5	18.3
文献[12]	76.0	53.2	15.3
文献[13]	62.2	41.1	5.3
文献[19]	56.4	38.7	-
文献[20]	69.4	47.4	3.6
文献[6]	67.0	46.3	3.6
文献[14]	62.8	41.0	-
文献[8]	73.2	51.0	3.6
文献[5]	70.4	44.5	4.8
文献[9]	58.3	41.3	3.8
本文方法	56.1	39.5	3.4

方法	MPJPE/mm	PA-MPJPE/mm	ACC-ERR/(mm/s²)
文献[10]	65.9	41.5	18.3
文献[12]	76.0	53.2	15.3
文献[13]	62.2	41.1	5.3
文献[19]	56.4	38.7	-
文献[20]	69.4	47.4	3.6
文献[6]	67.0	46.3	3.6
文献[14]	62.8	41.0	-
文献[8]	73.2	51.0	3.6
文献[5]	70.4	44.5	4.8
文献[9]	58.3	41.3	3.8
本文方法	56.1	39.5	3.4

方法	MPJPE/ mm	PA-MPJPE/ mm	PVE/ mm	ACC-ERR/ (mm/s²)
文献[10]	91.9	57.6	99.1	25.4
文献[12]	86.9	54.7	-	11.6
文献[13]	86.5	52.7	102.9	7.1
文献[19]	79.1	45.7	92.6	17.6
文献[20]	84.3	52.1	99.7	7.4
文献[6]	80.7	50.6	96.3	6.6
文献[14]	85.5	50.2	99.1	-
文献[8]	83.4	51.7	98.9	7.2
文献[5]	80.6	48.0	95.3	8.2
文献[9]	75.0	45.5	90.2	7.1
本文方法	74.4	50.0	90.0	7.1