Journal of Graphics ›› 2025, Vol. 46 ›› Issue (4): 746-755.DOI: 10.11996/JG.j.2095-302X.2025040746
• Image Processing and Computer Vision • Previous Articles Next Articles
YAN Zhuoyue1(), LIU Li1,2(
), FU Xiaodong1,2, LIU Lijun1,2, PENG Wei1,2
Received:
2024-11-06
Accepted:
2025-03-18
Online:
2025-08-30
Published:
2025-08-11
Contact:
LIU Li
About author:
First author contact:YAN Zhuoyue (1998-), master student. Her main research interest covers computer vision. E-mail:yanzhuoyue@stu.kust.edu.cn
Supported by:
CLC Number:
YAN Zhuoyue, LIU Li, FU Xiaodong, LIU Lijun, PENG Wei. Hierarchical attention spatial-temporal feature fusion algorithm for 3D human pose and shape estimation[J]. Journal of Graphics, 2025, 46(4): 746-755.
Add to citation manager EndNote|Ris|BibTeX
URL: http://www.txxb.com.cn/EN/10.11996/JG.j.2095-302X.2025040746
名称 | 版本 |
---|---|
Ubuntu | 18.04 |
Cuda | 11.3 |
Pytorch | 2.3.1 |
Python | 3.8 |
Table 1 Experimental environment version
名称 | 版本 |
---|---|
Ubuntu | 18.04 |
Cuda | 11.3 |
Pytorch | 2.3.1 |
Python | 3.8 |
方法 | MPJPE/mm | PA-MPJPE/mm | ACC-ERR/(mm/s2) |
---|---|---|---|
文献[ | 65.9 | 41.5 | 18.3 |
文献[ | 76.0 | 53.2 | 15.3 |
文献[ | 62.2 | 41.1 | 5.3 |
文献[ | 56.4 | 38.7 | - |
文献[ | 69.4 | 47.4 | 3.6 |
文献[ | 67.0 | 46.3 | 3.6 |
文献[ | 62.8 | 41.0 | - |
文献[ | 73.2 | 51.0 | 3.6 |
文献[ | 70.4 | 44.5 | 4.8 |
文献[ | 58.3 | 41.3 | 3.8 |
本文方法 | 56.1 | 39.5 | 3.4 |
Table 2 Quantitative results on Human3.6M dataset
方法 | MPJPE/mm | PA-MPJPE/mm | ACC-ERR/(mm/s2) |
---|---|---|---|
文献[ | 65.9 | 41.5 | 18.3 |
文献[ | 76.0 | 53.2 | 15.3 |
文献[ | 62.2 | 41.1 | 5.3 |
文献[ | 56.4 | 38.7 | - |
文献[ | 69.4 | 47.4 | 3.6 |
文献[ | 67.0 | 46.3 | 3.6 |
文献[ | 62.8 | 41.0 | - |
文献[ | 73.2 | 51.0 | 3.6 |
文献[ | 70.4 | 44.5 | 4.8 |
文献[ | 58.3 | 41.3 | 3.8 |
本文方法 | 56.1 | 39.5 | 3.4 |
方法 | MPJPE/ mm | PA-MPJPE/ mm | PVE/ mm | ACC-ERR/ (mm/s2) |
---|---|---|---|---|
文献[ | 91.9 | 57.6 | 99.1 | 25.4 |
文献[ | 86.9 | 54.7 | - | 11.6 |
文献[ | 86.5 | 52.7 | 102.9 | 7.1 |
文献[ | 79.1 | 45.7 | 92.6 | 17.6 |
文献[ | 84.3 | 52.1 | 99.7 | 7.4 |
文献[ | 80.7 | 50.6 | 96.3 | 6.6 |
文献[ | 85.5 | 50.2 | 99.1 | - |
文献[ | 83.4 | 51.7 | 98.9 | 7.2 |
文献[ | 80.6 | 48.0 | 95.3 | 8.2 |
文献[ | 75.0 | 45.5 | 90.2 | 7.1 |
本文方法 | 74.4 | 50.0 | 90.0 | 7.1 |
Table 3 Quantitative results on 3DPW dataset
方法 | MPJPE/ mm | PA-MPJPE/ mm | PVE/ mm | ACC-ERR/ (mm/s2) |
---|---|---|---|---|
文献[ | 91.9 | 57.6 | 99.1 | 25.4 |
文献[ | 86.9 | 54.7 | - | 11.6 |
文献[ | 86.5 | 52.7 | 102.9 | 7.1 |
文献[ | 79.1 | 45.7 | 92.6 | 17.6 |
文献[ | 84.3 | 52.1 | 99.7 | 7.4 |
文献[ | 80.7 | 50.6 | 96.3 | 6.6 |
文献[ | 85.5 | 50.2 | 99.1 | - |
文献[ | 83.4 | 51.7 | 98.9 | 7.2 |
文献[ | 80.6 | 48.0 | 95.3 | 8.2 |
文献[ | 75.0 | 45.5 | 90.2 | 7.1 |
本文方法 | 74.4 | 50.0 | 90.0 | 7.1 |
方法 | MPJPE/mm | PA-MPJPE/mm | ACC-ERR/(mm/s2) |
---|---|---|---|
文献[ | 103.9 | 68.9 | 27.3 |
文献[ | 96.4 | 65.4 | 11.1 |
文献[ | 97.6 | 63.5 | 8.5 |
文献[ | 83.6 | 56.2 | - |
文献[ | 96.7 | 62.8 | 9.6 |
文献[ | 93.9 | 61.5 | 7.9 |
文献[ | 98.2 | 62.5 | 8.6 |
文献[ | 93.7 | 59.6 | 10.0 |
文献[ | 94.4 | 60.4 | 9.2 |
本文方法 | 89.6 | 61.8 | 8.0 |
Table 4 Quantitative results on MPI-INF-3DHP dataset
方法 | MPJPE/mm | PA-MPJPE/mm | ACC-ERR/(mm/s2) |
---|---|---|---|
文献[ | 103.9 | 68.9 | 27.3 |
文献[ | 96.4 | 65.4 | 11.1 |
文献[ | 97.6 | 63.5 | 8.5 |
文献[ | 83.6 | 56.2 | - |
文献[ | 96.7 | 62.8 | 9.6 |
文献[ | 93.9 | 61.5 | 7.9 |
文献[ | 98.2 | 62.5 | 8.6 |
文献[ | 93.7 | 59.6 | 10.0 |
文献[ | 94.4 | 60.4 | 9.2 |
本文方法 | 89.6 | 61.8 | 8.0 |
动作 | 文献[ | 文献[ | 本文方法 |
---|---|---|---|
Directions | ![]() | ||
Sitdown | |||
Walk | |||
Walkdog |
Table 5 Qualitative results on Human3.6M dataset
动作 | 文献[ | 文献[ | 本文方法 |
---|---|---|---|
Directions | ![]() | ||
Sitdown | |||
Walk | |||
Walkdog |
模型 | MPJPE/ mm | PA-MPJPE/ mm | ACC-ERR/ (mm/s2) |
---|---|---|---|
Base | 78.4 | 52.8 | 8.9 |
Base+分层 | 75.6 | 52.0 | 8.9 |
Base+分层+融合 | 74.4 | 50.0 | 7.1 |
Table 6 Validity of each module on 3DPW dataset
模型 | MPJPE/ mm | PA-MPJPE/ mm | ACC-ERR/ (mm/s2) |
---|---|---|---|
Base | 78.4 | 52.8 | 8.9 |
Base+分层 | 75.6 | 52.0 | 8.9 |
Base+分层+融合 | 74.4 | 50.0 | 7.1 |
T | MPJPE/mm | PA-MPJPE/mm | ACC-ERR/(mm/s2) |
---|---|---|---|
4 | 59.1 | 43.7 | 11.1 |
8 | 59.9 | 42.8 | 7.1 |
16 | 56.1 | 39.5 | 3.4 |
Table 7 Parameter verification experiment
T | MPJPE/mm | PA-MPJPE/mm | ACC-ERR/(mm/s2) |
---|---|---|---|
4 | 59.1 | 43.7 | 11.1 |
8 | 59.9 | 42.8 | 7.1 |
16 | 56.1 | 39.5 | 3.4 |
[1] | TIAN Y T, ZHANG H W, LIU Y B, et al. Recovering 3D human mesh from monocular images: a survey[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023, 45(12): 15406-15425. |
[2] | YE V, PAVLAKOS G, MALIK J, et al. Decoupling human and camera motion from videos in the wild[C]// 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2023: 21222-21232. |
[3] | KANAZAWA A, BLACK M J, JACOBS D W, et al. End-to-end recovery of human shape and pose[C]// 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2018: 7122-7131. |
[4] | KANAZAWA A, ZHANG J Y, FELSEN P, et al. Learning 3D human dynamics from video[C]// 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2019: 5614-5623. |
[5] | YAO W, ZHANG H W, SUN Y L, et al. STAF:3D human mesh recovery from video with spatio-temporal alignment fusion[EB/OL]. [2024-05-05]. https://arxiv.org/abs/2401.01730.pdf. |
[6] | SHEN X L, YANG Z X, WANG X H, et al. Global-to-local modeling for video-based 3D human pose and shape estimation[C]// 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2023: 8887-8896. |
[7] | YANG S, HENG W, LIU G, et al. Capturing the motion of every joint:3D human pose and shape estimation with independent tokens[EB/OL]. [2024-05-05]. https://arxiv.org/pdf/2303.00298.pdf. |
[8] | ZHANG B Y, MA K H, WU S P, et al. Two-stage co-segmentation network based on discriminative representation for recovering human mesh from videos[C]// 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2023: 5662-5670. |
[9] | LEE M, LEE H, KIM B, et al. UNSPAT: Uncertainty-guided spatio-temporal transformer for 3D human pose and shape estimation on videos[C]// 2024 IEEE/CVF Winter Conference on Applications of Computer Vision. New York: IEEE Press, 2024: 3004-3013. |
[10] | KOCABAS M, ATHANASIOU N, BLACK M J. Vibe: Video inference for human body pose and shape estimation[C]// 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2020: 5253-5263. |
[11] | MAHMOOD N, GHORBANI N, TROJE N F, et al. AMASS: Archive of motion capture as surface shapes[C]// 2019 IEEE/CVF International Conference on Computer Vision. New York: IEEE Press, 2019: 5442-5451. |
[12] | LUO Z Y, GOLESTANEH S A, KITANI K M. 3D human motion estimation via motion compression and refinement[C]// Computer Vision - ACCV 2020: 15th Asian Conference. New York: ACM, 2020: 324-340. |
[13] | CHOI H, MOON G, CHANG J Y, et al. Beyond static features for temporally consistent 3D human pose and shape from a video[C]// 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2021: 1964-1973. |
[14] | ZHU W T, MA X X, LIU Z Y, et al. Motionbert: a unified perspective on learning human motion representations[C]// 2023 IEEE/CVF International Conference on Computer Vision. New York: IEEE Press, 2023: 15085-15099. |
[15] | VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]// The 31st International Conference on Neural Information Processing Systems. New York: ACM, 2017: 6000-6010. |
[16] |
王玉萍, 曾毅, 李胜辉, 等. 一种基于Transformer的三维人体姿态估计方法[J]. 图学学报, 2023, 44(1): 139-145.
DOI |
WANG Y P, ZENG Y, LI S H, et al. A Transformer-based 3D human pose estimation method[J]. Journal of Graphics, 2023, 44(1): 139-145 (in Chinese). | |
[17] | YOU Y X, LIU H, WANG T, et al. Co-evolution of pose and mesh for 3D human body estimation from video[C]// 2023 IEEE/CVF International Conference on Computer Vision. New York: IEEE Press, 2023: 14963-14973. |
[18] |
吕衡, 杨鸿宇. 一种基于时空运动信息交互建模的三维人体姿态估计方法[J]. 图学学报, 2024, 45(1): 159-168.
DOI |
LV H, YANG H Y. A 3D human pose estimation approach based on spatio-temporal motion interaction modeling[J]. Journal of Graphics, 2024, 45(1): 159-168 (in Chinese).
DOI |
|
[19] | WAN Z N, LI Z J, TIAN M Q, et al. Encoder-decoder with multi-level attention for 3D human shape and pose estimation[C]// 2021 IEEE/CVF International Conference on Computer Vision. New York: IEEE Press, 2021: 13033-13042. |
[20] | WEI W L, LIN J C, LIU T L, et al. Capturing humans in motion: Temporal-attentive 3D human pose and shape estimation from monocular video[C]// 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2022: 13211-13220. |
[21] | JIN K M, LIM B S, LEE G H, et al. Kinematic-aware hierarchical attention network for human pose estimation in videos[C]// 2023 IEEE/CVF Winter Conference on Applications of Computer Vision. New York: IEEE Press, 2023: 5725-5734. |
[22] | CUI M M, ZHANG K B, SUN Z N. Graph and Skipped Transformer:Exploiting spatial and temporal modeling capacities for efficient 3D human pose estimation[EB/OL]. [2024-05-05]. https://arxiv.org/pdf/2407.02990.pdf. |
[23] | TANG Z H, QIU Z F, HAO Y B, et al. 3D human pose estimation with spatio-temporal criss-cross attention[C]// 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2023: 4790-4799. |
[24] | XU J L, GUO Y J, PENG Y X. FinePOSE: fine-grained prompt-driven 3D human pose estimation via diffusion models[C]// 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2024: 561-570. |
[25] | JIAO J B, CHENG X N, CHEN W J, et al. Towards precise 3D human pose estimation with multi-perspective spatial-temporal relational transformer[EB/OL]. [2024-05-05]. https://arxiv.org/pdf/2401.16700.pdf. |
[26] | CHEN Y L, WANG Z C, PENG Y X, et al. Cascaded pyramid network for multi-person pose estimation[C]// 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2018: 7103-7112. |
[27] | XU Y F, ZHANG J, ZHANG Q M, et al. Vitpose: simple vision transformer baselines for human pose estimation[J]. Advances in Neural Information Processing Systems, 2022, 35: 38571-38584. |
[28] | KOLOTOUROS N, PAVLAKOS G, BLACK M J, et al. Learning to reconstruct 3D human pose and shape via model-fitting in the loop[C]// 2019 IEEE/CVF International Conference on Computer Vision. New York: IEEE Press, 2019: 2252-2261. |
[29] | IONESCU C, PAPAVA D, OLARU V, et al. Human3.6m: large scale datasets and predictive methods for 3D human sensing in natural environments[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2013, 36(7): 1325-1339. |
[30] | MEHTA D, RHODIN H, CASAS D, et al. Monocular 3D human pose estimation in the wild using improved CNN supervision[C]// 2017 IEEE/CVF International Conference on 3D Vision. New York: IEEE Press, 2017: 506-516. |
[31] | MARCARD T V, HENSCHEL R, BLACK M J, et al. Recovering accurate 3D human pose in the wild using imus and a moving camera[C]// Computer Vision - ECCV 2018: 15th European Conference. New York: ACM, 2018: 601-617. |
[1] | FANG Chenghao, WANG Kangkan. 3D human pose and shape estimation from single-view point clouds with semi-supervised learning [J]. Journal of Graphics, 2025, 46(2): 393-401. |
Viewed | ||||||
Full text |
|
|||||
Abstract |
|
|||||