Journal of Graphics ›› 2025, Vol. 46 ›› Issue (4): 746-755.DOI: 10.11996/JG.j.2095-302X.2025040746
• Image Processing and Computer Vision • Previous Articles Next Articles
YAN Zhuoyue1(
), LIU Li1,2(
), FU Xiaodong1,2, LIU Lijun1,2, PENG Wei1,2
Received:2024-11-06
Accepted:2025-03-18
Online:2025-08-30
Published:2025-08-11
Contact:
LIU Li
About author:First author contact:YAN Zhuoyue (1998-), master student. Her main research interest covers computer vision. E-mail:yanzhuoyue@stu.kust.edu.cn
Supported by:CLC Number:
YAN Zhuoyue, LIU Li, FU Xiaodong, LIU Lijun, PENG Wei. Hierarchical attention spatial-temporal feature fusion algorithm for 3D human pose and shape estimation[J]. Journal of Graphics, 2025, 46(4): 746-755.
Add to citation manager EndNote|Ris|BibTeX
URL: http://www.txxb.com.cn/EN/10.11996/JG.j.2095-302X.2025040746
| 名称 | 版本 |
|---|---|
| Ubuntu | 18.04 |
| Cuda | 11.3 |
| Pytorch | 2.3.1 |
| Python | 3.8 |
Table 1 Experimental environment version
| 名称 | 版本 |
|---|---|
| Ubuntu | 18.04 |
| Cuda | 11.3 |
| Pytorch | 2.3.1 |
| Python | 3.8 |
| 方法 | MPJPE/mm | PA-MPJPE/mm | ACC-ERR/(mm/s2) |
|---|---|---|---|
| 文献[ | 65.9 | 41.5 | 18.3 |
| 文献[ | 76.0 | 53.2 | 15.3 |
| 文献[ | 62.2 | 41.1 | 5.3 |
| 文献[ | 56.4 | 38.7 | - |
| 文献[ | 69.4 | 47.4 | 3.6 |
| 文献[ | 67.0 | 46.3 | 3.6 |
| 文献[ | 62.8 | 41.0 | - |
| 文献[ | 73.2 | 51.0 | 3.6 |
| 文献[ | 70.4 | 44.5 | 4.8 |
| 文献[ | 58.3 | 41.3 | 3.8 |
| 本文方法 | 56.1 | 39.5 | 3.4 |
Table 2 Quantitative results on Human3.6M dataset
| 方法 | MPJPE/mm | PA-MPJPE/mm | ACC-ERR/(mm/s2) |
|---|---|---|---|
| 文献[ | 65.9 | 41.5 | 18.3 |
| 文献[ | 76.0 | 53.2 | 15.3 |
| 文献[ | 62.2 | 41.1 | 5.3 |
| 文献[ | 56.4 | 38.7 | - |
| 文献[ | 69.4 | 47.4 | 3.6 |
| 文献[ | 67.0 | 46.3 | 3.6 |
| 文献[ | 62.8 | 41.0 | - |
| 文献[ | 73.2 | 51.0 | 3.6 |
| 文献[ | 70.4 | 44.5 | 4.8 |
| 文献[ | 58.3 | 41.3 | 3.8 |
| 本文方法 | 56.1 | 39.5 | 3.4 |
| 方法 | MPJPE/ mm | PA-MPJPE/ mm | PVE/ mm | ACC-ERR/ (mm/s2) |
|---|---|---|---|---|
| 文献[ | 91.9 | 57.6 | 99.1 | 25.4 |
| 文献[ | 86.9 | 54.7 | - | 11.6 |
| 文献[ | 86.5 | 52.7 | 102.9 | 7.1 |
| 文献[ | 79.1 | 45.7 | 92.6 | 17.6 |
| 文献[ | 84.3 | 52.1 | 99.7 | 7.4 |
| 文献[ | 80.7 | 50.6 | 96.3 | 6.6 |
| 文献[ | 85.5 | 50.2 | 99.1 | - |
| 文献[ | 83.4 | 51.7 | 98.9 | 7.2 |
| 文献[ | 80.6 | 48.0 | 95.3 | 8.2 |
| 文献[ | 75.0 | 45.5 | 90.2 | 7.1 |
| 本文方法 | 74.4 | 50.0 | 90.0 | 7.1 |
Table 3 Quantitative results on 3DPW dataset
| 方法 | MPJPE/ mm | PA-MPJPE/ mm | PVE/ mm | ACC-ERR/ (mm/s2) |
|---|---|---|---|---|
| 文献[ | 91.9 | 57.6 | 99.1 | 25.4 |
| 文献[ | 86.9 | 54.7 | - | 11.6 |
| 文献[ | 86.5 | 52.7 | 102.9 | 7.1 |
| 文献[ | 79.1 | 45.7 | 92.6 | 17.6 |
| 文献[ | 84.3 | 52.1 | 99.7 | 7.4 |
| 文献[ | 80.7 | 50.6 | 96.3 | 6.6 |
| 文献[ | 85.5 | 50.2 | 99.1 | - |
| 文献[ | 83.4 | 51.7 | 98.9 | 7.2 |
| 文献[ | 80.6 | 48.0 | 95.3 | 8.2 |
| 文献[ | 75.0 | 45.5 | 90.2 | 7.1 |
| 本文方法 | 74.4 | 50.0 | 90.0 | 7.1 |
| 方法 | MPJPE/mm | PA-MPJPE/mm | ACC-ERR/(mm/s2) |
|---|---|---|---|
| 文献[ | 103.9 | 68.9 | 27.3 |
| 文献[ | 96.4 | 65.4 | 11.1 |
| 文献[ | 97.6 | 63.5 | 8.5 |
| 文献[ | 83.6 | 56.2 | - |
| 文献[ | 96.7 | 62.8 | 9.6 |
| 文献[ | 93.9 | 61.5 | 7.9 |
| 文献[ | 98.2 | 62.5 | 8.6 |
| 文献[ | 93.7 | 59.6 | 10.0 |
| 文献[ | 94.4 | 60.4 | 9.2 |
| 本文方法 | 89.6 | 61.8 | 8.0 |
Table 4 Quantitative results on MPI-INF-3DHP dataset
| 方法 | MPJPE/mm | PA-MPJPE/mm | ACC-ERR/(mm/s2) |
|---|---|---|---|
| 文献[ | 103.9 | 68.9 | 27.3 |
| 文献[ | 96.4 | 65.4 | 11.1 |
| 文献[ | 97.6 | 63.5 | 8.5 |
| 文献[ | 83.6 | 56.2 | - |
| 文献[ | 96.7 | 62.8 | 9.6 |
| 文献[ | 93.9 | 61.5 | 7.9 |
| 文献[ | 98.2 | 62.5 | 8.6 |
| 文献[ | 93.7 | 59.6 | 10.0 |
| 文献[ | 94.4 | 60.4 | 9.2 |
| 本文方法 | 89.6 | 61.8 | 8.0 |
| 动作 | 文献[ | 文献[ | 本文方法 |
|---|---|---|---|
| Directions | ![]() | ||
| Sitdown | |||
| Walk | |||
| Walkdog | |||
Table 5 Qualitative results on Human3.6M dataset
| 动作 | 文献[ | 文献[ | 本文方法 |
|---|---|---|---|
| Directions | ![]() | ||
| Sitdown | |||
| Walk | |||
| Walkdog | |||
| 模型 | MPJPE/ mm | PA-MPJPE/ mm | ACC-ERR/ (mm/s2) |
|---|---|---|---|
| Base | 78.4 | 52.8 | 8.9 |
| Base+分层 | 75.6 | 52.0 | 8.9 |
| Base+分层+融合 | 74.4 | 50.0 | 7.1 |
Table 6 Validity of each module on 3DPW dataset
| 模型 | MPJPE/ mm | PA-MPJPE/ mm | ACC-ERR/ (mm/s2) |
|---|---|---|---|
| Base | 78.4 | 52.8 | 8.9 |
| Base+分层 | 75.6 | 52.0 | 8.9 |
| Base+分层+融合 | 74.4 | 50.0 | 7.1 |
| T | MPJPE/mm | PA-MPJPE/mm | ACC-ERR/(mm/s2) |
|---|---|---|---|
| 4 | 59.1 | 43.7 | 11.1 |
| 8 | 59.9 | 42.8 | 7.1 |
| 16 | 56.1 | 39.5 | 3.4 |
Table 7 Parameter verification experiment
| T | MPJPE/mm | PA-MPJPE/mm | ACC-ERR/(mm/s2) |
|---|---|---|---|
| 4 | 59.1 | 43.7 | 11.1 |
| 8 | 59.9 | 42.8 | 7.1 |
| 16 | 56.1 | 39.5 | 3.4 |
| [1] | TIAN Y T, ZHANG H W, LIU Y B, et al. Recovering 3D human mesh from monocular images: a survey[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023, 45(12): 15406-15425. |
| [2] | YE V, PAVLAKOS G, MALIK J, et al. Decoupling human and camera motion from videos in the wild[C]// 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2023: 21222-21232. |
| [3] | KANAZAWA A, BLACK M J, JACOBS D W, et al. End-to-end recovery of human shape and pose[C]// 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2018: 7122-7131. |
| [4] | KANAZAWA A, ZHANG J Y, FELSEN P, et al. Learning 3D human dynamics from video[C]// 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2019: 5614-5623. |
| [5] | YAO W, ZHANG H W, SUN Y L, et al. STAF:3D human mesh recovery from video with spatio-temporal alignment fusion[EB/OL]. [2024-05-05]. https://arxiv.org/abs/2401.01730.pdf. |
| [6] | SHEN X L, YANG Z X, WANG X H, et al. Global-to-local modeling for video-based 3D human pose and shape estimation[C]// 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2023: 8887-8896. |
| [7] | YANG S, HENG W, LIU G, et al. Capturing the motion of every joint:3D human pose and shape estimation with independent tokens[EB/OL]. [2024-05-05]. https://arxiv.org/pdf/2303.00298.pdf. |
| [8] | ZHANG B Y, MA K H, WU S P, et al. Two-stage co-segmentation network based on discriminative representation for recovering human mesh from videos[C]// 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2023: 5662-5670. |
| [9] | LEE M, LEE H, KIM B, et al. UNSPAT: Uncertainty-guided spatio-temporal transformer for 3D human pose and shape estimation on videos[C]// 2024 IEEE/CVF Winter Conference on Applications of Computer Vision. New York: IEEE Press, 2024: 3004-3013. |
| [10] | KOCABAS M, ATHANASIOU N, BLACK M J. Vibe: Video inference for human body pose and shape estimation[C]// 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2020: 5253-5263. |
| [11] | MAHMOOD N, GHORBANI N, TROJE N F, et al. AMASS: Archive of motion capture as surface shapes[C]// 2019 IEEE/CVF International Conference on Computer Vision. New York: IEEE Press, 2019: 5442-5451. |
| [12] | LUO Z Y, GOLESTANEH S A, KITANI K M. 3D human motion estimation via motion compression and refinement[C]// Computer Vision - ACCV 2020: 15th Asian Conference. New York: ACM, 2020: 324-340. |
| [13] | CHOI H, MOON G, CHANG J Y, et al. Beyond static features for temporally consistent 3D human pose and shape from a video[C]// 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2021: 1964-1973. |
| [14] | ZHU W T, MA X X, LIU Z Y, et al. Motionbert: a unified perspective on learning human motion representations[C]// 2023 IEEE/CVF International Conference on Computer Vision. New York: IEEE Press, 2023: 15085-15099. |
| [15] | VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]// The 31st International Conference on Neural Information Processing Systems. New York: ACM, 2017: 6000-6010. |
| [16] |
王玉萍, 曾毅, 李胜辉, 等. 一种基于Transformer的三维人体姿态估计方法[J]. 图学学报, 2023, 44(1): 139-145.
DOI |
| WANG Y P, ZENG Y, LI S H, et al. A Transformer-based 3D human pose estimation method[J]. Journal of Graphics, 2023, 44(1): 139-145 (in Chinese). | |
| [17] | YOU Y X, LIU H, WANG T, et al. Co-evolution of pose and mesh for 3D human body estimation from video[C]// 2023 IEEE/CVF International Conference on Computer Vision. New York: IEEE Press, 2023: 14963-14973. |
| [18] |
吕衡, 杨鸿宇. 一种基于时空运动信息交互建模的三维人体姿态估计方法[J]. 图学学报, 2024, 45(1): 159-168.
DOI |
|
LV H, YANG H Y. A 3D human pose estimation approach based on spatio-temporal motion interaction modeling[J]. Journal of Graphics, 2024, 45(1): 159-168 (in Chinese).
DOI |
|
| [19] | WAN Z N, LI Z J, TIAN M Q, et al. Encoder-decoder with multi-level attention for 3D human shape and pose estimation[C]// 2021 IEEE/CVF International Conference on Computer Vision. New York: IEEE Press, 2021: 13033-13042. |
| [20] | WEI W L, LIN J C, LIU T L, et al. Capturing humans in motion: Temporal-attentive 3D human pose and shape estimation from monocular video[C]// 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2022: 13211-13220. |
| [21] | JIN K M, LIM B S, LEE G H, et al. Kinematic-aware hierarchical attention network for human pose estimation in videos[C]// 2023 IEEE/CVF Winter Conference on Applications of Computer Vision. New York: IEEE Press, 2023: 5725-5734. |
| [22] | CUI M M, ZHANG K B, SUN Z N. Graph and Skipped Transformer:Exploiting spatial and temporal modeling capacities for efficient 3D human pose estimation[EB/OL]. [2024-05-05]. https://arxiv.org/pdf/2407.02990.pdf. |
| [23] | TANG Z H, QIU Z F, HAO Y B, et al. 3D human pose estimation with spatio-temporal criss-cross attention[C]// 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2023: 4790-4799. |
| [24] | XU J L, GUO Y J, PENG Y X. FinePOSE: fine-grained prompt-driven 3D human pose estimation via diffusion models[C]// 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2024: 561-570. |
| [25] | JIAO J B, CHENG X N, CHEN W J, et al. Towards precise 3D human pose estimation with multi-perspective spatial-temporal relational transformer[EB/OL]. [2024-05-05]. https://arxiv.org/pdf/2401.16700.pdf. |
| [26] | CHEN Y L, WANG Z C, PENG Y X, et al. Cascaded pyramid network for multi-person pose estimation[C]// 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2018: 7103-7112. |
| [27] | XU Y F, ZHANG J, ZHANG Q M, et al. Vitpose: simple vision transformer baselines for human pose estimation[J]. Advances in Neural Information Processing Systems, 2022, 35: 38571-38584. |
| [28] | KOLOTOUROS N, PAVLAKOS G, BLACK M J, et al. Learning to reconstruct 3D human pose and shape via model-fitting in the loop[C]// 2019 IEEE/CVF International Conference on Computer Vision. New York: IEEE Press, 2019: 2252-2261. |
| [29] | IONESCU C, PAPAVA D, OLARU V, et al. Human3.6m: large scale datasets and predictive methods for 3D human sensing in natural environments[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2013, 36(7): 1325-1339. |
| [30] | MEHTA D, RHODIN H, CASAS D, et al. Monocular 3D human pose estimation in the wild using improved CNN supervision[C]// 2017 IEEE/CVF International Conference on 3D Vision. New York: IEEE Press, 2017: 506-516. |
| [31] | MARCARD T V, HENSCHEL R, BLACK M J, et al. Recovering accurate 3D human pose in the wild using imus and a moving camera[C]// Computer Vision - ECCV 2018: 15th European Conference. New York: ACM, 2018: 601-617. |
| [1] | FANG Chenghao, WANG Kangkan. 3D human pose and shape estimation from single-view point clouds with semi-supervised learning [J]. Journal of Graphics, 2025, 46(2): 393-401. |
| Viewed | ||||||
|
Full text |
|
|||||
|
Abstract |
|
|||||