Journal of Graphics ›› 2024, Vol. 45 ›› Issue (1): 159-168.DOI: 10.11996/JG.j.2095-302X.2024010159
• Computer Graphics and Virtual Reality • Previous Articles Next Articles
Received:
2023-07-25
Accepted:
2023-10-28
Online:
2024-02-29
Published:
2024-02-29
Contact:
YANG Hongyu (1990-), associate professor, Ph.D. Her main research interests cover computer vision and pattern recognition, etc. E-mail:About author:
LV Heng (2001-), master student. His main research interests cover computer vision and machine learning. E-mail:19373716@buaa.edu.cn
Supported by:
CLC Number:
LV Heng, YANG Hongyu. A 3D human pose estimation approach based on spatio-temporal motion interaction modeling[J]. Journal of Graphics, 2024, 45(1): 159-168.
Add to citation manager EndNote|Ris|BibTeX
URL: http://www.txxb.com.cn/EN/10.11996/JG.j.2095-302X.2024010159
Fig. 1 Architecture diagram of the Transformer-based node motion information interaction model ((a) Shows the overall structure; (b) Depicts the structure of the spatial Transformer encoder)
名称 | 版本 |
---|---|
Ubuntu | 20.04 |
Cuda | 11.2 |
Pytorch | 1.9 |
Python | 3.9 |
Cudnn | 7 |
Table 1 Experimental environment version
名称 | 版本 |
---|---|
Ubuntu | 20.04 |
Cuda | 11.2 |
Pytorch | 1.9 |
Python | 3.9 |
Cudnn | 7 |
参数 | 值 |
---|---|
感受野 | 27 |
时间Transformer编码器模块层数 | 3 |
空间Transformer编码器模块层数 | 6 |
运动Transformer编码器模块层数 | 2 |
学习率 | 0.00014 |
注意力机制的多头数量 | 8 |
2D关节点映射到的特征维度 | 512 |
运动Transformer编码器模块的特征维度 | 32 |
Table 2 Model parameter configuration
参数 | 值 |
---|---|
感受野 | 27 |
时间Transformer编码器模块层数 | 3 |
空间Transformer编码器模块层数 | 6 |
运动Transformer编码器模块层数 | 2 |
学习率 | 0.00014 |
注意力机制的多头数量 | 8 |
2D关节点映射到的特征维度 | 512 |
运动Transformer编码器模块的特征维度 | 32 |
动作名称 | MPJPE/mm | P-MPJPE/mm | MPJVE/mm | 总帧数 | 总推理时间/ms |
---|---|---|---|---|---|
Dir. | 42.2 | 33.1 | 3.1 | 26 568 | 314 |
Disc. | 46.1 | 36.0 | 3.3 | 64 476 | 348 |
Eat. | 44.2 | 35.3 | 2.5 | 39 528 | 368 |
Greet | 44.3 | 35.6 | 3.6 | 30 780 | 328 |
Phone | 49.0 | 37.4 | 2.4 | 56 268 | 275 |
Photo | 54.1 | 41.6 | 3.0 | 29 484 | 294 |
Pose | 43.4 | 33.0 | 2.9 | 27 432 | 344 |
Purch. | 44.1 | 33.2 | 3.3 | 19 440 | 305 |
Sit | 56.0 | 45.9 | 2.1 | 40 392 | 336 |
SitD. | 65.0 | 51.2 | 2.9 | 33 588 | 303 |
Somke | 47.5 | 37.5 | 2.5 | 55 836 | 311 |
Wait | 43.5 | 32.1 | 2.7 | 38 016 | 373 |
WalkD. | 48.9 | 37.3 | 4.0 | 28 512 | 345 |
Walk | 32.5 | 25.1 | 3.4 | 29 484 | 284 |
WalkT. | 34.3 | 27.4 | 3.0 | 26 460 | 317 |
Average | 46.3 | 36.1 | 3.0 | 36 418 | 323 |
Table 3 Experimental results on the Human3.6M dataset
动作名称 | MPJPE/mm | P-MPJPE/mm | MPJVE/mm | 总帧数 | 总推理时间/ms |
---|---|---|---|---|---|
Dir. | 42.2 | 33.1 | 3.1 | 26 568 | 314 |
Disc. | 46.1 | 36.0 | 3.3 | 64 476 | 348 |
Eat. | 44.2 | 35.3 | 2.5 | 39 528 | 368 |
Greet | 44.3 | 35.6 | 3.6 | 30 780 | 328 |
Phone | 49.0 | 37.4 | 2.4 | 56 268 | 275 |
Photo | 54.1 | 41.6 | 3.0 | 29 484 | 294 |
Pose | 43.4 | 33.0 | 2.9 | 27 432 | 344 |
Purch. | 44.1 | 33.2 | 3.3 | 19 440 | 305 |
Sit | 56.0 | 45.9 | 2.1 | 40 392 | 336 |
SitD. | 65.0 | 51.2 | 2.9 | 33 588 | 303 |
Somke | 47.5 | 37.5 | 2.5 | 55 836 | 311 |
Wait | 43.5 | 32.1 | 2.7 | 38 016 | 373 |
WalkD. | 48.9 | 37.3 | 4.0 | 28 512 | 345 |
Walk | 32.5 | 25.1 | 3.4 | 29 484 | 284 |
WalkT. | 34.3 | 27.4 | 3.0 | 26 460 | 317 |
Average | 46.3 | 36.1 | 3.0 | 36 418 | 323 |
方法 | AVG-MPJPE/mm |
---|---|
文献[ | 62.9 |
文献[ | 73.9 |
文献[ | 59.5 |
文献[ | 58.1 |
文献[ | 47.5 |
文献[ | 47.6 |
文献[ | 48.3 |
文献[ | 47.0 |
Jointformer[ | 50.1 |
PoseFormer[ | 47.5 |
PoseFormer+Seq2Seq[ | 53.6 |
MixSTE [ | 45.3 |
Ours (f=27) | 46.3 |
Table 4 Comparison results with existing methods
方法 | AVG-MPJPE/mm |
---|---|
文献[ | 62.9 |
文献[ | 73.9 |
文献[ | 59.5 |
文献[ | 58.1 |
文献[ | 47.5 |
文献[ | 47.6 |
文献[ | 48.3 |
文献[ | 47.0 |
Jointformer[ | 50.1 |
PoseFormer[ | 47.5 |
PoseFormer+Seq2Seq[ | 53.6 |
MixSTE [ | 45.3 |
Ours (f=27) | 46.3 |
模型(f=27) | 每1 000帧平均推理时间/ms | 参数量/M | FLOPs/M |
---|---|---|---|
PoseFormer[ | 48.5 | 9.59 | 452 |
PoseFormer+Seq2Seq[ | 39.5 | 44.6 | 26 405 |
MixSTE [ | 11.3 | 33.7 | 83 |
Ours | 8.9 | 23.1 | 177 |
Table 5 Model inference speed comparison table
模型(f=27) | 每1 000帧平均推理时间/ms | 参数量/M | FLOPs/M |
---|---|---|---|
PoseFormer[ | 48.5 | 9.59 | 452 |
PoseFormer+Seq2Seq[ | 39.5 | 44.6 | 26 405 |
MixSTE [ | 11.3 | 33.7 | 83 |
Ours | 8.9 | 23.1 | 177 |
Fig. 4 Visualization of prediction results under occlusion scenarios ((a) Input; (b) Results of PoseFormer; (c) Results of our model; (d) Ground truth)
方法 | MPJPE↓ | AUC↑ | PCK↑ |
---|---|---|---|
文献[ | 79.8 | 51.4 | 83.6 |
文献[ | 99.7 | 46.1 | 81.2 |
文献[ | 90.3 | 50.1 | 81.0 |
文献[ | 101.5 | 43.1 | 79.5 |
文献[ | 68.1 | 62.1 | 86.9 |
文献[ | 73.0 | 57.3 | 88.6 |
PoseFormer.[ | 77.1 | 56.4 | 88.6 |
Ours.(f=27) | 57.4 | 64.2 | 94.1 |
Table 6 Experimental results on the MPI-INF-3DHP dataset
方法 | MPJPE↓ | AUC↑ | PCK↑ |
---|---|---|---|
文献[ | 79.8 | 51.4 | 83.6 |
文献[ | 99.7 | 46.1 | 81.2 |
文献[ | 90.3 | 50.1 | 81.0 |
文献[ | 101.5 | 43.1 | 79.5 |
文献[ | 68.1 | 62.1 | 86.9 |
文献[ | 73.0 | 57.3 | 88.6 |
PoseFormer.[ | 77.1 | 56.4 | 88.6 |
Ours.(f=27) | 57.4 | 64.2 | 94.1 |
运动Transformer 编码器模块层数 | 2D关节点映射的 特征维度 | 运动Transformer编码器的 特征维度 | AVG-MPJPE/mm↓ | 推理时间/ms↓ |
---|---|---|---|---|
1 | 64 | 64 | 49.3 | 4.6 |
2 | 64 | 64 | 48.2 | 4.9 |
3 | 64 | 64 | 48.2 | 6.1 |
2 | 128 | 64 | 48.5 | 6.7 |
2 | 256 | 64 | 47.5 | 7.9 |
2 | 512 | 64 | 46.4 | 9.4 |
2 | 512 | 128 | 46.4 | 16.3 |
2 | 512 | 32 | 46.3 | 8.9 |
0 | 512 | 32 | 48.4 | 5.9 |
Table 7 Results of hyperparameter tuning and ablation experiments
运动Transformer 编码器模块层数 | 2D关节点映射的 特征维度 | 运动Transformer编码器的 特征维度 | AVG-MPJPE/mm↓ | 推理时间/ms↓ |
---|---|---|---|---|
1 | 64 | 64 | 49.3 | 4.6 |
2 | 64 | 64 | 48.2 | 4.9 |
3 | 64 | 64 | 48.2 | 6.1 |
2 | 128 | 64 | 48.5 | 6.7 |
2 | 256 | 64 | 47.5 | 7.9 |
2 | 512 | 64 | 46.4 | 9.4 |
2 | 512 | 128 | 46.4 | 16.3 |
2 | 512 | 32 | 46.3 | 8.9 |
0 | 512 | 32 | 48.4 | 5.9 |
Lw | Lt | Lm | AVG-MPJPE/mm↓ |
---|---|---|---|
√ | × | × | 47.0 |
√ | √ | × | 46.6 |
√ | √ | √ | 46.3 |
Table 8 Results of the loss function ablation experiment
Lw | Lt | Lm | AVG-MPJPE/mm↓ |
---|---|---|---|
√ | × | × | 47.0 |
√ | √ | × | 46.6 |
√ | √ | √ | 46.3 |
[1] | HE Y H, YAN R, FRAGKIADAKI K, et al. Epipolar transformers[C]// 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2020: 7776-7785. |
[2] | CHEN X P, LIN K Y, LIU W T, et al. Weakly-supervised discovery of geometry-aware representation for 3D human pose estimation[C]// 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2020: 10887-10896. |
[3] | ZHENG C, ZHU S J, MENDIETA M, et al. 3D human pose estimation with spatial and temporal transformers[C]// 2021 IEEE/CVF International Conference on Computer Vision. New York: IEEE Press, 2022: 11636-11645. |
[4] | SHI M Y, ABERMAN K, ARISTIDOU A, et al. MotioNet: 3D human motion reconstruction from monocular video with skeleton consistency[J]. ACM Transactions on Graphics, 40(1): 1:1-1:15. |
[5] | PAVLLO D, FEICHTENHOFER C, GRANGIER D, et al. 3D human pose estimation in video with temporal convolutions and semi-supervised training[C]// 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2020: 7745-7754. |
[6] | MARTINEZ J, HOSSAIN R, ROMERO J, et al. A simple yet effective baseline for 3D human pose estimation[C]// 2017 IEEE International Conference on Computer Vision. New York: IEEE Press, 2017: 2659-2668. |
[7] | HE K M, ZHANG X Y, REN S Q, et al. Deep residual learning for image recognition[C]// 2016 IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2016: 770-778. |
[8] | VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all You need[C]// The 31st International Conference on Neural Information Processing Systems. New York: ACM, 2017: 6000-6010. |
[9] | DOSOVITSKIY A, BEYER L, KOLESNIKOV A, et al. An image is worth 16 x16 words: transformers for image recognition at scale[EB/OL]. [2023-06-02]. https://arxiv.org/abs/2010.11929.pdf. |
[10] | ZHANG J L, TU Z G, YANG J Y, et al. MixSTE: Seq2seq mixed spatio-temporal encoder for 3D human pose estimation in video[C]// 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2022: 13222-13232. |
[11] | LIU Z, LIN Y T, CAO Y, et al. Swin transformer: hierarchical vision transformer using shifted windows[C]// 2021 IEEE/CVF International Conference on Computer Vision. New York: IEEE Press, 2022: 9992-10002. |
[12] | SUTSKEVER I, VINYALS O, LE Q V. Sequence to sequence learning with neural networks[C]// The 27th International Conference on Neural Information Processing Systems - Volume 2. New York:ACM, 2014: 3104-3112. |
[13] | HOSSAIN M R I, LITTLE J J. Exploiting temporal information for 3D human pose estimation[C]// Computer Vision - ECCV 2018: 15th European Conference. New York: ACM, 2018: 69-86. |
[14] |
IONESCU C, PAPAVA D, OLARU V, et al. Human3. 6M: large scale datasets and predictive methods for 3D human sensing in natural environments[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2014, 36(7): 1325-1339.
DOI URL |
[15] | CHEN Y L, WANG Z C, PENG Y X, et al. Cascaded pyramid network for multi-person pose estimation[C]// 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2018: 7103-7112. |
[16] | ZHENG C, WU W H, CHEN C, et al. Deep learning-based human pose estimation: a survey[J]. ACM Computing Surveys, 56(1): 11:1-11:37. |
[17] | LI W H, LIU H, TANG H, et al. MHFormer: multi-hypothesis transformer for 3D human pose estimation[C]// 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2022: 13137-13146. |
[18] | LI C, LEE G H. Weakly supervised generative network for multiple 3D human pose hypotheses[EB/OL]. [2023-06-02]. https://arxiv.org/abs/2008.05770.pdf. |
[19] | BANIK S, GARCÍA A M, KNOLL A. 3D human pose regression using graph convolutional network[C]// 2021 IEEE International Conference on Image Processing. New York: IEEE Press, 2021: 924-928. |
[20] |
XU Y L, WANG W G, LIU T Y, et al. Monocular 3D pose estimation via pose grammar and data augmentation[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022, 44(10): 6327-6344.
DOI URL |
[21] | Oreshkin B N. HybrIK-Transformer[EB/OL]. [2023-06-22]. https://arxiv.org/abs/2302.04774. |
[22] | CAI J L, LIU H, DING R W, et al. HTNet: human Topology aware network for 3d Human pose estimation[C]// ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing. New York: IEEE Press, 2023: 1-5. |
[23] | KIM J, GWON M G, PARK H, et al. Sampling is matter: point-guided 3D human mesh reconstruction[C]// 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2023: 12880-12889. |
[24] | HASSAN M T, BEN HAMZA A. Regular splitting graph network for 3D human pose estimation[J]. IEEE Transactions on Image Processing: a Publication of the IEEE Signal Processing Society, 2023, 32: |
[25] | LUTZ S, BLYTHMAN R, GHOSAL K, et al. Jointformer: single-frame lifting transformer with error prediction and refinement for 3D human pose estimation[C]// 2022 26th International Conference on Pattern Recognition. New York: IEEE Press, 2022: 1156-1163. |
[26] | LIN J H, LEE G H. Trajectory space factorization for deep video-based 3D human pose estimation[EB/OL]. [2023-06-02]. https://arxiv.org/abs/1908.08289.pdf. |
[27] | LI S C, KE L, PRATAMA K, et al. Cascaded deep monocular 3D human pose estimation with evolutionary training data[C]// 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2020: 6172-6182. |
[28] | BOUAZIZI A, KRESSEL U, BELAGIANNIS V. Learning temporal 3D human pose estimation with pseudo- labels[C]// 2021 17th IEEE International Conference on Advanced Video and Signal Based Surveillance. New York: IEEE Press, 2022: 1-8. |
[29] | GUAN S Y, XU J W, HE M Z, et al. Out-of-domain human mesh reconstruction via dynamic bilevel online adaptation[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023, 45(4): |
[30] | WANG J B, YAN S J, XIONG Y J, et al. Motion guided 3D pose estimation from videos[EB/OL]. [2023-06-02]. https://www.ecva.net/papers/eccv_2020/paper_ECCV/papers/123580749.pdf. |
[31] | GONG K H, ZHANG J F, FENG J S. PoseAug: a differentiable pose augmentation framework for 3D human pose estimation[C]// 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2021: 8571-8580. |
[1] | HU Hai-tao , DU Hao-chen , WANG Su-qin , SHI Min , ZHU Deng-ming, . Improved YOLOX method for detecting surface defects of drug blister aluminum foil [J]. Journal of Graphics, 2022, 43(5): 803-814. |
Viewed | ||||||
Full text |
|
|||||
Abstract |
|
|||||