A 3D human pose estimation approach based on spatio-temporal motion interaction modeling

doi:10.11996/JG.j.2095-302X.2024010159

Abstract

Abstract:

3D human pose estimation plays a crucial role in fields such as virtual reality and human-computer interaction. In recent years, the Transformer has been introduced into the domain of 3D human pose estimation to capture the spatiotemporal motion information of human joints. However, existing studies typically focus on the collective movement of joint clusters or exclusively model the movement of individual joints, without delving into the unique movement patterns of each joint and their interdependencies. Consequently, an innovative approach was proposed, which meticulously learnt the spatial information of 2D human joints in each frame and conducted an in-depth analysis of the specific movement patterns of each joint. Through the design of a motion information interaction module based on the Transformer encoder, the proposed method accurately captured the dynamic relationships between different joints. In comparison to existing models that directly learnt the overall motion of human joints, the proposed method enhanced prediction accuracy by approximately 3%. When benchmarked against the state-of-the-art MixSTE model, which primarily focused on individual joint movement, the proposed model demonstrated greater efficiency in capturing spatiotemporal features of joints, achieving an inference speed boost of over 20%, making it especially suitable for real-time inference scenarios.

Key words: 3D human pose estimation, Transformer encoder, inter-joint motion, temporal-spatial information correlation, real-time inference

CLC Number:

TP391

LV Heng, YANG Hongyu. A 3D human pose estimation approach based on spatio-temporal motion interaction modeling[J]. Journal of Graphics, 2024, 45(1): 159-168.

Figures/Tables 12

References 31

[1]	HE Y H, YAN R, FRAGKIADAKI K, et al. Epipolar transformers[C]// 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2020: 7776-7785.
[2]	CHEN X P, LIN K Y, LIU W T, et al. Weakly-supervised discovery of geometry-aware representation for 3D human pose estimation[C]// 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2020: 10887-10896.
[3]	ZHENG C, ZHU S J, MENDIETA M, et al. 3D human pose estimation with spatial and temporal transformers[C]// 2021 IEEE/CVF International Conference on Computer Vision. New York: IEEE Press, 2022: 11636-11645.
[4]	SHI M Y, ABERMAN K, ARISTIDOU A, et al. MotioNet: 3D human motion reconstruction from monocular video with skeleton consistency[J]. ACM Transactions on Graphics, 40(1): 1:1-1:15.
[5]	PAVLLO D, FEICHTENHOFER C, GRANGIER D, et al. 3D human pose estimation in video with temporal convolutions and semi-supervised training[C]// 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2020: 7745-7754.
[6]	MARTINEZ J, HOSSAIN R, ROMERO J, et al. A simple yet effective baseline for 3D human pose estimation[C]// 2017 IEEE International Conference on Computer Vision. New York: IEEE Press, 2017: 2659-2668.
[7]	HE K M, ZHANG X Y, REN S Q, et al. Deep residual learning for image recognition[C]// 2016 IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2016: 770-778.
[8]	VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all You need[C]// The 31st International Conference on Neural Information Processing Systems. New York: ACM, 2017: 6000-6010.
[9]	DOSOVITSKIY A, BEYER L, KOLESNIKOV A, et al. An image is worth 16 x16 words: transformers for image recognition at scale[EB/OL]. [2023-06-02]. https://arxiv.org/abs/2010.11929.pdf.
[10]	ZHANG J L, TU Z G, YANG J Y, et al. MixSTE: Seq2seq mixed spatio-temporal encoder for 3D human pose estimation in video[C]// 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2022: 13222-13232.
[11]	LIU Z, LIN Y T, CAO Y, et al. Swin transformer: hierarchical vision transformer using shifted windows[C]// 2021 IEEE/CVF International Conference on Computer Vision. New York: IEEE Press, 2022: 9992-10002.
[12]	SUTSKEVER I, VINYALS O, LE Q V. Sequence to sequence learning with neural networks[C]// The 27th International Conference on Neural Information Processing Systems - Volume 2. New York:ACM, 2014: 3104-3112.
[13]	HOSSAIN M R I, LITTLE J J. Exploiting temporal information for 3D human pose estimation[C]// Computer Vision - ECCV 2018: 15th European Conference. New York: ACM, 2018: 69-86.
[14]	IONESCU C, PAPAVA D, OLARU V, et al. Human3. 6M: large scale datasets and predictive methods for 3D human sensing in natural environments[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2014, 36(7): 1325-1339. DOI URL
[15]	CHEN Y L, WANG Z C, PENG Y X, et al. Cascaded pyramid network for multi-person pose estimation[C]// 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2018: 7103-7112.
[16]	ZHENG C, WU W H, CHEN C, et al. Deep learning-based human pose estimation: a survey[J]. ACM Computing Surveys, 56(1): 11:1-11:37.
[17]	LI W H, LIU H, TANG H, et al. MHFormer: multi-hypothesis transformer for 3D human pose estimation[C]// 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2022: 13137-13146.
[18]	LI C, LEE G H. Weakly supervised generative network for multiple 3D human pose hypotheses[EB/OL]. [2023-06-02]. https://arxiv.org/abs/2008.05770.pdf.
[19]	BANIK S, GARCÍA A M, KNOLL A. 3D human pose regression using graph convolutional network[C]// 2021 IEEE International Conference on Image Processing. New York: IEEE Press, 2021: 924-928.
[20]	XU Y L, WANG W G, LIU T Y, et al. Monocular 3D pose estimation via pose grammar and data augmentation[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022, 44(10): 6327-6344. DOI URL
[21]	Oreshkin B N. HybrIK-Transformer[EB/OL]. [2023-06-22]. https://arxiv.org/abs/2302.04774.
[22]	CAI J L, LIU H, DING R W, et al. HTNet: human Topology aware network for 3d Human pose estimation[C]// ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing. New York: IEEE Press, 2023: 1-5.
[23]	KIM J, GWON M G, PARK H, et al. Sampling is matter: point-guided 3D human mesh reconstruction[C]// 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2023: 12880-12889.
[24]	HASSAN M T, BEN HAMZA A. Regular splitting graph network for 3D human pose estimation[J]. IEEE Transactions on Image Processing: a Publication of the IEEE Signal Processing Society, 2023, 32:
[25]	LUTZ S, BLYTHMAN R, GHOSAL K, et al. Jointformer: single-frame lifting transformer with error prediction and refinement for 3D human pose estimation[C]// 2022 26th International Conference on Pattern Recognition. New York: IEEE Press, 2022: 1156-1163.
[26]	LIN J H, LEE G H. Trajectory space factorization for deep video-based 3D human pose estimation[EB/OL]. [2023-06-02]. https://arxiv.org/abs/1908.08289.pdf.
[27]	LI S C, KE L, PRATAMA K, et al. Cascaded deep monocular 3D human pose estimation with evolutionary training data[C]// 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2020: 6172-6182.
[28]	BOUAZIZI A, KRESSEL U, BELAGIANNIS V. Learning temporal 3D human pose estimation with pseudo- labels[C]// 2021 17th IEEE International Conference on Advanced Video and Signal Based Surveillance. New York: IEEE Press, 2022: 1-8.
[29]	GUAN S Y, XU J W, HE M Z, et al. Out-of-domain human mesh reconstruction via dynamic bilevel online adaptation[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023, 45(4):
[30]	WANG J B, YAN S J, XIONG Y J, et al. Motion guided 3D pose estimation from videos[EB/OL]. [2023-06-02]. https://www.ecva.net/papers/eccv_2020/paper_ECCV/papers/123580749.pdf.
[31]	GONG K H, ZHANG J F, FENG J S. PoseAug: a differentiable pose augmentation framework for 3D human pose estimation[C]// 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2021: 8571-8580.

名称	版本
Ubuntu	20.04
Cuda	11.2
Pytorch	1.9
Python	3.9
Cudnn	7

名称	版本
Ubuntu	20.04
Cuda	11.2
Pytorch	1.9
Python	3.9
Cudnn	7

参数	值
感受野	27
时间Transformer编码器模块层数	3
空间Transformer编码器模块层数	6
运动Transformer编码器模块层数	2
学习率	0.00014
注意力机制的多头数量	8
2D关节点映射到的特征维度	512
运动Transformer编码器模块的特征维度	32

参数	值
感受野	27
时间Transformer编码器模块层数	3
空间Transformer编码器模块层数	6
运动Transformer编码器模块层数	2
学习率	0.00014
注意力机制的多头数量	8
2D关节点映射到的特征维度	512
运动Transformer编码器模块的特征维度	32

动作名称	MPJPE/mm	P-MPJPE/mm	MPJVE/mm	总帧数	总推理时间/ms
Dir.	42.2	33.1	3.1	26 568	314
Disc.	46.1	36.0	3.3	64 476	348
Eat.	44.2	35.3	2.5	39 528	368
Greet	44.3	35.6	3.6	30 780	328
Phone	49.0	37.4	2.4	56 268	275
Photo	54.1	41.6	3.0	29 484	294
Pose	43.4	33.0	2.9	27 432	344
Purch.	44.1	33.2	3.3	19 440	305
Sit	56.0	45.9	2.1	40 392	336
SitD.	65.0	51.2	2.9	33 588	303
Somke	47.5	37.5	2.5	55 836	311
Wait	43.5	32.1	2.7	38 016	373
WalkD.	48.9	37.3	4.0	28 512	345
Walk	32.5	25.1	3.4	29 484	284
WalkT.	34.3	27.4	3.0	26 460	317
Average	46.3	36.1	3.0	36 418	323