A Transformer-based 3D human pose estimation method

doi:10.11996/JG.j.2095-302X.2023010139

Abstract

Abstract:

3D human pose estimation is the foundation of human behavior understanding, but predicting reasonable 3D human pose sequences remains a challenging problem. To solve this problem, a Transformer-based 3D human pose estimation method was proposed, utilizing a multi-layer long short-term memory (LSTM) unit and a multi-scale Transformer structure to enhance the accuracy of human pose sequence prediction. First, a generator based on time series was designed to extract image features through the ResNet pre-trained neural network. Secondly, multi-layer LSTM units were used to learn the relationship between human poses in temporally continuous image sequences, thereby outputting a reasonable skinned multi-person linear (SMPL) human parameter model sequence. Finally, a multi-scale Transformer-based discriminator was constructed, and the multi-scale Transformer structure was employed to learn detailed features for multiple segmentation granularities, especially the Transformer block encoding the relative position to enhance the local feature learning ability. Experimental results show that the proposed method could yield better prediction accuracy than the VIBE method, which is 7.5% lower than the average (per) joint position error (MPJPE) of VIBE on the 3DPW dataset, and 1.8% lower than VIBE's MPJPE on the MP-INF-3DHP dataset.

Key words: multi-scale Transformer structure, LSTM unit, time series, attention mechanism

CLC Number:

TP391

WANG Yu-ping, ZENG Yi, LI Sheng-hui, ZHANG Lei. A Transformer-based 3D human pose estimation method[J]. Journal of Graphics, 2023, 44(1): 139-145.

Figures/Tables 10

References 27

[1]	PAVLAKOS G, ZHOU X W, DERPANIS K G, et al. Coarse-to-fine volumetric prediction for single-image 3D human pose[C]//2017 IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2017: 1263-1272.
[2]	LI S J, CHAN A B. 3D human pose estimation from monocular images with deep convolutional neural network[M]//Computer Vision - ACCV 2014. Cham: Springer International Publishing, 2015: 332-347.
[3]	KOCABAS M, ATHANASIOU N, BLACK M J. VIBE: video inference for human body pose and shape estimation[C]//2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2020: 5252-5262.
[4]	PAVLLO D, FEICHTENHOFER C, GRANGIER D, et al. 3D human pose estimation in video with temporal convolutions and semi-supervised training[C]// 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2019: 7745-7754.
[5]	DEVLIN J, CHANG M W, LEE K, et al. BERT: pre-training of deep bidirectional transformers for language understanding[EB/OL]. [2021-12-02].https://arxiv.org/abs/1810.04805.
[6]	DOSOVITSKIY A, BEYER L, KOLESNIKOV A, et al. An image is worth 16x16 words: transformers for image recognition at scale[EB/OL]. [2021-12-05]. https://arxiv.org/abs/2010.11929.
[7]	LI K, WANG S J, ZHANG X, et al. Pose recognition with cascade transformers[C]//2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2021: 1944-1953.
[8]	ZHENG C, ZHU S J, MENDIETA M, et al. 3D human pose estimation with spatial and temporal transformers[C]//2021 IEEE/CVF International Conference on Computer Vision. New York: IEEE Press, 2021: 11636-11645.
[9]	SHI X J, CHEN Z R, WANG H, et al. Convolutional LSTM Network: a machine learning approach for precipitation nowcasting[C]// The 28th International Conference on Neural Information Processing Systems - Volume 1. New York: ACM, 2015: 802-810.
[10]	LOPER M, MAHMOOD N, ROMERO J, et al. SMPL: a skinned multi-person linear model[J]. ACM Transactions on Graphics, 2015, 34(6): 248.
[11]	KANAZAWA A, BLACK M J, JACOBS D W, et al. End-to-end recovery of human shape and pose[C]//2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2018: 7122-7131.
[12]	JIANG Y, CHANG S, WANG Z. Transgan: two pure transformers can make one strong gan, and that can scale up[J]. Advances in Neural Information Processing Systems, 2021, 34: 14745-14758.
[13]	VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all You need[C]//The 31st International Conference on Neural Information Processing Systems. New York: ACM, 2017: 6000-6010.
[14]	DOSOVITSKIY A, BEYER L, KOLESNIKOV A, et al. An image is worth 16x16 words: transformers for image recognition at scale[EB/OL]. [2021-12-02].https://arxiv.org/abs/2010.11929.
[15]	SHAW P, USZKOREIT J, VASWANI A. Self-attention with relative position representations[EB/OL]. [2021-12-01].https://arxiv.org/abs/1803.02155.
[16]	RAFFEL C, SHAZEER N, ROBERTS A, et al. Exploring the limits of transfer learning with a unified text-to-text transformer[EB/OL]. [2021-12-02].https://arxiv.org/abs/1910.10683.
[17]	HUANG C Z A, VASWANI A, USZKOREIT J, et al. Music transformer[EB/OL]. [2021-12-05]. https://arxiv.org/abs/1809.04281.
[18]	LIU Z, LIN Y T, CAO Y, et al. Swin transformer: hierarchical vision transformer using shifted windows[C]//2021 IEEE/CVF International Conference on Computer Vision. New York: IEEE Press, 2021: 9992-10002.
[19]	HU H, ZHANG Z, XIE Z D, et al. Local relation networks for image recognition[C]//2019 IEEE/CVF International Conference on Computer Vision. New York: IEEE Press, 2019: 3463-3472.
[20]	HE K M, ZHANG X Y, REN S Q, et al. Identity mappings in deep residual networks[M]//Computer Vision - ECCV 2016. Cham: Springer International Publishing, 2016: 630-645.
[21]	KOLOTOUROS N, PAVLAKOS G, BLACK M, et al. Learning to reconstruct 3D human pose and shape via model-fitting in the loop[C]//2019 IEEE/CVF International Conference on Computer Vision. New York: IEEE Press, 2019: 2252-2261.
[22]	KANAZAWA A, ZHANG J Y, FELSEN P, et al. Learning 3D human dynamics from video[C]//2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2019: 5607-5616.
[23]	KINGMA D P, BA J. Adam: a method for stochastic optimization[EB/OL]. [2022-01-03].https://arxiv.org/abs/1412.6980.
[24]	MEHTA D, RHODIN H, CASAS D, et al. Monocular 3D human pose estimation in the wild using improved CNN supervision[C]// 2017 International Conference on 3D Vision. New York: IEEE Press, 2017: 506-516.
[25]	MAHMOOD N, GHORBANI N, TROJE N F, et al. AMASS: archive of motion capture As surface shapes[C]//2019 IEEE/CVF International Conference on Computer Vision. New York: IEEE Press, 2019: 5441-5450.
[26]	VON MARCARD T, HENSCHEL R, BLACK M J, et al. Recovering accurate 3D human pose in the wild using IMUs and a moving camera[M]//Computer Vision - ECCV 2018. Cham: Springer International Publishing, 2018: 614-631.
[27]	KOLOTOUROS N, PAVLAKOS G, BLACK M, et al. Learning to reconstruct 3D human pose and shape via model-fitting in the loop[C]//2019 IEEE/CVF International Conference on Computer Vision. New York: IEEE Press, 2019: 2252-2261.

设备	参数
操作系统	Ubuntu20.04
深度学习框架	Pytorch1.10
CUDA版本	11.5
开发软件	Pycharm
CPU	I7-12700KF
显卡	3 090(1块)

设备	参数
操作系统	Ubuntu20.04
深度学习框架	Pytorch1.10
CUDA版本	11.5
开发软件	Pycharm
CPU	I7-12700KF
显卡	3 090(1块)

Models	3DPW
Models	PA-MPJPE	MPJPE	PVE	Accel
HMR^[11]	76.7	130.0	-	37.4
SPIN ^[27]	59.2	96.9	116.4	29.8
VIB(direct)	58.7	100.0	118.5	28.7
VIBE	55.2	93.8	110.4	28.2
TR-VIBE(direct)	58.8	100.7	126.6	32.2
TR-VIBE	53.5	86.3	101.8	25.5

Models	3DPW
Models	PA-MPJPE	MPJPE	PVE	Accel
HMR^[11]	76.7	130.0	-	37.4
SPIN ^[27]	59.2	96.9	116.4	29.8
VIB(direct)	58.7	100.0	118.5	28.7
VIBE	55.2	93.8	110.4	28.2
TR-VIBE(direct)	58.8	100.7	126.6	32.2
TR-VIBE	53.5	86.3	101.8	25.5

Models	MPI-INF-3DHP
Models	PA-MPJPE	MPJPE	PVE	Accel
HMR^[11]	89.8	124.2	-	-
SPIN ^[27]	67.5	105.2	-	-
VIB(direct)	66.8	103.2	916.8	33.2
VIBE	64.3	100.8	915.0	32.2
TR-VIBE(direct)	66.7	102.7	915.3	34.8
TR-VIBE	64.9	99.0	907.9	30.1