一种基于Transformer的三维人体姿态估计方法

doi:10.11996/JG.j.2095-302X.2023010139

图学学报 ›› 2023, Vol. 44 ›› Issue (1): 139-145.DOI: 10.11996/JG.j.2095-302X.2023010139

• 计算机图形学与虚拟现实 • 上一篇下一篇

一种基于Transformer的三维人体姿态估计方法

王玉萍¹(), 曾毅¹, 李胜辉², 张磊³

1.郑州科技学院信息工程学院，河南郑州 450064
2.河南机电职业学院大数据学院，河南郑州 450064
3.郑州大学信息工程学院，河南郑州 450001

收稿日期:2022-04-07 修回日期:2022-07-19 出版日期:2023-10-31 发布日期:2023-02-16
作者简介:王玉萍(1979-)，女，教授，硕士。主要研究方向为机器视觉、虚拟现实与机器学习。E-mail：wangyupingpaper@163.com
基金资助:
河南省科技厅科技攻关项目(222102210174)

A Transformer-based 3D human pose estimation method

WANG Yu-ping¹(), ZENG Yi¹, LI Sheng-hui², ZHANG Lei³

1. School of Information Engineering, Zhengzhou University of Science and Technology, Zhengzhou Henan 450064, China
2. College of Big Data, Henan Electromechanical Vocational College, Zhengzhou Henan 450064, China
3. School of Information Engineering, Zhengzhou University, Zhengzhou Henan 450001, China

Received:2022-04-07 Revised:2022-07-19 Online:2023-10-31 Published:2023-02-16
About author:WANG Yu-ping (1979-), professor, master. Her main research interests cover machine vision, virtual reality and machine learning. E-mail：wangyupingpaper@163.com
Supported by:
Henan Provincial Department of Science and Technology Science and Technology Project(222102210174)

摘要/Abstract

摘要：

三维人体姿态估计是人类行为理解的基础，但是预测出合理的三维人体姿态序列仍然是具有挑战性的问题。为了解决这个问题，提出一种基于Transformer的三维人体姿态估计方法，利用多层长短期记忆(LSTM)单元和多尺度Transformer结构增强人体姿态序列预测的准确性。首先，设计基于时间序列的生成器，通过ResNet预训练神经网络提取图像特征；其次，采用多层LSTM单元学习时间连续性的图像序列中人体姿态之间的关系，输出合理的SMPL人体参数模型序列；最后，构建基于多尺度Transformer的判别器，利用多尺度Transformer结构对多个分割粒度进行细节特征学习，尤其是Transformer block对相对位置进行编码增强局部特征学习能力。实验结果表明，该方法相对于VIBE方法具有更好地预测精度，在3DPW数据集上比VIBE的平均(每)关节位置误差(MPJPE)低了7.5%；在MP-INF-3DHP数据集上比VIBE的MPJPE降低了1.8%。

关键词: 多尺度Transformer结构, LSTM单元, 时间序列, 注意力机制

Abstract:

3D human pose estimation is the foundation of human behavior understanding, but predicting reasonable 3D human pose sequences remains a challenging problem. To solve this problem, a Transformer-based 3D human pose estimation method was proposed, utilizing a multi-layer long short-term memory (LSTM) unit and a multi-scale Transformer structure to enhance the accuracy of human pose sequence prediction. First, a generator based on time series was designed to extract image features through the ResNet pre-trained neural network. Secondly, multi-layer LSTM units were used to learn the relationship between human poses in temporally continuous image sequences, thereby outputting a reasonable skinned multi-person linear (SMPL) human parameter model sequence. Finally, a multi-scale Transformer-based discriminator was constructed, and the multi-scale Transformer structure was employed to learn detailed features for multiple segmentation granularities, especially the Transformer block encoding the relative position to enhance the local feature learning ability. Experimental results show that the proposed method could yield better prediction accuracy than the VIBE method, which is 7.5% lower than the average (per) joint position error (MPJPE) of VIBE on the 3DPW dataset, and 1.8% lower than VIBE's MPJPE on the MP-INF-3DHP dataset.

Key words: multi-scale Transformer structure, LSTM unit, time series, attention mechanism

中图分类号:

TP391

王玉萍, 曾毅, 李胜辉, 张磊. 一种基于Transformer的三维人体姿态估计方法[J]. 图学学报, 2023, 44(1): 139-145.

WANG Yu-ping, ZENG Yi, LI Sheng-hui, ZHANG Lei. A Transformer-based 3D human pose estimation method[J]. Journal of Graphics, 2023, 44(1): 139-145.

图/表 10

图1 基于Transformer的三维人体姿态估计网络结构图

Fig. 1 Structural diagram of a three-dimensional human posture estimation network based on Transformer

图2 判别器网络结构图

Fig. 2 Discriminator network structure diagram

图3 多尺度Transformer结构图

Fig. 3 Multiscale Transformer structure diagram

图4 Transformer block结构图

Fig. 4 Transformer block structure diagram

表1 实验环境

Table 1 Experimental environment

设备	参数
操作系统	Ubuntu20.04
深度学习框架	Pytorch1.10
CUDA版本	11.5
开发软件	Pycharm
CPU	I7-12700KF
显卡	3 090(1块)

表2 3DPW实验结果对比

Table 2 Comparison of 3DPW experimental results

Models	3DPW
Models	PA-MPJPE	MPJPE	PVE	Accel
HMR^[11]	76.7	130.0	-	37.4
SPIN ^[27]	59.2	96.9	116.4	29.8
VIB(direct)	58.7	100.0	118.5	28.7
VIBE	55.2	93.8	110.4	28.2
TR-VIBE(direct)	58.8	100.7	126.6	32.2
TR-VIBE	53.5	86.3	101.8	25.5

表3 MPI-INF-3DHP实验结果对比

Table 3 Comparison of MPI-INF-3DHP experimental results

Models	MPI-INF-3DHP
Models	PA-MPJPE	MPJPE	PVE	Accel
HMR^[11]	89.8	124.2	-	-
SPIN ^[27]	67.5	105.2	-	-
VIB(direct)	66.8	103.2	916.8	33.2
VIBE	64.3	100.8	915.0	32.2
TR-VIBE(direct)	66.7	102.7	915.3	34.8
TR-VIBE	64.9	99.0	907.9	30.1

图5 室内定性比较图

Fig. 5 Indoors qualitative comparison diagram ((a) VIBE; (b) Ours)

图6 室外定性比较图

Fig. 6 Outdoor qualitative comparison diagram ((a) VIBE; (b) Ours)

表4 LSTM和Transformer的消融实验

Table 4 Ablation experiments for LSTM and Transformer

Models	3DPW
Models	PA-MPJPE	MPJPE	PVE	Accel
VIBE	55.2	93.8	110.4	28.2
VIBE-α	55.1	94.2	110.2	28.5
VIBE-β	55.0	87.7	104.2	25.8
TR-VIBE	53.5	86.3	101.8	25.5
TR-VIBE-α	55.9	92.5	110.1	28.6
TR-VIBE-β	55.0	94.2	110.4	29.8

参考文献 27

[1]	PAVLAKOS G, ZHOU X W, DERPANIS K G, et al. Coarse-to-fine volumetric prediction for single-image 3D human pose[C]//2017 IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2017: 1263-1272.
[2]	LI S J, CHAN A B. 3D human pose estimation from monocular images with deep convolutional neural network[M]//Computer Vision - ACCV 2014. Cham: Springer International Publishing, 2015: 332-347.
[3]	KOCABAS M, ATHANASIOU N, BLACK M J. VIBE: video inference for human body pose and shape estimation[C]//2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2020: 5252-5262.
[4]	PAVLLO D, FEICHTENHOFER C, GRANGIER D, et al. 3D human pose estimation in video with temporal convolutions and semi-supervised training[C]// 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2019: 7745-7754.
[5]	DEVLIN J, CHANG M W, LEE K, et al. BERT: pre-training of deep bidirectional transformers for language understanding[EB/OL]. [2021-12-02].https://arxiv.org/abs/1810.04805.
[6]	DOSOVITSKIY A, BEYER L, KOLESNIKOV A, et al. An image is worth 16x16 words: transformers for image recognition at scale[EB/OL]. [2021-12-05]. https://arxiv.org/abs/2010.11929.
[7]	LI K, WANG S J, ZHANG X, et al. Pose recognition with cascade transformers[C]//2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2021: 1944-1953.
[8]	ZHENG C, ZHU S J, MENDIETA M, et al. 3D human pose estimation with spatial and temporal transformers[C]//2021 IEEE/CVF International Conference on Computer Vision. New York: IEEE Press, 2021: 11636-11645.
[9]	SHI X J, CHEN Z R, WANG H, et al. Convolutional LSTM Network: a machine learning approach for precipitation nowcasting[C]// The 28th International Conference on Neural Information Processing Systems - Volume 1. New York: ACM, 2015: 802-810.
[10]	LOPER M, MAHMOOD N, ROMERO J, et al. SMPL: a skinned multi-person linear model[J]. ACM Transactions on Graphics, 2015, 34(6): 248.
[11]	KANAZAWA A, BLACK M J, JACOBS D W, et al. End-to-end recovery of human shape and pose[C]//2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2018: 7122-7131.
[12]	JIANG Y, CHANG S, WANG Z. Transgan: two pure transformers can make one strong gan, and that can scale up[J]. Advances in Neural Information Processing Systems, 2021, 34: 14745-14758.
[13]	VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all You need[C]//The 31st International Conference on Neural Information Processing Systems. New York: ACM, 2017: 6000-6010.
[14]	DOSOVITSKIY A, BEYER L, KOLESNIKOV A, et al. An image is worth 16x16 words: transformers for image recognition at scale[EB/OL]. [2021-12-02].https://arxiv.org/abs/2010.11929.
[15]	SHAW P, USZKOREIT J, VASWANI A. Self-attention with relative position representations[EB/OL]. [2021-12-01].https://arxiv.org/abs/1803.02155.
[16]	RAFFEL C, SHAZEER N, ROBERTS A, et al. Exploring the limits of transfer learning with a unified text-to-text transformer[EB/OL]. [2021-12-02].https://arxiv.org/abs/1910.10683.
[17]	HUANG C Z A, VASWANI A, USZKOREIT J, et al. Music transformer[EB/OL]. [2021-12-05]. https://arxiv.org/abs/1809.04281.
[18]	LIU Z, LIN Y T, CAO Y, et al. Swin transformer: hierarchical vision transformer using shifted windows[C]//2021 IEEE/CVF International Conference on Computer Vision. New York: IEEE Press, 2021: 9992-10002.
[19]	HU H, ZHANG Z, XIE Z D, et al. Local relation networks for image recognition[C]//2019 IEEE/CVF International Conference on Computer Vision. New York: IEEE Press, 2019: 3463-3472.
[20]	HE K M, ZHANG X Y, REN S Q, et al. Identity mappings in deep residual networks[M]//Computer Vision - ECCV 2016. Cham: Springer International Publishing, 2016: 630-645.
[21]	KOLOTOUROS N, PAVLAKOS G, BLACK M, et al. Learning to reconstruct 3D human pose and shape via model-fitting in the loop[C]//2019 IEEE/CVF International Conference on Computer Vision. New York: IEEE Press, 2019: 2252-2261.
[22]	KANAZAWA A, ZHANG J Y, FELSEN P, et al. Learning 3D human dynamics from video[C]//2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2019: 5607-5616.
[23]	KINGMA D P, BA J. Adam: a method for stochastic optimization[EB/OL]. [2022-01-03].https://arxiv.org/abs/1412.6980.
[24]	MEHTA D, RHODIN H, CASAS D, et al. Monocular 3D human pose estimation in the wild using improved CNN supervision[C]// 2017 International Conference on 3D Vision. New York: IEEE Press, 2017: 506-516.
[25]	MAHMOOD N, GHORBANI N, TROJE N F, et al. AMASS: archive of motion capture As surface shapes[C]//2019 IEEE/CVF International Conference on Computer Vision. New York: IEEE Press, 2019: 5441-5450.
[26]	VON MARCARD T, HENSCHEL R, BLACK M J, et al. Recovering accurate 3D human pose in the wild using IMUs and a moving camera[M]//Computer Vision - ECCV 2018. Cham: Springer International Publishing, 2018: 614-631.
[27]	KOLOTOUROS N, PAVLAKOS G, BLACK M, et al. Learning to reconstruct 3D human pose and shape via model-fitting in the loop[C]//2019 IEEE/CVF International Conference on Computer Vision. New York: IEEE Press, 2019: 2252-2261.

一种基于Transformer的三维人体姿态估计方法

A Transformer-based 3D human pose estimation method

RichHTML

PDF (PC)

可视化

摘要/Abstract

引用本文

使用本文

图/表 10

参考文献 27

相关文章 15

编辑推荐

Metrics

本文评价

[1]	杨陈成, 董秀成, 侯兵, 张党成, 向贤明, 冯琪茗. 基于参考的Transformer纹理迁移深度图像超分辨率重建[J]. 图学学报, 2023, 44(5): 861-867.
[2]	宋焕生, 文雅, 孙士杰, 宋翔宇, 张朝阳, 李旭. 基于改进教师学生网络的隧道火灾检测[J]. 图学学报, 2023, 44(5): 978-987.
[3]	李利霞, 王鑫, 王军, 张又元. 基于特征融合与注意力机制的无人机图像小目标检测算法[J]. 图学学报, 2023, 44(4): 658-666.
[4]	李鑫, 普园媛, 赵征鹏, 徐丹, 钱文华. 内容语义和风格特征匹配一致的艺术风格迁移[J]. 图学学报, 2023, 44(4): 699-709.
[5]	余伟群, 刘佳涛, 张亚萍. 融合注意力的拉普拉斯金字塔单目深度估计[J]. 图学学报, 2023, 44(4): 728-738.
[6]	胡欣, 周运强, 肖剑, 杨杰. 基于改进YOLOv5的螺纹钢表面缺陷检测[J]. 图学学报, 2023, 44(3): 427-437.
[7]	郝鹏飞, 刘立群, 顾任远. YOLO-RD-Apple果园异源图像遮挡果实检测模型[J]. 图学学报, 2023, 44(3): 456-464.
[8]	罗文宇, 傅明月. 基于YoloX-ECA模型的非法野泳野钓现场监测技术[J]. 图学学报, 2023, 44(3): 465-472.
[9]	李雨, 闫甜甜, 周东生, 魏小鹏. 基于注意力机制与深度多尺度特征融合的自然场景文本检测[J]. 图学学报, 2023, 44(3): 473-481.
[10]	吴文欢, 张淏坤. 融合空间十字注意力与通道注意力的语义分割网络[J]. 图学学报, 2023, 44(3): 531-539.
[11]	谢国波, 贺笛轩, 何宇钦, 林志毅. 基于P-CenterNet的光学遥感图像烟囱检测[J]. 图学学报, 2023, 44(2): 233-240.
[12]	熊举举, 徐杨, 范润泽, 孙少聪. 基于轻量化视觉Transformer的花卉识别[J]. 图学学报, 2023, 44(2): 271-279.
[13]	成浪, 敬超. 基于改进YOLOv7的X线图像旋转目标检测[J]. 图学学报, 2023, 44(2): 324-334.
[14]	曹义亲, 伍铭林, 徐露. 基于改进YOLOv5算法的钢材表面缺陷检测[J]. 图学学报, 2023, 44(2): 335-345.
[15]	张伟康, 孙浩, 陈鑫凯, 李叙兵, 姚立纲, 东辉. 基于改进YOLOv5的智能除草机器人蔬菜苗田杂草检测研究[J]. 图学学报, 2023, 44(2): 346-356.