欢迎访问《图学学报》 分享到:

图学学报 ›› 2023, Vol. 44 ›› Issue (1): 139-145.DOI: 10.11996/JG.j.2095-302X.2023010139

• 计算机图形学与虚拟现实 • 上一篇    下一篇

一种基于Transformer的三维人体姿态估计方法

王玉萍1(), 曾毅1, 李胜辉2, 张磊3   

  1. 1.郑州科技学院信息工程学院,河南 郑州 450064
    2.河南机电职业学院大数据学院,河南 郑州 450064
    3.郑州大学信息工程学院,河南 郑州 450001
  • 收稿日期:2022-04-07 修回日期:2022-07-19 出版日期:2023-10-31 发布日期:2023-02-16
  • 作者简介:王玉萍(1979-),女,教授,硕士。主要研究方向为机器视觉、虚拟现实与机器学习。E-mail:wangyupingpaper@163.com
  • 基金资助:
    河南省科技厅科技攻关项目(222102210174)

A Transformer-based 3D human pose estimation method

WANG Yu-ping1(), ZENG Yi1, LI Sheng-hui2, ZHANG Lei3   

  1. 1. School of Information Engineering, Zhengzhou University of Science and Technology, Zhengzhou Henan 450064, China
    2. College of Big Data, Henan Electromechanical Vocational College, Zhengzhou Henan 450064, China
    3. School of Information Engineering, Zhengzhou University, Zhengzhou Henan 450001, China
  • Received:2022-04-07 Revised:2022-07-19 Online:2023-10-31 Published:2023-02-16
  • About author:WANG Yu-ping (1979-), professor, master. Her main research interests cover machine vision, virtual reality and machine learning. E-mail:wangyupingpaper@163.com
  • Supported by:
    Henan Provincial Department of Science and Technology Science and Technology Project(222102210174)

摘要:

三维人体姿态估计是人类行为理解的基础,但是预测出合理的三维人体姿态序列仍然是具有挑战性的问题。为了解决这个问题,提出一种基于Transformer的三维人体姿态估计方法,利用多层长短期记忆(LSTM)单元和多尺度Transformer结构增强人体姿态序列预测的准确性。首先,设计基于时间序列的生成器,通过ResNet预训练神经网络提取图像特征;其次,采用多层LSTM单元学习时间连续性的图像序列中人体姿态之间的关系,输出合理的SMPL人体参数模型序列;最后,构建基于多尺度Transformer的判别器,利用多尺度Transformer结构对多个分割粒度进行细节特征学习,尤其是Transformer block对相对位置进行编码增强局部特征学习能力。实验结果表明,该方法相对于VIBE方法具有更好地预测精度,在3DPW数据集上比VIBE的平均(每)关节位置误差(MPJPE)低了7.5%;在MP-INF-3DHP数据集上比VIBE的MPJPE降低了1.8%。

关键词: 多尺度Transformer结构, LSTM单元, 时间序列, 注意力机制

Abstract:

3D human pose estimation is the foundation of human behavior understanding, but predicting reasonable 3D human pose sequences remains a challenging problem. To solve this problem, a Transformer-based 3D human pose estimation method was proposed, utilizing a multi-layer long short-term memory (LSTM) unit and a multi-scale Transformer structure to enhance the accuracy of human pose sequence prediction. First, a generator based on time series was designed to extract image features through the ResNet pre-trained neural network. Secondly, multi-layer LSTM units were used to learn the relationship between human poses in temporally continuous image sequences, thereby outputting a reasonable skinned multi-person linear (SMPL) human parameter model sequence. Finally, a multi-scale Transformer-based discriminator was constructed, and the multi-scale Transformer structure was employed to learn detailed features for multiple segmentation granularities, especially the Transformer block encoding the relative position to enhance the local feature learning ability. Experimental results show that the proposed method could yield better prediction accuracy than the VIBE method, which is 7.5% lower than the average (per) joint position error (MPJPE) of VIBE on the 3DPW dataset, and 1.8% lower than VIBE's MPJPE on the MP-INF-3DHP dataset.

Key words: multi-scale Transformer structure, LSTM unit, time series, attention mechanism

中图分类号: