Welcome to Journal of Graphics share: 

Journal of Graphics ›› 2023, Vol. 44 ›› Issue (1): 139-145.DOI: 10.11996/JG.j.2095-302X.2023010139

• Computer Graphics and Virtual Reality • Previous Articles     Next Articles

A Transformer-based 3D human pose estimation method

WANG Yu-ping1(), ZENG Yi1, LI Sheng-hui2, ZHANG Lei3   

  1. 1. School of Information Engineering, Zhengzhou University of Science and Technology, Zhengzhou Henan 450064, China
    2. College of Big Data, Henan Electromechanical Vocational College, Zhengzhou Henan 450064, China
    3. School of Information Engineering, Zhengzhou University, Zhengzhou Henan 450001, China
  • Received:2022-04-07 Revised:2022-07-19 Online:2023-10-31 Published:2023-02-16
  • About author:WANG Yu-ping (1979-), professor, master. Her main research interests cover machine vision, virtual reality and machine learning. E-mail:wangyupingpaper@163.com
  • Supported by:
    Henan Provincial Department of Science and Technology Science and Technology Project(222102210174)

Abstract:

3D human pose estimation is the foundation of human behavior understanding, but predicting reasonable 3D human pose sequences remains a challenging problem. To solve this problem, a Transformer-based 3D human pose estimation method was proposed, utilizing a multi-layer long short-term memory (LSTM) unit and a multi-scale Transformer structure to enhance the accuracy of human pose sequence prediction. First, a generator based on time series was designed to extract image features through the ResNet pre-trained neural network. Secondly, multi-layer LSTM units were used to learn the relationship between human poses in temporally continuous image sequences, thereby outputting a reasonable skinned multi-person linear (SMPL) human parameter model sequence. Finally, a multi-scale Transformer-based discriminator was constructed, and the multi-scale Transformer structure was employed to learn detailed features for multiple segmentation granularities, especially the Transformer block encoding the relative position to enhance the local feature learning ability. Experimental results show that the proposed method could yield better prediction accuracy than the VIBE method, which is 7.5% lower than the average (per) joint position error (MPJPE) of VIBE on the 3DPW dataset, and 1.8% lower than VIBE's MPJPE on the MP-INF-3DHP dataset.

Key words: multi-scale Transformer structure, LSTM unit, time series, attention mechanism

CLC Number: