Welcome to Journal of Graphics share: 

Journal of Graphics ›› 2024, Vol. 45 ›› Issue (1): 159-168.DOI: 10.11996/JG.j.2095-302X.2024010159

• Computer Graphics and Virtual Reality • Previous Articles     Next Articles

A 3D human pose estimation approach based on spatio-temporal motion interaction modeling

LV Heng1(), YANG Hongyu2()   

  1. 1. School of Computer Science and Engineering, Beihang University, Beijing 100191, China
    2. Institute of Artificial Intelligence, Beihang University, Beijing 100191, China
  • Received:2023-07-25 Accepted:2023-10-28 Online:2024-02-29 Published:2024-02-29
  • Contact: YANG Hongyu (1990-), associate professor, Ph.D. Her main research interests cover computer vision and pattern recognition, etc. E-mail:hongyuyang@buaa.edu.cn
  • About author:

    LV Heng (2001-), master student. His main research interests cover computer vision and machine learning. E-mail:19373716@buaa.edu.cn

  • Supported by:
    Beijing Natural Science Foundation(4222049);National Natural Science Foundation of China(62202031)

Abstract:

3D human pose estimation plays a crucial role in fields such as virtual reality and human-computer interaction. In recent years, the Transformer has been introduced into the domain of 3D human pose estimation to capture the spatiotemporal motion information of human joints. However, existing studies typically focus on the collective movement of joint clusters or exclusively model the movement of individual joints, without delving into the unique movement patterns of each joint and their interdependencies. Consequently, an innovative approach was proposed, which meticulously learnt the spatial information of 2D human joints in each frame and conducted an in-depth analysis of the specific movement patterns of each joint. Through the design of a motion information interaction module based on the Transformer encoder, the proposed method accurately captured the dynamic relationships between different joints. In comparison to existing models that directly learnt the overall motion of human joints, the proposed method enhanced prediction accuracy by approximately 3%. When benchmarked against the state-of-the-art MixSTE model, which primarily focused on individual joint movement, the proposed model demonstrated greater efficiency in capturing spatiotemporal features of joints, achieving an inference speed boost of over 20%, making it especially suitable for real-time inference scenarios.

Key words: 3D human pose estimation, Transformer encoder, inter-joint motion, temporal-spatial information correlation, real-time inference

CLC Number: