欢迎访问《图学学报》 分享到:

图学学报 ›› 2024, Vol. 45 ›› Issue (1): 159-168.DOI: 10.11996/JG.j.2095-302X.2024010159

• 计算机图形学与虚拟现实 • 上一篇    下一篇

一种基于时空运动信息交互建模的三维人体姿态估计方法

吕衡1(), 杨鸿宇2()   

  1. 1.北京航空航天大学计算机学院,北京 100191
    2.北京航空航天大学人工智能研究院,北京 100191
  • 收稿日期:2023-07-25 接受日期:2023-10-28 出版日期:2024-02-29 发布日期:2024-02-29
  • 通讯作者:杨鸿宇(1990-),女,副教授,博士。主要研究方向为计算机视觉、模式识别等。E-mail:hongyuyang@buaa.edu.cn
  • 第一作者:吕衡(2001-),男,硕士研究生。主要研究方向为计算机视觉与机器学习。E-mail:19373716@buaa.edu.cn
  • 基金资助:
    北京市自然科学基金项目(4222049);国家自然科学基金项目(62202031)

A 3D human pose estimation approach based on spatio-temporal motion interaction modeling

LV Heng1(), YANG Hongyu2()   

  1. 1. School of Computer Science and Engineering, Beihang University, Beijing 100191, China
    2. Institute of Artificial Intelligence, Beihang University, Beijing 100191, China
  • Received:2023-07-25 Accepted:2023-10-28 Published:2024-02-29 Online:2024-02-29
  • First author:LV Heng (2001-), master student. His main research interests cover computer vision and machine learning. E-mail:19373716@buaa.edu.cn
  • Supported by:
    Beijing Natural Science Foundation(4222049);National Natural Science Foundation of China(62202031)

摘要:

三维人体姿态估计在虚拟现实和人机交互等领域具有重要作用。近年来,Transformer已被引入三维人体姿态估计领域,用于捕捉人体关节点的时空运动信息。然而,现有研究通常只关注于人体关节点群的整体运动,或只对单独的人体关节点运动进行建模,均没有深入地探讨每个关节点的独特运动模式及不同关节点运动间的相互影响。因此,提出了一种创新的方法,旨在细致地学习每帧中的二维人体关节点的空间信息,并对每个关节点的特定运动模式进行深入分析。通过设计一个基于Transformer编码器的运动信息交互模块,精确地捕捉不同关节点之间的动态运动关系。相较于已有直接对人体关节点的整体运动进行学习的模型,此方法能够使得预测精度提高约3%。与注重单节点运动的最先进MixSTE模型相比,该模型在捕捉关节点的时空特征方面更为高效,推理速度实现了20%以上提升,使其更适合于实时推理的场景。

北京航空航天大学杨鸿宇副教授及其学生吕衡设计了一个基于时空运动信息交互建模的模型来估计单目视频中的三维人体姿态。该模型首先学习每帧 2D人体关节点的空间信息,然后对于每个关节点的运动模式分别进行建模,最后使用一个基于Transformer编码器的时空信息关联模块来充分学习不同人体关节点的动态运动关系。实验结果表明此模型能够更快地捕捉人体关节点的时空特征,更加适用于实时推理场景。

关键词: 3D人体姿态估计, Transformer编码器, 关节点间运动, 时空信息关联, 实时推理

Abstract:

3D human pose estimation plays a crucial role in fields such as virtual reality and human-computer interaction. In recent years, the Transformer has been introduced into the domain of 3D human pose estimation to capture the spatiotemporal motion information of human joints. However, existing studies typically focus on the collective movement of joint clusters or exclusively model the movement of individual joints, without delving into the unique movement patterns of each joint and their interdependencies. Consequently, an innovative approach was proposed, which meticulously learnt the spatial information of 2D human joints in each frame and conducted an in-depth analysis of the specific movement patterns of each joint. Through the design of a motion information interaction module based on the Transformer encoder, the proposed method accurately captured the dynamic relationships between different joints. In comparison to existing models that directly learnt the overall motion of human joints, the proposed method enhanced prediction accuracy by approximately 3%. When benchmarked against the state-of-the-art MixSTE model, which primarily focused on individual joint movement, the proposed model demonstrated greater efficiency in capturing spatiotemporal features of joints, achieving an inference speed boost of over 20%, making it especially suitable for real-time inference scenarios.

Key words: 3D human pose estimation, Transformer encoder, inter-joint motion, temporal-spatial information correlation, real-time inference

中图分类号: