一种基于时空运动信息交互建模的三维人体姿态估计方法

doi:10.11996/JG.j.2095-302X.2024010159

图学学报 ›› 2024, Vol. 45 ›› Issue (1): 159-168.DOI: 10.11996/JG.j.2095-302X.2024010159

• 计算机图形学与虚拟现实 • 上一篇下一篇

一种基于时空运动信息交互建模的三维人体姿态估计方法

吕衡¹(), 杨鸿宇²()

1.北京航空航天大学计算机学院，北京 100191
2.北京航空航天大学人工智能研究院，北京 100191

收稿日期:2023-07-25 接受日期:2023-10-28 出版日期:2024-02-29 发布日期:2024-02-29
通讯作者:杨鸿宇(1990-)，女，副教授，博士。主要研究方向为计算机视觉、模式识别等。E-mail：hongyuyang@buaa.edu.cn
第一作者:吕衡(2001-)，男，硕士研究生。主要研究方向为计算机视觉与机器学习。E-mail：19373716@buaa.edu.cn
基金资助:
北京市自然科学基金项目(4222049);国家自然科学基金项目(62202031)

A 3D human pose estimation approach based on spatio-temporal motion interaction modeling

LV Heng¹(), YANG Hongyu²()

1. School of Computer Science and Engineering, Beihang University, Beijing 100191, China
2. Institute of Artificial Intelligence, Beihang University, Beijing 100191, China

Received:2023-07-25 Accepted:2023-10-28 Published:2024-02-29 Online:2024-02-29
First author：LV Heng (2001-), master student. His main research interests cover computer vision and machine learning. E-mail：19373716@buaa.edu.cn
Supported by:
Beijing Natural Science Foundation(4222049);National Natural Science Foundation of China(62202031)

摘要/Abstract

摘要：

三维人体姿态估计在虚拟现实和人机交互等领域具有重要作用。近年来，Transformer已被引入三维人体姿态估计领域，用于捕捉人体关节点的时空运动信息。然而，现有研究通常只关注于人体关节点群的整体运动，或只对单独的人体关节点运动进行建模，均没有深入地探讨每个关节点的独特运动模式及不同关节点运动间的相互影响。因此，提出了一种创新的方法，旨在细致地学习每帧中的二维人体关节点的空间信息，并对每个关节点的特定运动模式进行深入分析。通过设计一个基于Transformer编码器的运动信息交互模块，精确地捕捉不同关节点之间的动态运动关系。相较于已有直接对人体关节点的整体运动进行学习的模型，此方法能够使得预测精度提高约3%。与注重单节点运动的最先进MixSTE模型相比，该模型在捕捉关节点的时空特征方面更为高效，推理速度实现了20%以上提升，使其更适合于实时推理的场景。

北京航空航天大学杨鸿宇副教授及其学生吕衡设计了一个基于时空运动信息交互建模的模型来估计单目视频中的三维人体姿态。该模型首先学习每帧 2D人体关节点的空间信息，然后对于每个关节点的运动模式分别进行建模，最后使用一个基于Transformer编码器的时空信息关联模块来充分学习不同人体关节点的动态运动关系。实验结果表明此模型能够更快地捕捉人体关节点的时空特征，更加适用于实时推理场景。

关键词: 3D人体姿态估计, Transformer编码器, 关节点间运动, 时空信息关联, 实时推理

Abstract:

3D human pose estimation plays a crucial role in fields such as virtual reality and human-computer interaction. In recent years, the Transformer has been introduced into the domain of 3D human pose estimation to capture the spatiotemporal motion information of human joints. However, existing studies typically focus on the collective movement of joint clusters or exclusively model the movement of individual joints, without delving into the unique movement patterns of each joint and their interdependencies. Consequently, an innovative approach was proposed, which meticulously learnt the spatial information of 2D human joints in each frame and conducted an in-depth analysis of the specific movement patterns of each joint. Through the design of a motion information interaction module based on the Transformer encoder, the proposed method accurately captured the dynamic relationships between different joints. In comparison to existing models that directly learnt the overall motion of human joints, the proposed method enhanced prediction accuracy by approximately 3%. When benchmarked against the state-of-the-art MixSTE model, which primarily focused on individual joint movement, the proposed model demonstrated greater efficiency in capturing spatiotemporal features of joints, achieving an inference speed boost of over 20%, making it especially suitable for real-time inference scenarios.

Key words: 3D human pose estimation, Transformer encoder, inter-joint motion, temporal-spatial information correlation, real-time inference

中图分类号:

TP391

吕衡, 杨鸿宇. 一种基于时空运动信息交互建模的三维人体姿态估计方法[J]. 图学学报, 2024, 45(1): 159-168.

LV Heng, YANG Hongyu. A 3D human pose estimation approach based on spatio-temporal motion interaction modeling[J]. Journal of Graphics, 2024, 45(1): 159-168.

图/表 12

参考文献 31

[1]	HE Y H, YAN R, FRAGKIADAKI K, et al. Epipolar transformers[C]// 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2020: 7776-7785.
[2]	CHEN X P, LIN K Y, LIU W T, et al. Weakly-supervised discovery of geometry-aware representation for 3D human pose estimation[C]// 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2020: 10887-10896.
[3]	ZHENG C, ZHU S J, MENDIETA M, et al. 3D human pose estimation with spatial and temporal transformers[C]// 2021 IEEE/CVF International Conference on Computer Vision. New York: IEEE Press, 2022: 11636-11645.
[4]	SHI M Y, ABERMAN K, ARISTIDOU A, et al. MotioNet: 3D human motion reconstruction from monocular video with skeleton consistency[J]. ACM Transactions on Graphics, 40(1): 1:1-1:15.
[5]	PAVLLO D, FEICHTENHOFER C, GRANGIER D, et al. 3D human pose estimation in video with temporal convolutions and semi-supervised training[C]// 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2020: 7745-7754.
[6]	MARTINEZ J, HOSSAIN R, ROMERO J, et al. A simple yet effective baseline for 3D human pose estimation[C]// 2017 IEEE International Conference on Computer Vision. New York: IEEE Press, 2017: 2659-2668.
[7]	HE K M, ZHANG X Y, REN S Q, et al. Deep residual learning for image recognition[C]// 2016 IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2016: 770-778.
[8]	VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all You need[C]// The 31st International Conference on Neural Information Processing Systems. New York: ACM, 2017: 6000-6010.
[9]	DOSOVITSKIY A, BEYER L, KOLESNIKOV A, et al. An image is worth 16 x16 words: transformers for image recognition at scale[EB/OL]. [2023-06-02]. https://arxiv.org/abs/2010.11929.pdf.
[10]	ZHANG J L, TU Z G, YANG J Y, et al. MixSTE: Seq2seq mixed spatio-temporal encoder for 3D human pose estimation in video[C]// 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2022: 13222-13232.
[11]	LIU Z, LIN Y T, CAO Y, et al. Swin transformer: hierarchical vision transformer using shifted windows[C]// 2021 IEEE/CVF International Conference on Computer Vision. New York: IEEE Press, 2022: 9992-10002.
[12]	SUTSKEVER I, VINYALS O, LE Q V. Sequence to sequence learning with neural networks[C]// The 27th International Conference on Neural Information Processing Systems - Volume 2. New York:ACM, 2014: 3104-3112.
[13]	HOSSAIN M R I, LITTLE J J. Exploiting temporal information for 3D human pose estimation[C]// Computer Vision - ECCV 2018: 15th European Conference. New York: ACM, 2018: 69-86.
[14]	IONESCU C, PAPAVA D, OLARU V, et al. Human3. 6M: large scale datasets and predictive methods for 3D human sensing in natural environments[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2014, 36(7): 1325-1339. DOI URL
[15]	CHEN Y L, WANG Z C, PENG Y X, et al. Cascaded pyramid network for multi-person pose estimation[C]// 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2018: 7103-7112.
[16]	ZHENG C, WU W H, CHEN C, et al. Deep learning-based human pose estimation: a survey[J]. ACM Computing Surveys, 56(1): 11:1-11:37.
[17]	LI W H, LIU H, TANG H, et al. MHFormer: multi-hypothesis transformer for 3D human pose estimation[C]// 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2022: 13137-13146.
[18]	LI C, LEE G H. Weakly supervised generative network for multiple 3D human pose hypotheses[EB/OL]. [2023-06-02]. https://arxiv.org/abs/2008.05770.pdf.
[19]	BANIK S, GARCÍA A M, KNOLL A. 3D human pose regression using graph convolutional network[C]// 2021 IEEE International Conference on Image Processing. New York: IEEE Press, 2021: 924-928.
[20]	XU Y L, WANG W G, LIU T Y, et al. Monocular 3D pose estimation via pose grammar and data augmentation[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022, 44(10): 6327-6344. DOI URL
[21]	Oreshkin B N. HybrIK-Transformer[EB/OL]. [2023-06-22]. https://arxiv.org/abs/2302.04774.
[22]	CAI J L, LIU H, DING R W, et al. HTNet: human Topology aware network for 3d Human pose estimation[C]// ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing. New York: IEEE Press, 2023: 1-5.
[23]	KIM J, GWON M G, PARK H, et al. Sampling is matter: point-guided 3D human mesh reconstruction[C]// 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2023: 12880-12889.
[24]	HASSAN M T, BEN HAMZA A. Regular splitting graph network for 3D human pose estimation[J]. IEEE Transactions on Image Processing: a Publication of the IEEE Signal Processing Society, 2023, 32:
[25]	LUTZ S, BLYTHMAN R, GHOSAL K, et al. Jointformer: single-frame lifting transformer with error prediction and refinement for 3D human pose estimation[C]// 2022 26th International Conference on Pattern Recognition. New York: IEEE Press, 2022: 1156-1163.
[26]	LIN J H, LEE G H. Trajectory space factorization for deep video-based 3D human pose estimation[EB/OL]. [2023-06-02]. https://arxiv.org/abs/1908.08289.pdf.
[27]	LI S C, KE L, PRATAMA K, et al. Cascaded deep monocular 3D human pose estimation with evolutionary training data[C]// 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2020: 6172-6182.
[28]	BOUAZIZI A, KRESSEL U, BELAGIANNIS V. Learning temporal 3D human pose estimation with pseudo- labels[C]// 2021 17th IEEE International Conference on Advanced Video and Signal Based Surveillance. New York: IEEE Press, 2022: 1-8.
[29]	GUAN S Y, XU J W, HE M Z, et al. Out-of-domain human mesh reconstruction via dynamic bilevel online adaptation[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023, 45(4):
[30]	WANG J B, YAN S J, XIONG Y J, et al. Motion guided 3D pose estimation from videos[EB/OL]. [2023-06-02]. https://www.ecva.net/papers/eccv_2020/paper_ECCV/papers/123580749.pdf.
[31]	GONG K H, ZHANG J F, FENG J S. PoseAug: a differentiable pose augmentation framework for 3D human pose estimation[C]// 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2021: 8571-8580.

名称	版本
Ubuntu	20.04
Cuda	11.2
Pytorch	1.9
Python	3.9
Cudnn	7

名称	版本
Ubuntu	20.04
Cuda	11.2
Pytorch	1.9
Python	3.9
Cudnn	7

参数	值
感受野	27
时间Transformer编码器模块层数	3
空间Transformer编码器模块层数	6
运动Transformer编码器模块层数	2
学习率	0.00014
注意力机制的多头数量	8
2D关节点映射到的特征维度	512
运动Transformer编码器模块的特征维度	32

参数	值
感受野	27
时间Transformer编码器模块层数	3
空间Transformer编码器模块层数	6
运动Transformer编码器模块层数	2
学习率	0.00014
注意力机制的多头数量	8
2D关节点映射到的特征维度	512
运动Transformer编码器模块的特征维度	32

动作名称	MPJPE/mm	P-MPJPE/mm	MPJVE/mm	总帧数	总推理时间/ms
Dir.	42.2	33.1	3.1	26 568	314
Disc.	46.1	36.0	3.3	64 476	348
Eat.	44.2	35.3	2.5	39 528	368
Greet	44.3	35.6	3.6	30 780	328
Phone	49.0	37.4	2.4	56 268	275
Photo	54.1	41.6	3.0	29 484	294
Pose	43.4	33.0	2.9	27 432	344
Purch.	44.1	33.2	3.3	19 440	305
Sit	56.0	45.9	2.1	40 392	336
SitD.	65.0	51.2	2.9	33 588	303
Somke	47.5	37.5	2.5	55 836	311
Wait	43.5	32.1	2.7	38 016	373
WalkD.	48.9	37.3	4.0	28 512	345
Walk	32.5	25.1	3.4	29 484	284
WalkT.	34.3	27.4	3.0	26 460	317
Average	46.3	36.1	3.0	36 418	323

一种基于时空运动信息交互建模的三维人体姿态估计方法

A 3D human pose estimation approach based on spatio-temporal motion interaction modeling

RichHTML

PDF (PC)

可视化

摘要/Abstract

引用本文

使用本文

图/表 12

参考文献 31

相关文章 15

编辑推荐

Metrics

本文评价

方法	AVG-MPJPE/mm
文献[7]	62.9
文献[18]	73.9
文献[19]	59.5
文献[20]	58.1
文献[21]	47.5
文献[22]	47.6
文献[23]	48.3
文献[24]	47.0
Jointformer^[25]	50.1
PoseFormer^[3](f=27)	47.5
PoseFormer+Seq2Seq^[4](f=27)	53.6
MixSTE^[4](f=27)	45.3
Ours (f=27)	46.3

模型(f=27)	每1 000帧平均推理时间/ms	参数量/M	FLOPs/M
PoseFormer^[3]	48.5	9.59	452
PoseFormer+Seq2Seq^[4]	39.5	44.6	26 405
MixSTE^[4]	11.3	33.7	83
Ours	8.9	23.1	177

方法	MPJPE↓	AUC↑	PCK↑
文献[26] (f =27)	79.8	51.4	83.6
文献[27]	99.7	46.1	81.2
文献[28]	90.3	50.1	81.0
文献[29]	101.5	43.1	79.5
文献[30] (f =96)	68.1	62.1	86.9
文献[31]	73.0	57.3	88.6
PoseFormer.^[3]	77.1	56.4	88.6
Ours.(f=27)	57.4	64.2	94.1

运动Transformer 编码器模块层数	2D关节点映射的特征维度	运动Transformer编码器的特征维度	AVG-MPJPE/mm↓	推理时间/ms↓
1	64	64	49.3	4.6
2	64	64	48.2	4.9
3	64	64	48.2	6.1
2	128	64	48.5	6.7
2	256	64	47.5	7.9
2	512	64	46.4	9.4
2	512	128	46.4	16.3
2	512	32	46.3	8.9
0	512	32	48.4	5.9

[1]	黄昱喆, 王旭鹏, 陈文会, 周中泽, 赵嘉鑫, 王芸倩. 面向足底压力优化的全接触矫形鞋垫设计[J]. 图学学报, 2024, 45(4): 868-878.
[2]	董相涛, 马鑫, 潘成伟, 鲁鹏. 室外大场景神经辐射场综述[J]. 图学学报, 2024, 45(4): 631-649.
[3]	李大湘, 吉展, 刘颖, 唐垚. 改进YOLOv7遥感图像目标检测算法[J]. 图学学报, 2024, 45(4): 650-658.
[4]	张新宇, 张家意, 高欣. ASC-Net：腹腔镜视频中手术器械与脏器快速分割网络[J]. 图学学报, 2024, 45(4): 659-669.
[5]	罗智徽, 胡海涛, 马潇峰, 程文刚. 基于同质中间模态的跨模态行人再识别方法[J]. 图学学报, 2024, 45(4): 670-682.
[6]	程艳, 严志航, 赖建明, 王桂喜, 钟林辉. 基于语义引导的人像自动抠图模型[J]. 图学学报, 2024, 45(4): 683-695.
[7]	魏敏, 姚鑫. 基于多尺度与注意力机制的两阶段风暴单体外推研究[J]. 图学学报, 2024, 45(4): 696-704.
[8]	倪云昊, 黄雷. 基于数据表示不变性的域泛化研究[J]. 图学学报, 2024, 45(4): 705-713.
[9]	胡欣, 常娅姝, 秦皓, 肖剑, 程鸿亮. 基于改进YOLOv8和GMM图像点集匹配的双目测距方法[J]. 图学学报, 2024, 45(4): 714-725.
[10]	牛为华, 郭迅. 基于改进YOLOv8的船舰遥感图像旋转目标检测算法[J]. 图学学报, 2024, 45(4): 726-735.
[11]	曾志超, 徐玥, 王景玉, 叶元龙, 黄志开, 王欢. 基于SOE-YOLO轻量化的水面目标检测算法[J]. 图学学报, 2024, 45(4): 736-744.
[12]	宫永超, 沈旭昆. 一种用于互惠目标检测与实例分割的深层架构[J]. 图学学报, 2024, 45(4): 745-759.
[13]	李松洋, 王雪婷, 陈相龙, 陈恩庆. 基于骨骼点动态时域滤波的人体动作识别[J]. 图学学报, 2024, 45(4): 760-769.
[14]	武兵, 田莹. 基于注意力机制的多尺度道路损伤检测算法研究[J]. 图学学报, 2024, 45(4): 770-778.
[15]	赵磊, 李栋, 房建东, 曹琪. 面向交通标志的改进YOLO目标检测算法[J]. 图学学报, 2024, 45(4): 779-790.

L_w	L_t	L_m	AVG-MPJPE/mm↓
√	×	×	47.0
√	√	×	46.6
√	√	√	46.3

L_w	L_t	L_m	AVG-MPJPE/mm↓
√	×	×	47.0
√	√	×	46.6
√	√	√	46.3