图学学报 ›› 2024, Vol. 45 ›› Issue (1): 159-168.DOI: 10.11996/JG.j.2095-302X.2024010159
收稿日期:
2023-07-25
接受日期:
2023-10-28
出版日期:
2024-02-29
发布日期:
2024-02-29
通讯作者:
杨鸿宇(1990-),女,副教授,博士。主要研究方向为计算机视觉、模式识别等。E-mail:hongyuyang@buaa.edu.cn第一作者:
吕衡(2001-),男,硕士研究生。主要研究方向为计算机视觉与机器学习。E-mail:19373716@buaa.edu.cn
基金资助:
Received:
2023-07-25
Accepted:
2023-10-28
Published:
2024-02-29
Online:
2024-02-29
First author:
LV Heng (2001-), master student. His main research interests cover computer vision and machine learning. E-mail:19373716@buaa.edu.cn
Supported by:
摘要:
三维人体姿态估计在虚拟现实和人机交互等领域具有重要作用。近年来,Transformer已被引入三维人体姿态估计领域,用于捕捉人体关节点的时空运动信息。然而,现有研究通常只关注于人体关节点群的整体运动,或只对单独的人体关节点运动进行建模,均没有深入地探讨每个关节点的独特运动模式及不同关节点运动间的相互影响。因此,提出了一种创新的方法,旨在细致地学习每帧中的二维人体关节点的空间信息,并对每个关节点的特定运动模式进行深入分析。通过设计一个基于Transformer编码器的运动信息交互模块,精确地捕捉不同关节点之间的动态运动关系。相较于已有直接对人体关节点的整体运动进行学习的模型,此方法能够使得预测精度提高约3%。与注重单节点运动的最先进MixSTE模型相比,该模型在捕捉关节点的时空特征方面更为高效,推理速度实现了20%以上提升,使其更适合于实时推理的场景。
中图分类号:
吕衡, 杨鸿宇. 一种基于时空运动信息交互建模的三维人体姿态估计方法[J]. 图学学报, 2024, 45(1): 159-168.
LV Heng, YANG Hongyu. A 3D human pose estimation approach based on spatio-temporal motion interaction modeling[J]. Journal of Graphics, 2024, 45(1): 159-168.
图1 基于Transformer的节点运动信息交互模型架构图((a)整体架构图;(b)空间Transformer编码器架构图)
Fig. 1 Architecture diagram of the Transformer-based node motion information interaction model ((a) Shows the overall structure; (b) Depicts the structure of the spatial Transformer encoder)
名称 | 版本 |
---|---|
Ubuntu | 20.04 |
Cuda | 11.2 |
Pytorch | 1.9 |
Python | 3.9 |
Cudnn | 7 |
表1 实验环境版本
Table 1 Experimental environment version
名称 | 版本 |
---|---|
Ubuntu | 20.04 |
Cuda | 11.2 |
Pytorch | 1.9 |
Python | 3.9 |
Cudnn | 7 |
参数 | 值 |
---|---|
感受野 | 27 |
时间Transformer编码器模块层数 | 3 |
空间Transformer编码器模块层数 | 6 |
运动Transformer编码器模块层数 | 2 |
学习率 | 0.00014 |
注意力机制的多头数量 | 8 |
2D关节点映射到的特征维度 | 512 |
运动Transformer编码器模块的特征维度 | 32 |
表2 模型参数设置
Table 2 Model parameter configuration
参数 | 值 |
---|---|
感受野 | 27 |
时间Transformer编码器模块层数 | 3 |
空间Transformer编码器模块层数 | 6 |
运动Transformer编码器模块层数 | 2 |
学习率 | 0.00014 |
注意力机制的多头数量 | 8 |
2D关节点映射到的特征维度 | 512 |
运动Transformer编码器模块的特征维度 | 32 |
动作名称 | MPJPE/mm | P-MPJPE/mm | MPJVE/mm | 总帧数 | 总推理时间/ms |
---|---|---|---|---|---|
Dir. | 42.2 | 33.1 | 3.1 | 26 568 | 314 |
Disc. | 46.1 | 36.0 | 3.3 | 64 476 | 348 |
Eat. | 44.2 | 35.3 | 2.5 | 39 528 | 368 |
Greet | 44.3 | 35.6 | 3.6 | 30 780 | 328 |
Phone | 49.0 | 37.4 | 2.4 | 56 268 | 275 |
Photo | 54.1 | 41.6 | 3.0 | 29 484 | 294 |
Pose | 43.4 | 33.0 | 2.9 | 27 432 | 344 |
Purch. | 44.1 | 33.2 | 3.3 | 19 440 | 305 |
Sit | 56.0 | 45.9 | 2.1 | 40 392 | 336 |
SitD. | 65.0 | 51.2 | 2.9 | 33 588 | 303 |
Somke | 47.5 | 37.5 | 2.5 | 55 836 | 311 |
Wait | 43.5 | 32.1 | 2.7 | 38 016 | 373 |
WalkD. | 48.9 | 37.3 | 4.0 | 28 512 | 345 |
Walk | 32.5 | 25.1 | 3.4 | 29 484 | 284 |
WalkT. | 34.3 | 27.4 | 3.0 | 26 460 | 317 |
Average | 46.3 | 36.1 | 3.0 | 36 418 | 323 |
表3 Human3.6M数据集实验结果
Table 3 Experimental results on the Human3.6M dataset
动作名称 | MPJPE/mm | P-MPJPE/mm | MPJVE/mm | 总帧数 | 总推理时间/ms |
---|---|---|---|---|---|
Dir. | 42.2 | 33.1 | 3.1 | 26 568 | 314 |
Disc. | 46.1 | 36.0 | 3.3 | 64 476 | 348 |
Eat. | 44.2 | 35.3 | 2.5 | 39 528 | 368 |
Greet | 44.3 | 35.6 | 3.6 | 30 780 | 328 |
Phone | 49.0 | 37.4 | 2.4 | 56 268 | 275 |
Photo | 54.1 | 41.6 | 3.0 | 29 484 | 294 |
Pose | 43.4 | 33.0 | 2.9 | 27 432 | 344 |
Purch. | 44.1 | 33.2 | 3.3 | 19 440 | 305 |
Sit | 56.0 | 45.9 | 2.1 | 40 392 | 336 |
SitD. | 65.0 | 51.2 | 2.9 | 33 588 | 303 |
Somke | 47.5 | 37.5 | 2.5 | 55 836 | 311 |
Wait | 43.5 | 32.1 | 2.7 | 38 016 | 373 |
WalkD. | 48.9 | 37.3 | 4.0 | 28 512 | 345 |
Walk | 32.5 | 25.1 | 3.4 | 29 484 | 284 |
WalkT. | 34.3 | 27.4 | 3.0 | 26 460 | 317 |
Average | 46.3 | 36.1 | 3.0 | 36 418 | 323 |
方法 | AVG-MPJPE/mm |
---|---|
文献[ | 62.9 |
文献[ | 73.9 |
文献[ | 59.5 |
文献[ | 58.1 |
文献[ | 47.5 |
文献[ | 47.6 |
文献[ | 48.3 |
文献[ | 47.0 |
Jointformer[ | 50.1 |
PoseFormer[ | 47.5 |
PoseFormer+Seq2Seq[ | 53.6 |
MixSTE [ | 45.3 |
Ours (f=27) | 46.3 |
表4 与已有方法的对比结果
Table 4 Comparison results with existing methods
方法 | AVG-MPJPE/mm |
---|---|
文献[ | 62.9 |
文献[ | 73.9 |
文献[ | 59.5 |
文献[ | 58.1 |
文献[ | 47.5 |
文献[ | 47.6 |
文献[ | 48.3 |
文献[ | 47.0 |
Jointformer[ | 50.1 |
PoseFormer[ | 47.5 |
PoseFormer+Seq2Seq[ | 53.6 |
MixSTE [ | 45.3 |
Ours (f=27) | 46.3 |
模型(f=27) | 每1 000帧平均推理时间/ms | 参数量/M | FLOPs/M |
---|---|---|---|
PoseFormer[ | 48.5 | 9.59 | 452 |
PoseFormer+Seq2Seq[ | 39.5 | 44.6 | 26 405 |
MixSTE [ | 11.3 | 33.7 | 83 |
Ours | 8.9 | 23.1 | 177 |
表5 模型推理速度对比表
Table 5 Model inference speed comparison table
模型(f=27) | 每1 000帧平均推理时间/ms | 参数量/M | FLOPs/M |
---|---|---|---|
PoseFormer[ | 48.5 | 9.59 | 452 |
PoseFormer+Seq2Seq[ | 39.5 | 44.6 | 26 405 |
MixSTE [ | 11.3 | 33.7 | 83 |
Ours | 8.9 | 23.1 | 177 |
图4 存在遮挡情况的预测结果可视化图((a)输入;(b) PoseFormer结果;(c)本模型结果;(d)真实值)
Fig. 4 Visualization of prediction results under occlusion scenarios ((a) Input; (b) Results of PoseFormer; (c) Results of our model; (d) Ground truth)
方法 | MPJPE↓ | AUC↑ | PCK↑ |
---|---|---|---|
文献[ | 79.8 | 51.4 | 83.6 |
文献[ | 99.7 | 46.1 | 81.2 |
文献[ | 90.3 | 50.1 | 81.0 |
文献[ | 101.5 | 43.1 | 79.5 |
文献[ | 68.1 | 62.1 | 86.9 |
文献[ | 73.0 | 57.3 | 88.6 |
PoseFormer.[ | 77.1 | 56.4 | 88.6 |
Ours.(f=27) | 57.4 | 64.2 | 94.1 |
表6 MPI-INF-3DHP数据集实验结果
Table 6 Experimental results on the MPI-INF-3DHP dataset
方法 | MPJPE↓ | AUC↑ | PCK↑ |
---|---|---|---|
文献[ | 79.8 | 51.4 | 83.6 |
文献[ | 99.7 | 46.1 | 81.2 |
文献[ | 90.3 | 50.1 | 81.0 |
文献[ | 101.5 | 43.1 | 79.5 |
文献[ | 68.1 | 62.1 | 86.9 |
文献[ | 73.0 | 57.3 | 88.6 |
PoseFormer.[ | 77.1 | 56.4 | 88.6 |
Ours.(f=27) | 57.4 | 64.2 | 94.1 |
运动Transformer 编码器模块层数 | 2D关节点映射的 特征维度 | 运动Transformer编码器的 特征维度 | AVG-MPJPE/mm↓ | 推理时间/ms↓ |
---|---|---|---|---|
1 | 64 | 64 | 49.3 | 4.6 |
2 | 64 | 64 | 48.2 | 4.9 |
3 | 64 | 64 | 48.2 | 6.1 |
2 | 128 | 64 | 48.5 | 6.7 |
2 | 256 | 64 | 47.5 | 7.9 |
2 | 512 | 64 | 46.4 | 9.4 |
2 | 512 | 128 | 46.4 | 16.3 |
2 | 512 | 32 | 46.3 | 8.9 |
0 | 512 | 32 | 48.4 | 5.9 |
表7 超参数调整以及消融实验结果
Table 7 Results of hyperparameter tuning and ablation experiments
运动Transformer 编码器模块层数 | 2D关节点映射的 特征维度 | 运动Transformer编码器的 特征维度 | AVG-MPJPE/mm↓ | 推理时间/ms↓ |
---|---|---|---|---|
1 | 64 | 64 | 49.3 | 4.6 |
2 | 64 | 64 | 48.2 | 4.9 |
3 | 64 | 64 | 48.2 | 6.1 |
2 | 128 | 64 | 48.5 | 6.7 |
2 | 256 | 64 | 47.5 | 7.9 |
2 | 512 | 64 | 46.4 | 9.4 |
2 | 512 | 128 | 46.4 | 16.3 |
2 | 512 | 32 | 46.3 | 8.9 |
0 | 512 | 32 | 48.4 | 5.9 |
Lw | Lt | Lm | AVG-MPJPE/mm↓ |
---|---|---|---|
√ | × | × | 47.0 |
√ | √ | × | 46.6 |
√ | √ | √ | 46.3 |
表8 损失函数消融实验结果
Table 8 Results of the loss function ablation experiment
Lw | Lt | Lm | AVG-MPJPE/mm↓ |
---|---|---|---|
√ | × | × | 47.0 |
√ | √ | × | 46.6 |
√ | √ | √ | 46.3 |
[1] | HE Y H, YAN R, FRAGKIADAKI K, et al. Epipolar transformers[C]// 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2020: 7776-7785. |
[2] | CHEN X P, LIN K Y, LIU W T, et al. Weakly-supervised discovery of geometry-aware representation for 3D human pose estimation[C]// 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2020: 10887-10896. |
[3] | ZHENG C, ZHU S J, MENDIETA M, et al. 3D human pose estimation with spatial and temporal transformers[C]// 2021 IEEE/CVF International Conference on Computer Vision. New York: IEEE Press, 2022: 11636-11645. |
[4] | SHI M Y, ABERMAN K, ARISTIDOU A, et al. MotioNet: 3D human motion reconstruction from monocular video with skeleton consistency[J]. ACM Transactions on Graphics, 40(1): 1:1-1:15. |
[5] | PAVLLO D, FEICHTENHOFER C, GRANGIER D, et al. 3D human pose estimation in video with temporal convolutions and semi-supervised training[C]// 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2020: 7745-7754. |
[6] | MARTINEZ J, HOSSAIN R, ROMERO J, et al. A simple yet effective baseline for 3D human pose estimation[C]// 2017 IEEE International Conference on Computer Vision. New York: IEEE Press, 2017: 2659-2668. |
[7] | HE K M, ZHANG X Y, REN S Q, et al. Deep residual learning for image recognition[C]// 2016 IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2016: 770-778. |
[8] | VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all You need[C]// The 31st International Conference on Neural Information Processing Systems. New York: ACM, 2017: 6000-6010. |
[9] | DOSOVITSKIY A, BEYER L, KOLESNIKOV A, et al. An image is worth 16 x16 words: transformers for image recognition at scale[EB/OL]. [2023-06-02]. https://arxiv.org/abs/2010.11929.pdf. |
[10] | ZHANG J L, TU Z G, YANG J Y, et al. MixSTE: Seq2seq mixed spatio-temporal encoder for 3D human pose estimation in video[C]// 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2022: 13222-13232. |
[11] | LIU Z, LIN Y T, CAO Y, et al. Swin transformer: hierarchical vision transformer using shifted windows[C]// 2021 IEEE/CVF International Conference on Computer Vision. New York: IEEE Press, 2022: 9992-10002. |
[12] | SUTSKEVER I, VINYALS O, LE Q V. Sequence to sequence learning with neural networks[C]// The 27th International Conference on Neural Information Processing Systems - Volume 2. New York:ACM, 2014: 3104-3112. |
[13] | HOSSAIN M R I, LITTLE J J. Exploiting temporal information for 3D human pose estimation[C]// Computer Vision - ECCV 2018: 15th European Conference. New York: ACM, 2018: 69-86. |
[14] |
IONESCU C, PAPAVA D, OLARU V, et al. Human3. 6M: large scale datasets and predictive methods for 3D human sensing in natural environments[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2014, 36(7): 1325-1339.
DOI URL |
[15] | CHEN Y L, WANG Z C, PENG Y X, et al. Cascaded pyramid network for multi-person pose estimation[C]// 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2018: 7103-7112. |
[16] | ZHENG C, WU W H, CHEN C, et al. Deep learning-based human pose estimation: a survey[J]. ACM Computing Surveys, 56(1): 11:1-11:37. |
[17] | LI W H, LIU H, TANG H, et al. MHFormer: multi-hypothesis transformer for 3D human pose estimation[C]// 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2022: 13137-13146. |
[18] | LI C, LEE G H. Weakly supervised generative network for multiple 3D human pose hypotheses[EB/OL]. [2023-06-02]. https://arxiv.org/abs/2008.05770.pdf. |
[19] | BANIK S, GARCÍA A M, KNOLL A. 3D human pose regression using graph convolutional network[C]// 2021 IEEE International Conference on Image Processing. New York: IEEE Press, 2021: 924-928. |
[20] |
XU Y L, WANG W G, LIU T Y, et al. Monocular 3D pose estimation via pose grammar and data augmentation[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022, 44(10): 6327-6344.
DOI URL |
[21] | Oreshkin B N. HybrIK-Transformer[EB/OL]. [2023-06-22]. https://arxiv.org/abs/2302.04774. |
[22] | CAI J L, LIU H, DING R W, et al. HTNet: human Topology aware network for 3d Human pose estimation[C]// ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing. New York: IEEE Press, 2023: 1-5. |
[23] | KIM J, GWON M G, PARK H, et al. Sampling is matter: point-guided 3D human mesh reconstruction[C]// 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2023: 12880-12889. |
[24] | HASSAN M T, BEN HAMZA A. Regular splitting graph network for 3D human pose estimation[J]. IEEE Transactions on Image Processing: a Publication of the IEEE Signal Processing Society, 2023, 32: |
[25] | LUTZ S, BLYTHMAN R, GHOSAL K, et al. Jointformer: single-frame lifting transformer with error prediction and refinement for 3D human pose estimation[C]// 2022 26th International Conference on Pattern Recognition. New York: IEEE Press, 2022: 1156-1163. |
[26] | LIN J H, LEE G H. Trajectory space factorization for deep video-based 3D human pose estimation[EB/OL]. [2023-06-02]. https://arxiv.org/abs/1908.08289.pdf. |
[27] | LI S C, KE L, PRATAMA K, et al. Cascaded deep monocular 3D human pose estimation with evolutionary training data[C]// 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2020: 6172-6182. |
[28] | BOUAZIZI A, KRESSEL U, BELAGIANNIS V. Learning temporal 3D human pose estimation with pseudo- labels[C]// 2021 17th IEEE International Conference on Advanced Video and Signal Based Surveillance. New York: IEEE Press, 2022: 1-8. |
[29] | GUAN S Y, XU J W, HE M Z, et al. Out-of-domain human mesh reconstruction via dynamic bilevel online adaptation[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023, 45(4): |
[30] | WANG J B, YAN S J, XIONG Y J, et al. Motion guided 3D pose estimation from videos[EB/OL]. [2023-06-02]. https://www.ecva.net/papers/eccv_2020/paper_ECCV/papers/123580749.pdf. |
[31] | GONG K H, ZHANG J F, FENG J S. PoseAug: a differentiable pose augmentation framework for 3D human pose estimation[C]// 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2021: 8571-8580. |
[1] | 黄昱喆, 王旭鹏, 陈文会, 周中泽, 赵嘉鑫, 王芸倩. 面向足底压力优化的全接触矫形鞋垫设计[J]. 图学学报, 2024, 45(4): 868-878. |
[2] | 董相涛, 马鑫, 潘成伟, 鲁鹏. 室外大场景神经辐射场综述[J]. 图学学报, 2024, 45(4): 631-649. |
[3] | 李大湘, 吉展, 刘颖, 唐垚. 改进YOLOv7遥感图像目标检测算法[J]. 图学学报, 2024, 45(4): 650-658. |
[4] | 张新宇, 张家意, 高欣. ASC-Net:腹腔镜视频中手术器械与脏器快速分割网络[J]. 图学学报, 2024, 45(4): 659-669. |
[5] | 罗智徽, 胡海涛, 马潇峰, 程文刚. 基于同质中间模态的跨模态行人再识别方法[J]. 图学学报, 2024, 45(4): 670-682. |
[6] | 程艳, 严志航, 赖建明, 王桂喜, 钟林辉. 基于语义引导的人像自动抠图模型[J]. 图学学报, 2024, 45(4): 683-695. |
[7] | 魏敏, 姚鑫. 基于多尺度与注意力机制的两阶段风暴单体外推研究[J]. 图学学报, 2024, 45(4): 696-704. |
[8] | 倪云昊, 黄雷. 基于数据表示不变性的域泛化研究[J]. 图学学报, 2024, 45(4): 705-713. |
[9] | 胡欣, 常娅姝, 秦皓, 肖剑, 程鸿亮. 基于改进YOLOv8和GMM图像点集匹配的双目测距方法[J]. 图学学报, 2024, 45(4): 714-725. |
[10] | 牛为华, 郭迅. 基于改进YOLOv8的船舰遥感图像旋转目标检测算法[J]. 图学学报, 2024, 45(4): 726-735. |
[11] | 曾志超, 徐玥, 王景玉, 叶元龙, 黄志开, 王欢. 基于SOE-YOLO轻量化的水面目标检测算法[J]. 图学学报, 2024, 45(4): 736-744. |
[12] | 宫永超, 沈旭昆. 一种用于互惠目标检测与实例分割的深层架构[J]. 图学学报, 2024, 45(4): 745-759. |
[13] | 李松洋, 王雪婷, 陈相龙, 陈恩庆. 基于骨骼点动态时域滤波的人体动作识别[J]. 图学学报, 2024, 45(4): 760-769. |
[14] | 武兵, 田莹. 基于注意力机制的多尺度道路损伤检测算法研究[J]. 图学学报, 2024, 45(4): 770-778. |
[15] | 赵磊, 李栋, 房建东, 曹琪. 面向交通标志的改进YOLO目标检测算法[J]. 图学学报, 2024, 45(4): 779-790. |
阅读次数 | ||||||
全文 |
|
|||||
摘要 |
|
|||||