图学学报 ›› 2025, Vol. 46 ›› Issue (2): 270-278.DOI: 10.11996/JG.j.2095-302X.2025020270
收稿日期:2024-08-27
接受日期:2024-12-16
出版日期:2025-04-30
发布日期:2025-04-24
通讯作者:郭新(1988-),女,副教授,博士。主要研究方向为机器学习与人工智能。E-mail:iexguo@zzu.edu.cn第一作者:王雪婷(2000-),女,硕士研究生。主要研究方向为智能信息处理。E-mail:wangxueting7270@163.com
基金资助:
WANG Xueting(
), GUO Xin(
), WANG Song, CHEN Enqing
Received:2024-08-27
Accepted:2024-12-16
Published:2025-04-30
Online:2025-04-24
Contact:
GUO Xin (1988-), associate professor, Ph.D. Her main research interests cover machine learning and artificial intelligence.E-mail:iexguo@zzu.edu.cn
First author:WANG Xueting (2000-), master student. Her main research interest covers intelligence information processing.E-mail:wangxueting7270@163.com
Supported by:摘要:
掩蔽自编码器(MAE)由于其强大的自监督学习能力被用于不同领域,特别是在数据被遮蔽或可用训练数据较少的任务中获得了较好的效果。但在诸如动作识别等视觉分类任务中,由于自编码器结构中编码器学习特征的能力有限,因此分类效果欠佳。为了实现用少量标注数据对模型进行训练,并提高自编码器在骨骼点动作识别任务上的特征提取能力,提出一种基于变分自编码器(VAE)的时空掩蔽重建模型(SkeletonMVAE)用于骨骼点动作识别。该模型在传统掩蔽重建模型的编码器后引入VAE的隐空间,使得编码器学习到数据的潜在结构和更丰富的信息,并通过参数β调控重建质量,对骨骼点数据进行掩蔽重建的预训练。预训练好的编码器被用作下游分类任务的特征提取器时,其输出的特征表示更紧凑、更具判别能力和鲁棒性,从而有助于提高模型分类精度和泛化能力,提升仅有少量标注数据训练情况下的模型性能。在NTU-60和NTU-120数据集上的实验结果表明了该方法在骨骼点动作识别任务上的有效性。
中图分类号:
王雪婷, 郭新, 汪松, 陈恩庆. 基于变分自编码器掩蔽重建的骨骼点动作识别方法[J]. 图学学报, 2025, 46(2): 270-278.
WANG Xueting, GUO Xin, WANG Song, CHEN Enqing. Human skeleton action recognition method based on variational autoencoder masked reconstruction[J]. Journal of Graphics, 2025, 46(2): 270-278.
图1 SkeletonMVAE结构((a) SkeletonMVAE的预训练过程;(b) SkeletonMVAE的微调过程;(c)隐空间采样;(d) SkeletonMVAE编码器的结构;(e) SkeletonMVAE解码器的结构;(f) STTFormer中STTA块的结构)
Fig. 1 SkeletonMVAE structure ((a) The pretrain prosses of SkeletonMVAE; (b) The finetune prosses of SkeletonMVAE; (c) Potential spatial sampling; (d) The structure of SkeletonMVAE encoder; (e) The structure of SkeletonMVAE decoder; (f) The structure of STTA block in STTFormer)
图2 时空随机掩蔽过程((a)时间帧上的随机掩蔽;(b)未掩蔽帧上的骨骼关节点掩蔽)
Fig. 2 The spatial-temporal random masking process ((a) Random masking on frames; (b) Masking the skeleton joints on the unmasked frames)
图3 隐空间t-SNE可视化((a) SkeletonMAE隐空间可视化;(b) SkeletonMVAE隐空间可视化)
Fig. 3 Potential space t-SNE visualization ((a) SkeletonMAE potential space visualization; (b) SkeletonMVAE potential space visualization)
| 方法 | 骨干网络 | NTU-60 | NTU-120 | ||
|---|---|---|---|---|---|
| X-sub | X-view | X-sub | X-set | ||
| SkeletonCLR[ | ST-GCN | 82.2 | 88.9 | 73.6 | 75.3 |
| CPM[ | ST-GCN | 84.8 | 91.1 | 78.4 | 78.9 |
| CrosSCLR[ | ST-GCN | 86.2 | 92.5 | 80.5 | 80.4 |
| AimCLR[ | ST-GCN | 86.9 | 92.8 | 80.1 | 80.9 |
| AimCLR[ | STTFormer | 83.9 | 90.4 | 74.6 | 77.2 |
| CrosSCLR[ | STTFormer | 84.6 | 90.5 | 75.0 | 77.9 |
| Hi-TRS[ | Transformer | 86.0 | 93.0 | 80.6 | 81.6 |
| SkeletonMAE[ | STTFormer | 86.6 | 92.9 | 76.8 | 79.1 |
| Ours | STTFormer | 88.4 | 93.1 | 80.6 | 83.5 |
表1 在NTU RGB+D 60和NTU RGB+D 120数据集上的微调结果/%
Table 1 Fine-tuned results on NTU RGB+D 60 and NTU RGB+D 120 datasets/%
| 方法 | 骨干网络 | NTU-60 | NTU-120 | ||
|---|---|---|---|---|---|
| X-sub | X-view | X-sub | X-set | ||
| SkeletonCLR[ | ST-GCN | 82.2 | 88.9 | 73.6 | 75.3 |
| CPM[ | ST-GCN | 84.8 | 91.1 | 78.4 | 78.9 |
| CrosSCLR[ | ST-GCN | 86.2 | 92.5 | 80.5 | 80.4 |
| AimCLR[ | ST-GCN | 86.9 | 92.8 | 80.1 | 80.9 |
| AimCLR[ | STTFormer | 83.9 | 90.4 | 74.6 | 77.2 |
| CrosSCLR[ | STTFormer | 84.6 | 90.5 | 75.0 | 77.9 |
| Hi-TRS[ | Transformer | 86.0 | 93.0 | 80.6 | 81.6 |
| SkeletonMAE[ | STTFormer | 86.6 | 92.9 | 76.8 | 79.1 |
| Ours | STTFormer | 88.4 | 93.1 | 80.6 | 83.5 |
| 方法 | NTU-60 | |||
|---|---|---|---|---|
| X-sub | X-view | |||
| 5 | 10 | 5 | 10 | |
| Hi-TRS[ | 63.3 | 70.7 | 68.3 | 74.8 |
| CrosSCLR[ | 63.5 | 71.0 | 66.9 | 75.1 |
| AimCLR[ | 63.9 | 70.2 | 67.5 | 76.2 |
| CPM[ | - | 73.0 | - | 77.1 |
| SkeletonMAE[ | 64.4 | 73.0 | 68.8 | 76.9 |
| Ours | 65.1 | 73.7 | 69.3 | 77.5 |
表2 在NTU RGB+D 60数据集上进行少标签数据的微调结果对比/%
Table 2 Fine-tuned results comparison on the NTU RGB+D 60 dataset with fewer labeled data/%
| 方法 | NTU-60 | |||
|---|---|---|---|---|
| X-sub | X-view | |||
| 5 | 10 | 5 | 10 | |
| Hi-TRS[ | 63.3 | 70.7 | 68.3 | 74.8 |
| CrosSCLR[ | 63.5 | 71.0 | 66.9 | 75.1 |
| AimCLR[ | 63.9 | 70.2 | 67.5 | 76.2 |
| CPM[ | - | 73.0 | - | 77.1 |
| SkeletonMAE[ | 64.4 | 73.0 | 68.8 | 76.9 |
| Ours | 65.1 | 73.7 | 69.3 | 77.5 |
| 方法 | NTU-120 | |||
|---|---|---|---|---|
| X-sub | X-set | |||
| 5 | 10 | 5 | 10 | |
| CrosSCLR[ | 50.2 | 58.5 | 50.4 | 60.6 |
| AimCLR[ | 49.0 | 58.6 | 51.8 | 60.5 |
| SkeletonMAE[ | 50.4 | 61.8 | 52.0 | 62.5 |
| Ours | 53.9 | 62.7 | 53.0 | 64.6 |
表3 在NTU RGB+D 120数据集上进行少标签数据的微调结果对比/%
Table 3 Fine-tuned results comparison on the NTU RGB+D 120 dataset with fewer labeled data/%
| 方法 | NTU-120 | |||
|---|---|---|---|---|
| X-sub | X-set | |||
| 5 | 10 | 5 | 10 | |
| CrosSCLR[ | 50.2 | 58.5 | 50.4 | 60.6 |
| AimCLR[ | 49.0 | 58.6 | 51.8 | 60.5 |
| SkeletonMAE[ | 50.4 | 61.8 | 52.0 | 62.5 |
| Ours | 53.9 | 62.7 | 53.0 | 64.6 |
| 帧掩蔽率 | 关节点掩蔽率 | NTU-60 X-sub |
|---|---|---|
| 0.4 | 0.4 | 88.4 |
| 0.6 | 87.0 | |
| 0.8 | 87.8 | |
| 0.5 | 0.4 | 87.5 |
| 0.6 | 87.5 | |
| 0.8 | 87.4 | |
| 0.6 | 0.4 | 87.4 |
| 0.6 | 87.5 | |
| 0.8 | 87.4 |
表4 帧和关节点掩蔽比的消融实验/%
Table 4 Ablation study on frame and joint masking ratio/%
| 帧掩蔽率 | 关节点掩蔽率 | NTU-60 X-sub |
|---|---|---|
| 0.4 | 0.4 | 88.4 |
| 0.6 | 87.0 | |
| 0.8 | 87.8 | |
| 0.5 | 0.4 | 87.5 |
| 0.6 | 87.5 | |
| 0.8 | 87.4 | |
| 0.6 | 0.4 | 87.4 |
| 0.6 | 87.5 | |
| 0.8 | 87.4 |
图4 掩蔽策略((a)仅对时间维度进行随机掩蔽;(b)仅对空间维度进行随机掩蔽;(c)同时对时间和空间维度进行固定掩蔽;(d)同时对时间和空间维度进行随机掩蔽)
Fig. 4 Masking strategy ((a) Random masking on frames; (b) Random masking on joints; (c) Fixed masking of both temporal and spatial dimensions; (d) Random masking of both temporal and spatial dimensions)
| 掩蔽策略 | NTU-60 X-sub |
|---|---|
| 仅随机掩蔽时间维度 | 87.5 |
| 仅随机掩蔽空间维度 | 86.7 |
| 同时固定掩蔽时间空间维度 | 87.3 |
| 同时随机掩蔽时间空间维度 | 88.4 |
表5 掩蔽策略的消融实验/%
Table 5 Ablation study on masking strategy/%
| 掩蔽策略 | NTU-60 X-sub |
|---|---|
| 仅随机掩蔽时间维度 | 87.5 |
| 仅随机掩蔽空间维度 | 86.7 |
| 同时固定掩蔽时间空间维度 | 87.3 |
| 同时随机掩蔽时间空间维度 | 88.4 |
| β值 | NTU-60 X-sub |
|---|---|
| 5 | 87.4 |
| 1 | 88.0 |
| 0.5 | 87.9 |
| 0.005 | 88.4 |
| 0.0005 | 87.7 |
表6 β值的消融实验/%
Table 6 Ablation study on β/%
| β值 | NTU-60 X-sub |
|---|---|
| 5 | 87.4 |
| 1 | 88.0 |
| 0.5 | 87.9 |
| 0.005 | 88.4 |
| 0.0005 | 87.7 |
| 潜在变量维度 | NTU-60 X-sub |
|---|---|
| 15 | 87.7 |
| 25 | 88.4 |
| 35 | 87.5 |
| 45 | 87.6 |
| 55 | 86.9 |
| 65 | 87.6 |
表7 隐变量维度消融实验/%
Table 7 Ablation study on latent variable dimension/%
| 潜在变量维度 | NTU-60 X-sub |
|---|---|
| 15 | 87.7 |
| 25 | 88.4 |
| 35 | 87.5 |
| 45 | 87.6 |
| 55 | 86.9 |
| 65 | 87.6 |
| 解码器嵌入维度 | NTU-60 X-sub |
|---|---|
| 128 | 86.1 |
| 256 | 88.4 |
| 512 | 86.6 |
表8 解码器嵌入维度的消融实验/%
Table 8 Ablation study on decoder embedding dimension/%
| 解码器嵌入维度 | NTU-60 X-sub |
|---|---|
| 128 | 86.1 |
| 256 | 88.4 |
| 512 | 86.6 |
| 解码器深度 | NTU-60 X-sub |
|---|---|
| 5 | 87.9 |
| 7 | 87.5 |
| 9 | 88.4 |
| 11 | 87.9 |
表9 解码器深度的消融实验/%
Table 9 Ablation study on decoder depth/%
| 解码器深度 | NTU-60 X-sub |
|---|---|
| 5 | 87.9 |
| 7 | 87.5 |
| 9 | 88.4 |
| 11 | 87.9 |
| [1] | FEICHTENHOFER C, FAN H Q, MALIK J, et al. SlowFast networks for video recognition[C]// 2019 IEEE/CVF International Conference on Computer Vision. New York: IEEE Press, 2019: 6201-6210. |
| [2] |
毕春艳, 刘越. 基于深度学习的视频人体动作识别综述[J]. 图学学报, 2023, 44(4): 625-639.
DOI |
| BI C Y, LIU Y. A survey of video human action recognition based on deep learning[J]. Journal of Graphics, 2023, 44(4): 625-639 (in Chinese). | |
| [3] | XU J W, YU Z B, NI B B, et al. Deep kinematics analysis for monocular 3D human pose estimation[C]// 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2020: 896-905. |
| [4] | DUAN H D, ZHAO Y, CHEN K, et al. Revisiting skeleton-based action recognition[C]// 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2022: 2959-2968. |
| [5] |
李松洋, 王雪婷, 陈相龙, 等. 基于骨骼点动态时域滤波的人体动作识别[J]. 图学学报, 2024, 45(4): 760-769.
DOI |
|
LI S Y, WANG X T, CHEN X L, et al. Human action recognition based on skeleton dynamic temporal filter[J]. Journal of Graphics, 2024, 45(4): 760-769 (in Chinese).
DOI |
|
| [6] | LU P, JIANG T, LI Y N, et al. RTMO: towards high-performance one-stage real-time multi-person pose estimation[C]// 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2024: 1491-1500. |
| [7] | LI S, LI W Q, COOK C, et al. Independently recurrent neural network: building a longer and deeper RNN[C]// 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2018: 5457-5466. |
| [8] | BANERJEE A, SINGH P K, SARKAR R. Fuzzy integral-based CNN classifier fusion for 3D skeleton action recognition[J]. IEEE Transactions on Circuits and Systems for Video Technology, 2021, 31(6): 2206-2216. |
| [9] | LIU Z Y, ZHANG H W, CHEN Z H, et al. Disentangling and unifying graph convolutions for skeleton-based action recognition[C]// 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2020: 140-149. |
| [10] | ZHANG Y H, WU B, LI W, et al. STST: spatial-temporal specialized transformer for skeleton-based action recognition[C]// The 29th ACM International Conference on Multimedia. New York: ACM, 2021: 3229-3237. |
| [11] | HE K M, CHEN X L, XIE S N, et al. Masked autoencoders are scalable vision learners[C]// 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2022: 15979-15988. |
| [12] | WU W H, HUA Y L, ZHENG C, et al. Skeletonmae: spatial-temporal masked autoencoders for self-supervised skeleton action recognition[C]// 2023 IEEE International Conference on Multimedia and Expo Workshops. New York: IEEE Press, 2023: 224-229. |
| [13] | KINGMA D P, WELLING M. Auto-encoding variational Bayes[EB/OL]. (2022-12-10)[2024-06-27]. https://arxiv.org/abs/1312.6114. |
| [14] | DEVLIN J, CHANG M W, LEE K, et al. BERT: pre-training of deep bidirectional transformers for language understanding[EB/OL]. (2019-05-24)[2024-06-27]. https://arxiv.org/abs/1810.04805. |
| [15] | TONG Z, SONG Y B, WANG J, et al. VideoMAE: masked autoencoders are data-efficient learners for self-supervised video pre-training[C]// The 36th International Conference on Neural Information Processing Systems. Red Hook: Curran Associates Inc., 2022: 732. |
| [16] | QING Z W, ZHANG S W, HUANG Z Y, et al. MAR: masked autoencoders for efficient action recognition[J]. IEEE Transactions on Multimedia, 2024, 26: 218-233. |
| [17] | HIGGINS I, MATTHEY L, PAL A, et al. β-VAE: learning basic visual concepts with a constrained variational framework[EB/OL]. [2024-06-27]. https://openreview.net/pdf?id=Sy2fzU9gl. |
| [18] | SHAHROUDY A, LIU J, NG T T, et al. NTU RGB+D: a large scale dataset for 3D human activity analysis[C]// IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2016: 1010-1019. |
| [19] |
LIU J, SHAHROUDY A, PEREZ M, et al. NTU RGB+D 120: a large-scale benchmark for 3D human activity understanding[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020, 42(10): 2684-2701.
DOI PMID |
| [20] | PASZKE A, GROSS S, MASSA F, et al. PyTorch:an imperative style, high-performance deep learning library[C]// The 33rd International Conference on Neural Information Processing Systems. Red Hook: Curran Associates Inc., 2019: 721. |
| [21] | ZHANG H Y, HOU Y H, ZHANG W J, et al. Contrastive positive mining for unsupervised 3D action representation learning[C]// The 17th European Conference on Computer Vision. Cham: Springer, 2022: 36-51. |
| [22] | HUA Y L, WU W H, ZHENG C, et al. Part aware contrastive learning for self-supervised action recognition[EB/OL]. (2023-05-11)[2024-06-27]. https://arxiv.org/abs/2305.00666. |
| [23] | LI L G, WANG M S, NI B B, et al. 3D human action representation learning via cross-view consistency pursuit[C]// 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2021: 4739-4748. |
| [24] | GUO T Y, LIU H, CHEN Z, et al. Contrastive learning from extremely augmented skeleton sequences for self-supervised action recognition[C]// The 36th AAAI Conference on Artificial Intelligence. Palo Alto: AAAI Press, 2022: 762-770. |
| [25] | CHEN Y X, ZHAO L, YUAN J B, et al. Hierarchically self-supervised transformer for human skeleton representation learning[C]// The 17th European Conference on Computer Vision. Cham: Springer, 2022: 185-202. |
| [1] | 张浩轩, 李海生, 王敏, 李楠. 基于物理信息神经网络的通用网格生成方法[J]. 图学学报, 2025, 46(5): 990-997. |
| [2] | 郑佳辉, 郭宇, 吴涛, 王胜博, 黄少华, 郑凯文. 基于数据挖掘与深度语义模型的工艺序列推荐方法[J]. 图学学报, 2025, 46(4): 864-873. |
| [3] | 刘鸿硕, 白静, 晏浩, 林淦. 面向三维点云的平衡泛化和特化的细粒度分类网络[J]. 图学学报, 2025, 46(3): 602-613. |
| [4] | 林晓, 张秋阳, 郑晓妹, 杨启哲. 基于自监督的主动标签清洗[J]. 图学学报, 2024, 45(3): 495-504. |
| [5] | 安 峰 , 戴 军 , 韩 振 , 严仲兴 . 引入注意力机制的自监督光流计算[J]. 图学学报, 2022, 43(5): 841-848. |
| [6] | 李农勤 1, 杨维信 2,3 . 基于生成式对抗神经网络的手写文字图像补全[J]. 图学学报, 2019, 40(5): 878-884. |
| 阅读次数 | ||||||
|
全文 |
|
|||||
|
摘要 |
|
|||||