欢迎访问《图学学报》 分享到:

图学学报 ›› 2025, Vol. 46 ›› Issue (2): 270-278.DOI: 10.11996/JG.j.2095-302X.2025020270

• 图像处理与计算机视觉 • 上一篇    下一篇

基于变分自编码器掩蔽重建的骨骼点动作识别方法

王雪婷(), 郭新(), 汪松, 陈恩庆   

  1. 郑州大学电气与信息工程学院,河南 郑州 450001
  • 收稿日期:2024-08-27 接受日期:2024-12-16 出版日期:2025-04-30 发布日期:2025-04-24
  • 通讯作者:郭新(1988-),女,副教授,博士。主要研究方向为机器学习与人工智能。E-mail:iexguo@zzu.edu.cn
  • 第一作者:王雪婷(2000-),女,硕士研究生。主要研究方向为智能信息处理。E-mail:wangxueting7270@163.com
  • 基金资助:
    国家自然科学基金青年科学基金(62301497);国家自然科学基金青年科学基金(62101503);河南省科技研发计划联合基金(235200810050)

Human skeleton action recognition method based on variational autoencoder masked reconstruction

WANG Xueting(), GUO Xin(), WANG Song, CHEN Enqing   

  1. College of Electrical and Information Engineering, Zhengzhou University, Zhengzhou Henan 450001, China
  • Received:2024-08-27 Accepted:2024-12-16 Published:2025-04-30 Online:2025-04-24
  • First author:WANG Xueting (2000-), master student. Her main research interest covers intelligence information processing.E-mail:wangxueting7270@163.com
  • Supported by:
    National Natural Science Foundation of China Youth Science Foundation(62301497);National Natural Science Foundation of China Youth Science Foundation(62101503);Joint Fund for Science and Technology Research Projects of Henan Province(235200810050)

摘要:

掩蔽自编码器(MAE)由于其强大的自监督学习能力被用于不同领域,特别是在数据被遮蔽或可用训练数据较少的任务中获得了较好的效果。但在诸如动作识别等视觉分类任务中,由于自编码器结构中编码器学习特征的能力有限,因此分类效果欠佳。为了实现用少量标注数据对模型进行训练,并提高自编码器在骨骼点动作识别任务上的特征提取能力,提出一种基于变分自编码器(VAE)的时空掩蔽重建模型(SkeletonMVAE)用于骨骼点动作识别。该模型在传统掩蔽重建模型的编码器后引入VAE的隐空间,使得编码器学习到数据的潜在结构和更丰富的信息,并通过参数β调控重建质量,对骨骼点数据进行掩蔽重建的预训练。预训练好的编码器被用作下游分类任务的特征提取器时,其输出的特征表示更紧凑、更具判别能力和鲁棒性,从而有助于提高模型分类精度和泛化能力,提升仅有少量标注数据训练情况下的模型性能。在NTU-60和NTU-120数据集上的实验结果表明了该方法在骨骼点动作识别任务上的有效性。

关键词: 人体骨骼点动作识别, 自监督学习, 时空掩蔽重建, 变分自动编码器, 隐空间聚合

Abstract:

Masked autoencoders (MAE) have been applied in different fields due to their powerful self-supervised learning ability, especially in tasks where data is obscured or less training data is available. However, in visual classification tasks such as action recognition, the classification effect is poor due to the limited feature-learning ability of the encoder in the autoencoder structure. In order to train the model with a small amount of labeled data and enhance the feature extraction ability of autoencoders in human skeleton action recognition tasks, a spatial-temporal masked reconstruction model based on variational auto-encoder (SkeletonMVAE) was proposed for skeleton action recognition. The model introduced the hidden space of the variational autoencoder after the traditional masked reconstruction model encoder, allowing the encoder to learn the potential structure and richer information of the data. By adjusting the reconstruction quality using parameters β, the model pretrained the masked reconstruction of the skeleton data. When the pretrained encoder was used as a feature extractor for downstream classification tasks, its output feature representations were more compact, discriminative, and robust, which helped improve the model’s classification accuracy and generalization ability and improved the model’s performance with only a small amount of labeled data training. Experimental results on the NTU-60 and NTU-120 datasets demonstrated the effectiveness of the proposed method in the human skeleton action recognition tasks.

Key words: human skeleton action recognition, self-supervised learning, spatial-temporal masked reconstruction, variational autoencoder, potential spatial aggregation

中图分类号: