Welcome to Journal of Graphics share: 

Journal of Graphics ›› 2025, Vol. 46 ›› Issue (2): 270-278.DOI: 10.11996/JG.j.2095-302X.2025020270

• Image Processing and Computer Vision • Previous Articles     Next Articles

Human skeleton action recognition method based on variational autoencoder masked reconstruction

WANG Xueting(), GUO Xin(), WANG Song, CHEN Enqing   

  1. College of Electrical and Information Engineering, Zhengzhou University, Zhengzhou Henan 450001, China
  • Received:2024-08-27 Accepted:2024-12-16 Online:2025-04-30 Published:2025-04-24
  • Contact: GUO Xin
  • About author:First author contact:

    WANG Xueting (2000-), master student. Her main research interest covers intelligence information processing.E-mail:wangxueting7270@163.com

  • Supported by:
    National Natural Science Foundation of China Youth Science Foundation(62301497);National Natural Science Foundation of China Youth Science Foundation(62101503);Joint Fund for Science and Technology Research Projects of Henan Province(235200810050)

Abstract:

Masked autoencoders (MAE) have been applied in different fields due to their powerful self-supervised learning ability, especially in tasks where data is obscured or less training data is available. However, in visual classification tasks such as action recognition, the classification effect is poor due to the limited feature-learning ability of the encoder in the autoencoder structure. In order to train the model with a small amount of labeled data and enhance the feature extraction ability of autoencoders in human skeleton action recognition tasks, a spatial-temporal masked reconstruction model based on variational auto-encoder (SkeletonMVAE) was proposed for skeleton action recognition. The model introduced the hidden space of the variational autoencoder after the traditional masked reconstruction model encoder, allowing the encoder to learn the potential structure and richer information of the data. By adjusting the reconstruction quality using parameters β, the model pretrained the masked reconstruction of the skeleton data. When the pretrained encoder was used as a feature extractor for downstream classification tasks, its output feature representations were more compact, discriminative, and robust, which helped improve the model’s classification accuracy and generalization ability and improved the model’s performance with only a small amount of labeled data training. Experimental results on the NTU-60 and NTU-120 datasets demonstrated the effectiveness of the proposed method in the human skeleton action recognition tasks.

Key words: human skeleton action recognition, self-supervised learning, spatial-temporal masked reconstruction, variational autoencoder, potential spatial aggregation

CLC Number: