Journal of Graphics ›› 2025, Vol. 46 ›› Issue (2): 270-278.DOI: 10.11996/JG.j.2095-302X.2025020270
• Image Processing and Computer Vision • Previous Articles Next Articles
WANG Xueting(), GUO Xin(
), WANG Song, CHEN Enqing
Received:
2024-08-27
Accepted:
2024-12-16
Online:
2025-04-30
Published:
2025-04-24
Contact:
GUO Xin
About author:
First author contact:WANG Xueting (2000-), master student. Her main research interest covers intelligence information processing.E-mail:wangxueting7270@163.com
Supported by:
CLC Number:
WANG Xueting, GUO Xin, WANG Song, CHEN Enqing. Human skeleton action recognition method based on variational autoencoder masked reconstruction[J]. Journal of Graphics, 2025, 46(2): 270-278.
Add to citation manager EndNote|Ris|BibTeX
URL: http://www.txxb.com.cn/EN/10.11996/JG.j.2095-302X.2025020270
Fig. 1 SkeletonMVAE structure ((a) The pretrain prosses of SkeletonMVAE; (b) The finetune prosses of SkeletonMVAE; (c) Potential spatial sampling; (d) The structure of SkeletonMVAE encoder; (e) The structure of SkeletonMVAE decoder; (f) The structure of STTA block in STTFormer)
方法 | 骨干网络 | NTU-60 | NTU-120 | ||
---|---|---|---|---|---|
X-sub | X-view | X-sub | X-set | ||
SkeletonCLR[ | ST-GCN | 82.2 | 88.9 | 73.6 | 75.3 |
CPM[ | ST-GCN | 84.8 | 91.1 | 78.4 | 78.9 |
CrosSCLR[ | ST-GCN | 86.2 | 92.5 | 80.5 | 80.4 |
AimCLR[ | ST-GCN | 86.9 | 92.8 | 80.1 | 80.9 |
AimCLR[ | STTFormer | 83.9 | 90.4 | 74.6 | 77.2 |
CrosSCLR[ | STTFormer | 84.6 | 90.5 | 75.0 | 77.9 |
Hi-TRS[ | Transformer | 86.0 | 93.0 | 80.6 | 81.6 |
SkeletonMAE[ | STTFormer | 86.6 | 92.9 | 76.8 | 79.1 |
Ours | STTFormer | 88.4 | 93.1 | 80.6 | 83.5 |
Table 1 Fine-tuned results on NTU RGB+D 60 and NTU RGB+D 120 datasets/%
方法 | 骨干网络 | NTU-60 | NTU-120 | ||
---|---|---|---|---|---|
X-sub | X-view | X-sub | X-set | ||
SkeletonCLR[ | ST-GCN | 82.2 | 88.9 | 73.6 | 75.3 |
CPM[ | ST-GCN | 84.8 | 91.1 | 78.4 | 78.9 |
CrosSCLR[ | ST-GCN | 86.2 | 92.5 | 80.5 | 80.4 |
AimCLR[ | ST-GCN | 86.9 | 92.8 | 80.1 | 80.9 |
AimCLR[ | STTFormer | 83.9 | 90.4 | 74.6 | 77.2 |
CrosSCLR[ | STTFormer | 84.6 | 90.5 | 75.0 | 77.9 |
Hi-TRS[ | Transformer | 86.0 | 93.0 | 80.6 | 81.6 |
SkeletonMAE[ | STTFormer | 86.6 | 92.9 | 76.8 | 79.1 |
Ours | STTFormer | 88.4 | 93.1 | 80.6 | 83.5 |
方法 | NTU-60 | |||
---|---|---|---|---|
X-sub | X-view | |||
5 | 10 | 5 | 10 | |
Hi-TRS[ | 63.3 | 70.7 | 68.3 | 74.8 |
CrosSCLR[ | 63.5 | 71.0 | 66.9 | 75.1 |
AimCLR[ | 63.9 | 70.2 | 67.5 | 76.2 |
CPM[ | - | 73.0 | - | 77.1 |
SkeletonMAE[ | 64.4 | 73.0 | 68.8 | 76.9 |
Ours | 65.1 | 73.7 | 69.3 | 77.5 |
Table 2 Fine-tuned results comparison on the NTU RGB+D 60 dataset with fewer labeled data/%
方法 | NTU-60 | |||
---|---|---|---|---|
X-sub | X-view | |||
5 | 10 | 5 | 10 | |
Hi-TRS[ | 63.3 | 70.7 | 68.3 | 74.8 |
CrosSCLR[ | 63.5 | 71.0 | 66.9 | 75.1 |
AimCLR[ | 63.9 | 70.2 | 67.5 | 76.2 |
CPM[ | - | 73.0 | - | 77.1 |
SkeletonMAE[ | 64.4 | 73.0 | 68.8 | 76.9 |
Ours | 65.1 | 73.7 | 69.3 | 77.5 |
方法 | NTU-120 | |||
---|---|---|---|---|
X-sub | X-set | |||
5 | 10 | 5 | 10 | |
CrosSCLR[ | 50.2 | 58.5 | 50.4 | 60.6 |
AimCLR[ | 49.0 | 58.6 | 51.8 | 60.5 |
SkeletonMAE[ | 50.4 | 61.8 | 52.0 | 62.5 |
Ours | 53.9 | 62.7 | 53.0 | 64.6 |
Table 3 Fine-tuned results comparison on the NTU RGB+D 120 dataset with fewer labeled data/%
方法 | NTU-120 | |||
---|---|---|---|---|
X-sub | X-set | |||
5 | 10 | 5 | 10 | |
CrosSCLR[ | 50.2 | 58.5 | 50.4 | 60.6 |
AimCLR[ | 49.0 | 58.6 | 51.8 | 60.5 |
SkeletonMAE[ | 50.4 | 61.8 | 52.0 | 62.5 |
Ours | 53.9 | 62.7 | 53.0 | 64.6 |
帧掩蔽率 | 关节点掩蔽率 | NTU-60 X-sub |
---|---|---|
0.4 | 0.4 | 88.4 |
0.6 | 87.0 | |
0.8 | 87.8 | |
0.5 | 0.4 | 87.5 |
0.6 | 87.5 | |
0.8 | 87.4 | |
0.6 | 0.4 | 87.4 |
0.6 | 87.5 | |
0.8 | 87.4 |
Table 4 Ablation study on frame and joint masking ratio/%
帧掩蔽率 | 关节点掩蔽率 | NTU-60 X-sub |
---|---|---|
0.4 | 0.4 | 88.4 |
0.6 | 87.0 | |
0.8 | 87.8 | |
0.5 | 0.4 | 87.5 |
0.6 | 87.5 | |
0.8 | 87.4 | |
0.6 | 0.4 | 87.4 |
0.6 | 87.5 | |
0.8 | 87.4 |
Fig. 4 Masking strategy ((a) Random masking on frames; (b) Random masking on joints; (c) Fixed masking of both temporal and spatial dimensions; (d) Random masking of both temporal and spatial dimensions)
掩蔽策略 | NTU-60 X-sub |
---|---|
仅随机掩蔽时间维度 | 87.5 |
仅随机掩蔽空间维度 | 86.7 |
同时固定掩蔽时间空间维度 | 87.3 |
同时随机掩蔽时间空间维度 | 88.4 |
Table 5 Ablation study on masking strategy/%
掩蔽策略 | NTU-60 X-sub |
---|---|
仅随机掩蔽时间维度 | 87.5 |
仅随机掩蔽空间维度 | 86.7 |
同时固定掩蔽时间空间维度 | 87.3 |
同时随机掩蔽时间空间维度 | 88.4 |
β值 | NTU-60 X-sub |
---|---|
5 | 87.4 |
1 | 88.0 |
0.5 | 87.9 |
0.005 | 88.4 |
0.0005 | 87.7 |
Table 6 Ablation study on β/%
β值 | NTU-60 X-sub |
---|---|
5 | 87.4 |
1 | 88.0 |
0.5 | 87.9 |
0.005 | 88.4 |
0.0005 | 87.7 |
潜在变量维度 | NTU-60 X-sub |
---|---|
15 | 87.7 |
25 | 88.4 |
35 | 87.5 |
45 | 87.6 |
55 | 86.9 |
65 | 87.6 |
Table 7 Ablation study on latent variable dimension/%
潜在变量维度 | NTU-60 X-sub |
---|---|
15 | 87.7 |
25 | 88.4 |
35 | 87.5 |
45 | 87.6 |
55 | 86.9 |
65 | 87.6 |
解码器嵌入维度 | NTU-60 X-sub |
---|---|
128 | 86.1 |
256 | 88.4 |
512 | 86.6 |
Table 8 Ablation study on decoder embedding dimension/%
解码器嵌入维度 | NTU-60 X-sub |
---|---|
128 | 86.1 |
256 | 88.4 |
512 | 86.6 |
解码器深度 | NTU-60 X-sub |
---|---|
5 | 87.9 |
7 | 87.5 |
9 | 88.4 |
11 | 87.9 |
Table 9 Ablation study on decoder depth/%
解码器深度 | NTU-60 X-sub |
---|---|
5 | 87.9 |
7 | 87.5 |
9 | 88.4 |
11 | 87.9 |
[1] | FEICHTENHOFER C, FAN H Q, MALIK J, et al. SlowFast networks for video recognition[C]// 2019 IEEE/CVF International Conference on Computer Vision. New York: IEEE Press, 2019: 6201-6210. |
[2] |
毕春艳, 刘越. 基于深度学习的视频人体动作识别综述[J]. 图学学报, 2023, 44(4): 625-639.
DOI |
BI C Y, LIU Y. A survey of video human action recognition based on deep learning[J]. Journal of Graphics, 2023, 44(4): 625-639 (in Chinese). | |
[3] | XU J W, YU Z B, NI B B, et al. Deep kinematics analysis for monocular 3D human pose estimation[C]// 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2020: 896-905. |
[4] | DUAN H D, ZHAO Y, CHEN K, et al. Revisiting skeleton-based action recognition[C]// 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2022: 2959-2968. |
[5] |
李松洋, 王雪婷, 陈相龙, 等. 基于骨骼点动态时域滤波的人体动作识别[J]. 图学学报, 2024, 45(4): 760-769.
DOI |
LI S Y, WANG X T, CHEN X L, et al. Human action recognition based on skeleton dynamic temporal filter[J]. Journal of Graphics, 2024, 45(4): 760-769 (in Chinese).
DOI |
|
[6] | LU P, JIANG T, LI Y N, et al. RTMO: towards high-performance one-stage real-time multi-person pose estimation[C]// 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2024: 1491-1500. |
[7] | LI S, LI W Q, COOK C, et al. Independently recurrent neural network: building a longer and deeper RNN[C]// 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2018: 5457-5466. |
[8] | BANERJEE A, SINGH P K, SARKAR R. Fuzzy integral-based CNN classifier fusion for 3D skeleton action recognition[J]. IEEE Transactions on Circuits and Systems for Video Technology, 2021, 31(6): 2206-2216. |
[9] | LIU Z Y, ZHANG H W, CHEN Z H, et al. Disentangling and unifying graph convolutions for skeleton-based action recognition[C]// 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2020: 140-149. |
[10] | ZHANG Y H, WU B, LI W, et al. STST: spatial-temporal specialized transformer for skeleton-based action recognition[C]// The 29th ACM International Conference on Multimedia. New York: ACM, 2021: 3229-3237. |
[11] | HE K M, CHEN X L, XIE S N, et al. Masked autoencoders are scalable vision learners[C]// 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2022: 15979-15988. |
[12] | WU W H, HUA Y L, ZHENG C, et al. Skeletonmae: spatial-temporal masked autoencoders for self-supervised skeleton action recognition[C]// 2023 IEEE International Conference on Multimedia and Expo Workshops. New York: IEEE Press, 2023: 224-229. |
[13] | KINGMA D P, WELLING M. Auto-encoding variational Bayes[EB/OL]. (2022-12-10)[2024-06-27]. https://arxiv.org/abs/1312.6114. |
[14] | DEVLIN J, CHANG M W, LEE K, et al. BERT: pre-training of deep bidirectional transformers for language understanding[EB/OL]. (2019-05-24)[2024-06-27]. https://arxiv.org/abs/1810.04805. |
[15] | TONG Z, SONG Y B, WANG J, et al. VideoMAE: masked autoencoders are data-efficient learners for self-supervised video pre-training[C]// The 36th International Conference on Neural Information Processing Systems. Red Hook: Curran Associates Inc., 2022: 732. |
[16] | QING Z W, ZHANG S W, HUANG Z Y, et al. MAR: masked autoencoders for efficient action recognition[J]. IEEE Transactions on Multimedia, 2024, 26: 218-233. |
[17] | HIGGINS I, MATTHEY L, PAL A, et al. β-VAE: learning basic visual concepts with a constrained variational framework[EB/OL]. [2024-06-27]. https://openreview.net/pdf?id=Sy2fzU9gl. |
[18] | SHAHROUDY A, LIU J, NG T T, et al. NTU RGB+D: a large scale dataset for 3D human activity analysis[C]// IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2016: 1010-1019. |
[19] |
LIU J, SHAHROUDY A, PEREZ M, et al. NTU RGB+D 120: a large-scale benchmark for 3D human activity understanding[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020, 42(10): 2684-2701.
DOI PMID |
[20] | PASZKE A, GROSS S, MASSA F, et al. PyTorch:an imperative style, high-performance deep learning library[C]// The 33rd International Conference on Neural Information Processing Systems. Red Hook: Curran Associates Inc., 2019: 721. |
[21] | ZHANG H Y, HOU Y H, ZHANG W J, et al. Contrastive positive mining for unsupervised 3D action representation learning[C]// The 17th European Conference on Computer Vision. Cham: Springer, 2022: 36-51. |
[22] | HUA Y L, WU W H, ZHENG C, et al. Part aware contrastive learning for self-supervised action recognition[EB/OL]. (2023-05-11)[2024-06-27]. https://arxiv.org/abs/2305.00666. |
[23] | LI L G, WANG M S, NI B B, et al. 3D human action representation learning via cross-view consistency pursuit[C]// 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2021: 4739-4748. |
[24] | GUO T Y, LIU H, CHEN Z, et al. Contrastive learning from extremely augmented skeleton sequences for self-supervised action recognition[C]// The 36th AAAI Conference on Artificial Intelligence. Palo Alto: AAAI Press, 2022: 762-770. |
[25] | CHEN Y X, ZHAO L, YUAN J B, et al. Hierarchically self-supervised transformer for human skeleton representation learning[C]// The 17th European Conference on Computer Vision. Cham: Springer, 2022: 185-202. |
[1] | LIN Xiao, ZHANG Qiuyang, ZHENG Xiaomei, YANG Qizhe. Self-supervised active label cleaning [J]. Journal of Graphics, 2024, 45(3): 495-504. |
[2] | AN Feng , DAI Jun, HAN Zhen , YAN Zhong-xing. Self-supervised optical flow estimation with attention module [J]. Journal of Graphics, 2022, 43(5): 841-848. |
Viewed | ||||||
Full text |
|
|||||
Abstract |
|
|||||