图学学报 ›› 2024, Vol. 45 ›› Issue (4): 791-803.DOI: 10.11996/JG.j.2095-302X.2024040791
梁成武1,2(), 杨杰1,2, 胡伟1,2, 蒋松琪1,2, 钱其扬2, 侯宁2(
)
收稿日期:
2023-12-25
接受日期:
2024-04-07
出版日期:
2024-08-31
发布日期:
2024-09-03
通讯作者:
侯宁(1982-),男,副教授,博士。主要研究方向为计算机视觉和模式识别等。E-mail:30090807@huuc.edu.cn第一作者:
梁成武(1982-),男,教授,博士。主要研究方向为人工智能和多媒体分析。E-mail:liangchengwu0615@126.com
基金资助:
LIANG Chengwu1,2(), YANG Jie1,2, HU Wei1,2, JIANG Songqi1,2, QIAN Qiyang2, HOU Ning2(
)
Received:
2023-12-25
Accepted:
2024-04-07
Published:
2024-08-31
Online:
2024-09-03
Contact:
HOU Ning (1982-), associate professor, Ph.D. His main research interests cover computer vision and pattern recognition, etc. E-mail:30090807@huuc.edu.cnFirst author:
LIANG Chengwu (1982-), professor, Ph.D. His main research interests cover artificial intelligence and multimedia. E-mail:liangchengwu0615@126.com
Supported by:
摘要:
骨架行为识别是计算机视觉和机器学习领域的研究热点。现有数据驱动型神经网络往往忽略骨架序列时间动态帧选择和模型内在人类可理解的决策逻辑,造成可解释性不足。为此提出一种基于时间动态帧选择与时空图卷积的可解释骨架行为识别方法,以提高模型的可解释性和识别性能。首先利用骨架帧置信度评价函数删除低质骨架帧,以解决骨架序列噪声问题。其次基于人体运动领域知识,提出自适应时间动态帧选择模块用于计算运动行为显著区域,以捕捉关键人体运动骨架帧的动态规律。为学习行为骨架节点内在拓扑结构,改进时空图卷积网络用于可解释骨架行为识别。在NTU RGB+D,NTU RGB+D 120和FineGym这3个大型公开数据集上的实验评估表明,该方法的骨架行为识别准确率优于对比方法并具有可解释性。
中图分类号:
梁成武, 杨杰, 胡伟, 蒋松琪, 钱其扬, 侯宁. 基于时间动态帧选择与时空图卷积的可解释骨架行为识别[J]. 图学学报, 2024, 45(4): 791-803.
LIANG Chengwu, YANG Jie, HU Wei, JIANG Songqi, QIAN Qiyang, HOU Ning. Temporal dynamic frame selection and spatio-temporal graph convolution for interpretable skeleton-based action recognition[J]. Journal of Graphics, 2024, 45(4): 791-803.
方法 | θ | Top1/% |
---|---|---|
基准方法 | 0.0 | 88.7 |
基准方法+ 骨架质量评估模块 | 0.1 | 89.1 |
0.2 | 89.9 | |
0.4 | 88.5 | |
0.6 | 88.2 | |
0.8 | 86.5 |
表1 不同θ取值的识别准确率对比
Table 1 Comparison of recognition accuracy of different values of θ
方法 | θ | Top1/% |
---|---|---|
基准方法 | 0.0 | 88.7 |
基准方法+ 骨架质量评估模块 | 0.1 | 89.1 |
0.2 | 89.9 | |
0.4 | 88.5 | |
0.6 | 88.2 | |
0.8 | 86.5 |
方法 | μ | Top1/% |
---|---|---|
基准方法 | - | 88.7 |
基准方法+ 自适应时间动态帧选择模块 | 0.2 | 90.7 |
0.5 | 91.5 | |
1.0 | 89.7 |
表2 不同μ取值的识别准确率对比
Table 2 Comparison of recognition accuracy of different values of μ
方法 | μ | Top1/% |
---|---|---|
基准方法 | - | 88.7 |
基准方法+ 自适应时间动态帧选择模块 | 0.2 | 90.7 |
0.5 | 91.5 | |
1.0 | 89.7 |
方法 | 骨架质量 评估模块 | 自适应时间动态 帧选择模块 | Top1/% |
---|---|---|---|
基准 方法 | 88.7 | ||
√ | 89.9 (+1.2) | ||
√ | 89.3 (+0.6) | ||
√ | √ | 91.5 (+2.8) |
表3 时间动态帧选择消融实验
Table 3 Ablation experiments of temporal dynamic frame selection
方法 | 骨架质量 评估模块 | 自适应时间动态 帧选择模块 | Top1/% |
---|---|---|---|
基准 方法 | 88.7 | ||
√ | 89.9 (+1.2) | ||
√ | 89.3 (+0.6) | ||
√ | √ | 91.5 (+2.8) |
方法 | 帧选择策略 | Top1/% |
---|---|---|
基准方法 | 等时间间隔选择 | 88.2 |
均匀选择 | 88.7 | |
基准方法+ 时间动态帧选择 | 最值选择 | 88.9 |
累积选择 | 91.5 |
表4 不同帧选择策略的识别准确率对比
Table 4 Comparison of recognition accuracy of different sampling strategies
方法 | 帧选择策略 | Top1/% |
---|---|---|
基准方法 | 等时间间隔选择 | 88.2 |
均匀选择 | 88.7 | |
基准方法+ 时间动态帧选择 | 最值选择 | 88.9 |
累积选择 | 91.5 |
图15 模型学习的关键骨架节点特征可视化((a)挥手;(b)跳跃;(c)敬礼;(d)拥抱;(e)握手;(f)走向对方)
Fig. 15 Features visualisation of key joints learned by the model ((a) Hand waving; (b) Jump up; (c) Salute; (d) Hugging; (e) Shaking hands; (f) Walking towards)
类型 | 方法 | NTU RGB+ D/% | NTU RGB+ D 120/% | ||
---|---|---|---|---|---|
Xsub | Xview | Xsub | Xset | ||
CNN | TSRJI[ | 73.3 | 80.0 | 65.5 | 59.7 |
SkeleMotion[ | 76.5 | 84.7 | 67.7 | 66.9 | |
RotClips+ MTCNN[ | 81.1 | 87.4 | 62.2 | 61.8 | |
3SCNN[ | 88.6 | 93.7 | - | - | |
RNN | STA-LSTM[ | 73.4 | 81.4 | - | - |
GCA-LSTM[ | 74.4 | 82.8 | 58.3 | 59.2 | |
VA-LSTM[ | 79.4 | 87.6 | - | - | |
SR-TSL[ | 84.8 | 92.4 | - | - | |
AGC-LSTM[ | 89.2 | 95.0 | - | - | |
GCN | ST-GCN[ | 81.5 | 88.3 | 70.7 | 73.2 |
AS-GCN[ | 86.8 | 94.2 | 78.3 | 79.8 | |
RA-GCN[ | 87.3 | 93.6 | 81.1 | 82.7 | |
2s-AGCN[ | 88.5 | 95.1 | 79.2 | 81.5 | |
GCN-HCRF[ | 90.0 | 95.5 | - | - | |
FGCN[ | 90.2 | 96.3 | 85.4 | 87.4 | |
AdaSGN[ | 90.5 | 95.3 | 85.9 | 86.8 | |
Shift-GCN[ | 90.7 | 96.5 | 85.9 | 87.6 | |
本文方法 | 93.4 | 98.2 | 87.0 | 90.0 |
表5 不同方法在NTU RGB+D和NTU RGB+D 120数据集的识别准确率对比
Table 5 Comparison of recognition accuracy of different methods on NTU RGB+D and NTU RGB+D 120 datasets
类型 | 方法 | NTU RGB+ D/% | NTU RGB+ D 120/% | ||
---|---|---|---|---|---|
Xsub | Xview | Xsub | Xset | ||
CNN | TSRJI[ | 73.3 | 80.0 | 65.5 | 59.7 |
SkeleMotion[ | 76.5 | 84.7 | 67.7 | 66.9 | |
RotClips+ MTCNN[ | 81.1 | 87.4 | 62.2 | 61.8 | |
3SCNN[ | 88.6 | 93.7 | - | - | |
RNN | STA-LSTM[ | 73.4 | 81.4 | - | - |
GCA-LSTM[ | 74.4 | 82.8 | 58.3 | 59.2 | |
VA-LSTM[ | 79.4 | 87.6 | - | - | |
SR-TSL[ | 84.8 | 92.4 | - | - | |
AGC-LSTM[ | 89.2 | 95.0 | - | - | |
GCN | ST-GCN[ | 81.5 | 88.3 | 70.7 | 73.2 |
AS-GCN[ | 86.8 | 94.2 | 78.3 | 79.8 | |
RA-GCN[ | 87.3 | 93.6 | 81.1 | 82.7 | |
2s-AGCN[ | 88.5 | 95.1 | 79.2 | 81.5 | |
GCN-HCRF[ | 90.0 | 95.5 | - | - | |
FGCN[ | 90.2 | 96.3 | 85.4 | 87.4 | |
AdaSGN[ | 90.5 | 95.3 | 85.9 | 86.8 | |
Shift-GCN[ | 90.7 | 96.5 | 85.9 | 87.6 | |
本文方法 | 93.4 | 98.2 | 87.0 | 90.0 |
方法 | 数据模态 | Mean Top1 /% |
---|---|---|
ST-GCN[ | 骨架 | 25.2 |
ActionVLAD[ | RGB | 50.1 |
I3D[ | RGB | 63.2 |
TSN[ | RGB、光流 | 76.4 |
TRN[ | RGB、光流 | 79.8 |
TRNms[ | RGB、光流 | 80.2 |
TSM[ | RGB、光流 | 81.2 |
RSANet[ | RGB | 86.4 |
RGBSformer[ | RGB、骨架 | 86.7 |
本文方法 | 骨架 | 90.3 |
表6 不同方法在FineGym数据集的识别准确率对比
Table 6 Comparison of recognition accuracy of different methods on FineGym dataset
方法 | 数据模态 | Mean Top1 /% |
---|---|---|
ST-GCN[ | 骨架 | 25.2 |
ActionVLAD[ | RGB | 50.1 |
I3D[ | RGB | 63.2 |
TSN[ | RGB、光流 | 76.4 |
TRN[ | RGB、光流 | 79.8 |
TRNms[ | RGB、光流 | 80.2 |
TSM[ | RGB、光流 | 81.2 |
RSANet[ | RGB | 86.4 |
RGBSformer[ | RGB、骨架 | 86.7 |
本文方法 | 骨架 | 90.3 |
[1] | SUN Z H, KE Q H, RAHMANI H, et al. Human action recognition from various data modalities: a review[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023, 45(3): 3200-3225. |
[2] | 施海勇, 侯振杰, 巢新, 等. 多模态时空特征表示及其在行为识别中的应用[J]. 中国图象图形学报, 2023, 28(4): 1041-1055. |
SHI H Y, HOU Z J, CHAO X, et al. Multimodal spatial-temporal feature representation and its application in action recognition[J]. Journal of Image and Graphics, 2023, 28(4): 1041-1055 (in Chinese). | |
[3] | YAN S J, XIONG Y J, LIN D H. Spatial temporal graph convolutional networks for skeleton-based action recognition[J]. Proceedings of the AAAI Conference on Artificial Intelligence, 2018, 32(1): 7444-7452. |
[4] | CARREIRA J, ZISSERMAN A. Quo vadis, action recognition? a new model and the kinetics dataset[C]// 2017 IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2017: 4724-4733. |
[5] | WANG L M, XIONG Y J, WANG Z, et al. Temporal segment networks: towards good practices for deep action recognition[C]// European Conference on Computer Vision. Cham: Springer, 2016: 20-36. |
[6] | ZIAEEFARD M, EBRAHIMNEZHAD H. Hierarchical human action recognition by normalized-polar histogram[C]// 2010 20th International Conference on Pattern Recognition. New York: IEEE Press, 2010: 3720-3723. |
[7] | CAETANO C, BRÉMOND F, SCHWARTZ W R. Skeleton image representation for 3D action recognition based on tree structure and reference joints[C]// 2019 32nd SIBGRAPI Conference on Graphics, Patterns and Images. New York: IEEE Press, 2019: 16-23. |
[8] | LIANG D H, FAN G L, LIN G F, et al. Three-stream convolutional neural network with multi-task and ensemble learning for 3D action recognition[C]// 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops. New York: IEEE Press, 2019: 934-940. |
[9] | SONG S J, LAN C L, XING J L, et al. An end-to-end spatio-temporal attention model for human action recognition from skeleton data[J]. Proceedings of the AAAI Conference on Artificial Intelligence, 2017, 31(1): 4263-4270 |
[10] | ZHANG P F, LAN C L, XING J L, et al. View adaptive recurrent neural networks for high performance human action recognition from skeleton data[C]// 2017 IEEE International Conference on Computer Vision. New York: IEEE Press, 2017: 2136-2145. |
[11] | SI C Y, CHEN W T, WANG W, et al. An attention enhanced graph convolutional LSTM network for skeleton-based action recognition[C]// 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2019: 1227-1236. |
[12] | SHI L, ZHANG Y F, CHENG J, et al. Two-stream adaptive graph convolutional networks for skeleton-based action recognition[C]// 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2019: 12018-12027. |
[13] | CHEN Y X, ZHANG Z Q, YUAN C F, et al. Channel-wise topology refinement graph convolution for skeleton-based action recognition[C]// 2021 IEEE/CVF International Conference on Computer Vision. New York: IEEE Press, 2021: 13339-13348. |
[14] | PLIZZARI C, CANNICI M, MATTEUCCI M. Skeleton-based action recognition via spatial and temporal transformer networks[J]. Computer Vision and Image Understanding, 2021, 208-209: 103219. |
[15] |
赵洪, 宣士斌. 人体运动视频关键帧优化及行为识别[J]. 图学学报, 2018, 39(3): 463-469.
DOI |
ZHAO H, XUAN S B. Optimization and behavior identification of keyframes in human action video[J]. Journal of Graphics, 2018, 39(3): 463-469 (in Chinese).
DOI |
|
[16] | TANG Y S, TIAN Y, LU J W, et al. Deep progressive reinforcement learning for skeleton-based action recognition[C]// 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2018: 5323-5332. |
[17] | SHI L, ZHANG Y F, CHENG J, et al. AdaSGN: adapting joint number and model size for efficient skeleton-based action recognition[C]// 2021 IEEE/CVF International Conference on Computer Vision. New York: IEEE Press, 2021: 13393-13402. |
[18] | ZHI Y, TONG Z, WANG L M, et al. MGSampler: an explainable sampling strategy for video action recognition[C]// 2021 IEEE/CVF International Conference on Computer Vision. New York: IEEE Press, 2021: 1493-1502. |
[19] | FEICHTENHOFER C, FAN H Q, MALIK J, et al. SlowFast networks for video recognition[C]// 2019 IEEE/CVF International Conference on Computer Vision. New York: IEEE Press, 2019: 6201-6210. |
[20] |
汪成峰, 陈洪, 张瑞萱, 等. 带有关节权重的DTW动作识别算法研究[J]. 图学学报, 2016, 37(4): 537-544.
DOI |
WANG C F, CHEN H, ZHANG R X, et al. Research on DTW action recognition algorithm with joint weighting[J]. Journal of Graphics, 2016, 37(4): 537-544 (in Chinese).
DOI |
|
[21] | SHAHROUDY A, LIU J, NG T T, et al. NTU RGB D: a large scale dataset for 3D human activity analysis[C]// 2016 IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2016: 1010-1019. |
[22] | LIU J, SHAHROUDY A, PEREZ M, et al. NTU RGB D 120: a large-scale benchmark for 3D human activity understanding[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020, 42(10): 2684-2701. |
[23] | SHAO D, ZHAO Y, DAI B, et al. FineGym: a hierarchical video dataset for fine-grained action understanding[C]// 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2020: 2613-2622. |
[24] | 张钹, 朱军, 苏航. 迈向第三代人工智能[J]. 中国科学: 信息科学, 2020, 50(9): 1281-1302. |
ZHANG B, ZHU J, SU H. Toward the third generation of artificial intelligence[J]. Scientia Sinica: Informationis, 2020, 50(9): 1281-1302 (in Chinese). | |
[25] | KWON H, KIM M, KWAK S, et al. Learning self-similarity in space and time as generalized motion for video action recognition[C]// 2021 IEEE/CVF International Conference on Computer Vision. New York: IEEE Press, 2021: 13045-13055. |
[26] | DUAN H D, ZHAO Y, CHEN K, et al. Revisiting skeleton-based action recognition[C]// 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2022: 2959-2968. |
[27] | DUAN H D, WANG J Q, CHEN K, et al. PYSKL: towards good practices for skeleton action recognition[C]// The 30th ACM International Conference on Multimedia. New York: ACM, 2022: 7351-7354. |
[28] | WANG J D, SUN K, CHENG T H, et al. Deep high-resolution representation learning for visual recognition[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021, 43(10): 3349-3364. |
[29] | CAETANO C, SENA J, BRÉMOND F, et al. SkeleMotion: a new representation of skeleton joint sequences based on motion information for 3D action recognition[C]// 2019 16th IEEE International Conference on Advanced Video and Signal Based Surveillance. New York: IEEE Press, 2019: 1-8. |
[30] | KE Q H, BENNAMOUN M, AN S J, et al. Learning clip representations for skeleton-based 3D action recognition[J]. IEEE Transactions on Image Processin, 2018, 27(6): 2842-2855. |
[31] | LIU J, WANG G, HU P, et al. Global context-aware attention LSTM networks for 3D action recognition[C]// 2017 IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2017: 3671-3680. |
[32] | SI C Y, JING Y, WANG W, et al. Skeleton-based action recognition with spatial reasoning and temporal stack learning[C]// European Conference on Computer Vision. Cham: Springer, 2018: 106-121. |
[33] | LI M S, CHEN S H, CHEN X, et al. Actional-structural graph convolutional networks for skeleton-based action recognition[C]// 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2019: 3590-3598. |
[34] | SONG Y F, ZHANG Z, SHAN C F, et al. Richly activated graph convolutional network for robust skeleton-based action recognition[J]. IEEE Transactions on Circuits and Systems for Video Technology, 2021, 31(5): 1915-1925. |
[35] | LIU K, GAO L, KHAN N M, et al. A multi-stream graph convolutional networks-hidden conditional random field model for skeleton-based action recognition[J]. IEEE Transactions on Multimedia, 2020, 23: 64-76. |
[36] | YANG H, YAN D, ZHANG L, et al. Feedback graph convolutional network for skeleton-based action recognition[J]. IEEE Transactions on Image Processing, 2021, 31: 164-175. |
[37] | CHENG K, ZHANG Y F, HE X Y, et al. Skeleton-based action recognition with shift graph convolutional network[C]// 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2020: 180-189. |
[38] | GIRDHAR R, RAMANAN D, GUPTA A, et al. ActionVLAD: learning spatio-temporal aggregation for action classification[C]// 2017 IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2017: 3165-3174. |
[39] | ZHOU B L, ANDONIAN A, OLIVA A, et al. Temporal relational reasoning in videos[C]// European Conference on Computer Vision. Cham: Springer, 2018: 831-846. |
[40] | LIN J, GAN C, HAN S. TSM: temporal shift module for efficient video understanding[C]// 2019 IEEE/CVF International Conference on Computer Vision. New York: IEEE Press, 2019: 7082-7092. |
[41] | KIM M, KWON H, WANG C Y, et al. Relational self-attention: what’s missing in attention for video understanding[EB/OL]. [2023-11-10]. http://arxiv.org/abs/2111.01673. |
[42] | SHI J, ZHANG Y Y, WANG W H, et al. A novel two-stream transformer-based framework for multi-modality human action recognition[J]. Applied Sciences, 2023, 13(4): 2058. |
[1] | 孙 峥, 张素才, 马喜波, . 基于全局时空编码网络的猴类动物行为识别[J]. 图学学报, 2022, 43(5): 832-840. |
[2] | 赵洪, 宣士斌. 人体运动视频关键帧优化及行为识别[J]. 图学学报, 2018, 39(3): 463-469. |
阅读次数 | ||||||
全文 |
|
|||||
摘要 |
|
|||||