基于时间动态帧选择与时空图卷积的可解释骨架行为识别

doi:10.11996/JG.j.2095-302X.2024040791

图学学报 ›› 2024, Vol. 45 ›› Issue (4): 791-803.DOI: 10.11996/JG.j.2095-302X.2024040791

• 图像处理与计算机视觉 • 上一篇下一篇

基于时间动态帧选择与时空图卷积的可解释骨架行为识别

梁成武¹^,²(), 杨杰¹^,², 胡伟¹^,², 蒋松琪¹^,², 钱其扬², 侯宁²()

1.三峡大学电气与新能源学院，湖北宜昌 443002
2.河南城建学院电气与控制工程学院，河南平顶山 467036

收稿日期:2023-12-25 接受日期:2024-04-07 出版日期:2024-08-31 发布日期:2024-09-03
通讯作者:侯宁(1982-)，男，副教授，博士。主要研究方向为计算机视觉和模式识别等。E-mail：30090807@huuc.edu.cn
第一作者:梁成武(1982-)，男，教授，博士。主要研究方向为人工智能和多媒体分析。E-mail：liangchengwu0615@126.com
基金资助:
国家自然科学基金项目(62176086);国家自然科学基金项目(U1804152);河南省科技攻关计划项目(242102211055)

Temporal dynamic frame selection and spatio-temporal graph convolution for interpretable skeleton-based action recognition

LIANG Chengwu¹^,²(), YANG Jie¹^,², HU Wei¹^,², JIANG Songqi¹^,², QIAN Qiyang², HOU Ning²()

1. College of Electrical Engineering and New Energy, China Three Gorges University, Yichang Hubei 443002, China
2. School of Electrical and Control Engineering, Henan University of Urban Construction, Pingdingshan Henan 467036, China

Received:2023-12-25 Accepted:2024-04-07 Published:2024-08-31 Online:2024-09-03
Contact: HOU Ning (1982-), associate professor, Ph.D. His main research interests cover computer vision and pattern recognition, etc. E-mail：30090807@huuc.edu.cn
First author：LIANG Chengwu (1982-), professor, Ph.D. His main research interests cover artificial intelligence and multimedia. E-mail：liangchengwu0615@126.com
Supported by:
National Natural Science Foundation of China(62176086);National Natural Science Foundation of China(U1804152);Henan Province Science and Technology Project(242102211055)

摘要/Abstract

摘要：

骨架行为识别是计算机视觉和机器学习领域的研究热点。现有数据驱动型神经网络往往忽略骨架序列时间动态帧选择和模型内在人类可理解的决策逻辑，造成可解释性不足。为此提出一种基于时间动态帧选择与时空图卷积的可解释骨架行为识别方法，以提高模型的可解释性和识别性能。首先利用骨架帧置信度评价函数删除低质骨架帧，以解决骨架序列噪声问题。其次基于人体运动领域知识，提出自适应时间动态帧选择模块用于计算运动行为显著区域，以捕捉关键人体运动骨架帧的动态规律。为学习行为骨架节点内在拓扑结构，改进时空图卷积网络用于可解释骨架行为识别。在NTU RGB+D，NTU RGB+D 120和FineGym这3个大型公开数据集上的实验评估表明，该方法的骨架行为识别准确率优于对比方法并具有可解释性。

关键词: 行为识别, 骨架序列, 可解释, 运动显著区域, 时空图卷积网络

Abstract:

Skeleton-based action recognition is a prominent research topic in computer vision and machine learning. Existing data-driven neural networks often overlook the temporal dynamic frame selection of skeleton sequences and lack the understandable decision logic inherent in the model, resulting in insufficient interpretability. To this end, we proposed an interpretable skeleton-based action recognition method based on temporal dynamic frame selection and spatio-temporal graph convolution, thereby enhancing the interpretability and recognition performance. Firstly, the quality of skeleton frames was estimated using the joint confidence to remove low-quality skeleton frames, addressing the skeleton noise problem. Secondly, based on the domain knowledge of human activity, an adaptive temporal dynamic frame selection module was proposed for calculating the motion salient regions to capture the dynamic patterns of key skeleton frames in human motion. To represent the intrinsic topology of human joints, an improved spatiotemporal graph convolutional network was used for interpretable skeleton-based action recognition. Experiments were conducted on three large public datasets, including NTU RGB+D, NTU RGB+D 120, and FineGym, and the results demonstrated that the recognition accuracy of this method outperformed comparative methods and possessed interpretability.

Key words: action recognition, skeleton sequence, interpretability, motion salient regions, spatio-temporal graph convolutional network

中图分类号:

TP391
TP181

梁成武, 杨杰, 胡伟, 蒋松琪, 钱其扬, 侯宁. 基于时间动态帧选择与时空图卷积的可解释骨架行为识别[J]. 图学学报, 2024, 45(4): 791-803.

LIANG Chengwu, YANG Jie, HU Wei, JIANG Songqi, QIAN Qiyang, HOU Ning. Temporal dynamic frame selection and spatio-temporal graph convolution for interpretable skeleton-based action recognition[J]. Journal of Graphics, 2024, 45(4): 791-803.

图/表 21

图1 本文方法总体框架

Fig. 1 Overall structure of proposed method

图2 骨架数据可视化

Fig. 2 Skeleton data visualization

图3 自适应时间动态帧选择模块

Fig. 3 Adaptive temporal dynamic frame selection module

图4 3种不同行为的运动距离分布((a)阅读；(b)拳打/掌击他人；(c)拥抱他人)

Fig. 4 The motion distance distribution of three actions ((a) Reading; (b) Punch/slap; (c) Hugging)

图5 不同μ取值下的运动累积分布函数

Fig. 5 Motion cumulative distribution function under different values of μ

图6 两种不同选择策略((a)累积选择策略；(b)最值选择策略)

Fig. 6 Two different sampling strategies ((a) Cumulative sampling; (b) Slope sampling)

图7 空间和时间特征提取示意图((a)空间特征提取；(b)时间特征提取)

Fig. 7 Spatial and temporal feature extraction ((a) Spatial feature extraction; (b) Temporal feature extraction)

图8 改进时空图卷积网络

Fig. 8 Improved spatiotemporal graph convolutional network

图9 NTU RGB+D部分典型行为

Fig. 9 Some of typical actions in NTU RGB+D

图10 NTU RGB+D 120部分典型行为

Fig. 10 Some of typical actions in NTU RGB+D 120

图11 FineGym的4种体育项目

Fig. 11 Four gymnastic events in FineGym

图12 Top1识别准确率和损失曲线图

Fig. 12 Top1 recognition accuracy and loss curves

表1 不同θ取值的识别准确率对比

Table 1 Comparison of recognition accuracy of different values of θ

方法	θ	Top1/%
基准方法	0.0	88.7
基准方法+ 骨架质量评估模块	0.1	89.1
	0.2	89.9
	0.4	88.5
	0.6	88.2
	0.8	86.5

表2 不同μ取值的识别准确率对比

Table 2 Comparison of recognition accuracy of different values of μ

方法	μ	Top1/%
基准方法	-	88.7
基准方法+ 自适应时间动态帧选择模块	0.2	90.7
	0.5	91.5
	1.0	89.7

表3 时间动态帧选择消融实验

Table 3 Ablation experiments of temporal dynamic frame selection

方法	骨架质量评估模块	自适应时间动态帧选择模块	Top1/%
基准方法			88.7
	√		89.9 (+1.2)
		√	89.3 (+0.6)
	√	√	91.5 (+2.8)

表4 不同帧选择策略的识别准确率对比

Table 4 Comparison of recognition accuracy of different sampling strategies

方法	帧选择策略	Top1/%
基准方法	等时间间隔选择	88.2
基准方法	均匀选择	88.7
基准方法+ 时间动态帧选择	最值选择	88.9
基准方法+ 时间动态帧选择	累积选择	91.5

图13 NTU RGB+D数据集上的60类行为识别准确率对比

Fig. 13 Comparison of recognition accuracy of 60 action classes on NTU RGB+D dataset

图14 NTU RGB+D数据集上的归一化混淆矩阵

Fig. 14 The normalized confusion matrix on NTU RGB+D dataset

图15 模型学习的关键骨架节点特征可视化((a)挥手；(b)跳跃；(c)敬礼；(d)拥抱；(e)握手；(f)走向对方)

Fig. 15 Features visualisation of key joints learned by the model ((a) Hand waving; (b) Jump up; (c) Salute; (d) Hugging; (e) Shaking hands; (f) Walking towards)

表5 不同方法在NTU RGB+D和NTU RGB+D 120数据集的识别准确率对比

Table 5 Comparison of recognition accuracy of different methods on NTU RGB+D and NTU RGB+D 120 datasets

类型	方法	NTU RGB+ D/%		NTU RGB+ D 120/%
类型	方法	X_sub	X_view	X_sub	X_set
CNN	TSRJI^[7]	73.3	80.0	65.5	59.7
	SkeleMotion^[29]	76.5	84.7	67.7	66.9
	RotClips+ MTCNN^[30]	81.1	87.4	62.2	61.8
	3SCNN^[8]	88.6	93.7	-	-
RNN	STA-LSTM^[9]	73.4	81.4	-	-
	GCA-LSTM^[31]	74.4	82.8	58.3	59.2
	VA-LSTM^[10]	79.4	87.6	-	-
	SR-TSL^[32]	84.8	92.4	-	-
	AGC-LSTM^[11]	89.2	95.0	-	-
GCN	ST-GCN^[3]	81.5	88.3	70.7	73.2
	AS-GCN^[33]	86.8	94.2	78.3	79.8
	RA-GCN^[34]	87.3	93.6	81.1	82.7
	2s-AGCN^[12]	88.5	95.1	79.2	81.5
	GCN-HCRF^[35]	90.0	95.5	-	-
	FGCN^[36]	90.2	96.3	85.4	87.4
	AdaSGN^[17]	90.5	95.3	85.9	86.8
	Shift-GCN^[37]	90.7	96.5	85.9	87.6
	本文方法	93.4	98.2	87.0	90.0

表6 不同方法在FineGym数据集的识别准确率对比

Table 6 Comparison of recognition accuracy of different methods on FineGym dataset

方法	数据模态	Mean Top1 /%
ST-GCN^[3]	骨架	25.2
ActionVLAD^[38]	RGB	50.1
I3D^[4]	RGB	63.2
TSN^[5]	RGB、光流	76.4
TRN^[39]	RGB、光流	79.8
TRNms^[39]	RGB、光流	80.2
TSM^[40]	RGB、光流	81.2
RSANet^[41]	RGB	86.4
RGBSformer^[42]	RGB、骨架	86.7
本文方法	骨架	90.3

参考文献 42

[1]	SUN Z H, KE Q H, RAHMANI H, et al. Human action recognition from various data modalities: a review[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023, 45(3): 3200-3225.
[2]	施海勇, 侯振杰, 巢新, 等. 多模态时空特征表示及其在行为识别中的应用[J]. 中国图象图形学报, 2023, 28(4): 1041-1055.
	SHI H Y, HOU Z J, CHAO X, et al. Multimodal spatial-temporal feature representation and its application in action recognition[J]. Journal of Image and Graphics, 2023, 28(4): 1041-1055 (in Chinese).
[3]	YAN S J, XIONG Y J, LIN D H. Spatial temporal graph convolutional networks for skeleton-based action recognition[J]. Proceedings of the AAAI Conference on Artificial Intelligence, 2018, 32(1): 7444-7452.
[4]	CARREIRA J, ZISSERMAN A. Quo vadis, action recognition? a new model and the kinetics dataset[C]// 2017 IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2017: 4724-4733.
[5]	WANG L M, XIONG Y J, WANG Z, et al. Temporal segment networks: towards good practices for deep action recognition[C]// European Conference on Computer Vision. Cham: Springer, 2016: 20-36.
[6]	ZIAEEFARD M, EBRAHIMNEZHAD H. Hierarchical human action recognition by normalized-polar histogram[C]// 2010 20th International Conference on Pattern Recognition. New York: IEEE Press, 2010: 3720-3723.
[7]	CAETANO C, BRÉMOND F, SCHWARTZ W R. Skeleton image representation for 3D action recognition based on tree structure and reference joints[C]// 2019 32nd SIBGRAPI Conference on Graphics, Patterns and Images. New York: IEEE Press, 2019: 16-23.
[8]	LIANG D H, FAN G L, LIN G F, et al. Three-stream convolutional neural network with multi-task and ensemble learning for 3D action recognition[C]// 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops. New York: IEEE Press, 2019: 934-940.
[9]	SONG S J, LAN C L, XING J L, et al. An end-to-end spatio-temporal attention model for human action recognition from skeleton data[J]. Proceedings of the AAAI Conference on Artificial Intelligence, 2017, 31(1): 4263-4270
[10]	ZHANG P F, LAN C L, XING J L, et al. View adaptive recurrent neural networks for high performance human action recognition from skeleton data[C]// 2017 IEEE International Conference on Computer Vision. New York: IEEE Press, 2017: 2136-2145.
[11]	SI C Y, CHEN W T, WANG W, et al. An attention enhanced graph convolutional LSTM network for skeleton-based action recognition[C]// 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2019: 1227-1236.
[12]	SHI L, ZHANG Y F, CHENG J, et al. Two-stream adaptive graph convolutional networks for skeleton-based action recognition[C]// 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2019: 12018-12027.
[13]	CHEN Y X, ZHANG Z Q, YUAN C F, et al. Channel-wise topology refinement graph convolution for skeleton-based action recognition[C]// 2021 IEEE/CVF International Conference on Computer Vision. New York: IEEE Press, 2021: 13339-13348.
[14]	PLIZZARI C, CANNICI M, MATTEUCCI M. Skeleton-based action recognition via spatial and temporal transformer networks[J]. Computer Vision and Image Understanding, 2021, 208-209: 103219.
[15]	赵洪, 宣士斌. 人体运动视频关键帧优化及行为识别[J]. 图学学报, 2018, 39(3): 463-469. DOI
	ZHAO H, XUAN S B. Optimization and behavior identification of keyframes in human action video[J]. Journal of Graphics, 2018, 39(3): 463-469 (in Chinese). DOI
[16]	TANG Y S, TIAN Y, LU J W, et al. Deep progressive reinforcement learning for skeleton-based action recognition[C]// 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2018: 5323-5332.
[17]	SHI L, ZHANG Y F, CHENG J, et al. AdaSGN: adapting joint number and model size for efficient skeleton-based action recognition[C]// 2021 IEEE/CVF International Conference on Computer Vision. New York: IEEE Press, 2021: 13393-13402.
[18]	ZHI Y, TONG Z, WANG L M, et al. MGSampler: an explainable sampling strategy for video action recognition[C]// 2021 IEEE/CVF International Conference on Computer Vision. New York: IEEE Press, 2021: 1493-1502.
[19]	FEICHTENHOFER C, FAN H Q, MALIK J, et al. SlowFast networks for video recognition[C]// 2019 IEEE/CVF International Conference on Computer Vision. New York: IEEE Press, 2019: 6201-6210.
[20]	汪成峰, 陈洪, 张瑞萱, 等. 带有关节权重的DTW动作识别算法研究[J]. 图学学报, 2016, 37(4): 537-544. DOI
	WANG C F, CHEN H, ZHANG R X, et al. Research on DTW action recognition algorithm with joint weighting[J]. Journal of Graphics, 2016, 37(4): 537-544 (in Chinese). DOI
[21]	SHAHROUDY A, LIU J, NG T T, et al. NTU RGB D: a large scale dataset for 3D human activity analysis[C]// 2016 IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2016: 1010-1019.
[22]	LIU J, SHAHROUDY A, PEREZ M, et al. NTU RGB D 120: a large-scale benchmark for 3D human activity understanding[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020, 42(10): 2684-2701.
[23]	SHAO D, ZHAO Y, DAI B, et al. FineGym: a hierarchical video dataset for fine-grained action understanding[C]// 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2020: 2613-2622.
[24]	张钹, 朱军, 苏航. 迈向第三代人工智能[J]. 中国科学: 信息科学, 2020, 50(9): 1281-1302.
	ZHANG B, ZHU J, SU H. Toward the third generation of artificial intelligence[J]. Scientia Sinica: Informationis, 2020, 50(9): 1281-1302 (in Chinese).
[25]	KWON H, KIM M, KWAK S, et al. Learning self-similarity in space and time as generalized motion for video action recognition[C]// 2021 IEEE/CVF International Conference on Computer Vision. New York: IEEE Press, 2021: 13045-13055.
[26]	DUAN H D, ZHAO Y, CHEN K, et al. Revisiting skeleton-based action recognition[C]// 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2022: 2959-2968.
[27]	DUAN H D, WANG J Q, CHEN K, et al. PYSKL: towards good practices for skeleton action recognition[C]// The 30th ACM International Conference on Multimedia. New York: ACM, 2022: 7351-7354.
[28]	WANG J D, SUN K, CHENG T H, et al. Deep high-resolution representation learning for visual recognition[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021, 43(10): 3349-3364.
[29]	CAETANO C, SENA J, BRÉMOND F, et al. SkeleMotion: a new representation of skeleton joint sequences based on motion information for 3D action recognition[C]// 2019 16th IEEE International Conference on Advanced Video and Signal Based Surveillance. New York: IEEE Press, 2019: 1-8.
[30]	KE Q H, BENNAMOUN M, AN S J, et al. Learning clip representations for skeleton-based 3D action recognition[J]. IEEE Transactions on Image Processin, 2018, 27(6): 2842-2855.
[31]	LIU J, WANG G, HU P, et al. Global context-aware attention LSTM networks for 3D action recognition[C]// 2017 IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2017: 3671-3680.
[32]	SI C Y, JING Y, WANG W, et al. Skeleton-based action recognition with spatial reasoning and temporal stack learning[C]// European Conference on Computer Vision. Cham: Springer, 2018: 106-121.
[33]	LI M S, CHEN S H, CHEN X, et al. Actional-structural graph convolutional networks for skeleton-based action recognition[C]// 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2019: 3590-3598.
[34]	SONG Y F, ZHANG Z, SHAN C F, et al. Richly activated graph convolutional network for robust skeleton-based action recognition[J]. IEEE Transactions on Circuits and Systems for Video Technology, 2021, 31(5): 1915-1925.
[35]	LIU K, GAO L, KHAN N M, et al. A multi-stream graph convolutional networks-hidden conditional random field model for skeleton-based action recognition[J]. IEEE Transactions on Multimedia, 2020, 23: 64-76.
[36]	YANG H, YAN D, ZHANG L, et al. Feedback graph convolutional network for skeleton-based action recognition[J]. IEEE Transactions on Image Processing, 2021, 31: 164-175.
[37]	CHENG K, ZHANG Y F, HE X Y, et al. Skeleton-based action recognition with shift graph convolutional network[C]// 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2020: 180-189.
[38]	GIRDHAR R, RAMANAN D, GUPTA A, et al. ActionVLAD: learning spatio-temporal aggregation for action classification[C]// 2017 IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2017: 3165-3174.
[39]	ZHOU B L, ANDONIAN A, OLIVA A, et al. Temporal relational reasoning in videos[C]// European Conference on Computer Vision. Cham: Springer, 2018: 831-846.
[40]	LIN J, GAN C, HAN S. TSM: temporal shift module for efficient video understanding[C]// 2019 IEEE/CVF International Conference on Computer Vision. New York: IEEE Press, 2019: 7082-7092.
[41]	KIM M, KWON H, WANG C Y, et al. Relational self-attention: what’s missing in attention for video understanding[EB/OL]. [2023-11-10]. http://arxiv.org/abs/2111.01673.
[42]	SHI J, ZHANG Y Y, WANG W H, et al. A novel two-stream transformer-based framework for multi-modality human action recognition[J]. Applied Sciences, 2023, 13(4): 2058.

基于时间动态帧选择与时空图卷积的可解释骨架行为识别

Temporal dynamic frame selection and spatio-temporal graph convolution for interpretable skeleton-based action recognition

RichHTML

PDF (PC)

可视化

摘要/Abstract

引用本文

使用本文

图/表 21

参考文献 42

相关文章 2

编辑推荐

Metrics

本文评价

[1]	孙峥, 张素才, 马喜波, . 基于全局时空编码网络的猴类动物行为识别[J]. 图学学报, 2022, 43(5): 832-840.
[2]	赵洪，宣士斌. 人体运动视频关键帧优化及行为识别[J]. 图学学报, 2018, 39(3): 463-469.