Temporal dynamic frame selection and spatio-temporal graph convolution for interpretable skeleton-based action recognition

doi:10.11996/JG.j.2095-302X.2024040791

Abstract

Abstract:

Skeleton-based action recognition is a prominent research topic in computer vision and machine learning. Existing data-driven neural networks often overlook the temporal dynamic frame selection of skeleton sequences and lack the understandable decision logic inherent in the model, resulting in insufficient interpretability. To this end, we proposed an interpretable skeleton-based action recognition method based on temporal dynamic frame selection and spatio-temporal graph convolution, thereby enhancing the interpretability and recognition performance. Firstly, the quality of skeleton frames was estimated using the joint confidence to remove low-quality skeleton frames, addressing the skeleton noise problem. Secondly, based on the domain knowledge of human activity, an adaptive temporal dynamic frame selection module was proposed for calculating the motion salient regions to capture the dynamic patterns of key skeleton frames in human motion. To represent the intrinsic topology of human joints, an improved spatiotemporal graph convolutional network was used for interpretable skeleton-based action recognition. Experiments were conducted on three large public datasets, including NTU RGB+D, NTU RGB+D 120, and FineGym, and the results demonstrated that the recognition accuracy of this method outperformed comparative methods and possessed interpretability.

Key words: action recognition, skeleton sequence, interpretability, motion salient regions, spatio-temporal graph convolutional network

CLC Number:

TP391
TP181

LIANG Chengwu, YANG Jie, HU Wei, JIANG Songqi, QIAN Qiyang, HOU Ning. Temporal dynamic frame selection and spatio-temporal graph convolution for interpretable skeleton-based action recognition[J]. Journal of Graphics, 2024, 45(4): 791-803.

Figures/Tables 21

Fig. 1 Overall structure of proposed method

Fig. 2 Skeleton data visualization

Fig. 3 Adaptive temporal dynamic frame selection module

Fig. 4 The motion distance distribution of three actions ((a) Reading; (b) Punch/slap; (c) Hugging)

Fig. 5 Motion cumulative distribution function under different values of μ

Fig. 6 Two different sampling strategies ((a) Cumulative sampling; (b) Slope sampling)

Fig. 7 Spatial and temporal feature extraction ((a) Spatial feature extraction; (b) Temporal feature extraction)

Fig. 8 Improved spatiotemporal graph convolutional network

Fig. 9 Some of typical actions in NTU RGB+D

Fig. 10 Some of typical actions in NTU RGB+D 120

Fig. 11 Four gymnastic events in FineGym

Fig. 12 Top1 recognition accuracy and loss curves

Table 1 Comparison of recognition accuracy of different values of θ

方法	θ	Top1/%
基准方法	0.0	88.7
基准方法+ 骨架质量评估模块	0.1	89.1
	0.2	89.9
	0.4	88.5
	0.6	88.2
	0.8	86.5

Table 2 Comparison of recognition accuracy of different values of μ

方法	μ	Top1/%
基准方法	-	88.7
基准方法+ 自适应时间动态帧选择模块	0.2	90.7
	0.5	91.5
	1.0	89.7

Table 3 Ablation experiments of temporal dynamic frame selection

方法	骨架质量评估模块	自适应时间动态帧选择模块	Top1/%
基准方法			88.7
	√		89.9 (+1.2)
		√	89.3 (+0.6)
	√	√	91.5 (+2.8)

Table 4 Comparison of recognition accuracy of different sampling strategies

方法	帧选择策略	Top1/%
基准方法	等时间间隔选择	88.2
基准方法	均匀选择	88.7
基准方法+ 时间动态帧选择	最值选择	88.9
基准方法+ 时间动态帧选择	累积选择	91.5

Fig. 13 Comparison of recognition accuracy of 60 action classes on NTU RGB+D dataset

Fig. 14 The normalized confusion matrix on NTU RGB+D dataset

Fig. 15 Features visualisation of key joints learned by the model ((a) Hand waving; (b) Jump up; (c) Salute; (d) Hugging; (e) Shaking hands; (f) Walking towards)

Table 5 Comparison of recognition accuracy of different methods on NTU RGB+D and NTU RGB+D 120 datasets

类型	方法	NTU RGB+ D/%		NTU RGB+ D 120/%
类型	方法	X_sub	X_view	X_sub	X_set
CNN	TSRJI^[7]	73.3	80.0	65.5	59.7
	SkeleMotion^[29]	76.5	84.7	67.7	66.9
	RotClips+ MTCNN^[30]	81.1	87.4	62.2	61.8
	3SCNN^[8]	88.6	93.7	-	-
RNN	STA-LSTM^[9]	73.4	81.4	-	-
	GCA-LSTM^[31]	74.4	82.8	58.3	59.2
	VA-LSTM^[10]	79.4	87.6	-	-
	SR-TSL^[32]	84.8	92.4	-	-
	AGC-LSTM^[11]	89.2	95.0	-	-
GCN	ST-GCN^[3]	81.5	88.3	70.7	73.2
	AS-GCN^[33]	86.8	94.2	78.3	79.8
	RA-GCN^[34]	87.3	93.6	81.1	82.7
	2s-AGCN^[12]	88.5	95.1	79.2	81.5
	GCN-HCRF^[35]	90.0	95.5	-	-
	FGCN^[36]	90.2	96.3	85.4	87.4
	AdaSGN^[17]	90.5	95.3	85.9	86.8
	Shift-GCN^[37]	90.7	96.5	85.9	87.6
	本文方法	93.4	98.2	87.0	90.0

Table 6 Comparison of recognition accuracy of different methods on FineGym dataset

方法	数据模态	Mean Top1 /%
ST-GCN^[3]	骨架	25.2
ActionVLAD^[38]	RGB	50.1
I3D^[4]	RGB	63.2
TSN^[5]	RGB、光流	76.4
TRN^[39]	RGB、光流	79.8
TRNms^[39]	RGB、光流	80.2
TSM^[40]	RGB、光流	81.2
RSANet^[41]	RGB	86.4
RGBSformer^[42]	RGB、骨架	86.7
本文方法	骨架	90.3

References 42

[1]	SUN Z H, KE Q H, RAHMANI H, et al. Human action recognition from various data modalities: a review[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023, 45(3): 3200-3225.
[2]	施海勇, 侯振杰, 巢新, 等. 多模态时空特征表示及其在行为识别中的应用[J]. 中国图象图形学报, 2023, 28(4): 1041-1055.
	SHI H Y, HOU Z J, CHAO X, et al. Multimodal spatial-temporal feature representation and its application in action recognition[J]. Journal of Image and Graphics, 2023, 28(4): 1041-1055 (in Chinese).
[3]	YAN S J, XIONG Y J, LIN D H. Spatial temporal graph convolutional networks for skeleton-based action recognition[J]. Proceedings of the AAAI Conference on Artificial Intelligence, 2018, 32(1): 7444-7452.
[4]	CARREIRA J, ZISSERMAN A. Quo vadis, action recognition? a new model and the kinetics dataset[C]// 2017 IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2017: 4724-4733.
[5]	WANG L M, XIONG Y J, WANG Z, et al. Temporal segment networks: towards good practices for deep action recognition[C]// European Conference on Computer Vision. Cham: Springer, 2016: 20-36.
[6]	ZIAEEFARD M, EBRAHIMNEZHAD H. Hierarchical human action recognition by normalized-polar histogram[C]// 2010 20th International Conference on Pattern Recognition. New York: IEEE Press, 2010: 3720-3723.
[7]	CAETANO C, BRÉMOND F, SCHWARTZ W R. Skeleton image representation for 3D action recognition based on tree structure and reference joints[C]// 2019 32nd SIBGRAPI Conference on Graphics, Patterns and Images. New York: IEEE Press, 2019: 16-23.
[8]	LIANG D H, FAN G L, LIN G F, et al. Three-stream convolutional neural network with multi-task and ensemble learning for 3D action recognition[C]// 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops. New York: IEEE Press, 2019: 934-940.
[9]	SONG S J, LAN C L, XING J L, et al. An end-to-end spatio-temporal attention model for human action recognition from skeleton data[J]. Proceedings of the AAAI Conference on Artificial Intelligence, 2017, 31(1): 4263-4270
[10]	ZHANG P F, LAN C L, XING J L, et al. View adaptive recurrent neural networks for high performance human action recognition from skeleton data[C]// 2017 IEEE International Conference on Computer Vision. New York: IEEE Press, 2017: 2136-2145.
[11]	SI C Y, CHEN W T, WANG W, et al. An attention enhanced graph convolutional LSTM network for skeleton-based action recognition[C]// 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2019: 1227-1236.
[12]	SHI L, ZHANG Y F, CHENG J, et al. Two-stream adaptive graph convolutional networks for skeleton-based action recognition[C]// 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2019: 12018-12027.
[13]	CHEN Y X, ZHANG Z Q, YUAN C F, et al. Channel-wise topology refinement graph convolution for skeleton-based action recognition[C]// 2021 IEEE/CVF International Conference on Computer Vision. New York: IEEE Press, 2021: 13339-13348.
[14]	PLIZZARI C, CANNICI M, MATTEUCCI M. Skeleton-based action recognition via spatial and temporal transformer networks[J]. Computer Vision and Image Understanding, 2021, 208-209: 103219.
[15]	赵洪, 宣士斌. 人体运动视频关键帧优化及行为识别[J]. 图学学报, 2018, 39(3): 463-469. DOI
	ZHAO H, XUAN S B. Optimization and behavior identification of keyframes in human action video[J]. Journal of Graphics, 2018, 39(3): 463-469 (in Chinese). DOI
[16]	TANG Y S, TIAN Y, LU J W, et al. Deep progressive reinforcement learning for skeleton-based action recognition[C]// 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2018: 5323-5332.
[17]	SHI L, ZHANG Y F, CHENG J, et al. AdaSGN: adapting joint number and model size for efficient skeleton-based action recognition[C]// 2021 IEEE/CVF International Conference on Computer Vision. New York: IEEE Press, 2021: 13393-13402.
[18]	ZHI Y, TONG Z, WANG L M, et al. MGSampler: an explainable sampling strategy for video action recognition[C]// 2021 IEEE/CVF International Conference on Computer Vision. New York: IEEE Press, 2021: 1493-1502.
[19]	FEICHTENHOFER C, FAN H Q, MALIK J, et al. SlowFast networks for video recognition[C]// 2019 IEEE/CVF International Conference on Computer Vision. New York: IEEE Press, 2019: 6201-6210.
[20]	汪成峰, 陈洪, 张瑞萱, 等. 带有关节权重的DTW动作识别算法研究[J]. 图学学报, 2016, 37(4): 537-544. DOI
	WANG C F, CHEN H, ZHANG R X, et al. Research on DTW action recognition algorithm with joint weighting[J]. Journal of Graphics, 2016, 37(4): 537-544 (in Chinese). DOI
[21]	SHAHROUDY A, LIU J, NG T T, et al. NTU RGB D: a large scale dataset for 3D human activity analysis[C]// 2016 IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2016: 1010-1019.
[22]	LIU J, SHAHROUDY A, PEREZ M, et al. NTU RGB D 120: a large-scale benchmark for 3D human activity understanding[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020, 42(10): 2684-2701.
[23]	SHAO D, ZHAO Y, DAI B, et al. FineGym: a hierarchical video dataset for fine-grained action understanding[C]// 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2020: 2613-2622.
[24]	张钹, 朱军, 苏航. 迈向第三代人工智能[J]. 中国科学: 信息科学, 2020, 50(9): 1281-1302.
	ZHANG B, ZHU J, SU H. Toward the third generation of artificial intelligence[J]. Scientia Sinica: Informationis, 2020, 50(9): 1281-1302 (in Chinese).
[25]	KWON H, KIM M, KWAK S, et al. Learning self-similarity in space and time as generalized motion for video action recognition[C]// 2021 IEEE/CVF International Conference on Computer Vision. New York: IEEE Press, 2021: 13045-13055.
[26]	DUAN H D, ZHAO Y, CHEN K, et al. Revisiting skeleton-based action recognition[C]// 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2022: 2959-2968.
[27]	DUAN H D, WANG J Q, CHEN K, et al. PYSKL: towards good practices for skeleton action recognition[C]// The 30th ACM International Conference on Multimedia. New York: ACM, 2022: 7351-7354.
[28]	WANG J D, SUN K, CHENG T H, et al. Deep high-resolution representation learning for visual recognition[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021, 43(10): 3349-3364.
[29]	CAETANO C, SENA J, BRÉMOND F, et al. SkeleMotion: a new representation of skeleton joint sequences based on motion information for 3D action recognition[C]// 2019 16th IEEE International Conference on Advanced Video and Signal Based Surveillance. New York: IEEE Press, 2019: 1-8.
[30]	KE Q H, BENNAMOUN M, AN S J, et al. Learning clip representations for skeleton-based 3D action recognition[J]. IEEE Transactions on Image Processin, 2018, 27(6): 2842-2855.
[31]	LIU J, WANG G, HU P, et al. Global context-aware attention LSTM networks for 3D action recognition[C]// 2017 IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2017: 3671-3680.
[32]	SI C Y, JING Y, WANG W, et al. Skeleton-based action recognition with spatial reasoning and temporal stack learning[C]// European Conference on Computer Vision. Cham: Springer, 2018: 106-121.
[33]	LI M S, CHEN S H, CHEN X, et al. Actional-structural graph convolutional networks for skeleton-based action recognition[C]// 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2019: 3590-3598.
[34]	SONG Y F, ZHANG Z, SHAN C F, et al. Richly activated graph convolutional network for robust skeleton-based action recognition[J]. IEEE Transactions on Circuits and Systems for Video Technology, 2021, 31(5): 1915-1925.
[35]	LIU K, GAO L, KHAN N M, et al. A multi-stream graph convolutional networks-hidden conditional random field model for skeleton-based action recognition[J]. IEEE Transactions on Multimedia, 2020, 23: 64-76.
[36]	YANG H, YAN D, ZHANG L, et al. Feedback graph convolutional network for skeleton-based action recognition[J]. IEEE Transactions on Image Processing, 2021, 31: 164-175.
[37]	CHENG K, ZHANG Y F, HE X Y, et al. Skeleton-based action recognition with shift graph convolutional network[C]// 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2020: 180-189.
[38]	GIRDHAR R, RAMANAN D, GUPTA A, et al. ActionVLAD: learning spatio-temporal aggregation for action classification[C]// 2017 IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2017: 3165-3174.
[39]	ZHOU B L, ANDONIAN A, OLIVA A, et al. Temporal relational reasoning in videos[C]// European Conference on Computer Vision. Cham: Springer, 2018: 831-846.
[40]	LIN J, GAN C, HAN S. TSM: temporal shift module for efficient video understanding[C]// 2019 IEEE/CVF International Conference on Computer Vision. New York: IEEE Press, 2019: 7082-7092.
[41]	KIM M, KWON H, WANG C Y, et al. Relational self-attention: what’s missing in attention for video understanding[EB/OL]. [2023-11-10]. http://arxiv.org/abs/2111.01673.
[42]	SHI J, ZHANG Y Y, WANG W H, et al. A novel two-stream transformer-based framework for multi-modality human action recognition[J]. Applied Sciences, 2023, 13(4): 2058.