Human action recognition based on skeleton dynamic temporal filter

doi:10.11996/JG.j.2095-302X.2024040760

Abstract

Abstract:

Human action recognition is one of the key research areas in computer vision, with a wide range of applications such as human-computer interaction and intelligent surveillance. Existing methods for skeleton-based action recognition often combine graph convolutional networks (GCN) with temporal convolutional networks (TCN). However, the limited size of convolutional kernel restricts the models’ global temporal modeling capability. Moreover, applying convolutional kernel to skeletal data leads to a lack of discriminative power among different skeleton points. Furthermore, using TCN to extract features often entails repeated calculations, leading to an increase in the parameter quantity of TCN as the network deepens. To address these issues, signal processing methods were utilized, and skeleton dynamic temporal filtering (SDTF) module was proposed for skeleton action recognition to replace TCN for global modeling. Based on this, lightweight improvements were made to AGCN, reducing the complexity. SDTF modeled temporal features through Fourier transform, multiplying the frequency domain features obtained from Fourier transform with the filtered frequency domain output, and then undergoing inverse Fourier transform. Extensive experiments conducted on the NTU-RGBD and Kinetics-Skeleton datasets demonstrated that the proposed model significantly reduced network parameters and computational complexity, while achieving comparable or even superior recognition performance compared to the original model.

Key words: human action recognition, graph convolutional network, dynamic temporal filter, Fourier transform, temporal convolutional networks

CLC Number:

TP391
TP183

LI Songyang, WANG Xueting, CHEN Xianglong, CHEN Enqing. Human action recognition based on skeleton dynamic temporal filter[J]. Journal of Graphics, 2024, 45(4): 760-769.

Figures/Tables 14

References 27

[1]	KRIZHEVSKY A, SUTSKEVER I, HINTON G E. Imagenet classification with deep convolutional neural networks[J]. Communications of the ACM, 2017, 60(6): 84-90.
[2]	KARPATHY A, TODERICI G, SHETTY S, et al. Large-scale video classification with convolutional neural networks[C]// 2014 IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2014: 1725-1732.
[3]	SIMONYAN K, ZISSERMAN A. Two-stream convolutional networks for action recognition in videoss[C]// The 27th International Conference on Neural Information Processing Systems. New York: IEEE Press, 2014:568-576.
[4]	蒋圣南, 陈恩庆, 郑铭耀, 等. 基于ResNeXt的人体动作识别[J]. 图学学报, 2020, 41(2): 277-282. DOI
	JIANG S N, CHEN E Q, ZHEN M Y, et al. Human action recognition based on ResNeXt[J]. Journal of Graphics, 2020, 41(02): 277-282 (in Chinese).
[5]	NG J Y H, HAUSKNECHT M, VIJAYANARASIMHAN S, et al. Beyond short snippets: deep networks for video classification[C]// 2015 IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2015: 4694-4702.
[6]	杨世强, 杨江涛, 李卓, 等. 基于LSTM神经网络的人体动作识别[J]. 图学学报, 2021, 42(2): 174-181.
	YANG S Q, YANG J T, LI Z, et al. Human action recognition based on LSTM neural network[J]. Journal of Graphics, 2021, 42(2): 174-181 (in Chinese). DOI
[7]	WANG L M, XIONG Y J, WANG Z, et al. Temporal segment networks: towards good practices for deep action recognition[C]// European Conference on Computer Vision. Cham: Springer, 2016: 20-36.
[8]	TRAN D, BOURDEV L, FERGUS R, et al. Learning spatiotemporal features with 3d convolutional networks[C]// 2015 IEEE International Conference on Computer Vision. New York: IEEE Press, 2015: 4489-4497.
[9]	CARREIRA J, ZISSERMAN A. Quo vadis, action recognition? a new model and the kinetics dataset[C]// 2017 IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2017: 6299-6308.
[10]	TRAN D, WANG H, TORRESANI L, et al. A closer look at spatiotemporal convolutions for action recognition[C]// 2018 IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2018: 6450-6459.
[11]	ZHANG Z Y. Microsoft kinect sensor and its effect[J]. IEEE Multimedia, 2012, 19(2): 4-10.
[12]	FANG H S, XIE S Q, TAI Y W, et al. Rmpe: regional multi-person pose estimation[C]// 2017 IEEE International Conference on Computer Vision. New York: IEEE Press, 2017: 2334-2343.
[13]	YAN S J, XIONG Y J, LIN D H. Spatial temporal graph convolutional networks for skeleton-based action recognition[J]. Proceedings of the AAAI Conference on Artificial Intelligence, 2018, 32(1): 7444-7452.
[14]	SHI L, ZHANG Y F, CHENG J, et al. Two-stream adaptive graph convolutional networks for skeleton-based action recognition[C]// 2019 IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2019: 12026-12035.
[15]	SHI L, ZHANG Y F, CHENG J, et al. Skeleton-based action recognition with directed graph neural networks[C]// 2019 IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2019: 7912-7921.
[16]	PLIZZARI C, CANNICI M, MATTEUCCI M. Skeleton-based action recognition via spatial and temporal transformer networks[J]. Computer Vision and Image Understanding, 2021, 208: 103219.
[17]	安峰, 戴军, 韩振, 等. 引入注意力机制的自监督光流计算[J]. 图学学报, 2022, 43(5): 841-848.
	AN F, DAI J, HAN Z, et al. Self-supervised optical flow estimation with attention module[J]. Journal of Graphics, 2022, 43(5): 841-848 (in Chinese).
[18]	LEE J, LEE M, LEE D, et al. Hierarchically decomposed graph convolutional networks for skeleton-based action recognition[C]// 2023 IEEE International Conference on Computer Vision. New York: IEEE Press, 2023: 10444-10453.
[19]	DONG J, SUN S, LIU Z, et al. Hierarchical contrast for unsupervised skeleton-based action representation learning[J]. Proceedings of the AAAI Conference on Artificial Intelligence. 2023, 37(1): 525-533.
[20]	SUN S K, LIU D Z, DONG J F, et al. Unified multi-modal unsupervised representation learning for skeleton-based action understanding[C]// The 31st ACM International Conference on Multimedia. New York: ACM, 2023: 2973-2984.
[21]	CHEN Y X, ZHANG Z Q, YUAN C F, et al. Channel-wise topology refinement graph convolution for skeleton-based action recognition[C]// 2021 IEEE International Conference on Computer Vision. New York: IEEE Press, 2021: 13359-13368.
[22]	OBINATA Y, YAMAMOTO T. Temporal extension module for skeleton-based action recognition[C]// The 25th International Conference on Pattern Recognition. New York: IEEE Press, 2021: 534-540.
[23]	LONG F, QIU Z, PAN Y, et al. Dynamic temporal filtering in video models[C]// The 17th European Conference on Computer Vision. Cham: Springer, 2022: 475-492.
[24]	SHAHROUDY A, LIU J, NG T T, et al. NTR RGB+ D: a large scale dataset for 3d human activity analysis[C]// 2016 IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2016: 1010-1019.
[25]	KAY W, CARREIRA J, SIMONYAN K, et al. The kinetics human action video dataset[EB/OL]. [2023-08-19]. https://arxiv.org/abs/1705.06950v1.
[26]	LI M S, CHEN S H, CHEN X, et al. Actional-structural graph convolutional networks for skeleton-based action recognition[C]// 2019 IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2019: 3595-3603.
[27]	PENG W, HONG X P, CHEN H Y, et al. Learning graph convolutional network for skeleton-based human action recognition by neural searching[C]// The AAAI Conference on Artificial Intelligence. Palo Alto: AAAI, 2020: 2669-2676.

层数	输入通道	输出通道	参数量
1	3	64	1792
2, 3, 4	64	64	36928
5	64	128	73856
6, 7	128	128	147584
8	128	256	295168
9, 10	256	256	590080
合计	3	256	1145408

层数	输入通道	输出通道	参数量
1	3	64	1792
2, 3, 4	64	64	36928
5	64	128	73856
6, 7	128	128	147584
8	128	256	295168
9, 10	256	256	590080
合计	3	256	1145408

层数	输入通道	输出通道	参数量
1	3	64	1792
2, 3, 4	300	302	90902
5	64	128	73856
6, 7	150	152	22952
8	128	256	295168
9, 10	75	76	5776
合计	3	256	490446

层数	输入通道	输出通道	参数量
1	3	64	1792
2, 3, 4	300	302	90902
5	64	128	73856
6, 7	150	152	22952
8	128	256	295168
9, 10	75	76	5776
合计	3	256	490446

激活函数	准确率/%
None	85.4
Sigmoid	86.4
Tanh	84.8
Softmax	86.1
LeakyReLU	84.5
ReLU	86.9