基于骨骼点动态时域滤波的人体动作识别

doi:10.11996/JG.j.2095-302X.2024040760

图学学报 ›› 2024, Vol. 45 ›› Issue (4): 760-769.DOI: 10.11996/JG.j.2095-302X.2024040760

• 图像处理与计算机视觉 • 上一篇下一篇

基于骨骼点动态时域滤波的人体动作识别

李松洋(), 王雪婷, 陈相龙, 陈恩庆()

郑州大学电气与信息工程学院，河南郑州 450001

收稿日期:2023-10-08 接受日期:2024-02-20 出版日期:2024-08-31 发布日期:2024-09-03
通讯作者:陈恩庆(1977-)，男，教授，博士。主要研究方向为计算机视觉、模式识别和多媒体信息处理，E-mail：ieeqchen@zzu.edu.cn
第一作者:李松洋(1998-)，男，硕士研究生。主要研究方向为计算机视觉与模式识别。E-mail：lisongyang1998@gs.zzu.edu.cn
基金资助:
国家自然科学基金项目(62101503);国家自然科学基金项目(U1804152);河南省科技攻关项目(222102210102);国家超级计算郑州中心支持项目

Human action recognition based on skeleton dynamic temporal filter

LI Songyang(), WANG Xueting, CHEN Xianglong, CHEN Enqing()

School of Electrical and Information Engineering, Zhengzhou University, Zhengzhou Henan 450001, China

Received:2023-10-08 Accepted:2024-02-20 Published:2024-08-31 Online:2024-09-03
Contact: CHEN Enqing (1977-), professor, Ph.D. His main research interests cover computer vision, pattern recognition and multimedia information processing. E-mail：ieeqchen@zzu.edu.cn
First author：LI Songyang (1998-), master student. His main research interests cover computer vision and pattern recognition. E-mail：lisongyang1998@gs.zzu.edu.cn
Supported by:
National Natural Science Foundation of China(62101503);National Natural Science Foundation of China(U1804152);Scientific and Technological Project of Henan(222102210102);National Supercomputing Center in Zhengzhou Project

摘要/Abstract

摘要：

人体动作识别是计算机视觉的重要研究方向，广泛应用于智能监控、人机交互等领域。现有基于骨骼点的动作识别方法多采用图卷积网络(GCN)和时间卷积网络(TCN)级联的方式实现，而后者卷积核的尺寸限制了模型的全局时间建模能力。此外，仅使用卷积处理骨骼点数据缺乏对于不同骨骼点的区分能力，并且TCN提取特征时往往会重复计算，使得TCN的参数量随着网络层数的加深而增大。借助信号处理的方法提出了一种适用于骨骼点的动态时域滤波模块(SDTF)，用于代替TCN对时间特征进行全局建模，并在此基础上对AGCN进行轻量化改进，提出的AGCN-SDTF动作识别模型降低了模型复杂度。SDTF通过傅里叶变换对时间特征进行建模，将傅里叶变换得到的频域特征与滤波得到的频域输出相乘再经过傅里叶逆变换，从而实现对全局时间特征的提取。在NTU-RGBD和Kinetics-Skeleton大型数据集上的实验结果表明，该模型在达到与原模型相同的识别效果时，降低了模型所需的参数量和计算量。

关键词: 人体动作识别, 图卷积网络, 动态时域滤波, 傅里叶变换, 时间卷积网络

Abstract:

Human action recognition is one of the key research areas in computer vision, with a wide range of applications such as human-computer interaction and intelligent surveillance. Existing methods for skeleton-based action recognition often combine graph convolutional networks (GCN) with temporal convolutional networks (TCN). However, the limited size of convolutional kernel restricts the models’ global temporal modeling capability. Moreover, applying convolutional kernel to skeletal data leads to a lack of discriminative power among different skeleton points. Furthermore, using TCN to extract features often entails repeated calculations, leading to an increase in the parameter quantity of TCN as the network deepens. To address these issues, signal processing methods were utilized, and skeleton dynamic temporal filtering (SDTF) module was proposed for skeleton action recognition to replace TCN for global modeling. Based on this, lightweight improvements were made to AGCN, reducing the complexity. SDTF modeled temporal features through Fourier transform, multiplying the frequency domain features obtained from Fourier transform with the filtered frequency domain output, and then undergoing inverse Fourier transform. Extensive experiments conducted on the NTU-RGBD and Kinetics-Skeleton datasets demonstrated that the proposed model significantly reduced network parameters and computational complexity, while achieving comparable or even superior recognition performance compared to the original model.

Key words: human action recognition, graph convolutional network, dynamic temporal filter, Fourier transform, temporal convolutional networks

中图分类号:

TP391
TP183

李松洋, 王雪婷, 陈相龙, 陈恩庆. 基于骨骼点动态时域滤波的人体动作识别[J]. 图学学报, 2024, 45(4): 760-769.

LI Songyang, WANG Xueting, CHEN Xianglong, CHEN Enqing. Human action recognition based on skeleton dynamic temporal filter[J]. Journal of Graphics, 2024, 45(4): 760-769.

图/表 14

图1 自适应图卷积骨骼点动态时域滤波网络

Fig. 1 Adaptive graph convolutional skeleton dynamic temporal filter network

图2 SDTF模块结构图

Fig. 2 Structure of SDTF

表1 AGCN卷积理论参数量

Table 1 Convolution parameter quantities in AGCN

层数	输入通道	输出通道	参数量
1	3	64	1792
2, 3, 4	64	64	36928
5	64	128	73856
6, 7	128	128	147584
8	128	256	295168
9, 10	256	256	590080
合计	3	256	1145408

表2 AGCN-SDTF卷积理论参数量

Table 2 Convolution parameter quantities in AGCN-SDTF

层数	输入通道	输出通道	参数量
1	3	64	1792
2, 3, 4	300	302	90902
5	64	128	73856
6, 7	150	152	22952
8	128	256	295168
9, 10	75	76	5776
合计	3	256	490446

图3 AGCN和AGCN-SDTF输入计算示意图

Fig. 3 Input calculation map of AGCN and input calculation map of AGCN-SDTF ((a) AGCN; (b) AGCN-SDTF)

表3 激活函数选择对模型性能影响

Table 3 Influence of activation function selection on model performance

激活函数	准确率/%
None	85.4
Sigmoid	86.4
Tanh	84.8
Softmax	86.1
LeakyReLU	84.5
ReLU	86.9

表4 卷积层和BN层对SDTF性能影响

Table 4 The influence of convolutional layer and BN layer on SDTF performance

组件	准确率/%
Conv*	86.1
BN*	84.6

表5 使用不同层数SDTF对模型性能影响

Table 5 The impact of using different layers of SDTF on model performance

模型	准确率/%
AGCN	86.5
Full-SDTF	80.1
Meta-SDTF	85.4
AGCN-SDTF	86.9

表6 使用不同层数SDTF对模型复杂度影响

Table 6 The impact of using different layers of SDTF on model complexity

模型	参数量/M	计算量/GFLOPs
AGCN	3.45	18.65
Full-SDTF	1.89	10.35
Meta-SDTF	2.46	13.27
AGCN-SDTF	2.37	12.61

表7 AGCN在NTU-RGBD数据集性能表现

Table 7 Performance of AGCN on NTU-RGBD Dataset

模型	数据类型	CS准确率/%	CV准确率/%
AGCN	骨骼点	86.5	93.7
AGCN	骨骼	87.1	93.2
AGCN-TCN*	骨骼点	85.8	93.4
AGCN-TCN*	骨骼	86.5	93.3

表8 AGCN-SDTF在NTU-RGBD数据集性能表现

Table 8 Performance of AGCN-SDTF on NTU-RGBD Dataset

模型	数据类型	CS准确率/%	CV准确率/%
AGCN-SDTF	骨骼点	86.9	93.7
AGCN-SDTF	骨骼	87.3	93.6

表9 不同模型在NTU-RGBD数据集准确率比较

Table 9 Comparison of accuracy between different models in NTU-RGBD dataset

模型	CS准确率/%	CV准确率/%	参数量/ M	计算量/ GFLOPs
ST-GCN^[13]	81.5	88.3	3.08	16.32
AS-GCN^[26]	86.8	94.2	-	-
NAS-GCN^[27]	87.6	94.5	6.50	36.60
ST-TR-AGCN^[16]	89.3	96.1	12.11	64.41
2s-AGCN^[14]	88.5	95.1	6.90	37.30
2s-AGCN-SDTF	89.1	95.1	4.74	25.22

表10 不同模型在Kinetics-Skeleton数据集准确率比较

Table 10 Comparison of accuracy between different models in Kinetics-Skeleton dataset

模型	Top1准确率/%	Top5准确率/%
ST-GCN^[13]	30.7	52.8
AS-GCN^[26]	34.8	56.5
NAS-GCN^[27]	35.5	57.9
2s-AGCN^[14]	36.1	58.7
2s-AGCN-SDTF	35.8	59.0

表11 ST-GCN-SDTF和CTR-GCN-SDTF的准确率

Table 11 Performance of ST-GCN-SDTF and CTR-GCN-SDTF

模型	CS准确率/%
ST-GCN^[13]	81.5
ST-GCN-SDTF	81.7
CTR-GCN^[21]	89.8
CTR-GCN-SDTF	90.0

参考文献 27

[1]	KRIZHEVSKY A, SUTSKEVER I, HINTON G E. Imagenet classification with deep convolutional neural networks[J]. Communications of the ACM, 2017, 60(6): 84-90.
[2]	KARPATHY A, TODERICI G, SHETTY S, et al. Large-scale video classification with convolutional neural networks[C]// 2014 IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2014: 1725-1732.
[3]	SIMONYAN K, ZISSERMAN A. Two-stream convolutional networks for action recognition in videoss[C]// The 27th International Conference on Neural Information Processing Systems. New York: IEEE Press, 2014:568-576.
[4]	蒋圣南, 陈恩庆, 郑铭耀, 等. 基于ResNeXt的人体动作识别[J]. 图学学报, 2020, 41(2): 277-282. DOI
	JIANG S N, CHEN E Q, ZHEN M Y, et al. Human action recognition based on ResNeXt[J]. Journal of Graphics, 2020, 41(02): 277-282 (in Chinese).
[5]	NG J Y H, HAUSKNECHT M, VIJAYANARASIMHAN S, et al. Beyond short snippets: deep networks for video classification[C]// 2015 IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2015: 4694-4702.
[6]	杨世强, 杨江涛, 李卓, 等. 基于LSTM神经网络的人体动作识别[J]. 图学学报, 2021, 42(2): 174-181.
	YANG S Q, YANG J T, LI Z, et al. Human action recognition based on LSTM neural network[J]. Journal of Graphics, 2021, 42(2): 174-181 (in Chinese). DOI
[7]	WANG L M, XIONG Y J, WANG Z, et al. Temporal segment networks: towards good practices for deep action recognition[C]// European Conference on Computer Vision. Cham: Springer, 2016: 20-36.
[8]	TRAN D, BOURDEV L, FERGUS R, et al. Learning spatiotemporal features with 3d convolutional networks[C]// 2015 IEEE International Conference on Computer Vision. New York: IEEE Press, 2015: 4489-4497.
[9]	CARREIRA J, ZISSERMAN A. Quo vadis, action recognition? a new model and the kinetics dataset[C]// 2017 IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2017: 6299-6308.
[10]	TRAN D, WANG H, TORRESANI L, et al. A closer look at spatiotemporal convolutions for action recognition[C]// 2018 IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2018: 6450-6459.
[11]	ZHANG Z Y. Microsoft kinect sensor and its effect[J]. IEEE Multimedia, 2012, 19(2): 4-10.
[12]	FANG H S, XIE S Q, TAI Y W, et al. Rmpe: regional multi-person pose estimation[C]// 2017 IEEE International Conference on Computer Vision. New York: IEEE Press, 2017: 2334-2343.
[13]	YAN S J, XIONG Y J, LIN D H. Spatial temporal graph convolutional networks for skeleton-based action recognition[J]. Proceedings of the AAAI Conference on Artificial Intelligence, 2018, 32(1): 7444-7452.
[14]	SHI L, ZHANG Y F, CHENG J, et al. Two-stream adaptive graph convolutional networks for skeleton-based action recognition[C]// 2019 IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2019: 12026-12035.
[15]	SHI L, ZHANG Y F, CHENG J, et al. Skeleton-based action recognition with directed graph neural networks[C]// 2019 IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2019: 7912-7921.
[16]	PLIZZARI C, CANNICI M, MATTEUCCI M. Skeleton-based action recognition via spatial and temporal transformer networks[J]. Computer Vision and Image Understanding, 2021, 208: 103219.
[17]	安峰, 戴军, 韩振, 等. 引入注意力机制的自监督光流计算[J]. 图学学报, 2022, 43(5): 841-848.
	AN F, DAI J, HAN Z, et al. Self-supervised optical flow estimation with attention module[J]. Journal of Graphics, 2022, 43(5): 841-848 (in Chinese).
[18]	LEE J, LEE M, LEE D, et al. Hierarchically decomposed graph convolutional networks for skeleton-based action recognition[C]// 2023 IEEE International Conference on Computer Vision. New York: IEEE Press, 2023: 10444-10453.
[19]	DONG J, SUN S, LIU Z, et al. Hierarchical contrast for unsupervised skeleton-based action representation learning[J]. Proceedings of the AAAI Conference on Artificial Intelligence. 2023, 37(1): 525-533.
[20]	SUN S K, LIU D Z, DONG J F, et al. Unified multi-modal unsupervised representation learning for skeleton-based action understanding[C]// The 31st ACM International Conference on Multimedia. New York: ACM, 2023: 2973-2984.
[21]	CHEN Y X, ZHANG Z Q, YUAN C F, et al. Channel-wise topology refinement graph convolution for skeleton-based action recognition[C]// 2021 IEEE International Conference on Computer Vision. New York: IEEE Press, 2021: 13359-13368.
[22]	OBINATA Y, YAMAMOTO T. Temporal extension module for skeleton-based action recognition[C]// The 25th International Conference on Pattern Recognition. New York: IEEE Press, 2021: 534-540.
[23]	LONG F, QIU Z, PAN Y, et al. Dynamic temporal filtering in video models[C]// The 17th European Conference on Computer Vision. Cham: Springer, 2022: 475-492.
[24]	SHAHROUDY A, LIU J, NG T T, et al. NTR RGB+ D: a large scale dataset for 3d human activity analysis[C]// 2016 IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2016: 1010-1019.
[25]	KAY W, CARREIRA J, SIMONYAN K, et al. The kinetics human action video dataset[EB/OL]. [2023-08-19]. https://arxiv.org/abs/1705.06950v1.
[26]	LI M S, CHEN S H, CHEN X, et al. Actional-structural graph convolutional networks for skeleton-based action recognition[C]// 2019 IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2019: 3595-3603.
[27]	PENG W, HONG X P, CHEN H Y, et al. Learning graph convolutional network for skeleton-based human action recognition by neural searching[C]// The AAAI Conference on Artificial Intelligence. Palo Alto: AAAI, 2020: 2669-2676.

基于骨骼点动态时域滤波的人体动作识别

Human action recognition based on skeleton dynamic temporal filter

RichHTML

PDF (PC)

可视化

摘要/Abstract

引用本文

使用本文

图/表 14

参考文献 27

相关文章 5

编辑推荐

Metrics

本文评价

[1]	姜晓恒, 段金忠, 卢洋, 崔丽莎, 徐明亮. 融合先验知识推理的表面缺陷检测[J]. 图学学报, 2024, 45(5): 957-967.
[2]	梁成武, 杨杰, 胡伟, 蒋松琪, 钱其扬, 侯宁. 基于时间动态帧选择与时空图卷积的可解释骨架行为识别[J]. 图学学报, 2024, 45(4): 791-803.
[3]	郭宗洋, 刘立东, 蒋东华, 刘子翔, 朱熟康, 陈京华. 基于语义引导神经网络的人体动作识别算法[J]. 图学学报, 2024, 45(1): 26-34.
[4]	芦楠楠, 刘一雄, 邱铭恺. 基于随机传播图卷积模型的零样本图像分类[J]. 图学学报, 2022, 43(4): 624-631.
[5]	周波, 郭正跃, 韩承村, 杜华, 严伊蔓, 罗月童 . 基于图卷积网络的 BREP→CSG 转换方法及其应用研究 [J]. 图学学报, 2022, 43(1): 101-109.