基于超图Transformer的多尺度时序增强动作识别方法

doi:10.11996/JG.j.2095-302X.2026020311

摘要/Abstract

摘要：

基于骨骼的人体动作识别因其对背景干扰的鲁棒性和结构化表示而受到广泛关注。近年来，Transformer架构因其强大的建模能力被广泛应用于该任务。然而，现有方法在识别包含局部细节变化、复杂时间动态或强时序依赖的动作时仍面临挑战，主要归因于其局部空间语义建模不足、多尺度动态感知能力有限以及缺乏显式的时间位置感知。此外，传统Transformer方法采用传统时间卷积降维易导致重要动态信息丢失。为克服上述问题，提出一种基于超图Transformer的多尺度时序增强模型。首先，设计了一个局部多尺度增强模块(LME)，且通过矩形上下文建模机制增强对四肢等关键区域的局部特征感知，并利用高效多尺度注意力机制融合不同时间粒度的动作模式，从而提升模型对多节奏动作的适应性。同时，在空间注意力模块中引入可学习的时间位置编码(TPE)，为空间依赖建模注入时序先验，以更准确地捕捉时空耦合关系。进一步地，采用基于Haar小波变换与通道注意力机制的时间压缩模块(SEDS)替代传统时间卷积降维，在降低计算量的同时保留关键动态信息。在NTU RGB+D 60，NTU RGB+D 120和Northwestern-UCLA3个公开数据集上的实验结果表明，该模型在识别准确率上优于多种主流方法，尤其在复杂背景、细节动作及大规模数据场景下展现出更强的鲁棒性与准确性。

关键词: 骨骼点动作识别, 时间位置编码, 多尺度特征, 矩形上下文建模, 局部特征

Abstract:

Skeleton-based human action recognition has gained widespread attention due to its robustness to background interference and structured representations. In recent years, the Transformer architecture has been widely applied to this task due to its powerful modeling capabilities. However, the existing methods still face challenges in recognizing actions with local detail changes, complex temporal dynamics, or strong temporal dependence, mainly because of their insufficient local spatial semantic modeling, limited multi-scale dynamic perception, and a lack of explicit temporal location perception. in addition, traditional temporal convolution used for dimensionality reduction was prone to the loss of important dynamic information. To overcome these problems, a multi-scale temporal-enhanced model based on a hypergraph Transformer was proposed. Specifically, a Local-Multi-Scale Enhancement (LME) module was designedto enhance the perception of local features in key areas such as limbs through a rectangular context modeling mechanism, and an efficient multi-scale attention mechanism was used to integrate action patterns at different time granularities, improving the adaptability of the model to multi-rhythmic actions. At the same time, a learnable Temporal Positional Encoding (TPE) was introduced into the spatial attention module to inject temporal priors into the spatial dependence modeling to capture the spatio-temporal coupling relationship more accurately. Furthermore, a time-compression module, Squeeze and Excitation Downsampling (SEDS), based on the Haar wavelet transform and channel attention mechanism was adopted to replace the dimensionality reduction by traditional time convolution, reducing the calculation amount while preserving the key dynamic information. The experimental results on three public datasets, NTU RGB +D 60, NTU RGB+D 120, and Northwestern UCLA, showed that the proposed model outperformed many mainstream methods in recognition accuracy, especially in complex background, detailed action and large-scale data scenes.

Key words: skeleton point action recognition, temporal positional coding, multi-scale features, rectangular context modeling, local features

中图分类号:

TP391.41

陈庆拴, 陈恩庆, 郭新, 汪松. 基于超图Transformer的多尺度时序增强动作识别方法[J]. 图学学报, 2026, 47(2): 311-321.

CHEN Qingshuan, CHEN Enqing, GUO Xin, WANG Song. Multiscale temporal enhanced action recognition method based on hypergraph Transformer[J]. Journal of Graphics, 2026, 47(2): 311-321.

图/表 15

参考文献 29

[1]	孙满贞, 张鹏, 苏本跃. 基于骨骼数据特征的人体行为识别方法综述[J]. 软件导刊, 2022, 21(4): 233-239.
	SUN M Z, ZHANG P, SU B Y. Survey of human action recognition methods based on skeleton data features[J]. Software Guide, 2022, 21(4): 233-239 (in Chinese).
[2]	黄倩, 崔静雯, 李畅. 基于骨骼的人体行为识别方法研究综述[J]. 计算机辅助设计与图形学学报, 2024, 36(2): 173-194.
	HUANG Q, CUI J W, LI C. A review of skeleton-based human action recognition[J]. Journal of Computer-Aided Design & Computer Graphics, 2024, 36(2): 173-194 (in Chinese).
[3]	卢健, 李萱峰, 赵博, 等. 骨骼信息的人体行为识别综述[J]. 中国图像图形学报, 2023, 28(12): 3651-3669.
	LU J, LI X F, ZHAO B, et al. A review of skeleton-based human action recognition[J]. Journal of Image and Graphics, 2023, 28(12): 3651-3669 (in Chinese). DOI URL
[4]	KARPATHY A, TODERICI G, SHETTY S, et al. Large-scale video classification with convolutional neural networks[C]// 2014 IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2014: 1725-1732.
[5]	蒋圣南, 陈恩庆, 郑铭耀, 等. 基于ResNeXt的人体动作识别[J]. 图学学报, 2020, 41(2): 277-282. DOI
	JIANG S N, CHEN E Q, ZHENG M Y, et al. Human action recognition based on ResNeXt[J]. Journal of Graphics, 2020, 41(2): 277-282 (in Chinese).
[6]	YAN S J, XIONG Y J, LIN D H. Spatial temporal graph convolutional networks for skeleton-based action recognition[EB/OL]. [2025-06-22]. https://ojs.aaai.org/index.php/aaai/article/view/12328.
[7]	SHI L, ZHANG Y F, CHENG J, et al. Two-stream adaptive graph convolutional networks for skeleton-based action recognition[C]// 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2019: 12026-12035.
[8]	TIAN X Y, JIN Y, ZHANG Z, et al. STGA-Net: spatial-temporal graph attention network for skeleton-based temporal action segmentation[C]// 2023 IEEE International Conference on Multimedia and Expo Workshops. New York: IEEE Press, 2023: 218-223.
[9]	SHI M, TANG Y F, ZHU X Q, et al. Multi-class imbalanced graph convolutional network learning[EB/OL]. [2025-06-22]. https://dl.acm.org/doi/10.5555/3491440.3491838.https://dl.acm.org/doi/10.5555/3491440.3491838.
[10]	PLIZZARI C, CANNICI M, MATTEUCCI M. Skeleton-based action recognition via spatial and temporal transformer networks[J]. Computer Vision and Image Understanding, 2021, 208-209: 103219. DOI URL
[11]	王可心. 基于 Transformer 的双人交互行为识别及数据增强方法[D]. 西安: 西安电子科技大学, 2023.
	WANG K X. Two-person interaction behavior recognition and data augmentation method based on Transformer[D]. Xi’an: Xidian University, 2023 (in Chinese).
[12]	ZHOU Y X, CHENG Z Q, LI C, et al. Hypergraph transformer for skeleton-based action recognition[EB/OL]. [2025-06-22]. https://arxiv.org/abs/2211.09590.
[13]	SHAHROUDY A, LIU J, NG T T, et al. NTU RGB+D: a large scale dataset for 3D human activity analysis[C]// 2016 IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2016: 1010-1019.
[14]	LIU J, SHAHROUDY A, PEREZ M, et al. NTU RGB+D 120: a large-scale benchmark for 3D human activity understanding[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020, 42(10): 2684-2701. DOI PMID
[15]	WANG J, NIE X H, XIA Y, et al. Cross-view action modeling, learning, and recognition[C]// 2014 IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2014: 2649-2656.
[16]	CHEN Y F, YOU Z, ZHANG S H, et al. Core context aware transformers for long context language modeling[EB/OL]. [2025-06-22]. https://icml.cc/virtual/2025/poster/45555.
[17]	ZHANG P F, LAN C L, XING J L, et al. View adaptive recurrent neural networks for high performance human action recognition from skeleton data[C]// 2017 IEEE International Conference on Computer Vision. New York: IEEE Press, 2017: 2136-2145.
[18]	SI C Y, CHEN W T, WANG W, et al. An attention enhanced graph convolutional LSTM network for skeleton-based action recognition[C]// 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2019: 1227-1236.
[19]	ZHANG P F, LAN C L, XING J L, et al. View adaptive neural networks for high performance skeleton-based human action recognition[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2019, 41(8) 1963-1978. DOI PMID
[20]	XU K L, YE F F, ZHONG Q Y, et al. Topology-aware convolutional neural network for efficient skeleton-based action recognition[EB/OL]. [2025-06-22]. https://ojs.aaai.org/index.php/AAAI/article/view/20191.
[21]	CHENG K, ZHANG Y F, HE X Y, et al. Skeleton-based action recognition with shift graph convolutional network[C]// 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2020: 180-189.
[22]	LEE J, LEE M, LEE D, et al. Hierarchically decomposed graph convolutional networks for skeleton-based action recognition[C]// 2023 IEEE/CVF International Conference on Computer Vision. New York: IEEE Press, 2023: 10410-10419.
[23]	ZHU Q L, DENG H M. Spatial adaptive graph convolutional network for skeleton-based action recognition[J]. Applied Intelligence, 2023, 53(14): 17796-17808. DOI
[24]	ZHOU H Y, LIU Q J, WANG Y H. Learning discriminative representations for skeleton based action recognition[C]// 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2023: 10608-10617.
[25]	WANG X H, XU X, MU Y D. Neural Koopman pooling: control-inspired temporal dynamics encoding for skeleton- based action recognition[C]// 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2023: 10597-10607.
[26]	XIANG W M, LI C, ZHOU Y X, et al. Generative action description prompts for skeleton-based action recognition[C]// 2023 IEEE/CVF International Conference on Computer Vision. New York: IEEE Press, 2023: 10242-10251.
[27]	ZHOU Y X, YAN X D, CHENG Z Q, et al. BlockGCN: redefine topology awareness for skeleton-based action recognition[C]// 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2024: 2049-2058.
[28]	CHEN D, CHEN M D, WU P S, et al. Two-stream spatio-temporal GCN-transformer networks for skeleton-based action recognition[J]. Scientific Reports, 2025, 15(1): 4982. DOI
[29]	LEE I, KIM D, KANG S, et al. Ensemble deep learning for skeleton-based action recognition using temporal sliding LSTM networks[C]// 2017 IEEE International Conference on Computer Vision. New York: IEEE Press, 2017: 1012-1020.

类型	方法	参数量/M	NTU RGB+D 60		NTU RGB+D 120
类型	方法	参数量/M	X-Sub/%	X-View/%	X-Sub/%	X-Set/%
RNN	VA-LSTM^[17]	—	79.4	87.6	—	—
RNN	AGC-LSTM^[18]	22.90	89.2	95.0	—	—
CNN	VA-CNN^[19]	24.90	88.7	94.3	—	—
CNN	Ta-CNN+^[20]	1.06	90.7	95.1	85.7	87.3
GCN	Shift-GCN(4-ensemble)^[21]	2.80	90.7	96.5	85.9	87.6
	HD-GCN^[22]	—	93.0	97.0	89.8	91.2
	SARGCN^[23]	1.09	88.9	94.8	83.8	85.1
	FR-Head^[24]	1.45	92.8	96.8	89.5	90.9
	Koopman^[25]	5.38	92.9	96.8	90.0	91.3
	LST^[26]	2.10	92.9	97.0	89.9	91.1
	BlockGCN^[27]	1.30	93.1	97.0	90.3	91.5
Transformer	ST-TR^[10]	12.10	89.9	96.1	82.7	84.7
	SA-TDGFormer^[28]	—	92.7	96.8	86.8	88.9
	Hyperformer(joint)	2.72	90.5	94.8	86.4	88.0
	Hyperformer^[12]	2.72	92.7	96.2	89.7	91.0
	LTPEformer(joint)	2.88	91.4	95.8	87.2	88.4
	LTPEformer	2.88	93.3	97.0	90.2	91.5

类型	方法	参数量/M	NTU RGB+D 60		NTU RGB+D 120
类型	方法	参数量/M	X-Sub/%	X-View/%	X-Sub/%	X-Set/%
RNN	VA-LSTM^[17]	—	79.4	87.6	—	—
RNN	AGC-LSTM^[18]	22.90	89.2	95.0	—	—
CNN	VA-CNN^[19]	24.90	88.7	94.3	—	—
CNN	Ta-CNN+^[20]	1.06	90.7	95.1	85.7	87.3
GCN	Shift-GCN(4-ensemble)^[21]	2.80	90.7	96.5	85.9	87.6
	HD-GCN^[22]	—	93.0	97.0	89.8	91.2
	SARGCN^[23]	1.09	88.9	94.8	83.8	85.1
	FR-Head^[24]	1.45	92.8	96.8	89.5	90.9
	Koopman^[25]	5.38	92.9	96.8	90.0	91.3
	LST^[26]	2.10	92.9	97.0	89.9	91.1
	BlockGCN^[27]	1.30	93.1	97.0	90.3	91.5
Transformer	ST-TR^[10]	12.10	89.9	96.1	82.7	84.7
	SA-TDGFormer^[28]	—	92.7	96.8	86.8	88.9
	Hyperformer(joint)	2.72	90.5	94.8	86.4	88.0
	Hyperformer^[12]	2.72	92.7	96.2	89.7	91.0
	LTPEformer(joint)	2.88	91.4	95.8	87.2	88.4
	LTPEformer	2.88	93.3	97.0	90.2	91.5

类型	方法	准确率/%
RNN	TS-LSTM^[29]	89.2
RNN	2s-AGC-LSTM^[18]	93.3
CNN	VA-CNN^[19]	90.7
CNN	Ta-CNN^[20]	96.1
GCN	4s-shift-GCN^[21]	94.6
GCN	BlockGCN^[27]	96.9
Transformer	Hyperformer^[12]	96.5
Transformer	LTPEformer	97.0

类型	方法	准确率/%
RNN	TS-LSTM^[29]	89.2
RNN	2s-AGC-LSTM^[18]	93.3
CNN	VA-CNN^[19]	90.7
CNN	Ta-CNN^[20]	96.1
GCN	4s-shift-GCN^[21]	94.6
GCN	BlockGCN^[27]	96.9
Transformer	Hyperformer^[12]	96.5
Transformer	LTPEformer	97.0

模型	方法	参数量/ M	X-Sub/%
1	基线	2.72	90.5
2	+1LME	2.75	91.0
3	+2LME	2.75	90.8
4	+3LME	2.75	90.6