Multiscale temporal enhanced action recognition method based on hypergraph Transformer

doi:10.11996/JG.j.2095-302X.2026020311

Abstract

Abstract:

Skeleton-based human action recognition has gained widespread attention due to its robustness to background interference and structured representations. In recent years, the Transformer architecture has been widely applied to this task due to its powerful modeling capabilities. However, the existing methods still face challenges in recognizing actions with local detail changes, complex temporal dynamics, or strong temporal dependence, mainly because of their insufficient local spatial semantic modeling, limited multi-scale dynamic perception, and a lack of explicit temporal location perception. in addition, traditional temporal convolution used for dimensionality reduction was prone to the loss of important dynamic information. To overcome these problems, a multi-scale temporal-enhanced model based on a hypergraph Transformer was proposed. Specifically, a Local-Multi-Scale Enhancement (LME) module was designedto enhance the perception of local features in key areas such as limbs through a rectangular context modeling mechanism, and an efficient multi-scale attention mechanism was used to integrate action patterns at different time granularities, improving the adaptability of the model to multi-rhythmic actions. At the same time, a learnable Temporal Positional Encoding (TPE) was introduced into the spatial attention module to inject temporal priors into the spatial dependence modeling to capture the spatio-temporal coupling relationship more accurately. Furthermore, a time-compression module, Squeeze and Excitation Downsampling (SEDS), based on the Haar wavelet transform and channel attention mechanism was adopted to replace the dimensionality reduction by traditional time convolution, reducing the calculation amount while preserving the key dynamic information. The experimental results on three public datasets, NTU RGB +D 60, NTU RGB+D 120, and Northwestern UCLA, showed that the proposed model outperformed many mainstream methods in recognition accuracy, especially in complex background, detailed action and large-scale data scenes.

Key words: skeleton point action recognition, temporal positional coding, multi-scale features, rectangular context modeling, local features

CLC Number:

TP391.41

CHEN Qingshuan, CHEN Enqing, GUO Xin, WANG Song. Multiscale temporal enhanced action recognition method based on hypergraph Transformer[J]. Journal of Graphics, 2026, 47(2): 311-321.

Figures/Tables 15

References 29

[1]	孙满贞, 张鹏, 苏本跃. 基于骨骼数据特征的人体行为识别方法综述[J]. 软件导刊, 2022, 21(4): 233-239.
	SUN M Z, ZHANG P, SU B Y. Survey of human action recognition methods based on skeleton data features[J]. Software Guide, 2022, 21(4): 233-239 (in Chinese).
[2]	黄倩, 崔静雯, 李畅. 基于骨骼的人体行为识别方法研究综述[J]. 计算机辅助设计与图形学学报, 2024, 36(2): 173-194.
	HUANG Q, CUI J W, LI C. A review of skeleton-based human action recognition[J]. Journal of Computer-Aided Design & Computer Graphics, 2024, 36(2): 173-194 (in Chinese).
[3]	卢健, 李萱峰, 赵博, 等. 骨骼信息的人体行为识别综述[J]. 中国图像图形学报, 2023, 28(12): 3651-3669.
	LU J, LI X F, ZHAO B, et al. A review of skeleton-based human action recognition[J]. Journal of Image and Graphics, 2023, 28(12): 3651-3669 (in Chinese). DOI URL
[4]	KARPATHY A, TODERICI G, SHETTY S, et al. Large-scale video classification with convolutional neural networks[C]// 2014 IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2014: 1725-1732.
[5]	蒋圣南, 陈恩庆, 郑铭耀, 等. 基于ResNeXt的人体动作识别[J]. 图学学报, 2020, 41(2): 277-282. DOI
	JIANG S N, CHEN E Q, ZHENG M Y, et al. Human action recognition based on ResNeXt[J]. Journal of Graphics, 2020, 41(2): 277-282 (in Chinese).
[6]	YAN S J, XIONG Y J, LIN D H. Spatial temporal graph convolutional networks for skeleton-based action recognition[EB/OL]. [2025-06-22]. https://ojs.aaai.org/index.php/aaai/article/view/12328.
[7]	SHI L, ZHANG Y F, CHENG J, et al. Two-stream adaptive graph convolutional networks for skeleton-based action recognition[C]// 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2019: 12026-12035.
[8]	TIAN X Y, JIN Y, ZHANG Z, et al. STGA-Net: spatial-temporal graph attention network for skeleton-based temporal action segmentation[C]// 2023 IEEE International Conference on Multimedia and Expo Workshops. New York: IEEE Press, 2023: 218-223.
[9]	SHI M, TANG Y F, ZHU X Q, et al. Multi-class imbalanced graph convolutional network learning[EB/OL]. [2025-06-22]. https://dl.acm.org/doi/10.5555/3491440.3491838.https://dl.acm.org/doi/10.5555/3491440.3491838.
[10]	PLIZZARI C, CANNICI M, MATTEUCCI M. Skeleton-based action recognition via spatial and temporal transformer networks[J]. Computer Vision and Image Understanding, 2021, 208-209: 103219. DOI URL
[11]	王可心. 基于 Transformer 的双人交互行为识别及数据增强方法[D]. 西安: 西安电子科技大学, 2023.
	WANG K X. Two-person interaction behavior recognition and data augmentation method based on Transformer[D]. Xi’an: Xidian University, 2023 (in Chinese).
[12]	ZHOU Y X, CHENG Z Q, LI C, et al. Hypergraph transformer for skeleton-based action recognition[EB/OL]. [2025-06-22]. https://arxiv.org/abs/2211.09590.
[13]	SHAHROUDY A, LIU J, NG T T, et al. NTU RGB+D: a large scale dataset for 3D human activity analysis[C]// 2016 IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2016: 1010-1019.
[14]	LIU J, SHAHROUDY A, PEREZ M, et al. NTU RGB+D 120: a large-scale benchmark for 3D human activity understanding[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020, 42(10): 2684-2701. DOI PMID
[15]	WANG J, NIE X H, XIA Y, et al. Cross-view action modeling, learning, and recognition[C]// 2014 IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2014: 2649-2656.
[16]	CHEN Y F, YOU Z, ZHANG S H, et al. Core context aware transformers for long context language modeling[EB/OL]. [2025-06-22]. https://icml.cc/virtual/2025/poster/45555.
[17]	ZHANG P F, LAN C L, XING J L, et al. View adaptive recurrent neural networks for high performance human action recognition from skeleton data[C]// 2017 IEEE International Conference on Computer Vision. New York: IEEE Press, 2017: 2136-2145.
[18]	SI C Y, CHEN W T, WANG W, et al. An attention enhanced graph convolutional LSTM network for skeleton-based action recognition[C]// 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2019: 1227-1236.
[19]	ZHANG P F, LAN C L, XING J L, et al. View adaptive neural networks for high performance skeleton-based human action recognition[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2019, 41(8) 1963-1978. DOI PMID
[20]	XU K L, YE F F, ZHONG Q Y, et al. Topology-aware convolutional neural network for efficient skeleton-based action recognition[EB/OL]. [2025-06-22]. https://ojs.aaai.org/index.php/AAAI/article/view/20191.
[21]	CHENG K, ZHANG Y F, HE X Y, et al. Skeleton-based action recognition with shift graph convolutional network[C]// 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2020: 180-189.
[22]	LEE J, LEE M, LEE D, et al. Hierarchically decomposed graph convolutional networks for skeleton-based action recognition[C]// 2023 IEEE/CVF International Conference on Computer Vision. New York: IEEE Press, 2023: 10410-10419.
[23]	ZHU Q L, DENG H M. Spatial adaptive graph convolutional network for skeleton-based action recognition[J]. Applied Intelligence, 2023, 53(14): 17796-17808. DOI
[24]	ZHOU H Y, LIU Q J, WANG Y H. Learning discriminative representations for skeleton based action recognition[C]// 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2023: 10608-10617.
[25]	WANG X H, XU X, MU Y D. Neural Koopman pooling: control-inspired temporal dynamics encoding for skeleton- based action recognition[C]// 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2023: 10597-10607.
[26]	XIANG W M, LI C, ZHOU Y X, et al. Generative action description prompts for skeleton-based action recognition[C]// 2023 IEEE/CVF International Conference on Computer Vision. New York: IEEE Press, 2023: 10242-10251.
[27]	ZHOU Y X, YAN X D, CHENG Z Q, et al. BlockGCN: redefine topology awareness for skeleton-based action recognition[C]// 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2024: 2049-2058.
[28]	CHEN D, CHEN M D, WU P S, et al. Two-stream spatio-temporal GCN-transformer networks for skeleton-based action recognition[J]. Scientific Reports, 2025, 15(1): 4982. DOI
[29]	LEE I, KIM D, KANG S, et al. Ensemble deep learning for skeleton-based action recognition using temporal sliding LSTM networks[C]// 2017 IEEE International Conference on Computer Vision. New York: IEEE Press, 2017: 1012-1020.

类型	方法	参数量/M	NTU RGB+D 60		NTU RGB+D 120
类型	方法	参数量/M	X-Sub/%	X-View/%	X-Sub/%	X-Set/%
RNN	VA-LSTM^[17]	—	79.4	87.6	—	—
RNN	AGC-LSTM^[18]	22.90	89.2	95.0	—	—
CNN	VA-CNN^[19]	24.90	88.7	94.3	—	—
CNN	Ta-CNN+^[20]	1.06	90.7	95.1	85.7	87.3
GCN	Shift-GCN(4-ensemble)^[21]	2.80	90.7	96.5	85.9	87.6
	HD-GCN^[22]	—	93.0	97.0	89.8	91.2
	SARGCN^[23]	1.09	88.9	94.8	83.8	85.1
	FR-Head^[24]	1.45	92.8	96.8	89.5	90.9
	Koopman^[25]	5.38	92.9	96.8	90.0	91.3
	LST^[26]	2.10	92.9	97.0	89.9	91.1
	BlockGCN^[27]	1.30	93.1	97.0	90.3	91.5
Transformer	ST-TR^[10]	12.10	89.9	96.1	82.7	84.7
	SA-TDGFormer^[28]	—	92.7	96.8	86.8	88.9
	Hyperformer(joint)	2.72	90.5	94.8	86.4	88.0
	Hyperformer^[12]	2.72	92.7	96.2	89.7	91.0
	LTPEformer(joint)	2.88	91.4	95.8	87.2	88.4
	LTPEformer	2.88	93.3	97.0	90.2	91.5

类型	方法	参数量/M	NTU RGB+D 60		NTU RGB+D 120
类型	方法	参数量/M	X-Sub/%	X-View/%	X-Sub/%	X-Set/%
RNN	VA-LSTM^[17]	—	79.4	87.6	—	—
RNN	AGC-LSTM^[18]	22.90	89.2	95.0	—	—
CNN	VA-CNN^[19]	24.90	88.7	94.3	—	—
CNN	Ta-CNN+^[20]	1.06	90.7	95.1	85.7	87.3
GCN	Shift-GCN(4-ensemble)^[21]	2.80	90.7	96.5	85.9	87.6
	HD-GCN^[22]	—	93.0	97.0	89.8	91.2
	SARGCN^[23]	1.09	88.9	94.8	83.8	85.1
	FR-Head^[24]	1.45	92.8	96.8	89.5	90.9
	Koopman^[25]	5.38	92.9	96.8	90.0	91.3
	LST^[26]	2.10	92.9	97.0	89.9	91.1
	BlockGCN^[27]	1.30	93.1	97.0	90.3	91.5
Transformer	ST-TR^[10]	12.10	89.9	96.1	82.7	84.7
	SA-TDGFormer^[28]	—	92.7	96.8	86.8	88.9
	Hyperformer(joint)	2.72	90.5	94.8	86.4	88.0
	Hyperformer^[12]	2.72	92.7	96.2	89.7	91.0
	LTPEformer(joint)	2.88	91.4	95.8	87.2	88.4
	LTPEformer	2.88	93.3	97.0	90.2	91.5

类型	方法	准确率/%
RNN	TS-LSTM^[29]	89.2
RNN	2s-AGC-LSTM^[18]	93.3
CNN	VA-CNN^[19]	90.7
CNN	Ta-CNN^[20]	96.1
GCN	4s-shift-GCN^[21]	94.6
GCN	BlockGCN^[27]	96.9
Transformer	Hyperformer^[12]	96.5
Transformer	LTPEformer	97.0

类型	方法	准确率/%
RNN	TS-LSTM^[29]	89.2
RNN	2s-AGC-LSTM^[18]	93.3
CNN	VA-CNN^[19]	90.7
CNN	Ta-CNN^[20]	96.1
GCN	4s-shift-GCN^[21]	94.6
GCN	BlockGCN^[27]	96.9
Transformer	Hyperformer^[12]	96.5
Transformer	LTPEformer	97.0

模型	方法	参数量/ M	X-Sub/%
1	基线	2.72	90.5
2	+1LME	2.75	91.0
3	+2LME	2.75	90.8
4	+3LME	2.75	90.6