图学学报 ›› 2026, Vol. 47 ›› Issue (2): 311-321.DOI: 10.11996/JG.j.2095-302X.2026020311
收稿日期:2025-08-22
接受日期:2025-11-21
出版日期:2026-04-30
发布日期:2026-05-20
通讯作者:陈恩庆,E-mail:ceq2003@163.com基金资助:
CHEN Qingshuan, CHEN Enqing(
), GUO Xin, WANG Song
Received:2025-08-22
Accepted:2025-11-21
Published:2026-04-30
Online:2026-05-20
Contact:
CHEN Enqing,E-mail:ceq2003@163.comSupported by:摘要:
基于骨骼的人体动作识别因其对背景干扰的鲁棒性和结构化表示而受到广泛关注。近年来,Transformer架构因其强大的建模能力被广泛应用于该任务。然而,现有方法在识别包含局部细节变化、复杂时间动态或强时序依赖的动作时仍面临挑战,主要归因于其局部空间语义建模不足、多尺度动态感知能力有限以及缺乏显式的时间位置感知。此外,传统Transformer方法采用传统时间卷积降维易导致重要动态信息丢失。为克服上述问题,提出一种基于超图Transformer的多尺度时序增强模型。首先,设计了一个局部多尺度增强模块(LME),且通过矩形上下文建模机制增强对四肢等关键区域的局部特征感知,并利用高效多尺度注意力机制融合不同时间粒度的动作模式,从而提升模型对多节奏动作的适应性。同时,在空间注意力模块中引入可学习的时间位置编码(TPE),为空间依赖建模注入时序先验,以更准确地捕捉时空耦合关系。进一步地,采用基于Haar小波变换与通道注意力机制的时间压缩模块(SEDS)替代传统时间卷积降维,在降低计算量的同时保留关键动态信息。在NTU RGB+D 60,NTU RGB+D 120和Northwestern-UCLA3个公开数据集上的实验结果表明,该模型在识别准确率上优于多种主流方法,尤其在复杂背景、细节动作及大规模数据场景下展现出更强的鲁棒性与准确性。
中图分类号:
陈庆拴, 陈恩庆, 郭新, 汪松. 基于超图Transformer的多尺度时序增强动作识别方法[J]. 图学学报, 2026, 47(2): 311-321.
CHEN Qingshuan, CHEN Enqing, GUO Xin, WANG Song. Multiscale temporal enhanced action recognition method based on hypergraph Transformer[J]. Journal of Graphics, 2026, 47(2): 311-321.
| 类型 | 方法 | 参数量/M | NTU RGB+D 60 | NTU RGB+D 120 | ||
|---|---|---|---|---|---|---|
| X-Sub/% | X-View/% | X-Sub/% | X-Set/% | |||
| RNN | VA-LSTM[ | — | 79.4 | 87.6 | — | — |
| AGC-LSTM[ | 22.90 | 89.2 | 95.0 | — | — | |
| CNN | VA-CNN[ | 24.90 | 88.7 | 94.3 | — | — |
| Ta-CNN+[ | 1.06 | 90.7 | 95.1 | 85.7 | 87.3 | |
| GCN | Shift-GCN(4-ensemble)[ | 2.80 | 90.7 | 96.5 | 85.9 | 87.6 |
| HD-GCN[ | — | 93.0 | 97.0 | 89.8 | 91.2 | |
| SARGCN[ | 1.09 | 88.9 | 94.8 | 83.8 | 85.1 | |
| FR-Head[ | 1.45 | 92.8 | 96.8 | 89.5 | 90.9 | |
| Koopman[ | 5.38 | 92.9 | 96.8 | 90.0 | 91.3 | |
| LST[ | 2.10 | 92.9 | 97.0 | 89.9 | 91.1 | |
| BlockGCN[ | 1.30 | 93.1 | 97.0 | 90.3 | 91.5 | |
| Transformer | ST-TR[ | 12.10 | 89.9 | 96.1 | 82.7 | 84.7 |
| SA-TDGFormer[ | — | 92.7 | 96.8 | 86.8 | 88.9 | |
| Hyperformer(joint) | 2.72 | 90.5 | 94.8 | 86.4 | 88.0 | |
| Hyperformer[ | 2.72 | 92.7 | 96.2 | 89.7 | 91.0 | |
| LTPEformer(joint) | 2.88 | 91.4 | 95.8 | 87.2 | 88.4 | |
| LTPEformer | 2.88 | 93.3 | 97.0 | 90.2 | 91.5 | |
表1 不同模型在NTU RGB+D数据集准确率比较
Table 1 Comparison of the accuracy of different models in NTU RGB+D dataset
| 类型 | 方法 | 参数量/M | NTU RGB+D 60 | NTU RGB+D 120 | ||
|---|---|---|---|---|---|---|
| X-Sub/% | X-View/% | X-Sub/% | X-Set/% | |||
| RNN | VA-LSTM[ | — | 79.4 | 87.6 | — | — |
| AGC-LSTM[ | 22.90 | 89.2 | 95.0 | — | — | |
| CNN | VA-CNN[ | 24.90 | 88.7 | 94.3 | — | — |
| Ta-CNN+[ | 1.06 | 90.7 | 95.1 | 85.7 | 87.3 | |
| GCN | Shift-GCN(4-ensemble)[ | 2.80 | 90.7 | 96.5 | 85.9 | 87.6 |
| HD-GCN[ | — | 93.0 | 97.0 | 89.8 | 91.2 | |
| SARGCN[ | 1.09 | 88.9 | 94.8 | 83.8 | 85.1 | |
| FR-Head[ | 1.45 | 92.8 | 96.8 | 89.5 | 90.9 | |
| Koopman[ | 5.38 | 92.9 | 96.8 | 90.0 | 91.3 | |
| LST[ | 2.10 | 92.9 | 97.0 | 89.9 | 91.1 | |
| BlockGCN[ | 1.30 | 93.1 | 97.0 | 90.3 | 91.5 | |
| Transformer | ST-TR[ | 12.10 | 89.9 | 96.1 | 82.7 | 84.7 |
| SA-TDGFormer[ | — | 92.7 | 96.8 | 86.8 | 88.9 | |
| Hyperformer(joint) | 2.72 | 90.5 | 94.8 | 86.4 | 88.0 | |
| Hyperformer[ | 2.72 | 92.7 | 96.2 | 89.7 | 91.0 | |
| LTPEformer(joint) | 2.88 | 91.4 | 95.8 | 87.2 | 88.4 | |
| LTPEformer | 2.88 | 93.3 | 97.0 | 90.2 | 91.5 | |
| 类型 | 方法 | 准确率/% |
|---|---|---|
| RNN | TS-LSTM[ | 89.2 |
| 2s-AGC-LSTM[ | 93.3 | |
| CNN | VA-CNN[ | 90.7 |
| Ta-CNN[ | 96.1 | |
| GCN | 4s-shift-GCN[ | 94.6 |
| BlockGCN[ | 96.9 | |
| Transformer | Hyperformer[ | 96.5 |
| LTPEformer | 97.0 |
表2 不同模型在UCLA数据集准确率比较
Table 2 Comparison of the accuracy of different models in UCLA datasets
| 类型 | 方法 | 准确率/% |
|---|---|---|
| RNN | TS-LSTM[ | 89.2 |
| 2s-AGC-LSTM[ | 93.3 | |
| CNN | VA-CNN[ | 90.7 |
| Ta-CNN[ | 96.1 | |
| GCN | 4s-shift-GCN[ | 94.6 |
| BlockGCN[ | 96.9 | |
| Transformer | Hyperformer[ | 96.5 |
| LTPEformer | 97.0 |
| 模型 | 方法 | 参数量/ M | X-Sub/% |
|---|---|---|---|
| 1 | 基线 | 2.72 | 90.5 |
| 2 | +1LME | 2.75 | 91.0 |
| 3 | +2LME | 2.75 | 90.8 |
| 4 | +3LME | 2.75 | 90.6 |
表3 LME的有效性
Table 3 The Effectiveness of LME
| 模型 | 方法 | 参数量/ M | X-Sub/% |
|---|---|---|---|
| 1 | 基线 | 2.72 | 90.5 |
| 2 | +1LME | 2.75 | 91.0 |
| 3 | +2LME | 2.75 | 90.8 |
| 4 | +3LME | 2.75 | 90.6 |
| 方法 | RCM | EMA | 参数量/M | X-Sub/% |
|---|---|---|---|---|
| 基线 | × | × | 2.72 | 90.5 |
| +RCM | √ | × | 2.75 | 90.8 |
| +EMA | × | √ | 2.73 | 90.8 |
| +LME | √ | √ | 2.75 | 91.0 |
表4 LME内组件的有效性分析
Table 4 Effectiveness analysis of LME internal components
| 方法 | RCM | EMA | 参数量/M | X-Sub/% |
|---|---|---|---|---|
| 基线 | × | × | 2.72 | 90.5 |
| +RCM | √ | × | 2.75 | 90.8 |
| +EMA | × | √ | 2.73 | 90.8 |
| +LME | √ | √ | 2.75 | 91.0 |
| 分组数目 | X-Sub/% |
|---|---|
| 2 | 90.6 |
| 4 | 90.8 |
| 8 | 91.0 |
| 16 | 90.9 |
| 32 | 90.4 |
表5 EMA不同分组效果
Table 5 EMA different grouping effects
| 分组数目 | X-Sub/% |
|---|---|
| 2 | 90.6 |
| 4 | 90.8 |
| 8 | 91.0 |
| 16 | 90.9 |
| 32 | 90.4 |
| 方法 | 参数量/M | X-Sub/% |
|---|---|---|
| 基线 | 2.72 | 90.5 |
| +LME | 2.75 | 91.0 |
| +LME+TPE | 2.88 | 91.2 |
表6 TPE的有效性
Table 6 The effectiveness of TPE
| 方法 | 参数量/M | X-Sub/% |
|---|---|---|
| 基线 | 2.72 | 90.5 |
| +LME | 2.75 | 91.0 |
| +LME+TPE | 2.88 | 91.2 |
| 步长 | 参数量/M | X-Sub/% |
|---|---|---|
| 16 | 2.78 | 91.23 |
| 32 | 2.81 | 91.23 |
| 48 | 2.85 | 91.20 |
| 64 | 2.88 | 91.25 |
| 80 | 2.92 | 91.23 |
| 96 | 2.95 | 91.23 |
表7 时间长度Tmax的影响
Table 7 Effect of time length Tmax
| 步长 | 参数量/M | X-Sub/% |
|---|---|---|
| 16 | 2.78 | 91.23 |
| 32 | 2.81 | 91.23 |
| 48 | 2.85 | 91.20 |
| 64 | 2.88 | 91.25 |
| 80 | 2.92 | 91.23 |
| 96 | 2.95 | 91.23 |
| 方法 | RCM | EMA | TPE | SEDS | 参数量/M | FLOPs/G | X-Sub/% |
|---|---|---|---|---|---|---|---|
| 基线 | × | × | × | × | 2.72 | 14.39 | 90.5 |
| RCM | √ | × | × | × | 2.75 | 14.40 | 90.8 |
| EMA | × | √ | × | × | 2.73 | 14.40 | 90.8 |
| LME | √ | √ | × | × | 2.75 | 14.40 | 91.0 |
| LME+TPE | √ | √ | √ | × | 2.88 | 14.40 | 91.2 |
| LME+TPE+SEDS | √ | √ | √ | √ | 2.95 | 13.61 | 91.4 |
表8 模块消融实验与计算复杂度分析
Table 8 Module ablation experiment and computational complexity analysis
| 方法 | RCM | EMA | TPE | SEDS | 参数量/M | FLOPs/G | X-Sub/% |
|---|---|---|---|---|---|---|---|
| 基线 | × | × | × | × | 2.72 | 14.39 | 90.5 |
| RCM | √ | × | × | × | 2.75 | 14.40 | 90.8 |
| EMA | × | √ | × | × | 2.73 | 14.40 | 90.8 |
| LME | √ | √ | × | × | 2.75 | 14.40 | 91.0 |
| LME+TPE | √ | √ | √ | × | 2.88 | 14.40 | 91.2 |
| LME+TPE+SEDS | √ | √ | √ | √ | 2.95 | 13.61 | 91.4 |
| [1] | 孙满贞, 张鹏, 苏本跃. 基于骨骼数据特征的人体行为识别方法综述[J]. 软件导刊, 2022, 21(4): 233-239. |
| SUN M Z, ZHANG P, SU B Y. Survey of human action recognition methods based on skeleton data features[J]. Software Guide, 2022, 21(4): 233-239 (in Chinese). | |
| [2] | 黄倩, 崔静雯, 李畅. 基于骨骼的人体行为识别方法研究综述[J]. 计算机辅助设计与图形学学报, 2024, 36(2): 173-194. |
| HUANG Q, CUI J W, LI C. A review of skeleton-based human action recognition[J]. Journal of Computer-Aided Design & Computer Graphics, 2024, 36(2): 173-194 (in Chinese). | |
| [3] | 卢健, 李萱峰, 赵博, 等. 骨骼信息的人体行为识别综述[J]. 中国图像图形学报, 2023, 28(12): 3651-3669. |
|
LU J, LI X F, ZHAO B, et al. A review of skeleton-based human action recognition[J]. Journal of Image and Graphics, 2023, 28(12): 3651-3669 (in Chinese).
DOI URL |
|
| [4] | KARPATHY A, TODERICI G, SHETTY S, et al. Large-scale video classification with convolutional neural networks[C]// 2014 IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2014: 1725-1732. |
| [5] |
蒋圣南, 陈恩庆, 郑铭耀, 等. 基于ResNeXt的人体动作识别[J]. 图学学报, 2020, 41(2): 277-282.
DOI |
| JIANG S N, CHEN E Q, ZHENG M Y, et al. Human action recognition based on ResNeXt[J]. Journal of Graphics, 2020, 41(2): 277-282 (in Chinese). | |
| [6] | YAN S J, XIONG Y J, LIN D H. Spatial temporal graph convolutional networks for skeleton-based action recognition[EB/OL]. [2025-06-22]. https://ojs.aaai.org/index.php/aaai/article/view/12328. |
| [7] | SHI L, ZHANG Y F, CHENG J, et al. Two-stream adaptive graph convolutional networks for skeleton-based action recognition[C]// 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2019: 12026-12035. |
| [8] | TIAN X Y, JIN Y, ZHANG Z, et al. STGA-Net: spatial-temporal graph attention network for skeleton-based temporal action segmentation[C]// 2023 IEEE International Conference on Multimedia and Expo Workshops. New York: IEEE Press, 2023: 218-223. |
| [9] | SHI M, TANG Y F, ZHU X Q, et al. Multi-class imbalanced graph convolutional network learning[EB/OL]. [2025-06-22]. https://dl.acm.org/doi/10.5555/3491440.3491838.https://dl.acm.org/doi/10.5555/3491440.3491838. |
| [10] |
PLIZZARI C, CANNICI M, MATTEUCCI M. Skeleton-based action recognition via spatial and temporal transformer networks[J]. Computer Vision and Image Understanding, 2021, 208-209: 103219.
DOI URL |
| [11] | 王可心. 基于 Transformer 的双人交互行为识别及数据增强方法[D]. 西安: 西安电子科技大学, 2023. |
| WANG K X. Two-person interaction behavior recognition and data augmentation method based on Transformer[D]. Xi’an: Xidian University, 2023 (in Chinese). | |
| [12] | ZHOU Y X, CHENG Z Q, LI C, et al. Hypergraph transformer for skeleton-based action recognition[EB/OL]. [2025-06-22]. https://arxiv.org/abs/2211.09590. |
| [13] | SHAHROUDY A, LIU J, NG T T, et al. NTU RGB+D: a large scale dataset for 3D human activity analysis[C]// 2016 IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2016: 1010-1019. |
| [14] |
LIU J, SHAHROUDY A, PEREZ M, et al. NTU RGB+D 120: a large-scale benchmark for 3D human activity understanding[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020, 42(10): 2684-2701.
DOI PMID |
| [15] | WANG J, NIE X H, XIA Y, et al. Cross-view action modeling, learning, and recognition[C]// 2014 IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2014: 2649-2656. |
| [16] | CHEN Y F, YOU Z, ZHANG S H, et al. Core context aware transformers for long context language modeling[EB/OL]. [2025-06-22]. https://icml.cc/virtual/2025/poster/45555. |
| [17] | ZHANG P F, LAN C L, XING J L, et al. View adaptive recurrent neural networks for high performance human action recognition from skeleton data[C]// 2017 IEEE International Conference on Computer Vision. New York: IEEE Press, 2017: 2136-2145. |
| [18] | SI C Y, CHEN W T, WANG W, et al. An attention enhanced graph convolutional LSTM network for skeleton-based action recognition[C]// 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2019: 1227-1236. |
| [19] |
ZHANG P F, LAN C L, XING J L, et al. View adaptive neural networks for high performance skeleton-based human action recognition[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2019, 41(8) 1963-1978.
DOI PMID |
| [20] | XU K L, YE F F, ZHONG Q Y, et al. Topology-aware convolutional neural network for efficient skeleton-based action recognition[EB/OL]. [2025-06-22]. https://ojs.aaai.org/index.php/AAAI/article/view/20191. |
| [21] | CHENG K, ZHANG Y F, HE X Y, et al. Skeleton-based action recognition with shift graph convolutional network[C]// 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2020: 180-189. |
| [22] | LEE J, LEE M, LEE D, et al. Hierarchically decomposed graph convolutional networks for skeleton-based action recognition[C]// 2023 IEEE/CVF International Conference on Computer Vision. New York: IEEE Press, 2023: 10410-10419. |
| [23] |
ZHU Q L, DENG H M. Spatial adaptive graph convolutional network for skeleton-based action recognition[J]. Applied Intelligence, 2023, 53(14): 17796-17808.
DOI |
| [24] | ZHOU H Y, LIU Q J, WANG Y H. Learning discriminative representations for skeleton based action recognition[C]// 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2023: 10608-10617. |
| [25] | WANG X H, XU X, MU Y D. Neural Koopman pooling: control-inspired temporal dynamics encoding for skeleton- based action recognition[C]// 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2023: 10597-10607. |
| [26] | XIANG W M, LI C, ZHOU Y X, et al. Generative action description prompts for skeleton-based action recognition[C]// 2023 IEEE/CVF International Conference on Computer Vision. New York: IEEE Press, 2023: 10242-10251. |
| [27] | ZHOU Y X, YAN X D, CHENG Z Q, et al. BlockGCN: redefine topology awareness for skeleton-based action recognition[C]// 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2024: 2049-2058. |
| [28] |
CHEN D, CHEN M D, WU P S, et al. Two-stream spatio-temporal GCN-transformer networks for skeleton-based action recognition[J]. Scientific Reports, 2025, 15(1): 4982.
DOI |
| [29] | LEE I, KIM D, KANG S, et al. Ensemble deep learning for skeleton-based action recognition using temporal sliding LSTM networks[C]// 2017 IEEE International Conference on Computer Vision. New York: IEEE Press, 2017: 1012-1020. |
| [1] | 张行顺, 陈海永. 基于动态视觉传感器的航发叶片缺陷检测[J]. 图学学报, 2026, 47(1): 120-130. |
| [2] | 翟永杰, 翟邦朝, 胡哲东, 杨珂, 王乾铭, 赵晓瑜. 基于自适应特征融合金字塔与注意力机制的输电线路绝缘子缺陷检测方法[J]. 图学学报, 2025, 46(5): 950-959. |
| [3] | 王雪婷, 郭新, 汪松, 陈恩庆. 基于变分自编码器掩蔽重建的骨骼点动作识别方法[J]. 图学学报, 2025, 46(2): 270-278. |
| [4] | 郭业才, 胡晓伟, 毛湘南. 多尺度密集交互注意力残差真实图像去噪网络[J]. 图学学报, 2025, 46(2): 279-287. |
| [5] | 潘树焱, 刘立群. MSFAFuse:基于多尺度特征信息与注意力机制的SAR和可见光图像融合模型[J]. 图学学报, 2025, 46(2): 300-311. |
| [6] | 王杨, 马唱, 胡明, 孙涛, 饶元, 袁振羽. 基于多尺度特征融合的轻量型野外蝙蝠检测[J]. 图学学报, 2025, 46(1): 70-80. |
| [7] | 刘丽, 张起凡, 白宇昂, 黄凯烨. 结合Swin Transformer的多尺度遥感图像变化检测研究[J]. 图学学报, 2024, 45(5): 941-956. |
| [8] | 张新宇, 张家意, 高欣. ASC-Net:腹腔镜视频中手术器械与脏器快速分割网络[J]. 图学学报, 2024, 45(4): 659-669. |
| [9] | 郭宗洋, 刘立东, 蒋东华, 刘子翔, 朱熟康, 陈京华. 基于语义引导神经网络的人体动作识别算法[J]. 图学学报, 2024, 45(1): 26-34. |
| [10] | 张丽媛, 赵海蓉, 何巍, 唐雄风. 融合全局-局部注意模块的Mask R-CNN膝关节囊肿检测方法[J]. 图学学报, 2023, 44(6): 1183-1190. |
| [11] | 李雨, 闫甜甜, 周东生, 魏小鹏. 基于注意力机制与深度多尺度特征融合的自然场景文本检测[J]. 图学学报, 2023, 44(3): 473-481. |
| [12] | 陆秋, 邵铧泽, 张云磊. 动态平衡多尺度特征融合的结直肠息肉分割[J]. 图学学报, 2023, 44(2): 225-232. |
| [13] | 张倩, 王夏黎, 王炜昊, 武历展, 李超. 基于多尺度特征融合的细胞计数方法[J]. 图学学报, 2023, 44(1): 41-49. |
| [14] | 黄志勇, 韩莎莎, 陈致君, 姚玉, 熊彪, 马凯. 一种用于视频对象分割的仿U形网络[J]. 图学学报, 2023, 44(1): 104-111. |
| [15] | 武历展, 王夏黎, 张 倩, 王炜昊, 李 超. 基于优化 YOLOv5s 的跌倒人物目标检测方法[J]. 图学学报, 2022, 43(5): 791-802. |
| 阅读次数 | ||||||
|
全文 |
|
|||||
|
摘要 |
|
|||||