欢迎访问《图学学报》

图学学报 ›› 2026, Vol. 47 ›› Issue (2): 311-321.DOI: 10.11996/JG.j.2095-302X.2026020311

• 图像处理与计算机视觉 • 上一篇    下一篇

基于超图Transformer的多尺度时序增强动作识别方法

陈庆拴, 陈恩庆(), 郭新, 汪松   

  1. 郑州大学电气与信息工程学院河南 郑州 450001
  • 收稿日期:2025-08-22 接受日期:2025-11-21 出版日期:2026-04-30 发布日期:2026-05-20
  • 通讯作者:陈恩庆,E-mail:ceq2003@163.com
  • 基金资助:
    国家自然科学基金(62301497);国家自然科学基金(62101503);河南省科技攻关项目(252102211024);河南省科技项目(242102211017);河南重点研发专项(231111212000)

Multiscale temporal enhanced action recognition method based on hypergraph Transformer

CHEN Qingshuan, CHEN Enqing(), GUO Xin, WANG Song   

  1. College of Electrical and Information Engineering, Zhengzhou University, Zhengzhou Henan 450001, China
  • Received:2025-08-22 Accepted:2025-11-21 Published:2026-04-30 Online:2026-05-20
  • Contact: CHEN Enqing,E-mail:ceq2003@163.com
  • Supported by:
    National Natural Science Foundation of China(62301497);National Natural Science Foundation of China(62101503);Science and Technology Research Program of Henan(252102211024);Science and Technology Project of Henan Province(242102211017);The Key Research Program of Henan(231111212000)

摘要:

基于骨骼的人体动作识别因其对背景干扰的鲁棒性和结构化表示而受到广泛关注。近年来,Transformer架构因其强大的建模能力被广泛应用于该任务。然而,现有方法在识别包含局部细节变化、复杂时间动态或强时序依赖的动作时仍面临挑战,主要归因于其局部空间语义建模不足、多尺度动态感知能力有限以及缺乏显式的时间位置感知。此外,传统Transformer方法采用传统时间卷积降维易导致重要动态信息丢失。为克服上述问题,提出一种基于超图Transformer的多尺度时序增强模型。首先,设计了一个局部多尺度增强模块(LME),且通过矩形上下文建模机制增强对四肢等关键区域的局部特征感知,并利用高效多尺度注意力机制融合不同时间粒度的动作模式,从而提升模型对多节奏动作的适应性。同时,在空间注意力模块中引入可学习的时间位置编码(TPE),为空间依赖建模注入时序先验,以更准确地捕捉时空耦合关系。进一步地,采用基于Haar小波变换与通道注意力机制的时间压缩模块(SEDS)替代传统时间卷积降维,在降低计算量的同时保留关键动态信息。在NTU RGB+D 60,NTU RGB+D 120和Northwestern-UCLA3个公开数据集上的实验结果表明,该模型在识别准确率上优于多种主流方法,尤其在复杂背景、细节动作及大规模数据场景下展现出更强的鲁棒性与准确性。

关键词: 骨骼点动作识别, 时间位置编码, 多尺度特征, 矩形上下文建模, 局部特征

Abstract:

Skeleton-based human action recognition has gained widespread attention due to its robustness to background interference and structured representations. In recent years, the Transformer architecture has been widely applied to this task due to its powerful modeling capabilities. However, the existing methods still face challenges in recognizing actions with local detail changes, complex temporal dynamics, or strong temporal dependence, mainly because of their insufficient local spatial semantic modeling, limited multi-scale dynamic perception, and a lack of explicit temporal location perception. in addition, traditional temporal convolution used for dimensionality reduction was prone to the loss of important dynamic information. To overcome these problems, a multi-scale temporal-enhanced model based on a hypergraph Transformer was proposed. Specifically, a Local-Multi-Scale Enhancement (LME) module was designedto enhance the perception of local features in key areas such as limbs through a rectangular context modeling mechanism, and an efficient multi-scale attention mechanism was used to integrate action patterns at different time granularities, improving the adaptability of the model to multi-rhythmic actions. At the same time, a learnable Temporal Positional Encoding (TPE) was introduced into the spatial attention module to inject temporal priors into the spatial dependence modeling to capture the spatio-temporal coupling relationship more accurately. Furthermore, a time-compression module, Squeeze and Excitation Downsampling (SEDS), based on the Haar wavelet transform and channel attention mechanism was adopted to replace the dimensionality reduction by traditional time convolution, reducing the calculation amount while preserving the key dynamic information. The experimental results on three public datasets, NTU RGB +D 60, NTU RGB+D 120, and Northwestern UCLA, showed that the proposed model outperformed many mainstream methods in recognition accuracy, especially in complex background, detailed action and large-scale data scenes.

Key words: skeleton point action recognition, temporal positional coding, multi-scale features, rectangular context modeling, local features

中图分类号: