Welcome to Journal of Graphics

Journal of Graphics ›› 2026, Vol. 47 ›› Issue (2): 311-321.DOI: 10.11996/JG.j.2095-302X.2026020311

• Image Processing and Computer Vision • Previous Articles     Next Articles

Multiscale temporal enhanced action recognition method based on hypergraph Transformer

CHEN Qingshuan, CHEN Enqing(), GUO Xin, WANG Song   

  1. College of Electrical and Information Engineering, Zhengzhou University, Zhengzhou Henan 450001, China
  • Received:2025-08-22 Accepted:2025-11-21 Online:2026-04-30 Published:2026-05-20
  • Contact: CHEN Enqing
  • Supported by:
    National Natural Science Foundation of China(62301497);National Natural Science Foundation of China(62101503);Science and Technology Research Program of Henan(252102211024);Science and Technology Project of Henan Province(242102211017);The Key Research Program of Henan(231111212000)

Abstract:

Skeleton-based human action recognition has gained widespread attention due to its robustness to background interference and structured representations. In recent years, the Transformer architecture has been widely applied to this task due to its powerful modeling capabilities. However, the existing methods still face challenges in recognizing actions with local detail changes, complex temporal dynamics, or strong temporal dependence, mainly because of their insufficient local spatial semantic modeling, limited multi-scale dynamic perception, and a lack of explicit temporal location perception. in addition, traditional temporal convolution used for dimensionality reduction was prone to the loss of important dynamic information. To overcome these problems, a multi-scale temporal-enhanced model based on a hypergraph Transformer was proposed. Specifically, a Local-Multi-Scale Enhancement (LME) module was designedto enhance the perception of local features in key areas such as limbs through a rectangular context modeling mechanism, and an efficient multi-scale attention mechanism was used to integrate action patterns at different time granularities, improving the adaptability of the model to multi-rhythmic actions. At the same time, a learnable Temporal Positional Encoding (TPE) was introduced into the spatial attention module to inject temporal priors into the spatial dependence modeling to capture the spatio-temporal coupling relationship more accurately. Furthermore, a time-compression module, Squeeze and Excitation Downsampling (SEDS), based on the Haar wavelet transform and channel attention mechanism was adopted to replace the dimensionality reduction by traditional time convolution, reducing the calculation amount while preserving the key dynamic information. The experimental results on three public datasets, NTU RGB +D 60, NTU RGB+D 120, and Northwestern UCLA, showed that the proposed model outperformed many mainstream methods in recognition accuracy, especially in complex background, detailed action and large-scale data scenes.

Key words: skeleton point action recognition, temporal positional coding, multi-scale features, rectangular context modeling, local features

CLC Number: