Welcome to Journal of Graphics share: 

Journal of Graphics ›› 2024, Vol. 45 ›› Issue (4): 791-803.DOI: 10.11996/JG.j.2095-302X.2024040791

• Image Processing and Computer Vision • Previous Articles     Next Articles

Temporal dynamic frame selection and spatio-temporal graph convolution for interpretable skeleton-based action recognition

LIANG Chengwu1,2(), YANG Jie1,2, HU Wei1,2, JIANG Songqi1,2, QIAN Qiyang2, HOU Ning2()   

  1. 1. College of Electrical Engineering and New Energy, China Three Gorges University, Yichang Hubei 443002, China
    2. School of Electrical and Control Engineering, Henan University of Urban Construction, Pingdingshan Henan 467036, China
  • Received:2023-12-25 Accepted:2024-04-07 Online:2024-08-31 Published:2024-09-03
  • Contact: HOU Ning
  • About author:First author contact:

    LIANG Chengwu (1982-), professor, Ph.D. His main research interests cover artificial intelligence and multimedia. E-mail:liangchengwu0615@126.com

  • Supported by:
    National Natural Science Foundation of China(62176086);National Natural Science Foundation of China(U1804152);Henan Province Science and Technology Project(242102211055)

Abstract:

Skeleton-based action recognition is a prominent research topic in computer vision and machine learning. Existing data-driven neural networks often overlook the temporal dynamic frame selection of skeleton sequences and lack the understandable decision logic inherent in the model, resulting in insufficient interpretability. To this end, we proposed an interpretable skeleton-based action recognition method based on temporal dynamic frame selection and spatio-temporal graph convolution, thereby enhancing the interpretability and recognition performance. Firstly, the quality of skeleton frames was estimated using the joint confidence to remove low-quality skeleton frames, addressing the skeleton noise problem. Secondly, based on the domain knowledge of human activity, an adaptive temporal dynamic frame selection module was proposed for calculating the motion salient regions to capture the dynamic patterns of key skeleton frames in human motion. To represent the intrinsic topology of human joints, an improved spatiotemporal graph convolutional network was used for interpretable skeleton-based action recognition. Experiments were conducted on three large public datasets, including NTU RGB+D, NTU RGB+D 120, and FineGym, and the results demonstrated that the recognition accuracy of this method outperformed comparative methods and possessed interpretability.

Key words: action recognition, skeleton sequence, interpretability, motion salient regions, spatio-temporal graph convolutional network

CLC Number: