Welcome to Journal of Graphics share: 

Journal of Graphics ›› 2025, Vol. 46 ›› Issue (4): 746-755.DOI: 10.11996/JG.j.2095-302X.2025040746

• Image Processing and Computer Vision • Previous Articles     Next Articles

Hierarchical attention spatial-temporal feature fusion algorithm for 3D human pose and shape estimation

YAN Zhuoyue1(), LIU Li1,2(), FU Xiaodong1,2, LIU Lijun1,2, PENG Wei1,2   

  1. 1. Faculty of Information Engineering and Automation, Kunming University of Science and Technology, Kunming Yunnan 650500, China
    2. Yunnan Key Laboratory of Computer Technologies Application, Kunming University of Science and Technology, Kunming Yunnan 650500, China
  • Received:2024-11-06 Accepted:2025-03-18 Online:2025-08-30 Published:2025-08-11
  • Contact: LIU Li
  • About author:First author contact:

    YAN Zhuoyue (1998-), master student. Her main research interest covers computer vision. E-mail:yanzhuoyue@stu.kust.edu.cn

  • Supported by:
    National Natural Science Foundation of China(62262036);National Natural Science Foundation of China(62362043);Xingdian Talent Support Project(KKXY202203008)

Abstract:

Monocular-video-based 3D human pose and shape estimation plays an important role in the fields of virtual try-on and special effects production. To address the problem of insufficient human modeling, simple spatial-temporal feature representation, and limited estimation accuracy in 3D human pose and shape estimation from monocular videos, a hierarchical-attention spatial-temporal feature-fusion algorithm was proposed. Firstly, hierarchical attention was applied for model human body parts in hierarchical spatial modeling, yielding learnable human pose spatial features. Secondly, the learnable human pose spatial features were combined with a parametric human template to guide spatial-temporal modeling of human motion temporal feature, achieving spatial-temporal feature fusion. Finally, the method of 3D human pose and shape co-optimization was proposed, and more accurate and smooth 3D human mesh was returned by multilayer perceptron. Experimental results on Human3.6M dataset demonstrated that MPJPE and ACC-ERR were 56.1 mm and 3.4 mm/s2 respectively, reductions of 0.5% and 5.6% compared with the state-of-the-art method, improving the accuracy of 3D human pose and shape estimation, and generating accurate and smooth 3D human mesh. Furthermore, the testing results on 3DPW and Internet videos confirmed the robustness of the proposed method when facing the challenge of fast motion.

Key words: 3D human pose and shape estimation, hierarchical attention, spatial-temporal modeling, spatial-temporal feature fusion, pose and shape co-optimization

CLC Number: