Welcome to Journal of Graphics share: 

Journal of Graphics ›› 2025, Vol. 46 ›› Issue (3): 625-634.DOI: 10.11996/JG.j.2095-302X.2025030625

• Computer Graphics and Virtual Reality • Previous Articles     Next Articles

3D human mesh reconstruction based on dual-stream network fusion

YU Bing1,2(), CHENG Guang1,2, HUANG Dongjin1,2, DING Youdong1,2   

  1. 1. Shanghai Film Academy, Shanghai University, Shanghai 200072, China
    2. Shanghai Film Special Effects Engineering Technology Research Center, Shanghai 200072, China
  • Received:2024-08-23 Accepted:2024-12-24 Online:2025-06-30 Published:2025-06-13
  • About author:First author contact:

    YU Bing (1989-), lecturer, Ph.D. His main research interests cover image processing, deep learning. E-mail:yubing@shu.edu.cn

  • Supported by:
    Shanghai Talent Development Funding Program(2021016)

Abstract:

The reconstruction of 3D human body meshes holds significant application value in fields such as computer vision, animation production, and virtual reality. However, while most existing methods primarily focus on 3D human body reconstruction from single images, accurately and smoothly reconstructing 3D human motion from video data remains a challenging problem. To address this issue, a dual-stream network fusion architecture was proposed that utilized 3D human pose as an intermediary to achieve 3D human body mesh reconstruction from video data. Specifically, the proposed method comprised three components: First, a 3D pose estimation stream network was employed to estimate 3D joint points from the input video, providing precise joint information. Second, a temporal feature aggregation stream network was used to extract temporal image features from the video, capturing spatial motion and temporal pose characteristics. Finally, a fusion decoder was designed to regress the 3D mesh vertex coordinates by integrating the 3D joint points, temporal image features, and the mesh structure provided by the SMPL template. Experimental results demonstrated that the proposed method achieved superior prediction accuracy compared to MPS-Net. On the 3DPW dataset, the mean per joint position error (MPJPE) was reduced by 9.3%, and on the MPI-INF-3DHP dataset, the MPJPE was reduced by 9.2%. Moreover, the reconstructed results exhibited more visually plausible outcomes, demonstrating higher accuracy and smoothness.

Key words: 3D human reconstruction, SMPL model, attention mechanisms, dual-stream network architecture, spatio-temporal information association

CLC Number: