欢迎访问《图学学报》 分享到:

图学学报 ›› 2025, Vol. 46 ›› Issue (3): 625-634.DOI: 10.11996/JG.j.2095-302X.2025030625

• 计算机图形学与虚拟现实 • 上一篇    下一篇

基于双流网络融合的三维人体网格重建

于冰1,2(), 程广1,2, 黄东晋1,2, 丁友东1,2   

  1. 1.上海大学上海电影学院,上海 200072
    2.上海电影特效工程技术研究中心,上海 200072
  • 收稿日期:2024-08-23 接受日期:2024-12-24 出版日期:2025-06-30 发布日期:2025-06-13
  • 第一作者:于冰(1989-),男,讲师,博士。主要研究方向为图像处理、深度学习。E-mail:yubing@shu.edu.cn
  • 基金资助:
    上海市人才发展资金项目(2021016)

3D human mesh reconstruction based on dual-stream network fusion

YU Bing1,2(), CHENG Guang1,2, HUANG Dongjin1,2, DING Youdong1,2   

  1. 1. Shanghai Film Academy, Shanghai University, Shanghai 200072, China
    2. Shanghai Film Special Effects Engineering Technology Research Center, Shanghai 200072, China
  • Received:2024-08-23 Accepted:2024-12-24 Published:2025-06-30 Online:2025-06-13
  • First author:YU Bing (1989-), lecturer, Ph.D. His main research interests cover image processing, deep learning. E-mail:yubing@shu.edu.cn
  • Supported by:
    Shanghai Talent Development Funding Program(2021016)

摘要:

三维人体网格重建在计算机视觉、动画制作和虚拟现实等领域具有重要的应用价值。然而,目前大多数方法主要聚焦于单幅图像的三维人体重建,如何从视频数据中准确、平滑地重建三维人体动作仍然是一个难题。为此,提出了一种双流网络融合结构,以三维人体姿态为中介,在视频数据中实现三维人体网格重建。首先,利用三维姿态估计流网络对输入视频进行三维关节点估计,获得精确的关节信息;其次,通过时序特征聚合流网络提取视频的时序图像特征,捕获人体运动位置信息和时序姿态特征信息;最后,设计融合解码器,将三维关节点、时序图像特征与SMPL模板提供的网格结构进行回归,预测三维网格顶点坐标。实验结果表明,该方法相对于MPS-Net方法具有更好的预测精度,在3DPW数据集上比MPS-Net的平均关节位置误差(MPJPE)低了9.3%;在MPI-INF-3DHP数据集上比MPS-Net的MPJPE低了9.2%,同时重建结果在视觉效果上更为合理,展现出更高的准确性和平滑性。

关键词: 三维人体重建, SMPL模型, 注意力机制, 双流网络结构, 时空信息关联

Abstract:

The reconstruction of 3D human body meshes holds significant application value in fields such as computer vision, animation production, and virtual reality. However, while most existing methods primarily focus on 3D human body reconstruction from single images, accurately and smoothly reconstructing 3D human motion from video data remains a challenging problem. To address this issue, a dual-stream network fusion architecture was proposed that utilized 3D human pose as an intermediary to achieve 3D human body mesh reconstruction from video data. Specifically, the proposed method comprised three components: First, a 3D pose estimation stream network was employed to estimate 3D joint points from the input video, providing precise joint information. Second, a temporal feature aggregation stream network was used to extract temporal image features from the video, capturing spatial motion and temporal pose characteristics. Finally, a fusion decoder was designed to regress the 3D mesh vertex coordinates by integrating the 3D joint points, temporal image features, and the mesh structure provided by the SMPL template. Experimental results demonstrated that the proposed method achieved superior prediction accuracy compared to MPS-Net. On the 3DPW dataset, the mean per joint position error (MPJPE) was reduced by 9.3%, and on the MPI-INF-3DHP dataset, the MPJPE was reduced by 9.2%. Moreover, the reconstructed results exhibited more visually plausible outcomes, demonstrating higher accuracy and smoothness.

Key words: 3D human reconstruction, SMPL model, attention mechanisms, dual-stream network architecture, spatio-temporal information association

中图分类号: