基于双流网络融合的三维人体网格重建

doi:10.11996/JG.j.2095-302X.2025030625

摘要/Abstract

摘要：

三维人体网格重建在计算机视觉、动画制作和虚拟现实等领域具有重要的应用价值。然而，目前大多数方法主要聚焦于单幅图像的三维人体重建，如何从视频数据中准确、平滑地重建三维人体动作仍然是一个难题。为此，提出了一种双流网络融合结构，以三维人体姿态为中介，在视频数据中实现三维人体网格重建。首先，利用三维姿态估计流网络对输入视频进行三维关节点估计，获得精确的关节信息；其次，通过时序特征聚合流网络提取视频的时序图像特征，捕获人体运动位置信息和时序姿态特征信息；最后，设计融合解码器，将三维关节点、时序图像特征与SMPL模板提供的网格结构进行回归，预测三维网格顶点坐标。实验结果表明，该方法相对于MPS-Net方法具有更好的预测精度，在3DPW数据集上比MPS-Net的平均关节位置误差(MPJPE)低了9.3%；在MPI-INF-3DHP数据集上比MPS-Net的MPJPE低了9.2%，同时重建结果在视觉效果上更为合理，展现出更高的准确性和平滑性。

关键词: 三维人体重建, SMPL模型, 注意力机制, 双流网络结构, 时空信息关联

Abstract:

The reconstruction of 3D human body meshes holds significant application value in fields such as computer vision, animation production, and virtual reality. However, while most existing methods primarily focus on 3D human body reconstruction from single images, accurately and smoothly reconstructing 3D human motion from video data remains a challenging problem. To address this issue, a dual-stream network fusion architecture was proposed that utilized 3D human pose as an intermediary to achieve 3D human body mesh reconstruction from video data. Specifically, the proposed method comprised three components: First, a 3D pose estimation stream network was employed to estimate 3D joint points from the input video, providing precise joint information. Second, a temporal feature aggregation stream network was used to extract temporal image features from the video, capturing spatial motion and temporal pose characteristics. Finally, a fusion decoder was designed to regress the 3D mesh vertex coordinates by integrating the 3D joint points, temporal image features, and the mesh structure provided by the SMPL template. Experimental results demonstrated that the proposed method achieved superior prediction accuracy compared to MPS-Net. On the 3DPW dataset, the mean per joint position error (MPJPE) was reduced by 9.3%, and on the MPI-INF-3DHP dataset, the MPJPE was reduced by 9.2%. Moreover, the reconstructed results exhibited more visually plausible outcomes, demonstrating higher accuracy and smoothness.

Key words: 3D human reconstruction, SMPL model, attention mechanisms, dual-stream network architecture, spatio-temporal information association

中图分类号:

TP391

于冰, 程广, 黄东晋, 丁友东. 基于双流网络融合的三维人体网格重建[J]. 图学学报, 2025, 46(3): 625-634.

YU Bing, CHENG Guang, HUANG Dongjin, DING Youdong. 3D human mesh reconstruction based on dual-stream network fusion[J]. Journal of Graphics, 2025, 46(3): 625-634.

图/表 14

图1 基于双流网络融合的三维人体网格重建方法整体流程

Fig. 1 Overall process of 3D human mesh reconstruction method based on dual-stream network fusion

图2 时空Transformer网络结构

Fig. 2 Spatial-temporal Transformer network structure

图3 融合解码器结构

Fig. 3 Fusion decoder architecture

表1 3DPW上实验结果对比

Table 1 Comparison of experimental results on 3DPW

Method	MPJPE↓	P-MPJPE↓	MPVPE↓	ACCEL↓
HMMR^[32]	116.5	72.6	139.3	15.2
MEVA^[4]	86.9	54.7	-	11.6
VIBE^[33]	91.9	57.6	99.1	25.4
TCMR^[3]	86.5	52.7	102.9	7.1
MPS-Net^[9]	84.3	52.1	99.7	7.4
GLOT^[20]	80.7	50.6	96.3	6.6
本文方法	76.5	46.7	90.8	6.2

表2 MPI-INF-3DHP上实验结果对比

Table 2 Comparison of experimental results on MPI-INF-3DHP

Method	MPJPE↓	P-MPJPE↓	ACCEL↓
MEVA^[4]	96.4	65.4	11.1
VIBE^[33]	103.9	68.9	27.3
TCMR^[3]	97.6	63.5	8.5
MPS-Net^[6]	96.7	62.8	9.6
GLOT^[20]	93.9	61.5	7.9
本文方法	87.7	54.5	7.1

表3 Human3.6M上实验结果对比

Table 3 Comparison of experimental results on Human3.6M

Method	MPJPE↓	P-MPJPE↓	ACCEL↓
MEVA^[4]	76.0	53.2	15.3
VIBE^[33]	65.9	41.5	18.3
TCMR^[3]	62.3	41.1	5.3
MPS-Net^[6]	69.4	47.4	3.6
GLOT^[20]	67.0	46.3	3.6
本文方法	57.9	38.9	3.3

表4 与基于图像方法在3DPW上的实验结果对比

Table 4 Comparison of experimental results with image-based methods on 3DPW

Method	MPJPE↓	P-MPJPE↓	ACCEL↓
PQ-GCN^[28]	89.2	58.3	-
Pose2Mesh^[34]	88.9	58.3	22.6
GTRS^[24]	88.5	58.9	25.0
HybrIK^[35]	81.0	76.0	7.1
NIKI^[36]	85.5	53.5	-
ReFit^[37]	71.0	43.9	-
本文方法	74.6	47.7	6.9

表5 3DPW上添加不同模块的消融实验

Table 5 Ablation experiments with different modules added on the 3DPW

Method	MPJPE↓	P-MPJPE↓	MPVPE↓
F(baseline)	84.3	52.1	99.7
F+ST	79.1	49.6	93.6
F+Dec	81.3	50.2	96.5
本文方法	76.5	46.7	90.8

表6 3DPW上三维姿态估计消融实验

Table 6 Three-dimensional attitude estimation ablation experiments on the 3DPW

Method	MPJPE↓	P-MPJPE↓	MPVPE↓	ACCEL↓
Crop	80.2	50.2	96.4	7.3
Crop+Bbox	78.1	47.9	94.1	7.4
本文方法	76.5	46.7	90.8	6.2

图4 不同视频序列长度下的MPJPE指标

Fig. 4 MPJPE metrics for different video sequence lengths

图5 不同方法在3DPW数据集上的重建结果((a)输入；(b) MPS-Net；(c) GLOT；(d)本文方法)

Fig. 5 Reconstruction results of different methods on the 3DPW dataset ((a) Input; (b) MPS-Net; (c) GLOT; (d) Ours)

图6 不同方法在挑战视频上的重建结果((a)输入；(b) MPS-Net；(c) GLOT；(d)本文方法)

Fig. 6 Reconstruction results of different methods on challenge videos ((a) Input; (b) MPS-Net; (c) GLOT; (d) Ours)

图7 本文方法在遮挡视频上的重建结果((a)原始画面；(b)重建画面)

Fig. 7 Reconstruction results of the proposed method on occluded videos ((a) Original image; (b) Reconstructed image)

图8 本文方法的失败案例

Fig. 8 Failure cases of the proposed method

参考文献 37

[1]	WANG J B, TAN S J, ZHEN X T, et al. Deep 3D human pose estimation: a review[J]. Computer Vision and Image Understanding, 2021, 210: 103225.
[2]	DUAN H D, ZHAO Y, CHEN K, et al. Revisiting skeleton-based action recognition[C]// 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2022: 2969-2978.
[3]	吕衡, 杨鸿宇. 一种基于时空运动信息交互建模的三维人体姿态估计方法[J]. 图学学报, 2024, 45(1): 159-168. DOI
	LV H, YANG H Y. A 3D human pose estimation approach based on spatio-temporal motion interaction modeling[J]. Journal of Graphics, 2024, 45(1): 159-168 (in Chinese). DOI
[4]	LUO Z Y, GOLESTANEH S A, KITANI K M. 3D human motion estimation via motion compression and refinement[C]// The 15th Asian Conference on Computer Vision. Cham: Springer, 2021: 324-340.
[5]	SUN Y, YE Y, LIU W, et al. Human mesh recovery from monocular images via a skeleton-disentangled representation[C]// 2019 IEEE/CVF International Conference on Computer Vision. New York: IEEE Press, 2019: 5349-5358.
[6]	KANAZAWA A, ZHANG J Y, FELSEN P, et al. Learning 3D human dynamics from video[C]// 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2019: 5614-5623.
[7]	王玉萍, 曾毅, 李胜辉, 等. 一种基于Transformer的三维人体姿态估计方法[J]. 图学学报, 2023, 44(1): 139-145. DOI
	WANG Y P, ZENG Y, LI S H, et al. A Transformer-based 3D human pose estimation method[J]. Journal of Graphics, 2023, 44(1): 139-145 (in Chinese).
[8]	LOPER M, MAHMOOD N, ROMERO J, et al. SMPL: a skinned multi-person linear model[J]. ACM Transactions on Graphics, 2015, 34(6): 248.
[9]	ANGUELOV D, SRINIVASAN P, KOLLER D, et al. SCAPE: shape completion and animation of people[C]// ACM SIGGRAPH 2005 Papers. New York: ACM, 2005: 408-416.
[10]	OSMAN A A A, BOLKART T, BLACK M J. STAR: sparse trained articulated human body regressor[C]// The 16th European Conference on Computer Vision. Cham: Springer, 2020: 598-613.
[11]	KANAZAWA A, BLACK M J, JACOBS D W, et al. End-to-end recovery of human shape and pose[C]// 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2018: 7122-7131.
[12]	ZHANG J L, TU Z G, YANG J Y, et al. MixSTE: seq2seq mixed spatio-temporal encoder for 3D human pose estimation in video[C]// 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2022: 13232-13242.
[13]	黄友文, 林志钦, 章劲, 等. 结合坐标Transformer的轻量级人体姿态估计算法[J]. 图学学报, 2024, 45(3): 516-527. DOI
	HUANG Y W, LIN Z Q, ZHANG J, et al. Lightweight human pose estimation algorithm combined with coordinate Transformer[J]. Journal of Graphics, 2024, 45(3): 516-527 (in Chinese). DOI
[14]	LI Z, CHEN L L, LIU C L, et al. 3D human avatar digitization from a single image[C]// The 17th ACM SIGGRAPH International Conference on Virtual-Reality Continuum and its Applications in Industry. New York: ACM, 2019: 12.
[15]	DEY R, SALEM F M. Gate-variants of gated recurrent unit (GRU) neural networks[C]// The 60th IEEE International Midwest Symposium on Circuits and Systems. New York: IEEE Press, 2017: 1597-1600.
[16]	WANG J, HU Y Z. An improved enhancement algorithm based on CNN applicable for weak contrast images[J]. IEEE Access, 2020, 8: 8459-8476.
[17]	LI W H, LIU H, TANG H, et al. MHFormer: multi-hypothesis transformer for 3D human pose estimation[C]// 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2022: 13147-13156.
[18]	WAN Z N, LI Z J, TIAN M Q, et al. Encoder-decoder with multi-level attention for 3D human shape and pose estimation[C]// 2021 IEEE/CVF International Conference on Computer Vision. New York: IEEE Press, 2021: 13033-13042.
[19]	KISSOS I, FRITZ L, GOLDMAN M, et al. Beyond weak perspective for monocular 3D human pose estimation[C]// Computer Vision-ECCV 2020 Workshops. Cham: Springer, 2020: 541-554.
[20]	SHEN X L, YANG Z X, WANG X H, et al. Global-to-local modeling for video-based 3D human pose and shape estimation[C]// 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2023: 8887-8896.
[21]	RONG Y, LIU Z W, LI C, et al. Delving deep into hybrid annotations for 3D human recovery in the wild[C]// 2019 IEEE/CVF International Conference on Computer Vision. New York: IEEE Press, 2019: 5340-5348.
[22]	LI Z W, XU B, HUANG H, et al. Deep two-stream video inference for human body pose and shape estimation[C]// 2022 IEEE/CVF Winter Conference on Applications of Computer Vision. New York: IEEE Press, 2022: 430-439.
[23]	ZHANG Z X, LU X Q, CAO G J, et al. ViT-YOLO: transformer-based YOLO for object detection[C]// 2021 IEEE/CVF International Conference on Computer Vision Workshops. New York: IEEE Press, 2021: 2799-2808.
[24]	LI Z H, LIU J Z, ZHANG Z S, et al. CLIFF: carrying location information in full frames into human pose and shape estimation[C]// The 17th European Conference on Computer Vision. Cham: Springer, 2022: 590-606.
[25]	ZHENG C, MENDIETA M, WANG P, et al. A lightweight graph transformer network for human mesh reconstruction from 2D human pose[C]// The 30th ACM International Conference on Multimedia. New York: ACM, 2022: 5496-5507.
[26]	CHENG G, HUANG Y, YU B. Recurrent transformer for 3D human pose estimation[C]// The 4th International Conference on Big Data & Artificial Intelligence & Software Engineering. New York: IEEE Press, 2023: 207-210.
[27]	KOLOTOUROS N, PAVLAKOS G, BLACK M J, et al. Learning to reconstruct 3D human pose and shape via model-fitting in the loop[C]// 2019 IEEE/CVF International Conference on Computer Vision. New York: IEEE Press, 2019: 2252-2261.
[28]	CHO K, VAN MERRIËNBOER B, GULCEHRE C, et al. Learning phrase representations using RNN encoder-decoder for statistical machine translation[EB/OL]. [2024-12-24]https://arxiv.org/abs/1406.1078.
[29]	VON MARCARD T, HENSCHEL R, BLACK M J, et al. Recovering accurate 3D human pose in the wild using IMUs and a moving camera[C]// The 15th European Conference on Computer Vision. Cham: Springer, 2018: 601-617.
[30]	MEHTA D, RHODIN H, CASAS D, et al. Monocular 3D human pose estimation in the wild using improved CNN supervision[C]// 2017 International Conference on 3D Vision. New York: IEEE Press, 2017: 506-516.
[31]	IONESCU C, PAPAVA D, OLARU V, et al. Human3.6M: large scale datasets and predictive methods for 3D human sensing in natural environments[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2014, 36(7): 1325-1339. DOI PMID
[32]	WEI W L, LIN J C, LIU T L, et al. Capturing humans in motion: temporal-attentive 3D human pose and shape estimation from monocular video[C]// 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2022: 13211-13220.
[33]	KOCABAS M, ATHANASIOU N, BLACK M J. VIBE: video inference for human body pose and shape estimation[C]// 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2020: 5253-5263.
[34]	CHOI H, MOON G, LEE K M. Pose2Mesh: graph convolutional network for 3D human pose and mesh recovery from a 2D human pose[C]// The 16th European Conference on Computer Vision. Cham: Springer, 2020: 769-787.
[35]	LI J F, XU C, CHEN Z C, et al. HybrIK: a hybrid analytical-neural inverse kinematics solution for 3D human pose and shape estimation[C]// 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2021: 3383-3393.
[36]	LI J F, BIAN S Y, LIU Q, et al. NIKI: neural inverse kinematics with invertible neural networks for 3D human pose and shape estimation[C]// 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2023: 12933-12942.
[37]	WANG Y F, DANIILIDIS K. ReFit: recurrent fitting network for 3D human recovery[C]// 2023 IEEE/CVF International Conference on Computer Vision. New York: IEEE Press, 2023: 14644-14654.