3D human mesh reconstruction based on dual-stream network fusion

doi:10.11996/JG.j.2095-302X.2025030625

Abstract

Abstract:

The reconstruction of 3D human body meshes holds significant application value in fields such as computer vision, animation production, and virtual reality. However, while most existing methods primarily focus on 3D human body reconstruction from single images, accurately and smoothly reconstructing 3D human motion from video data remains a challenging problem. To address this issue, a dual-stream network fusion architecture was proposed that utilized 3D human pose as an intermediary to achieve 3D human body mesh reconstruction from video data. Specifically, the proposed method comprised three components: First, a 3D pose estimation stream network was employed to estimate 3D joint points from the input video, providing precise joint information. Second, a temporal feature aggregation stream network was used to extract temporal image features from the video, capturing spatial motion and temporal pose characteristics. Finally, a fusion decoder was designed to regress the 3D mesh vertex coordinates by integrating the 3D joint points, temporal image features, and the mesh structure provided by the SMPL template. Experimental results demonstrated that the proposed method achieved superior prediction accuracy compared to MPS-Net. On the 3DPW dataset, the mean per joint position error (MPJPE) was reduced by 9.3%, and on the MPI-INF-3DHP dataset, the MPJPE was reduced by 9.2%. Moreover, the reconstructed results exhibited more visually plausible outcomes, demonstrating higher accuracy and smoothness.

Key words: 3D human reconstruction, SMPL model, attention mechanisms, dual-stream network architecture, spatio-temporal information association

CLC Number:

TP391

YU Bing, CHENG Guang, HUANG Dongjin, DING Youdong. 3D human mesh reconstruction based on dual-stream network fusion[J]. Journal of Graphics, 2025, 46(3): 625-634.

Figures/Tables 14

References 37

[1]	WANG J B, TAN S J, ZHEN X T, et al. Deep 3D human pose estimation: a review[J]. Computer Vision and Image Understanding, 2021, 210: 103225.
[2]	DUAN H D, ZHAO Y, CHEN K, et al. Revisiting skeleton-based action recognition[C]// 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2022: 2969-2978.
[3]	吕衡, 杨鸿宇. 一种基于时空运动信息交互建模的三维人体姿态估计方法[J]. 图学学报, 2024, 45(1): 159-168. DOI
	LV H, YANG H Y. A 3D human pose estimation approach based on spatio-temporal motion interaction modeling[J]. Journal of Graphics, 2024, 45(1): 159-168 (in Chinese). DOI
[4]	LUO Z Y, GOLESTANEH S A, KITANI K M. 3D human motion estimation via motion compression and refinement[C]// The 15th Asian Conference on Computer Vision. Cham: Springer, 2021: 324-340.
[5]	SUN Y, YE Y, LIU W, et al. Human mesh recovery from monocular images via a skeleton-disentangled representation[C]// 2019 IEEE/CVF International Conference on Computer Vision. New York: IEEE Press, 2019: 5349-5358.
[6]	KANAZAWA A, ZHANG J Y, FELSEN P, et al. Learning 3D human dynamics from video[C]// 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2019: 5614-5623.
[7]	王玉萍, 曾毅, 李胜辉, 等. 一种基于Transformer的三维人体姿态估计方法[J]. 图学学报, 2023, 44(1): 139-145. DOI
	WANG Y P, ZENG Y, LI S H, et al. A Transformer-based 3D human pose estimation method[J]. Journal of Graphics, 2023, 44(1): 139-145 (in Chinese).
[8]	LOPER M, MAHMOOD N, ROMERO J, et al. SMPL: a skinned multi-person linear model[J]. ACM Transactions on Graphics, 2015, 34(6): 248.
[9]	ANGUELOV D, SRINIVASAN P, KOLLER D, et al. SCAPE: shape completion and animation of people[C]// ACM SIGGRAPH 2005 Papers. New York: ACM, 2005: 408-416.
[10]	OSMAN A A A, BOLKART T, BLACK M J. STAR: sparse trained articulated human body regressor[C]// The 16th European Conference on Computer Vision. Cham: Springer, 2020: 598-613.
[11]	KANAZAWA A, BLACK M J, JACOBS D W, et al. End-to-end recovery of human shape and pose[C]// 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2018: 7122-7131.
[12]	ZHANG J L, TU Z G, YANG J Y, et al. MixSTE: seq2seq mixed spatio-temporal encoder for 3D human pose estimation in video[C]// 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2022: 13232-13242.
[13]	黄友文, 林志钦, 章劲, 等. 结合坐标Transformer的轻量级人体姿态估计算法[J]. 图学学报, 2024, 45(3): 516-527. DOI
	HUANG Y W, LIN Z Q, ZHANG J, et al. Lightweight human pose estimation algorithm combined with coordinate Transformer[J]. Journal of Graphics, 2024, 45(3): 516-527 (in Chinese). DOI
[14]	LI Z, CHEN L L, LIU C L, et al. 3D human avatar digitization from a single image[C]// The 17th ACM SIGGRAPH International Conference on Virtual-Reality Continuum and its Applications in Industry. New York: ACM, 2019: 12.
[15]	DEY R, SALEM F M. Gate-variants of gated recurrent unit (GRU) neural networks[C]// The 60th IEEE International Midwest Symposium on Circuits and Systems. New York: IEEE Press, 2017: 1597-1600.
[16]	WANG J, HU Y Z. An improved enhancement algorithm based on CNN applicable for weak contrast images[J]. IEEE Access, 2020, 8: 8459-8476.
[17]	LI W H, LIU H, TANG H, et al. MHFormer: multi-hypothesis transformer for 3D human pose estimation[C]// 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2022: 13147-13156.
[18]	WAN Z N, LI Z J, TIAN M Q, et al. Encoder-decoder with multi-level attention for 3D human shape and pose estimation[C]// 2021 IEEE/CVF International Conference on Computer Vision. New York: IEEE Press, 2021: 13033-13042.
[19]	KISSOS I, FRITZ L, GOLDMAN M, et al. Beyond weak perspective for monocular 3D human pose estimation[C]// Computer Vision-ECCV 2020 Workshops. Cham: Springer, 2020: 541-554.
[20]	SHEN X L, YANG Z X, WANG X H, et al. Global-to-local modeling for video-based 3D human pose and shape estimation[C]// 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2023: 8887-8896.
[21]	RONG Y, LIU Z W, LI C, et al. Delving deep into hybrid annotations for 3D human recovery in the wild[C]// 2019 IEEE/CVF International Conference on Computer Vision. New York: IEEE Press, 2019: 5340-5348.
[22]	LI Z W, XU B, HUANG H, et al. Deep two-stream video inference for human body pose and shape estimation[C]// 2022 IEEE/CVF Winter Conference on Applications of Computer Vision. New York: IEEE Press, 2022: 430-439.
[23]	ZHANG Z X, LU X Q, CAO G J, et al. ViT-YOLO: transformer-based YOLO for object detection[C]// 2021 IEEE/CVF International Conference on Computer Vision Workshops. New York: IEEE Press, 2021: 2799-2808.
[24]	LI Z H, LIU J Z, ZHANG Z S, et al. CLIFF: carrying location information in full frames into human pose and shape estimation[C]// The 17th European Conference on Computer Vision. Cham: Springer, 2022: 590-606.
[25]	ZHENG C, MENDIETA M, WANG P, et al. A lightweight graph transformer network for human mesh reconstruction from 2D human pose[C]// The 30th ACM International Conference on Multimedia. New York: ACM, 2022: 5496-5507.
[26]	CHENG G, HUANG Y, YU B. Recurrent transformer for 3D human pose estimation[C]// The 4th International Conference on Big Data & Artificial Intelligence & Software Engineering. New York: IEEE Press, 2023: 207-210.
[27]	KOLOTOUROS N, PAVLAKOS G, BLACK M J, et al. Learning to reconstruct 3D human pose and shape via model-fitting in the loop[C]// 2019 IEEE/CVF International Conference on Computer Vision. New York: IEEE Press, 2019: 2252-2261.
[28]	CHO K, VAN MERRIËNBOER B, GULCEHRE C, et al. Learning phrase representations using RNN encoder-decoder for statistical machine translation[EB/OL]. [2024-12-24]https://arxiv.org/abs/1406.1078.
[29]	VON MARCARD T, HENSCHEL R, BLACK M J, et al. Recovering accurate 3D human pose in the wild using IMUs and a moving camera[C]// The 15th European Conference on Computer Vision. Cham: Springer, 2018: 601-617.
[30]	MEHTA D, RHODIN H, CASAS D, et al. Monocular 3D human pose estimation in the wild using improved CNN supervision[C]// 2017 International Conference on 3D Vision. New York: IEEE Press, 2017: 506-516.
[31]	IONESCU C, PAPAVA D, OLARU V, et al. Human3.6M: large scale datasets and predictive methods for 3D human sensing in natural environments[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2014, 36(7): 1325-1339. DOI PMID
[32]	WEI W L, LIN J C, LIU T L, et al. Capturing humans in motion: temporal-attentive 3D human pose and shape estimation from monocular video[C]// 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2022: 13211-13220.
[33]	KOCABAS M, ATHANASIOU N, BLACK M J. VIBE: video inference for human body pose and shape estimation[C]// 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2020: 5253-5263.
[34]	CHOI H, MOON G, LEE K M. Pose2Mesh: graph convolutional network for 3D human pose and mesh recovery from a 2D human pose[C]// The 16th European Conference on Computer Vision. Cham: Springer, 2020: 769-787.
[35]	LI J F, XU C, CHEN Z C, et al. HybrIK: a hybrid analytical-neural inverse kinematics solution for 3D human pose and shape estimation[C]// 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2021: 3383-3393.
[36]	LI J F, BIAN S Y, LIU Q, et al. NIKI: neural inverse kinematics with invertible neural networks for 3D human pose and shape estimation[C]// 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2023: 12933-12942.
[37]	WANG Y F, DANIILIDIS K. ReFit: recurrent fitting network for 3D human recovery[C]// 2023 IEEE/CVF International Conference on Computer Vision. New York: IEEE Press, 2023: 14644-14654.

Method	MPJPE↓	P-MPJPE↓	MPVPE↓	ACCEL↓
HMMR^[32]	116.5	72.6	139.3	15.2
MEVA^[4]	86.9	54.7	-	11.6
VIBE^[33]	91.9	57.6	99.1	25.4
TCMR^[3]	86.5	52.7	102.9	7.1
MPS-Net^[9]	84.3	52.1	99.7	7.4
GLOT^[20]	80.7	50.6	96.3	6.6
本文方法	76.5	46.7	90.8	6.2

Method	MPJPE↓	P-MPJPE↓	MPVPE↓	ACCEL↓
HMMR^[32]	116.5	72.6	139.3	15.2
MEVA^[4]	86.9	54.7	-	11.6
VIBE^[33]	91.9	57.6	99.1	25.4
TCMR^[3]	86.5	52.7	102.9	7.1
MPS-Net^[9]	84.3	52.1	99.7	7.4
GLOT^[20]	80.7	50.6	96.3	6.6
本文方法	76.5	46.7	90.8	6.2

Method	MPJPE↓	P-MPJPE↓	ACCEL↓
MEVA^[4]	96.4	65.4	11.1
VIBE^[33]	103.9	68.9	27.3
TCMR^[3]	97.6	63.5	8.5
MPS-Net^[6]	96.7	62.8	9.6
GLOT^[20]	93.9	61.5	7.9
本文方法	87.7	54.5	7.1

Method	MPJPE↓	P-MPJPE↓	ACCEL↓
MEVA^[4]	96.4	65.4	11.1
VIBE^[33]	103.9	68.9	27.3
TCMR^[3]	97.6	63.5	8.5
MPS-Net^[6]	96.7	62.8	9.6
GLOT^[20]	93.9	61.5	7.9
本文方法	87.7	54.5	7.1

Method	MPJPE↓	P-MPJPE↓	ACCEL↓
MEVA^[4]	76.0	53.2	15.3
VIBE^[33]	65.9	41.5	18.3
TCMR^[3]	62.3	41.1	5.3
MPS-Net^[6]	69.4	47.4	3.6
GLOT^[20]	67.0	46.3	3.6
本文方法	57.9	38.9	3.3