Generative model based unsupervised multi-view stereo network

doi:10.11996/JG.j.2095-302X.2026010029

Abstract

Abstract:

Existing research on multi-view stereo scheme utilizes depth-estimation algorithms to achieve stereo representation by establishing a mapping relationship between the physical and digital worlds. Supervised learning-based neural networks have achieved accurate and high-fidelity 3D reconstruction results through training. However, in-the-wild visual reconstruction remains challenging due to the lack of rendered depth priors and wide-baseline characteristics of images. A novel system was proposed to obtain optimized depth for naturally collected multi-view images without prior information by applying an unsupervised learning network and semantically optimized Neural Radiation Field (NeRF) rendering. First, preliminary depth information for wild multi-view images were produced without ground truth based on unsupervised deep learning. Subsequently, in a separate NeRF module, a diffusion model was used to construct a surface semantic rendering loss, enabling a fine-grained volumetric representation. Experimental results on the benchmark dataset validated the performance of the proposed system by improving an average of 24.6% of the overall metrics, compared with other state-of-the-art schemes. A novel wild wide-baseline dataset was also applied to verify the generalization performance, and the proposed system reduced the reconstruction error by up to 40.8% compared with all methods.

Key words: unsupervised deep learning, multi-view stereo, 3D reconstruction, neural radiation field, depth optimization

CLC Number:

TP391.41

PAN Yuxuan, JIN Rui, LIU Yu, ZHANG Lin. Generative model based unsupervised multi-view stereo network[J]. Journal of Graphics, 2026, 47(1): 29-38.

Figures/Tables 10

References 33

[1]	LEE L H, BRAUD T, ZHOU P Y, et al. All one needs to know about metaverse: a complete survey on technological singularity, virtual ecosystem, and research agenda[J]. Foundations and Trends® in Human-Computer Interaction, 2024, 18(2/3): 100-337. DOI URL
[2]	YAO Y, LUO Z X, LI S W, et al. MVSNet: depth inference for unstructured multi-view stereo[C]// The 15th European Conference on Computer Vision - ECCV 2018. Cham: Springer, 2018: 785-801.
[3]	HUANG B C, YI H W, HUANG C, et al. M3VSNET: unsupervised multi-metric multi-view stereo network[C]// 2021 IEEE International Conference on Image Processing. New York: IEEE Press, 2021: 3163-3167.
[4]	LI J L, LU Z D, WANG Y Q, et al. DS-MVSNet: unsupervised multi-view stereo via depth synthesis[C]// The 30th ACM International Conference on Multimedia. New York: ACM, 2022: 5593-5601.
[5]	XIONG K Q, PENG R, ZHANG Z, et al. CL-MVSNet: unsupervised multi-view stereo with dual-level contrastive learning[C]// 2023 IEEE/CVF International Conference on Computer Vision. New York: IEEE Press, 2023: 3746-3757.
[6]	MILDENHALL B, SRINIVASAN P P, TANCIK M, et al. NeRF: representing scenes as neural radiance fields for view synthesis[C]// The 16th European Conference on Computer Vision. Cham: Springer, 2020: 405-421.
[7]	王道累, 丁子健, 杨君, 等. 基于体素网格特征的NeRF大场景重建方法[J]. 图学学报, 2025, 46(3): 502-509. DOI
	WANG D L, DING Z J, YANG J, et al. Large scene reconstruction method based on voxel grid feature of NeRF[J]. Journal of Graphics, 2025, 46(3): 502-509 (in Chinese). DOI
[8]	ZAWISH M, DHAREJO F A, KHOWAJA S A, et al. AI and 6G into the Metaverse: fundamentals, challenges and future research trends[J]. IEEE Open Journal of the Communications Society, 2024, 5: 730-778. DOI URL
[9]	刘鑫, 李洋, 冯胜杰, 等. 面向RGB-D数据的特征线提取和表示算法[J]. 图学学报, 2025, 46(3): 542-550. DOI
	LIU X, LI Y, FENG S J, et al. Line extraction and representation algorithm for RGB-D data[J]. Journal of Graphics, 2025, 46(3): 542-550 (in Chinese). DOI
[10]	SCHÖNBERGER J L, ZHENG E L, FRAHM J M, et al. Pixelwise view selection for unstructured multi-view stereo[C]// The 14th European Conference on Computer Vision. Cham: Springer, 2016: 501-518.
[11]	HEEP M, ZELL E. ShadowPatch: shadow based segmentation for reliable depth discontinuities in photometric stereo[J]. Computer Graphics Forum, 2022, 41(7): 635-646. DOI URL
[12]	LIANG J, WANG R J, PENG R, et al. High fidelity aggregated planar prior assisted PatchMatch multi-view stereo[C]// The 32nd ACM International Conference on Multimedia. New York: ACM, 2024: 3141-3150.
[13]	TANG J Y, CAI Y G, GAO X S, et al. Generalized sampling of non-local textural clues multi-view stereo framework[C]// The 32nd ACM International Conference on Multimedia. New York: ACM, 2024: 11222-11225.
[14]	XU H B, CHEN W T, SUN B G, et al. RobustMVS: single domain generalized deep multi-view stereo[J]. IEEE Transactions on Circuits and Systems for Video Technology, 2024, 34(10): 9181-9194. DOI URL
[15]	ZHU J, PENG B, LIU B Z, et al. Self-constructing stereo correspondences for unsupervised multi-view stereo[J]. IEEE Transactions on Circuits and Systems for Video Technology, 2024, 34(11): 10732-10742. DOI URL
[16]	JIANG J F, CAO M F, YI J, et al. DI-MVS: learning efficient multi-view stereo with depth-aware iterations[C]// 2024 IEEE International Conference on Acoustics, Speech and Signal Processing New York: IEEE Press, 2024: 3180-3184.
[17]	KHOT T, AGRAWAL S, TULSIANI S, et al. Learning unsupervised multi-view stereopsis via robust photometric consistency[EB/OL]. (2019-06-06)[2025-01-27]. https://arxiv.org/abs/1905.02706.
[18]	RENDLE G, KRESKOWSKI A, FROEHLICH B. Volumetric avatar reconstruction with spatio-temporally offset RGBD cameras[C]// 2023 IEEE Conference Virtual Reality and 3D User Interfaces. New York: IEEE Press, 2023: 72-82.
[19]	WANG S C, JIANG H, XIANG L. CT-MVSNet: efficient multi-view stereo with cross-scale transformer[C]// The 30th International Conference on Multimedia Modeling. Cham: Springer, 2024: 394-408.
[20]	CHANG D, BOŽIČ A, ZHANG T, et al. RC-MVSNet: unsupervised multi-view stereo with neural rendering[C]// The 17th European Conference on Computer Vision. Cham: Springer, 2022: 665-680.
[21]	DENG K L, LIU A, ZHU J Y, et al. Depth-supervised NeRF: fewer views and faster training for free[C]// 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2022: 12872-12881.
[22]	TOSI F, TONIONI A, DE GREGORIO D, et al. Nerf-supervised deep stereo[C]// 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2023: 855-866.
[23]	SANTO H, OKURA F, MATSUSHITA Y. MVCPS-NeuS: multi-view constrained photometric stereo for neural surface reconstruction[C]// 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2024: 20475-20484.
[24]	ZHU D X, KONG H R, QIU Q, et al. Multi-view stereo network based on attention mechanism and neural volume rendering[J]. Electronics, 2023, 12(22): 4603. DOI URL
[25]	WEI Y, LIU S H, ZHOU J, et al. Depth-guided optimization of neural radiance fields for indoor multi-view stereo[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023, 45(9): 10835-10849. DOI URL
[26]	ITO S, MIURA K, ITO K, et al. Neural radiance field-inspired depth map refinement for accurate multi-view stereo[J]. Journal of Imaging, 2024, 10(3): 68. DOI URL
[27]	ZHU H X, CHEN Z B. CMC: few-shot novel view synthesis via cross-view multiplane consistency[C]// 2024 IEEE Conference Virtual Reality and 3D User Interfaces. New York: IEEE Press, 2024: 960-968.
[28]	SCHÖNBERGER J L, FRAHM J M. Structure-from-motion revisited[C]// 2016 IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2016: 4104-4113.
[29]	CAO T S, KREIS K, FIDLER S, et al. TexFusion: synthesizing 3D textures with text-guided image diffusion models[C]// 2023 IEEE/CVF International Conference on Computer Vision. New York: IEEE Press, 2023: 4146-4158.
[30]	AANÆS H, JENSEN R R, VOGIATZIS G, et al. Large-scale data for multiple-view stereopsis[J]. International Journal of Computer Vision, 2016, 120(2): 153-168. DOI URL
[31]	ZHANG Y S, ZHU J K, LIN L X. Multi-view stereo representation revist: region-aware MVSNet[C]// 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2023: 17376-17385.
[32]	KNAPITSCH A, PARK J, ZHOU Q Y, et al. Tanks and temples: benchmarking large-scale scene reconstruction[J]. ACM Transactions on Graphics, 2017, 36(4): 78.
[33]	PAN Y X, LIU Y, ZHANG L. LiTrix: a lightweight live light field video scheme for metaverse stereoscopic applications[J]. IEEE Internet of Things Magazine, 2023, 6(2): 137-142.

参数	准确率	完整性	总体指标
N=3	0.352	0.276	0.314
N=4	0.338	0.256	0.297
N=5	0.337	0.256	0.295
N=6	0.340	0.261	0.300
N=7	0.357	0.284	0.321

参数	准确率	完整性	总体指标
N=3	0.352	0.276	0.314
N=4	0.338	0.256	0.297
N=5	0.337	0.256	0.295
N=6	0.340	0.261	0.300
N=7	0.357	0.284	0.321

方案	准确率	完整性	总体指标
Colmap^[10]	0.400	0.664	0.532
MVSNet^[2]	0.396	0.527	0.462
M3VSNet^[3]	0.636	0.531	0.583
Unsup-MVS^[17]	0.881	1.073	0.977
RC-MVSNet^[20]	0.396	0.295	0.345
CL-MVSNet^[5]	0.375	0.283	0.329
RA-MVSNet^[31]	0.326	0.268	0.297
CT-MVSNet^[19]	0.341	0.264	0.302
ColNeRF^[26]	0.384	0.378	0.381
本文方案	0.337	0.256	0.295

方案	准确率	完整性	总体指标
Colmap^[10]	0.400	0.664	0.532
MVSNet^[2]	0.396	0.527	0.462
M3VSNet^[3]	0.636	0.531	0.583
Unsup-MVS^[17]	0.881	1.073	0.977
RC-MVSNet^[20]	0.396	0.295	0.345
CL-MVSNet^[5]	0.375	0.283	0.329
RA-MVSNet^[31]	0.326	0.268	0.297
CT-MVSNet^[19]	0.341	0.264	0.302
ColNeRF^[26]	0.384	0.378	0.381
本文方案	0.337	0.256	0.295

方案	准确率	完整性	总体指标
L_P	0.432	0.349	0.391
L_P+L_FV	0.391	0.285	0.338
L and L_NeRF	0.337	0.256	0.295