基于生成模型的无监督多视点立体视觉网络

doi:10.11996/JG.j.2095-302X.2026010029

摘要/Abstract

摘要：

现有的多视点立体视觉研究利用深度估计算法，通过建立物理世界与数字世界的映射关系来实现立体表征。基于有监督学习的神经网络算法通过训练能够取得准确且高保真的三维重建结果。然而，由于缺乏深度先验信息且图像具备大视场的特性，面向自然场景的视觉重建仍然具有挑战性。研究应用无监督学习网络和基于语义优化的神经辐射场(NeRF)渲染，在没有先验信息的情况下实现对自然采集的多视点图像的深度估计。首先通过无监督学习无参考地生成多视点图像初步的深度信息，进一步在独立的NeRF模型中，利用扩散模型建立表面语义渲染损失来实现细粒度的三维表征。在基准数据集上的实验结果表明，该方法与其他最先进的方案相比整体重建的指标平均提高了24.6%；在宽基线数据集的泛化性能验证中，该方法将现有方法测得的重建误差最多降低了40.8%。

关键词: 无监督深度学习, 多视点立体视觉, 三维重建, 神经辐射场, 深度优化

Abstract:

Existing research on multi-view stereo scheme utilizes depth-estimation algorithms to achieve stereo representation by establishing a mapping relationship between the physical and digital worlds. Supervised learning-based neural networks have achieved accurate and high-fidelity 3D reconstruction results through training. However, in-the-wild visual reconstruction remains challenging due to the lack of rendered depth priors and wide-baseline characteristics of images. A novel system was proposed to obtain optimized depth for naturally collected multi-view images without prior information by applying an unsupervised learning network and semantically optimized Neural Radiation Field (NeRF) rendering. First, preliminary depth information for wild multi-view images were produced without ground truth based on unsupervised deep learning. Subsequently, in a separate NeRF module, a diffusion model was used to construct a surface semantic rendering loss, enabling a fine-grained volumetric representation. Experimental results on the benchmark dataset validated the performance of the proposed system by improving an average of 24.6% of the overall metrics, compared with other state-of-the-art schemes. A novel wild wide-baseline dataset was also applied to verify the generalization performance, and the proposed system reduced the reconstruction error by up to 40.8% compared with all methods.

Key words: unsupervised deep learning, multi-view stereo, 3D reconstruction, neural radiation field, depth optimization

中图分类号:

TP391.41

潘宇轩, 金锐, 刘雨, 张琳. 基于生成模型的无监督多视点立体视觉网络[J]. 图学学报, 2026, 47(1): 29-38.

PAN Yuxuan, JIN Rui, LIU Yu, ZHANG Lin. Generative model based unsupervised multi-view stereo network[J]. Journal of Graphics, 2026, 47(1): 29-38.

图/表 10

图1 系统架构

Fig. 1 System architecture

表1 DTU数据集上的参数优化实验

Table 1 Parameter Optimization on DTU dataset

参数	准确率	完整性	总体指标
N=3	0.352	0.276	0.314
N=4	0.338	0.256	0.297
N=5	0.337	0.256	0.295
N=6	0.340	0.261	0.300
N=7	0.357	0.284	0.321

图2 DTU数据集上的点云重建结果((a) 原图；(b) 深度结果；(c) 重建结果)

Fig. 2 Point cloud results on DTU dataset ((a) Original data; (b) Depth estimation result; (c) Reconstruction result)

表2 DTU数据集上的重建性能

Table 2 Evaluation metrics on DTU dataset

方案	准确率	完整性	总体指标
Colmap^[10]	0.400	0.664	0.532
MVSNet^[2]	0.396	0.527	0.462
M3VSNet^[3]	0.636	0.531	0.583
Unsup-MVS^[17]	0.881	1.073	0.977
RC-MVSNet^[20]	0.396	0.295	0.345
CL-MVSNet^[5]	0.375	0.283	0.329
RA-MVSNet^[31]	0.326	0.268	0.297
CT-MVSNet^[19]	0.341	0.264	0.302
ColNeRF^[26]	0.384	0.378	0.381
本文方案	0.337	0.256	0.295

表3 DTU数据集上的消融实验

Table 3 Ablation study on DTU dataset

方案	准确率	完整性	总体指标
L_P	0.432	0.349	0.391
L_P+L_FV	0.391	0.285	0.338
L and L_NeRF	0.337	0.256	0.295

图3 DTU数据集上的消融实验可视化结果((a) Lp；(b) Lp+ LFV；(c) L和LNeRF；(d) 数据集参考结果)

Fig. 3 Visual results of ablation study on DTU dataset ((a) Lp; (b) Lp+ LFV; (c) L and LNeRF; (d) Dataset baseline result)

表4 Tanks and Temples数据集上的重建性能

Table 4 Evaluation metrics on Tanks and Temples dataset

方案	Lighthouse	Panther	Train
Colmap^[10]	56.43	46.97	42.04
MVSNet^[2]	50.79	50.86	34.69
M3VSNet^[3]	44.42	44.95	30.31
Unsup-MVS^[17]	42.03	44.00	36.45
RC-MVSNet^[20]	53.49	52.30	49.37
CL-MVSNet^[5]	60.02	59.97	52.28
RA-MVSNet^[31]	64.78	65.60	58.08
CT-MVSNet^[19]	62.60	64.83	58.68
ColNeRF^[26]	60.23	59.46	52.57
本文方案
D=128	61.17	61.20	53.14
D=160	64.97	65.90	58.74
D=192	64.89	65.85	58.71

图4 Tanks and Temples数据集上的点云重建结果((a) 原图；(b) 重建结果)

Fig. 4 Point cloud results on Tanks and Temples ((a) Original data; (b) Reconstruction result)

图5 NERULN数据集上的点云重建结果((a) 本文方案(完整架构)；(b) 本文消融方案(只有无监督MVS网络)；(c) Co1NeRF；(d) RC-MVSNet；(e) M3VSNet)

Fig. 5 Point cloud results on NERULN dataset ((a) Proposed system (Full); (b) Ablation proposed system (L only); (c) Co1NeRF; (d) RC-MVSNet; (e) M3VSNet)

表5 NERULN数据集上的重建性能

Table 5 Evaluation metrics on NERULN dataset

方案	点数量	面数量	重建误差/px	处理时间/s	模型规模/MB
M3VSNet^[3]	519 291	103 854	0.284	354.9	6320
RC-MVSNet^[20]	504 690	100 574	0.189	294.5	9189
ColNeRF^[26]	538 372	119 578	0.204	341.5	5964
本文-L only	530 249	118 988	0.202	286.7	5970
本文-Full	560 560	130 609	0.168	321.3	8672

参考文献 33

[1]	LEE L H, BRAUD T, ZHOU P Y, et al. All one needs to know about metaverse: a complete survey on technological singularity, virtual ecosystem, and research agenda[J]. Foundations and Trends® in Human-Computer Interaction, 2024, 18(2/3): 100-337. DOI URL
[2]	YAO Y, LUO Z X, LI S W, et al. MVSNet: depth inference for unstructured multi-view stereo[C]// The 15th European Conference on Computer Vision - ECCV 2018. Cham: Springer, 2018: 785-801.
[3]	HUANG B C, YI H W, HUANG C, et al. M3VSNET: unsupervised multi-metric multi-view stereo network[C]// 2021 IEEE International Conference on Image Processing. New York: IEEE Press, 2021: 3163-3167.
[4]	LI J L, LU Z D, WANG Y Q, et al. DS-MVSNet: unsupervised multi-view stereo via depth synthesis[C]// The 30th ACM International Conference on Multimedia. New York: ACM, 2022: 5593-5601.
[5]	XIONG K Q, PENG R, ZHANG Z, et al. CL-MVSNet: unsupervised multi-view stereo with dual-level contrastive learning[C]// 2023 IEEE/CVF International Conference on Computer Vision. New York: IEEE Press, 2023: 3746-3757.
[6]	MILDENHALL B, SRINIVASAN P P, TANCIK M, et al. NeRF: representing scenes as neural radiance fields for view synthesis[C]// The 16th European Conference on Computer Vision. Cham: Springer, 2020: 405-421.
[7]	王道累, 丁子健, 杨君, 等. 基于体素网格特征的NeRF大场景重建方法[J]. 图学学报, 2025, 46(3): 502-509. DOI
	WANG D L, DING Z J, YANG J, et al. Large scene reconstruction method based on voxel grid feature of NeRF[J]. Journal of Graphics, 2025, 46(3): 502-509 (in Chinese). DOI
[8]	ZAWISH M, DHAREJO F A, KHOWAJA S A, et al. AI and 6G into the Metaverse: fundamentals, challenges and future research trends[J]. IEEE Open Journal of the Communications Society, 2024, 5: 730-778. DOI URL
[9]	刘鑫, 李洋, 冯胜杰, 等. 面向RGB-D数据的特征线提取和表示算法[J]. 图学学报, 2025, 46(3): 542-550. DOI
	LIU X, LI Y, FENG S J, et al. Line extraction and representation algorithm for RGB-D data[J]. Journal of Graphics, 2025, 46(3): 542-550 (in Chinese). DOI
[10]	SCHÖNBERGER J L, ZHENG E L, FRAHM J M, et al. Pixelwise view selection for unstructured multi-view stereo[C]// The 14th European Conference on Computer Vision. Cham: Springer, 2016: 501-518.
[11]	HEEP M, ZELL E. ShadowPatch: shadow based segmentation for reliable depth discontinuities in photometric stereo[J]. Computer Graphics Forum, 2022, 41(7): 635-646. DOI URL
[12]	LIANG J, WANG R J, PENG R, et al. High fidelity aggregated planar prior assisted PatchMatch multi-view stereo[C]// The 32nd ACM International Conference on Multimedia. New York: ACM, 2024: 3141-3150.
[13]	TANG J Y, CAI Y G, GAO X S, et al. Generalized sampling of non-local textural clues multi-view stereo framework[C]// The 32nd ACM International Conference on Multimedia. New York: ACM, 2024: 11222-11225.
[14]	XU H B, CHEN W T, SUN B G, et al. RobustMVS: single domain generalized deep multi-view stereo[J]. IEEE Transactions on Circuits and Systems for Video Technology, 2024, 34(10): 9181-9194. DOI URL
[15]	ZHU J, PENG B, LIU B Z, et al. Self-constructing stereo correspondences for unsupervised multi-view stereo[J]. IEEE Transactions on Circuits and Systems for Video Technology, 2024, 34(11): 10732-10742. DOI URL
[16]	JIANG J F, CAO M F, YI J, et al. DI-MVS: learning efficient multi-view stereo with depth-aware iterations[C]// 2024 IEEE International Conference on Acoustics, Speech and Signal Processing New York: IEEE Press, 2024: 3180-3184.
[17]	KHOT T, AGRAWAL S, TULSIANI S, et al. Learning unsupervised multi-view stereopsis via robust photometric consistency[EB/OL]. (2019-06-06)[2025-01-27]. https://arxiv.org/abs/1905.02706.
[18]	RENDLE G, KRESKOWSKI A, FROEHLICH B. Volumetric avatar reconstruction with spatio-temporally offset RGBD cameras[C]// 2023 IEEE Conference Virtual Reality and 3D User Interfaces. New York: IEEE Press, 2023: 72-82.
[19]	WANG S C, JIANG H, XIANG L. CT-MVSNet: efficient multi-view stereo with cross-scale transformer[C]// The 30th International Conference on Multimedia Modeling. Cham: Springer, 2024: 394-408.
[20]	CHANG D, BOŽIČ A, ZHANG T, et al. RC-MVSNet: unsupervised multi-view stereo with neural rendering[C]// The 17th European Conference on Computer Vision. Cham: Springer, 2022: 665-680.
[21]	DENG K L, LIU A, ZHU J Y, et al. Depth-supervised NeRF: fewer views and faster training for free[C]// 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2022: 12872-12881.
[22]	TOSI F, TONIONI A, DE GREGORIO D, et al. Nerf-supervised deep stereo[C]// 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2023: 855-866.
[23]	SANTO H, OKURA F, MATSUSHITA Y. MVCPS-NeuS: multi-view constrained photometric stereo for neural surface reconstruction[C]// 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2024: 20475-20484.
[24]	ZHU D X, KONG H R, QIU Q, et al. Multi-view stereo network based on attention mechanism and neural volume rendering[J]. Electronics, 2023, 12(22): 4603. DOI URL
[25]	WEI Y, LIU S H, ZHOU J, et al. Depth-guided optimization of neural radiance fields for indoor multi-view stereo[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023, 45(9): 10835-10849. DOI URL
[26]	ITO S, MIURA K, ITO K, et al. Neural radiance field-inspired depth map refinement for accurate multi-view stereo[J]. Journal of Imaging, 2024, 10(3): 68. DOI URL
[27]	ZHU H X, CHEN Z B. CMC: few-shot novel view synthesis via cross-view multiplane consistency[C]// 2024 IEEE Conference Virtual Reality and 3D User Interfaces. New York: IEEE Press, 2024: 960-968.
[28]	SCHÖNBERGER J L, FRAHM J M. Structure-from-motion revisited[C]// 2016 IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2016: 4104-4113.
[29]	CAO T S, KREIS K, FIDLER S, et al. TexFusion: synthesizing 3D textures with text-guided image diffusion models[C]// 2023 IEEE/CVF International Conference on Computer Vision. New York: IEEE Press, 2023: 4146-4158.
[30]	AANÆS H, JENSEN R R, VOGIATZIS G, et al. Large-scale data for multiple-view stereopsis[J]. International Journal of Computer Vision, 2016, 120(2): 153-168. DOI URL
[31]	ZHANG Y S, ZHU J K, LIN L X. Multi-view stereo representation revist: region-aware MVSNet[C]// 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2023: 17376-17385.
[32]	KNAPITSCH A, PARK J, ZHOU Q Y, et al. Tanks and temples: benchmarking large-scale scene reconstruction[J]. ACM Transactions on Graphics, 2017, 36(4): 78.
[33]	PAN Y X, LIU Y, ZHANG L. LiTrix: a lightweight live light field video scheme for metaverse stereoscopic applications[J]. IEEE Internet of Things Magazine, 2023, 6(2): 137-142.