欢迎访问《图学学报》 分享到:

图学学报 ›› 2026, Vol. 47 ›› Issue (1): 29-38.DOI: 10.11996/JG.j.2095-302X.2026010029

• 图像处理与计算机视觉 • 上一篇    下一篇

基于生成模型的无监督多视点立体视觉网络

潘宇轩1, 金锐1, 刘雨1, 张琳1,2()   

  1. 1 北京邮电大学人工智能学院北京 100876
    2 北京市大数据中心北京 100086
  • 收稿日期:2025-04-29 接受日期:2025-06-28 出版日期:2026-02-28 发布日期:2026-03-16
  • 通讯作者:张琳,E-mail:zhanglin@bupt.edu.cn
  • 基金资助:
    国家重点研发计划(2023YFB2704500);北京市自然科学基金(4222033)

Generative model based unsupervised multi-view stereo network

PAN Yuxuan1, JIN Rui1, LIU Yu1, ZHANG Lin1,2()   

  1. 1 School of Artificial Intelligence, Beijing University of Posts and Telecommunications, Beijing 100876, China
    2 Beijing Big Data Center, Beijing 100086, China
  • Received:2025-04-29 Accepted:2025-06-28 Published:2026-02-28 Online:2026-03-16
  • Supported by:
    National Key Research and Development Program of China(2023YFB2704500);Beijing Natural Science Foundation(4222033)

摘要:

现有的多视点立体视觉研究利用深度估计算法,通过建立物理世界与数字世界的映射关系来实现立体表征。基于有监督学习的神经网络算法通过训练能够取得准确且高保真的三维重建结果。然而,由于缺乏深度先验信息且图像具备大视场的特性,面向自然场景的视觉重建仍然具有挑战性。研究应用无监督学习网络和基于语义优化的神经辐射场(NeRF)渲染,在没有先验信息的情况下实现对自然采集的多视点图像的深度估计。首先通过无监督学习无参考地生成多视点图像初步的深度信息,进一步在独立的NeRF模型中,利用扩散模型建立表面语义渲染损失来实现细粒度的三维表征。在基准数据集上的实验结果表明,该方法与其他最先进的方案相比整体重建的指标平均提高了24.6%;在宽基线数据集的泛化性能验证中,该方法将现有方法测得的重建误差最多降低了40.8%。

关键词: 无监督深度学习, 多视点立体视觉, 三维重建, 神经辐射场, 深度优化

Abstract:

Existing research on multi-view stereo scheme utilizes depth-estimation algorithms to achieve stereo representation by establishing a mapping relationship between the physical and digital worlds. Supervised learning-based neural networks have achieved accurate and high-fidelity 3D reconstruction results through training. However, in-the-wild visual reconstruction remains challenging due to the lack of rendered depth priors and wide-baseline characteristics of images. A novel system was proposed to obtain optimized depth for naturally collected multi-view images without prior information by applying an unsupervised learning network and semantically optimized Neural Radiation Field (NeRF) rendering. First, preliminary depth information for wild multi-view images were produced without ground truth based on unsupervised deep learning. Subsequently, in a separate NeRF module, a diffusion model was used to construct a surface semantic rendering loss, enabling a fine-grained volumetric representation. Experimental results on the benchmark dataset validated the performance of the proposed system by improving an average of 24.6% of the overall metrics, compared with other state-of-the-art schemes. A novel wild wide-baseline dataset was also applied to verify the generalization performance, and the proposed system reduced the reconstruction error by up to 40.8% compared with all methods.

Key words: unsupervised deep learning, multi-view stereo, 3D reconstruction, neural radiation field, depth optimization

中图分类号: