基于自适应聚合循环递归的稠密点云重建网络

doi:10.11996/JG.j.2095-302X.2024010230

摘要/Abstract

摘要：

为了解决弱纹理重建难、资源消耗大和重建时间长等问题，提出了一种基于自适应聚合循环递归卷积的多阶段稠密点云重建网络，即A²R²-MVSNet(adaptive aggregation recurrent recursive multi view stereo net)。该方法首先引入一种基于多尺度循环递归残差的特征提取模块，聚合上下文语义信息，以解决弱纹理或无纹理区域特征提取难的问题。在代价体正则化部分，提出一种残差正则化模块，该模块在略微增加内存消耗的前提下，提高了3D CNN提取和聚合上下文语意的能力。实验结果表明，提出的方法在 DTU数据集上的综合指标排名靠前，在重建细节上有着更好的体现，且在BlendedMVS数据集上生成了不错的深度图和点云结果，此外网络还在自采集的大规模高分辨率数据集上进行了泛化测试。归功于由粗到细的多阶段思想和我们提出的模块，网络在生成高准确性和完整性深度图的同时，还能进行高分辨率重建以适用于实际问题。

长安大学王江安副教授提出了一种基于自适应聚合循环递归卷积的多阶段稠密点云重建网络。其中多尺度循环特征提取模块可以提取出深层丰富的语义信息，以解决弱纹理、反射面等区域特征提取难、提取效果差的问题。残差正则化模块可以加强上下文信息聚合和抗噪能力。所提网络综合性能优于大部分算法，GPU占用也更少，具有良好的泛化性。

关键词: 深度学习, 计算机视觉, 三维重建, 稠密重建, 多视图立体, 递归神经网络

Abstract:

To address the problems such as difficulties in weak texture reconstruction, high resource consumption, and long reconstruction time, a multi-stage dense point cloud reconstruction network based on adaptive aggregation cyclic recursive convolution was proposed, namely A²R²-MVSNet (adaptive aggregation recurrent recursive multi view stereo net). This method first introduced a feature extraction module based on multi-scale cyclic recursive residuals to aggregate contextual semantic information, addressing the problem of difficult feature extraction in weakly textured or textureless regions. In the cost body regularization part, a residual regularization module was proposed. This module enhanced the ability of 3D CNN to extract and aggregate contextual semantics under the premise of slightly increasing memory consumption. The experimental results demonstrated that the proposed method ranked high in comprehensive metrics on the DTU dataset, showcasing superior performance in reconstructing details. Additionally, it could generate good depth maps and point cloud results on the BlendedMVS dataset. Furthermore, the network was tested for generalization on self-collected large-scale high-resolution datasets. Thanks to the coarse-to-fine multi-stage idea and our proposed module, the network could not only generate high-accuracy and complete depth maps, but also perform high-resolution reconstructions suitable for practical applications.

Key words: deep learning, computer vision, 3D reconstruction, dense reconstruction, multi-view stereo, recurrent neural network

中图分类号:

TP391

王江安, 黄乐, 庞大为, 秦林珍, 梁温茜. 基于自适应聚合循环递归的稠密点云重建网络[J]. 图学学报, 2024, 45(1): 230-239.

WANG Jiang’an, HUANG Le, PANG Dawei, QIN Linzhen, LIANG Wenqian. Dense point cloud reconstruction network based on adaptive aggregation recurrent recursion[J]. Journal of Graphics, 2024, 45(1): 230-239.

图/表 10

参考文献 56

[1]	AANAES H, JENSEN R R, VOGIATZIS G, et al. Large-scale data for multiple-view stereopsis[J]. International Journal of Computer Vision, 2016, 120(2): 153-168. DOI URL
[2]	FURUKAWA Y, HERNÁNDEZ C. Multi-view stereo: a tutorial[J]. Foundations and Trends® in Computer Graphics and Vision, 2015, 9(1-2): 1-148. DOI URL
[3]	王思启, 张家强, 李丽圆, 等. MVSNet在空间目标三维重建中的应用[J]. 中国激光, 2022, 49(23): 176-185.
	WANG S Q, ZHANG J Q, LI L Y, et al. Application of MVSNet in 3D reconstruction of space objects[J]. Chinese Journal of Lasers, 2022, 49(23): 176-185 (in Chinese).
[4]	SCHÖNBERGER J L, FRAHM J M. Structure-from-motion revisited[C]// 2016 IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2016: 4104-4113.
[5]	KANG S B, SZELISKI R, CHAI J X. Handling occlusions in dense multi-view stereo[C]// The 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR. New York: IEEE Press, 2003:I: 103-I:110..
[6]	SCHÖNBERGER J L, ZHENG E L, FRAHM J M, et al. Pixelwise view selection for unstructured multi-view stereo[C]// European Conference on Computer Vision. Cham: Springer, 2016: 501-518.
[7]	刘万军, 王俊恺, 曲海成. 多尺度代价体信息共享的多视角立体重建网络[J]. 中国图象图形学报, 2022, 27(11): 3331-3342.
	LIU W J, WANG J K, QU H C. Multi-scale cost volumes information sharing based multi-view stereo reconstructed model[J]. Journal of Image and Graphics, 2022, 27(11): 3331-3342 (in Chinese).
[8]	王江安, 庞大为, 黄乐, 等. 基于多尺度特征递归卷积的稠密点云重建网络[J]. 图学学报, 2022, 43(5): 875-883.
	WANG J A, PANG D W, HUANG L, et al. Dense point cloud reconstruction network using multi-scale feature recursive convolution[J]. Journal of Graphics, 2022, 43(5): 875-883 (in Chinese).
[9]	NIRKIN Y, WOLF L, HASSNER T. HyperSeg: patch-wise hypernetwork for real-time semantic segmentation[C]// 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2021: 4060-4069.
[10]	罗旭东, 吴一全, 陈金林. 无人机航拍影像目标检测与语义分割的深度学习方法研究进展[J/OL]. 航空学报, 2023: 1-33. [2023-06-12]. https://kns.cnki.net/kcms/detail/11.1929.V.20230609.1350.008.html.
	LUO X D, WU Y Q, CHEN J L. Research progress on deep learning methods for object detection and semantic segmentation in UAV aerial images[J/OL]. Acta Aeronautica et Astronautica Sinica, 2023: 1-33. [2023-06-12]. https://kns.cnki.net/kcms/detail/11.1929.V.20230609.1350.008.html. (in Chinese).
[11]	王艺娴, 胡雨凡, 孔庆群, 等. 三维点云语义分割:现状与挑战[J]. 工程科学学报, 2023, 45(10): 1653-1665.
	WANG Y X, HU Y F, KONG Q Q, et al. 3D point cloud semantic segmentation: state of the art and challenges[J]. Chinese Journal of Engineering, 2023, 45(10): 1653-1665 (in Chinese).
[12]	HAMID M S, MANAP N A, HAMZAH R A, et al. Stereo matching algorithm based on deep learning: a survey[J]. Journal of King Saud University - Computer and Information Sciences, 2022, 34(5): 1663-1673. DOI URL
[13]	张新钰, 高洪波, 赵建辉, 等. 基于深度学习的自动驾驶技术综述[J]. 清华大学学报: 自然科学版, 2018, 58(4): 438-444.
	ZHANG X Y, GAO H B, ZHAO J H, et al. Overview of deep learning intelligent driving methods[J]. Journal of Tsinghua University: Science and Technology, 2018, 58(4): 438-444 (in Chinese).
[14]	KNAPITSCH A, PARK J, ZHOU Q Y, et al. Tanks and temples: benchmarking large-scale scene reconstruction[J]. ACM Transactions on Graphics, 36(4): 78:1-78:13.
[15]	ZHU Q T, MIN C, WEI Z Z, et al. Deep learning for multi-view stereo via plane sweep: a survey[EB/OL]. [2023-06-22]. http://arxiv.org/abs/2106.15328v2.
[16]	许允波, 张建兵, 谭宁生. 基于平面扫描的线状缓冲区生成的改进算法[J]. 计算机应用研究, 2012, 29(11): 4364-4366, 4389.
	XU Y B, ZHANG J B, TAN N S. Improved algorithm for line buffering based on plane sweep technique[J]. Application Research of Computers, 2012, 29(11): 4364-4366, 4389 (in Chinese).
[17]	YAO Y, LUO Z X, LI S W, et al. MVSNet: depth inference for unstructured multi-view stereo[C]// European Conference on Computer Vision. Cham: Springer, 2018: 785-801.
[18]	YAO Y, LUO Z X, LI S W, et al. Recurrent MVSNet for high-resolution multi-view stereo depth inference[C]// 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2020: 5520-5529.
[19]	YU Z H, GAO S H. Fast-MVSNet: sparse-to-dense multi-view stereo with learned propagation and gauss-newton refinement[C]// 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2020: 1946-1955.
[20]	汤建龙, 解佳龙, 薛成均. 利用高斯牛顿迭代的时频差无源定位算法[J]. 西安电子科技大学学报, 2023, 50(1): 19-28, 47.
	TANG J L, XIE J L, XUE C J. TDOA-FDOA passive location algorithm using gauss-newton iteration[J]. Journal of Xidian University, 2023, 50(1): 19-28, 47 (in Chinese).
[21]	ZHANG J Y, YAO Y, LI S W, et al. Visibility-aware multi-view stereo network[EB/OL]. [2023-06-22]. https://arxiv.org/abs/2008.07928.pdf.
[22]	YAN J F, WEI Z Z, YI H W, et al. Dense hybrid recurrent multi-view stereo net with dynamic consistency checking[C]// European Conference on Computer Vision. Cham: Springer, 2020: 674-689.
[23]	WEI Z Z, ZHU Q T, MIN C, et al. AA-RMVSNet: adaptive aggregation recurrent multi-view stereo network[C]// 2021 IEEE/CVF International Conference on Computer Vision. New York: IEEE Press, 2022: 6167-6176.
[24]	SHI X J, CHEN Z R, WANG H, et al. Convolutional LSTM Network: a machine learning approach for precipitation nowcasting[C]// The 28th International Conference on Neural Information Processing Systems - Volume 1. New York:ACM, 2015: 802-810.
[25]	GU X D, FAN Z W, ZHU S Y, et al. Cascade cost volume for high-resolution multi-view stereo and stereo matching[C]// 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2020: 2492-2501.
[26]	YANG J Y, MAO W, ALVAREZ J M, et al. Cost volume pyramid based depth inference for multi-view stereo[C]// 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2020: 4876-4885.
[27]	ZHANG X D, HU Y T, WANG H C, et al. Long-range attention network for multi-view stereo[C]// 2021 IEEE Winter Conference on Applications of Computer Vision. New York: IEEE Press, 2021: 3781-3790.
[28]	WANG F, GALLIANI S, VOGEL C, et al. PatchmatchNet: learned multi-view patchmatch stereo[C]// 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2021: 14189-14198.
[29]	MA X J, GONG Y, WANG Q R, et al. EPP-MVSNet: epipolar-assembling based depth prediction for multi-view stereo[C]// 2021 IEEE/CVF International Conference on Computer Vision. New York: IEEE Press, 2022: 5712-5720.
[30]	WANG F, GALLIANI S, VOGEL C, et al. IterMVS: iterative probability estimation for efficient multi-view stereo[C]// 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2022: 8596-8605.
[31]	CHO K, VAN MERRIENBOER B, GULCEHRE C, et al. Learning phrase representations using RNN encoder-decoder for statistical machine translation[EB/OL]. [2023-06-22]. https://arxiv.org/abs/1406.1078.pdf
[32]	PENG R, WANG R J, WANG Z Y, et al. Rethinking depth estimation for multi-view stereo: a unified representation[C]// 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2022: 8635-8644.
[33]	XI J H, SHI Y F, WANG Y J, et al. Ra_yMVSNet: learning ray-based 1D implicit fields for accurate multi-view stereo[C]// 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2022: 8585-8595.
[34]	DING Y K, YUAN W T, ZHU Q T, et al. TransMVSNet: global context-aware multi-view stereo network with transformers[C]// 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2022: 8575-8584.
[35]	VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all You need[C]// The 31st International Conference on Neural Information Processing Systems. New York: ACM, 2017: 6000-6010.
[36]	MI Z X, DI C, XU D. Generalized binary search network for highly-efficient multi-view stereo[C]// 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2022: 12981-12990.
[37]	YAMASHITA K, ENYO Y, NOBUHARA S, et al. nLMVS-net: deep non-lambertian multi-view stereo[C]// 2023 IEEE/CVF Winter Conference on Applications of Computer Vision. New York: IEEE Press, 2023: 3036-3045.
[38]	CHIU C Y, WU Y T, SHEN I C, et al. 360MVSNet: deep multi-view stereo network with 360° images for indoor scene reconstruction[C]// 2023 IEEE/CVF Winter Conference on Applications of Computer Vision. New York: IEEE Press, 2023: 3056-3065.
[39]	ZHANG X D, YANG F Z, CHANG M, et al. MG-MVSNet: multiple granularities feature fusion network for multi-view stereo[J]. Neurocomputing, 2023, 528: 35-47. DOI URL
[40]	ZHANGL Y, ZHU J K, LIN L X. Multi-view stereo representation revist: region-aware MVSNet[C]// 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2023: 17376-17385.
[41]	QIAO S Y, CHEN L C, YUILLE A. DetectoRS: detecting objects with recursive feature pyramid and switchable atrous convolution[C]// 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2021: 10208-10219.
[42]	HUANG G, LIU Z, VAN DER MAATEN L, et al. Densely connected convolutional networks[C]// 2017 IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2017: 2261-2269.
[43]	HE K M, ZHANG X Y, REN S Q, et al. Deep residual learning for image recognition[C]// 2016 IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2016: 770-778.
[44]	鄢化彪, 徐方奇, 黄绿娥, 等. 基于深度学习的多视图立体重建方法综述[J]. 光学精密工程, 2023, 31(16): 2444-2464.
	YAN H B, XU F Q, HUANG L E, et al. Review of multi-view stereo reconstruction methods based on deep learning[J]. Optics and Precision Engineering, 2023, 31(16): 2444-2464 (in Chinese). DOI URL
[45]	RONNEBERGER O, FISCHER P, BROX T. U-net: convolutional networks for biomedical image segmentation[M]// Lecture Notes in Computer Science. Cham: Springer International Publishing, 2015: 234-241.
[46]	杨航, 陈瑞, 安仕鹏, 等. 深度学习背景下的图像三维重建技术进展综述[J]. 中国图象图形学报, 2023, 28(8): 2396-2409.
	YANG H, CHEN R, AN S P, et al. The growth of image-related three dimensional reconstruction techniques in deep learning-driven era: a critical summary[J]. Journal of Image and Graphics, 2023, 28(8): 2396-2409 (in Chinese). DOI URL
[47]	IOFFE S, SZEGEDY C. Batch normalization: accelerating deep network training by reducing internal covariate shift[C]// The 32nd International Conference on International Conference on Machine Learning - Volume 37. New York:ACM, 2015: 448-456.
[48]	Glorot X, Bordes A, Bengio Y. Deep sparse rectifier neural networks[C]// Proceedings of the fourteenth international conference on artificial intelligence and statistics. JMLR Workshop and Conference Proceedings, 2011: 315-323.
[49]	WU Y X, HE K M. Group normalization[C]// European Conference on Computer Vision. Cham: Springer, 2018: 3-19.
[50]	XU B, WANG N Y, CHEN T Q, et al. Empirical evaluation of rectified activations in convolutional network[EB/OL]. [2023-06-22]. https://arxiv.org/abs/1505.00853.pdf
[51]	许彪, 董友强, 张力, 等. 分区优化混合SfM方法[J]. 测绘学报, 2022, 51(1): 115-126. DOI
	XU B, DONG Y Q, ZHANG L, et al. A hybrid SfM method based on partition optimization[J]. Acta Geodaetica et Cartographica Sinica, 2022, 51(1): 115-126 (in Chinese). DOI
[52]	袁艺天, 林春雨, 赵耀, 等. 基于边缘校正的深度图像上采样后处理算法[J]. 铁道学报, 2015, 37(12): 67-73.
	YUAN Y T, LIN C Y, ZHAO Y, et al. A post processing algorithm for upsampling depth image based on boundary correction[J]. Journal of the China Railway Society, 2015, 37(12): 67-73 (in Chinese).
[53]	YAO Y, LUO Z X, LI S W, et al. BlendedMVS: a large-scale dataset for generalized multi-view stereo networks[C]// 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2020: 1787-1796.
[54]	GALLIANI S, LASINGER K, SCHINDLER K. Massively parallel multiview stereopsis by surface normal diffusion[C]// 2015 IEEE International Conference on Computer Vision. New York: IEEE Press, 2016: 873-881.
[55]	FURUKAWA Y, PONCE J. Accurate, dense, and robust multiview stereopsis[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2010, 32(8): 1362-1376. DOI PMID
[56]	CHENG S, XU Z X, ZHU S L, et al. Deep stereo using adaptive thin volume representation with uncertainty awareness[C]// 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2020: 2521-2531.

输入尺寸	结构	输出尺寸
H×W×3	Conv+GN+LeakyReLU,3×3, stride=1	H×W×8
H×W×8	Conv+GN+LeakyReLU,3×3, stride=1	H×W×8
H×W×8	Conv+GN+LeakyReLU,3×3, stride=2	H/2×W/2×16
H/2×W/2×16	Conv+GN+LeakyReLU,3×3, stride=1	H/2×W/2×16
H/2×W/2×16	Conv+GN+LeakyReLU,3×3, stride=2	H/4×W/4×32
H/4×W/4×32	Conv+GN+LeakyReLU,3×3, stride=1	H/4×W/4×32
H/4×W/4×32	Conv+GN+LeakyReLU,3×3, stride=2	H/8×W/8×64
H/8×W/8×64	Conv+GN+LeakyReLU,3×3, stride=1	H/8×W/8×64
H/8×W/8×64	Conv+GN+LeakyReLU,3×3, stride=1	H/8×W/8×64
H/4×W/4×96	Conv+GN+LeakyReLU,3×3, stride=1	H/4×W/4×32
H/2×W/2×48	Conv+GN+LeakyReLU,3×3, stride=1	H/2×W/2×16
H×W×24	Conv+GN+LeakyReLU,3×3, stride=1	H×W×8
H/4×W/4×32	Conv+GN+LeakyReLU,3×3, stride=1	H/4×W/4×16
H/2×W/2×16	Conv+GN+LeakyReLU,3×3, stride=1	H/2×W/2×16
H×W×8	Conv+GN+LeakyReLU,3×3, stride=1	H×W×16

输入尺寸	结构	输出尺寸
H×W×3	Conv+GN+LeakyReLU,3×3, stride=1	H×W×8
H×W×8	Conv+GN+LeakyReLU,3×3, stride=1	H×W×8
H×W×8	Conv+GN+LeakyReLU,3×3, stride=2	H/2×W/2×16
H/2×W/2×16	Conv+GN+LeakyReLU,3×3, stride=1	H/2×W/2×16
H/2×W/2×16	Conv+GN+LeakyReLU,3×3, stride=2	H/4×W/4×32
H/4×W/4×32	Conv+GN+LeakyReLU,3×3, stride=1	H/4×W/4×32
H/4×W/4×32	Conv+GN+LeakyReLU,3×3, stride=2	H/8×W/8×64
H/8×W/8×64	Conv+GN+LeakyReLU,3×3, stride=1	H/8×W/8×64
H/8×W/8×64	Conv+GN+LeakyReLU,3×3, stride=1	H/8×W/8×64
H/4×W/4×96	Conv+GN+LeakyReLU,3×3, stride=1	H/4×W/4×32
H/2×W/2×48	Conv+GN+LeakyReLU,3×3, stride=1	H/2×W/2×16
H×W×24	Conv+GN+LeakyReLU,3×3, stride=1	H×W×8
H/4×W/4×32	Conv+GN+LeakyReLU,3×3, stride=1	H/4×W/4×16
H/2×W/2×16	Conv+GN+LeakyReLU,3×3, stride=1	H/2×W/2×16
H×W×8	Conv+GN+LeakyReLU,3×3, stride=1	H×W×16

方法	Acc	Comp	Overall
Furu^[55]	0.613	0.941	0.777
Gipuma^[54]	0.283	0.873	0.578
COLMAP^[6]	0.400	0.664	0.532
MVSNet^[17]	0.396	0.527	0.462
R-MVSNet^[18]	0.383	0.452	0.417
D2HC-RMVSNet^[22]	0.395	0.378	0.386
IterMVS^[30]	0.373	0.354	0.363
EPP-MVSNet^[29]	0.413	0.296	0.355
Cas-MVSNet^[25]	0.325	0.385	0.355
PatchmatchNet^[28]	0.427	0.277	0.352
CVP-MVSNet^[26]	0.296	0.406	0.351
MG-MVSNet^[39]	0.358	0.338	0.348
UCSNet^[56]	0.338	0.349	0.344
LANet^[27]	0.320	0.349	0.335
UniMVSNet^[32]	0.352	0.278	0.315
Ours	0.321	0.346	0.334

方法	Acc	Comp	Overall
Furu^[55]	0.613	0.941	0.777
Gipuma^[54]	0.283	0.873	0.578
COLMAP^[6]	0.400	0.664	0.532
MVSNet^[17]	0.396	0.527	0.462
R-MVSNet^[18]	0.383	0.452	0.417
D2HC-RMVSNet^[22]	0.395	0.378	0.386
IterMVS^[30]	0.373	0.354	0.363
EPP-MVSNet^[29]	0.413	0.296	0.355
Cas-MVSNet^[25]	0.325	0.385	0.355
PatchmatchNet^[28]	0.427	0.277	0.352
CVP-MVSNet^[26]	0.296	0.406	0.351
MG-MVSNet^[39]	0.358	0.338	0.348
UCSNet^[56]	0.338	0.349	0.344
LANet^[27]	0.320	0.349	0.335
UniMVSNet^[32]	0.352	0.278	0.315
Ours	0.321	0.346	0.334

方法	参数量/M	GPU占用/GB	运行时间/s	Acc/mm	Comp/mm	Overall/mm
Baseline	0.44	7.545	2.872	0.348	0.357	0.353
Baseline+FPN	0.55	7.548	2.885	0.345	0.352	0.349
Baseline+A²R²CNN	0.56	7.548	3.165	0.337	0.342	0.340
Baseline+RU-Net	0.56	8.339	2.911	0.327	0.356	0.342
Ours	0.67	8.342	3.210	0.321	0.346	0.334