结合金字塔结构和注意力机制的单目深度估计

doi:10.11996/JG.j.2095-302X.2024030454

摘要/Abstract

摘要：

单目深度估计是由单幅彩色图像预测出一幅稠密的深度图像。针对目前单目深度估计算法存在边界模糊、上下文信息捕捉能力不足等问题，提出了一种结合金字塔结构和注意力机制的单目深度估计算法。算法采用编码器-解码器的总体框架，其中编码器选用PVTv2网络，目的是利用Transformer网络在建模全局信息方面的优势以获取更充分的全局语义信息；解码器由深度估计主分支和2个金字塔子分支组成。深度估计主分支通过空间和通道注意力机制来自适应地关注编码器和解码器特征间重要的特征区域和特征通道；拉普拉斯金字塔子分支和深度残差金字塔子分支旨在从彩色图像和深度估计主分支深度特征中学习到丰富的局部信息并传递到深度估计主分支，进一步解决单目深度估计中细节缺失、结构混乱等问题。实验结果表明，与先进的算法P3Depth相比，在室内公开数据集NYU Depth V2上，该算法的δ_1.25阈值精度提升了1.22%，绝对误差和根均方误差分别降低了5.8%和2.8%；而在室外公开数据集KITTI上，该算法的绝对误差、根均方对数误差和根均方误差分别降低了8.5%，3.9%和0.4%。该算法提升了深度估计精度并得到了良好的视觉呈现效果。

关键词: 深度学习, 单目深度估计, 金字塔结构, 注意力机制, Transformer

Abstract:

Monocular depth estimation is the prediction of a dense depth image from a single color image. A monocular depth estimation algorithm combining pyramid structure and attention mechanism was proposed to address the issues of boundary ambiguity and insufficient capture of contextual information in current monocular depth estimation algorithms. The algorithm adopted the overall framework of encoder-decoder, in which the encoder selected the PVTv2 network to obtain more adequate global semantic information by taking advantage of the Transformer network in modeling global information. The decoder consisted of a depth estimation main branch and two pyramid sub-branches. The depth estimation main branch adaptively focused on important feature regions and feature channels between the encoder and decoder features through spatial and channel attention mechanisms. The Laplacian pyramid sub-branch and depth residual pyramid sub-branch aimed to learn rich local information from color images and depth estimation main branch depth features, transferring it to the depth estimation main branch to address the problems of missing details and chaotic structures in monocular depth estimation. Experimental results demonstrated that on the indoor public dataset NYU Depth V2, compared with the advanced algorithm P3Depth, the accuracy of δ_1.25 threshold was increased by 1.22%, the absolute error and root mean square error were decreased by 5.8% and 2.8%, respectively. On the outdoor public dataset KITTI, the absolute error, root mean square logarithmic error, and root mean square error of the algorithm were decreased by 8.5%, 3.9%, and 0.4%, respectively. The algorithm improved the accuracy of depth estimation and achieved a good visual rendering.

Key words: deep learning, monocular depth estimation, pyramid structure, attention mechanism, Transformer

中图分类号:

TP391

李滔, 胡婷, 武丹丹. 结合金字塔结构和注意力机制的单目深度估计[J]. 图学学报, 2024, 45(3): 454-463.

LI Tao, HU Ting, WU Dandan. Monocular depth estimation combining pyramid structure and attention mechanism[J]. Journal of Graphics, 2024, 45(3): 454-463.

图/表 13

图1 总体网络框架

Fig. 1 Overall network architecture

图2 双注意力融合模块

Fig. 2 Dual attention fusion module

图3 深度残差模块

Fig. 3 Depth residual module

表1 不同方法在NYU Depth V2数据集上的定量结果比较

Table 1 Comparison of quantitative results of different methods on NYU Depth V2 dataset

方法	RMSE↓	AbsRel↓	Log10↓	$δ 1.25$ ↑	$δ 1.25 2$ ↑	$δ 1.25 3$ ↑
文献[3]	0.641	0.158	-	0.769	0.950	0.988
文献[15]	0.562	-	0.064	0.800	0.952	0.988
文献[16]	0.514	0.110	0.048	0.878	0.977	0.994
文献[17]	0.495	0.139	0.047	0.888	0.978	0.995
文献[18]	0.470	0.109	-	0.859	0.973	0.995
文献[19]	0.416	0.108	0.048	0.875	0.976	0.994
文献[20]	0.398	0.108	0.047	0.884	0.979	-
文献[21]	0.398	0.116	0.048	0.875	0.980	0.995
文献[5]	0.392	0.110	0.047	0.885	0.978	0.994
文献[4]	0.384	0.105	0.045	0.895	0.983	0.996
文献[22]	0.374	0.103	0.044	0.902	0.985	0.997
文献[23]	0.373	0.107	0.046	0.893	0.985	0.997
文献[10]	0.365	0.106	0.045	0.900	0.983	0.996
文献[24]	0.364	0.103	0.044	0.903	0.984	0.997
文献[8]	0.357	0.110	0.045	0.904	0.988	0.998
文献[6]	0.356	0.104	0.043	0.898	0.981	0.996
本文方法	0.346	0.098	0.042	0.909	0.988	0.998

表1 不同方法在NYU Depth V2数据集上的定量结果比较

Table 1 Comparison of quantitative results of different methods on NYU Depth V2 dataset

方法	RMSE↓	AbsRel↓	Log10↓	$δ 1.25$ ↑	$δ 1.25 2$ ↑	$δ 1.25 3$ ↑
文献[3]	0.641	0.158	-	0.769	0.950	0.988
文献[15]	0.562	-	0.064	0.800	0.952	0.988
文献[16]	0.514	0.110	0.048	0.878	0.977	0.994
文献[17]	0.495	0.139	0.047	0.888	0.978	0.995
文献[18]	0.470	0.109	-	0.859	0.973	0.995
文献[19]	0.416	0.108	0.048	0.875	0.976	0.994
文献[20]	0.398	0.108	0.047	0.884	0.979	-
文献[21]	0.398	0.116	0.048	0.875	0.980	0.995
文献[5]	0.392	0.110	0.047	0.885	0.978	0.994
文献[4]	0.384	0.105	0.045	0.895	0.983	0.996
文献[22]	0.374	0.103	0.044	0.902	0.985	0.997
文献[23]	0.373	0.107	0.046	0.893	0.985	0.997
文献[10]	0.365	0.106	0.045	0.900	0.983	0.996
文献[24]	0.364	0.103	0.044	0.903	0.984	0.997
文献[8]	0.357	0.110	0.045	0.904	0.988	0.998
文献[6]	0.356	0.104	0.043	0.898	0.981	0.996
本文方法	0.346	0.098	0.042	0.909	0.988	0.998

表2 不同方法在KITTI数据集上的定量结果比较

Table 2 Comparison of quantitative results of different methods on KITTI dataset

方法	RMSE↓	RMSElog↓	AbsRel↓	$δ 1.25$ ↑	$δ 1.25 2$ ↑	$δ 1.25 3$ ↑
文献[25]	4.387	0.184	0.097	0.891	0.962	0.982
文献[26]	4.251	0.174	0.117	0.895	0.972	0.985
文献[27]	3.933	0.173	0.098	0.890	0.964	0.985
文献[17]	3.802	0.151	0.090	0.902	0.972	0.990
文献[28]	3.325	0.116	0.074	0.933	0.989	0.997
文献[19]	3.258	0.117	0.072	0.938	0.990	0.998
文献[29]	3.248	0.143	0.092	0.902	0.978	0.994
文献[21]	3.076	0.120	0.082	0.926	0.986	0.997
文献[6]	2.842	0.103	0.071	0.953	0.993	0.998
本文方法	2.831	0.099	0.065	0.955	0.993	0.999

表2 不同方法在KITTI数据集上的定量结果比较

Table 2 Comparison of quantitative results of different methods on KITTI dataset

方法	RMSE↓	RMSElog↓	AbsRel↓	$δ 1.25$ ↑	$δ 1.25 2$ ↑	$δ 1.25 3$ ↑
文献[25]	4.387	0.184	0.097	0.891	0.962	0.982
文献[26]	4.251	0.174	0.117	0.895	0.972	0.985
文献[27]	3.933	0.173	0.098	0.890	0.964	0.985
文献[17]	3.802	0.151	0.090	0.902	0.972	0.990
文献[28]	3.325	0.116	0.074	0.933	0.989	0.997
文献[19]	3.258	0.117	0.072	0.938	0.990	0.998
文献[29]	3.248	0.143	0.092	0.902	0.978	0.994
文献[21]	3.076	0.120	0.082	0.926	0.986	0.997
文献[6]	2.842	0.103	0.071	0.953	0.993	0.998
本文方法	2.831	0.099	0.065	0.955	0.993	0.999

表3 不同方法在KITTI DP benchmark公开测试集上的定量结果比较

Table 3 Comparison of quantitative results of different methods on KITTI DP benchmark public test dataset

方法	SIlog↓	SqRel↓	AbsRel↓	iRMSE↓
文献[30]	15.18	3.79	12.33	17.86
文献[31]	14.68	3.90	12.31	15.96
文献[5]	14.67	3.12	12.42	16.84
文献[32]	13.53	3.06	10.35	15.96
文献[33]	13.08	2.72	10.27	13.95
文献[34]	13.00	2.95	10.38	13.78
文献[35]	12.86	2.87	10.03	14.40
文献[8]	12.83	3.62	11.01	13.43
文献[6]	12.82	2.53	9.92	13.71
本文方法	12.45	2.68	9.92	13.26

图4 不同方法在NYU Depth V2数据集上的定性结果比较((a)输入图片；(b)真实场景深度；(c) BTS；(d) LapDepth；(e) ASTransformer；(f)本文方法)

Fig. 4 Comparison of qualitative results of different methods on the NYU Depth V2 dataset ((a) Input image; (b) Ground depth; (c) BTS; (d) LapDepth; (e) ASTransformer; (f) Ours)

图5 不同方法在KITTI DP benchmark公开数据集上的定性结果比较((a)输入图片；(b) VNL；(c) P3Depth；(d)本文方法)

Fig. 5 Comparison of qualitative results of different methods on the KITTI DP benchmark public test dataset ((a) Input image; (b) VNL; (c) P3Depth; (d) Ours)

图6 本文方法与BTS，PWA和NeWCRFs在KITTI DP benchmark公开测试集上的定性结果比较((a)输入图片；(b) BTS；(c) PWA；(d) NeWCRFs；(e)本文方法)

Fig. 6 Comparison of qualitative results of our method, BTS, PWA, and NeWCRFs on the KITTI DP benchmark public test dataset ((a) Input image; (b) BTS; (c) PWA; (d) NeWCRFs; (e) Ours)

表4 不同方法在NYU Depth V2数据集上的复杂度对比

Table 4 Comparison of complexity of different methods on NYU Depth V2 dataset

对比方法	参数量/M	FLOPs/G
文献[5]	47.001	244.652
文献[4]	58.018	134.285
文献[6]	84.419	850.502
文献[22]	95.728	219.423
本文方法	91.308	254.758

表5 不同模块在NYU Depth V2数据集上的消融实验定量结果比较

Table 5 Ablation experimental results of different modules on the NYU Depth V2 dataset

对比方法	骨干网络	RMSE↓	AbsRel↓	log10↓	$δ 1.25$ ↑	$δ 1.25 2$ ↑	$δ 1.25 3$ ↑	参数量/M
Base	PVTv2-b5	0.355	0.104	0.044	0.904	0.987	0.998	88.26
Base+Lap		0.354	0.104	0.043	0.906	0.986	0.998	88.37
Base+DAFM		0.353	0.106	0.044	0.898	0.986	0.997	88.75
Base+DRM		0.349	0.101	0.043	0.908	0.988	0.998	91.17
Base+Lap+DAFM		0.351	0.102	0.043	0.905	0.987	0.997	88.87
Base+Lap+DRM		0.350	0.102	0.043	0.907	0.987	0.998	91.29
Base+DAFM+DRM		0.348	0.100	0.042	0.907	0.987	0.997	91.67
Base+Lap+DAFM+DRM		0.346	0.098	0.042	0.909	0.988	0.998	91.82

表5 不同模块在NYU Depth V2数据集上的消融实验定量结果比较

Table 5 Ablation experimental results of different modules on the NYU Depth V2 dataset

对比方法	骨干网络	RMSE↓	AbsRel↓	log10↓	$δ 1.25$ ↑	$δ 1.25 2$ ↑	$δ 1.25 3$ ↑	参数量/M
Base	PVTv2-b5	0.355	0.104	0.044	0.904	0.987	0.998	88.26
Base+Lap		0.354	0.104	0.043	0.906	0.986	0.998	88.37
Base+DAFM		0.353	0.106	0.044	0.898	0.986	0.997	88.75
Base+DRM		0.349	0.101	0.043	0.908	0.988	0.998	91.17
Base+Lap+DAFM		0.351	0.102	0.043	0.905	0.987	0.997	88.87
Base+Lap+DRM		0.350	0.102	0.043	0.907	0.987	0.998	91.29
Base+DAFM+DRM		0.348	0.100	0.042	0.907	0.987	0.997	91.67
Base+Lap+DAFM+DRM		0.346	0.098	0.042	0.909	0.988	0.998	91.82

图7 不同拉普拉斯金字塔级别所对应的解码器结构

Fig. 7 The decoder structure corresponding to the different Laplace pyramid levels ((a) Lap0; (b) Lap3; (c) Lap4_0; (e) Lap4_1; (e) Lap5; (f) Lap6)

表6 不同拉普拉斯金字塔结构对性能影响的结果比较

Table 6 Comparison of results of the impact of different Laplace pyramid structures on performance

对比方法	RMSE↓	AbsRel↓	log10↓	$δ 1.25$ ↑	$δ 1.25 2$ ↑	$δ 1.25 3$ ↑
Lap0	0.349	0.101	0.043	0.910	0.988	0.997
Lap3	0.348	0.101	0.043	0.907	0.986	0.997
Lap4_0	0.352	0.099	0.042	0.909	0.987	0.997
Lap4_1	0.346	0.098	0.042	0.909	0.988	0.998
Lap5	0.349	0.099	0.042	0.912	0.987	0.998
Lap6	0.349	0.100	0.042	0.906	0.986	0.997

表6 不同拉普拉斯金字塔结构对性能影响的结果比较

Table 6 Comparison of results of the impact of different Laplace pyramid structures on performance

对比方法	RMSE↓	AbsRel↓	log10↓	$δ 1.25$ ↑	$δ 1.25 2$ ↑	$δ 1.25 3$ ↑
Lap0	0.349	0.101	0.043	0.910	0.988	0.997
Lap3	0.348	0.101	0.043	0.907	0.986	0.997
Lap4_0	0.352	0.099	0.042	0.909	0.987	0.997
Lap4_1	0.346	0.098	0.042	0.909	0.988	0.998
Lap5	0.349	0.099	0.042	0.912	0.987	0.998
Lap6	0.349	0.100	0.042	0.906	0.986	0.997

参考文献 37

[1]	SAXENA A, CHUNG S H, NG A Y. Learning depth from single monocular images[C]// The 18th International Conference on Neural Information Processing Systems. Cambridge: MIT Press, 2005: 1161-1168.
[2]	CHOI S, MIN D B, HAM B, et al. Depth analogy: data-driven approach for single image depth estimation using gradient samples[J]. IEEE Transactions on Image Processing, 2015, 24(12): 5953-5966. DOI PMID
[3]	EIGEN D, PUHRSCH C, FERGUS R. Depth map prediction from a single image using a multi-scale deep network[C]// The 27th International Conference on Neural Information Processing Systems. Cambridge: MIT Press, 2014: 2366-2374.
[4]	SONG M, LIM S, KIM W. Monocular depth estimation using Laplacian pyramid-based depth residuals[J]. IEEE Transactions on Circuits and Systems for Video Technology, 2021, 31(11): 4381-4393.
[5]	LEE J H, HAN M K, KO D W, et al. From big to small: multi-scale local planar guidance for monocular depth estimation[EB/OL]. [2023-06-20]. ttps://arxiv.org/abs/1907.10326.pdf.
[6]	PATIL V, SAKARIDIS C, LINIGER A, et al. P3Depth: monocular depth estimation with a piecewise planarity prior[C]// 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2022: 1600-1611.
[7]	DOSOVITSKIY A, BEYER L, KOLESNIKOV A, et al. An image is worth 16x16 words: transformers for image recognition at scale[EB/OL]. [2023-06-20]. http://arxiv.org/abs/2010.11929.pdf.
[8]	RANFTL R, BOCHKOVSKIY A, KOLTUN V. Vision transformers for dense prediction[C]// 2021 IEEE/CVF International Conference on Computer Vision. New York: IEEE Press, 2021: 12159-12168.
[9]	HE K M, ZHANG X Y, REN S Q, et al. Deep residual learning for image recognition[C]// 2016 IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2016: 770-778.
[10]	YANG G L, TANG H, DING M L, et al. Transformer-based attention networks for continuous pixel-wise prediction[C]// 2021 IEEE/CVF International Conference on Computer Vision. New York: IEEE Press, 2021: 16249-16259.
[11]	WANG W H, XIE E Z, LI X, et al. PVT v2: improved baselines with pyramid vision transformer[J]. Computational Visual Media, 2022, 8(3): 415-424.
[12]	SILBERMAN N, HOIEM D, KOHLI P, et al. Indoor segmentation and support inference from RGBD images[EB/OL]. [2023-06-20]. https://www.doc88.com/p-05429309382760.html?id=7&s=rel.
[13]	GEIGER A, LENZ P, STILLER C, et al. Vision meets robotics: the KITTI dataset[J]. International Journal of Robotics Research, 2013, 32(11): 1231-1237.
[14]	RUSSAKOVSKY O, DENG J, SU H, et al. ImageNet large scale visual recognition challenge[J]. International Journal of Computer Vision, 2015, 115(3): 211-252.
[15]	谢昭, 马海龙, 吴克伟, 等. 基于采样汇集网络的场景深度估计[J]. 自动化学报, 2020, 46(3): 600-612.
	XIE Z, MA H L, WU K W, et al. Sampling aggregate network for scene depth estimation[J]. Acta Automatica Sinica, 2020, 46(3): 600-612 (in Chinese).
[16]	CHEN X T, CHEN X J, ZHA Z J. Structure-aware residual pyramid network for monocular depth estimation[C]// The 28th International Joint Conference on Artificial Intelligence. New York: ACM, 2019: 694-700.
[17]	RAMAMONJISOA M, LEPETIT V. SharpNet: fast and accurate recovery of occluding contours in monocular depth estimation[C]// 2019 IEEE/CVF International Conference on Computer Vision Workshop. New York: IEEE Press, 2019: 2109-2118.
[18]	MENG X Y, FAN C X, MING Y, et al. CORNet: context-based ordinal regression network for monocular depth estimation[J]. IEEE Transactions on Circuits and Systems for Video Technology, 2022, 32(7): 4841-4853.
[19]	YIN W, LIU Y F, SHEN C H. Virtual normal: enforcing geometric constraints for accurate and robust depth prediction[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022, 44(10): 7282-7295.
[20]	XU X F, CHEN Z, YIN F L. Monocular depth estimation with multi-scale feature fusion[J]. IEEE Signal Processing Letters, 2021, 28: 678-682.
[21]	MANIMARAN G, SWAMINATHAN J. Focal-WNet: an architecture unifying convolution and attention for depth estimation[C]// 2022 IEEE 7th International conference for Convergence in Technology. New York: IEEE Press, 2022: 1-7.
[22]	Chang W Y, Zhang Y Y, Xiong Z W. Transformer-based monocular depth estimation with attention supervision[EB/OL]. [2024-04-09]. https://api.semanticscholar.org/CorpusID:2490 11083.
[23]	LEE M, HWANG S, PARK C, et al. EdgeConv with attention module for monocular depth estimation[C]// 2022 IEEE/CVF Winter Conference on Applications of Computer Vision. New York: IEEE Press, 2022: 2364-2373.
[24]	FAROOQ BHAT S, ALHASHIM I, WONKA P. AdaBins: depth estimation using adaptive bins[C]// 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2021: 4008-4017.
[25]	RAMAMONJISOA M, FIRMAN M, WATSON J, et al. Single image depth prediction with wavelet decomposition[C]// 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2021: 11084-11093.
[26]	SU W, ZHANG H F, SU Y, et al. Monocular depth estimation with spatially coherent sliced network[J]. Image and Vision Computing, 2022, 124: 104487.
[27]	GAN Y K, XU X Y, SUN W X, et al. Monocular depth estimation with affinity, vertical pooling, and label enhancement[C]// Computer Vision-ECCV 2018: 15th European Conference. Cham: Springer, 2018: 232-247.
[28]	ALI U, BAYRAMLI B, ALSARHAN T, et al. A lightweight network for monocular depth estimation with decoupled body and edge supervision[J]. Image and Vision Computing, 2021, 113: 104261.
[29]	WU J P, JI R R, WANG Q, et al. Fast monocular depth estimation via side prediction aggregation with continuous spatial refinement[J]. IEEE Transactions on Multimedia, 2023, 25: 1204-1216.
[30]	JIANG H L, HUANG R. Hierarchical binary classification for monocular depth estimation[EB/OL]. [2023-07-09]. https://ieeexplore.ieee.org/abstract/document/8961430.
[31]	OCHS M, KRETZ A, MESTER R. SDNet: semantically guided depth estimation network[C]// German Conference on Pattern Recognition. Cham: Springer, 2019: 288-302.
[32]	FU H, GONG M M, WANG C H, et al. Deep ordinal regression network for monocular depth estimation[C]// 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2018: 2002-2011.
[33]	ZHANG Z Y, CUI Z, XU C Y, et al. Pattern-affinitive propagation across depth, surface normal and semantic segmentation[C]// 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2019: 4101-4110.
[34]	DÍAZ R, MARATHE A. Soft labels for ordinal regression[C]// 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2019: 4733-4742.
[35]	REN H Y, EL-KHAMY M, LEE J. Deep robust single image depth estimation neural network using scene understanding[EB/OL]. [2023-04-12]. http://arxiv.org/abs/1906.03279.pdf.
[36]	LEE S, LEE J, KIM B, et al. Patch-wise attention network for monocular depth estimation[J]. Proceedings of the AAAI Conference on Artificial Intelligence, 2021, 35(3): 1873-1881.
[37]	YUAN W H, GU X D, DAI Z Z, et al. Neural window fully-connected CRFs for monocular depth estimation[C]// 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2022: 3906-3915.