Monocular depth estimation based on Laplacian pyramid with attention fusion

doi:10.11996/JG.j.2095-302X.2023040728

Abstract

Abstract:

With the rapid development of deep neural networks, research on deep learning-based monocular depth estimation has centered on regressing depth through encoder-decoder structures and has yielded significant results. However, most traditional methods typically entail the repetition of simple upsampling operations during the decoding process, which fail to take full advantage of the characteristics of the encoder for monocular depth estimation. To address this problem, this study proposed a dense feature decoding structure combined with an attention mechanism. Utilizing a single RGB image as input, the feature map of each level of the encoder was fused into the branch of the Laplace pyramid to heighten the utilization of the feature map at each level. Attention mechanisms were introduced into the decoder to further enhance depth estimation. Finally, data loss and structural similarity loss were combined to reinforce the stability and convergence speed of model training and diminish the training cost of the model. The experimental results demonstrated that compared with the existing model on the KITTI dataset, the root mean square error decreased by 4.8% and the training cost was reduced by 36% relative to the advanced algorithm LapDepth, with a more significant improvement in depth estimation accuracy and convergence speed.

Key words: deep learning, monocular depth estimation, attention mechanism, Laplacian pyramid, Laplacian residuals

CLC Number:

TP391

YU Wei-qun, LIU Jia-tao, ZHANG Ya-ping. Monocular depth estimation based on Laplacian pyramid with attention fusion[J]. Journal of Graphics, 2023, 44(4): 728-738.

Figures/Tables 15

References 26

[1]	GODARD C, MAC AODHA O, BROSTOW G J. Unsupervised monocular depth estimation with left-right consistency[C]// 2017 IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2017: 270-279.
[2]	蒲正东, 陈姝, 邹北骥, 等. 基于高分辨率网络的自监督单目深度估计方法[J]. 计算机辅助设计与图形学学报, 2023, 35(1): 118-127.
	PU Z D, CHEN S, ZOU B J, et al. A self-supervised monocular depth estimation method based on high resolution convolutional neural network[J]. Journal of Computer-Aided Design & Computer Graphics, 2023, 35(1): 118-127 (in Chinese).
[3]	赵霖, 赵滟, 靳捷. 基于局部注意力和位姿迭代优化的自监督单目深度估计算法[J]. 信号处理, 2022, 38(5): 1088-1097.
	ZHAO L, ZHAO Y, JIN J. A self-supervised monocular depth estimation algorithm based on local attention and iterative pose refinement[J]. Journal of Signal Processing, 2022, 38(5): 1088-1097 (in Chinese).
[4]	SONG M, LIM S, KIM W. Monocular depth estimation using Laplacian pyramid-based depth residuals[J]. IEEE Transactions on Circuits and Systems for Video Technology, 2021, 31(11): 4381-4393. DOI URL
[5]	FU H, GONG M M, WANG C H, et al. Deep ordinal regression network for monocular depth estimation[C]// 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2018: 2002-2011.
[6]	YANG M K, YU K, ZHANG C, et al. DenseASPP for semantic segmentation in street scenes[C]// 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2018: 3684-3692.
[7]	张涛, 张晓利, 任彦. Transformer与CNN融合的单目图像深度估计[J]. 哈尔滨理工大学学报, 2022, 27(6): 88-94.
	ZHANG T, ZHANG X L, REN Y. Monocular image depth estimation based on the fusion of transformer and CNN[J]. Journal of Harbin University of Science and Technology, 2022, 27(6): 88-94 (in Chinese).
[8]	RANFTL R, BOCHKOVSKIY A, KOLTUN V. Vision transformers for dense prediction[C]// 2021 IEEE/CVF International Conference on Computer Vision. New York: IEEE Press, 2022: 12179-12188.
[9]	WOO S, PARK J, LEE J Y, et al. CBAM: convolutional block attention module[C]// European Conference on Computer Vision. Cham: Springer International Publishing, 2018: 3-19.
[10]	HOU Q B, ZHOU D Q, FENG J S. Coordinate attention for efficient mobile network design[C]// 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2021: 13713-13722.
[11]	ZHANG Q L, YANG Y B. SA-net: shuffle attention for deep convolutional neural networks[C]// 2021 IEEE International Conference on Acoustics, Speech and Signal Processing. New York: IEEE Press, 2021: 2235-2239.
[12]	XIE S N, GIRSHICK R, DOLLÁR P, et al. Aggregated residual transformations for deep neural networks[C]// 2017 IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2017: 1492-1500.
[13]	HUANG G, LIU Z, VAN DER MAATEN L, et al. Densely connected convolutional networks[C]// 2017 IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2017: 2261-2269.
[14]	RUSSAKOVSKY O, DENG J, SU H, et al. ImageNet large scale visual recognition challenge[J]. International Journal of Computer Vision, 2015, 115(3): 211-252. DOI URL
[15]	HE K, ZHANG X, REN S, et al. Deep residual learning for image recognition[C]// 2016 IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2016: 770-778.
[16]	SZEGEDY C, VANHOUCKE V, IOFFE S, et al. Rethinking the inception architecture for computer vision[C]// 2016 IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2016: 2818-2826.
[17]	EIGEN D, PUHRSCH C, FERGUS R. Depth map prediction from a single image using a multi-scale deep network[EB/OL]. [2022-06-15]. https://arxiv.org/abs/1406.2283.
[18]	LEE J H, HAN M K, KO D W, et al. From big to small: multi-scale local planar guidance for monocular depth estimation[EB/OL]. [2022-06-15]. https://arxiv.org/abs/1907.10326.
[19]	UHRIG J, SCHNEIDER N, SCHNEIDER L, et al. Sparsity invariant CNNs[C]// 2017 International Conference on 3D Vision (3DV). New York: IEEE Press, 2017: 11-20.
[20]	PASZKE A, GROSS S, MASSA F, et al. PyTorch: an imperative style, high-performance deep learning library[EB/OL]. [2022-06-15]. https://arxiv.org/abs/1912.01703.
[21]	LOSHCHILOV I, HUTTER F. Decoupled weight decay regularization[EB/OL]. [2022-06-15]. https://arxiv.org/abs/1711.05101.
[22]	SIMONYAN K, ZISSERMAN A. Very deep convolutional networks for large-scale image recognition[EB/OL]. [2022-06-15]. https://arxiv.org/abs/1409.1556.
[23]	HUANG G, LIU Z, VAN DER MAATEN L, et al. Densely connected convolutional networks[C]// 2017 IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2017: 4700-4708
[24]	HU J, SHEN L, SUN G. Squeeze-and-excitation networks[C]// 2018 IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2018: 7132-7141.
[25]	MISRA D, NALAMADA T, ARASANIPALAI A U, et al. Rotate to attend: convolutional triplet attention module[C]// 2021 IEEE/CVF Winter Conference on Applications of Computer Vision. New York: IEEE Press, 2021: 3139-3148.
[26]	WANG Z, SIMONCELLI E P, BOVIK A C. Multiscale structural similarity for image quality assessment[C]// The Thrity-Seventh Asilomar Conference on Signals, Systems & Computers. New York: IEEE Press, 2004: 1398-1402.

Encoder
Block	Filter	Stride	Channel	In	Out	Input
layer1	7×7	2	3/64	S	S/2	Input RGB
Maxpool	3×3	2	64/64	S/2	S/4	F(layer1)
layer2	3×3	2	64/256	S/4	S/4	F(Maxpool)
layer3	3×3	2	256/512	S/8	S/8	F(layer2)
layer4	3×3	2	512/1024	S/16	S/16	F(layer3)
Decoder
Block	Filter size	Up	Channel	In	Out	Input	Lev
reduction	1×1	1	1024/512	S/16	S/16	F(layer4)	-
ASPP	3×3	1	512/512	S/16	S/16	F(reduction)	-
sa	1×1	1	512/512	S/16	S/16	F(ASPP)	-
dec5	3×3	1	512/1	S/16	S/16	F(sa)	5th
dec4up	3×3	2	512/256	S/16	S/8	F(sa)	4th
dec4ca	1×1	2	1024/512	S/16	S/8	F((UP(CA(layer4))©layer3))	4th
dec4reduc	1×1	1	768/252	S/8	S/8	F(dec4ca©dec4up)	4th
dec4upr	3×3	2	2/1	S/16	S/8	F(UP(R5) © UP(CA(R5)))	4th
dec4bneck	3×3	1	256/256	S/8	S/8	F(dec4reduc© dec4upr ©L4)	4th
dec4	3×3	1	256/1	S/8	S/8	F(dec4bneck)	4th
dec3up	3×3	2	256/128	S/8	S/4	F(dec4bneck)	3rd
dec3ca	1×1	2	512/128	S/8	S/4	F( (UP(CA(layer3))©layer2))	3rd
dec3reduc	1×1	1	384/124	S/4	S/4	F(dec3ca©dec3up)	3rd
dec3upr	3×3	2	2/1	S/8	S/4	F(UP(R4) © UP(CA(R4)))	3rd
dec3bneck	3×3	1	128/128	S/4	S/4	F(dec3reduc© dec3upr ©L3)	3rd
dec3	3×3	1	128/1	S/4	S/4	F(dec3bneck)	3rd
dec2up	3×3	2	128/64	S/4	S/2	F(dec3bneck)	2nd
dec2ca	1×1	2	128/64	S/4	S/2	F((UP(CA(layer2))©Maxpool))	2nd
dec2reduc	1×1	1	128/60	S/2	S/2	F(dec2ca©dec2up)	2nd
dec2upr	3×3	2	2/1	S/4	S/2	F(UP(R3) © UP(CA(R3)))	2nd
dec2bneck	3×3	1	64/64	S/2	S/2	F(dec2reduc© dec2upr ©L2)	2nd
dec2	3×3	1	64/1	S/2	S/2	F(dec2bneck)	2nd
dec1up	3×3	2	64/60	S/2	S	F(dec2bneck)	1st
dec1upr	3×3	2	2/1	S/2	S	F(UP(R2) © UP(CA(R2)))	1st
dec1bneck	3×3	1	64/64	S	S	F(dec1reduc© dec1upr ©L1)	1st
dec1	3×3	1	64/1	S	S	F(dec1bneck)	1st

Encoder
Block	Filter	Stride	Channel	In	Out	Input
layer1	7×7	2	3/64	S	S/2	Input RGB
Maxpool	3×3	2	64/64	S/2	S/4	F(layer1)
layer2	3×3	2	64/256	S/4	S/4	F(Maxpool)
layer3	3×3	2	256/512	S/8	S/8	F(layer2)
layer4	3×3	2	512/1024	S/16	S/16	F(layer3)
Decoder
Block	Filter size	Up	Channel	In	Out	Input	Lev
reduction	1×1	1	1024/512	S/16	S/16	F(layer4)	-
ASPP	3×3	1	512/512	S/16	S/16	F(reduction)	-
sa	1×1	1	512/512	S/16	S/16	F(ASPP)	-
dec5	3×3	1	512/1	S/16	S/16	F(sa)	5th
dec4up	3×3	2	512/256	S/16	S/8	F(sa)	4th
dec4ca	1×1	2	1024/512	S/16	S/8	F((UP(CA(layer4))©layer3))	4th
dec4reduc	1×1	1	768/252	S/8	S/8	F(dec4ca©dec4up)	4th
dec4upr	3×3	2	2/1	S/16	S/8	F(UP(R5) © UP(CA(R5)))	4th
dec4bneck	3×3	1	256/256	S/8	S/8	F(dec4reduc© dec4upr ©L4)	4th
dec4	3×3	1	256/1	S/8	S/8	F(dec4bneck)	4th
dec3up	3×3	2	256/128	S/8	S/4	F(dec4bneck)	3rd
dec3ca	1×1	2	512/128	S/8	S/4	F( (UP(CA(layer3))©layer2))	3rd
dec3reduc	1×1	1	384/124	S/4	S/4	F(dec3ca©dec3up)	3rd
dec3upr	3×3	2	2/1	S/8	S/4	F(UP(R4) © UP(CA(R4)))	3rd
dec3bneck	3×3	1	128/128	S/4	S/4	F(dec3reduc© dec3upr ©L3)	3rd
dec3	3×3	1	128/1	S/4	S/4	F(dec3bneck)	3rd
dec2up	3×3	2	128/64	S/4	S/2	F(dec3bneck)	2nd
dec2ca	1×1	2	128/64	S/4	S/2	F((UP(CA(layer2))©Maxpool))	2nd
dec2reduc	1×1	1	128/60	S/2	S/2	F(dec2ca©dec2up)	2nd
dec2upr	3×3	2	2/1	S/4	S/2	F(UP(R3) © UP(CA(R3)))	2nd
dec2bneck	3×3	1	64/64	S/2	S/2	F(dec2reduc© dec2upr ©L2)	2nd
dec2	3×3	1	64/1	S/2	S/2	F(dec2bneck)	2nd
dec1up	3×3	2	64/60	S/2	S	F(dec2bneck)	1st
dec1upr	3×3	2	2/1	S/2	S	F(UP(R2) © UP(CA(R2)))	1st
dec1bneck	3×3	1	64/64	S	S	F(dec1reduc© dec1upr ©L1)	1st
dec1	3×3	1	64/1	S	S	F(dec1bneck)	1st

Method		Higher value is better			Lower value is better
Method		δ<1.25	δ<1.25²	δ<1.25³	Abs Rel	Sq Rel	RMSE	RMSE log	Total_iter (M)
Cap=80 m	文献[1]	0.916	0.980	0.994	0.085	0.584	3.938	0.135	-
	文献[5]	0.932	0.984	0.994	0.072	0.307	2.727	0.120	-
	文献[18]	0.950	0.993	0.999	0.064	0.254	2.815	0.100	-
	文献[4]	0.962	0.994	0.999	0.059	0.212	2.446	0.091	0.734
	Ours	0.963	0.995	0.999	0.058	0.199	2.328	0.088	0.470
Cap=50 m	文献[1]	0.861	0.949	0.976	0.114	0.898	4.935	0.206	-
	文献[5]	0.936	0.985	0.995	0.071	0.268	2.271	0.116	-
	文献[18]	0.959	0.994	0.999	0.060	0.182	2.005	0.092	-
	文献[4]	0.967	0.995	0.999	0.056	0.161	1.830	0.086	0.734
	Ours	0.967	0.995	0.999	0.056	0.156	1.768	0.084	0.470

Method		Higher value is better			Lower value is better
Method		δ<1.25	δ<1.25²	δ<1.25³	Abs Rel	Sq Rel	RMSE	RMSE log	Total_iter (M)
Cap=80 m	文献[1]	0.916	0.980	0.994	0.085	0.584	3.938	0.135	-
	文献[5]	0.932	0.984	0.994	0.072	0.307	2.727	0.120	-
	文献[18]	0.950	0.993	0.999	0.064	0.254	2.815	0.100	-
	文献[4]	0.962	0.994	0.999	0.059	0.212	2.446	0.091	0.734
	Ours	0.963	0.995	0.999	0.058	0.199	2.328	0.088	0.470
Cap=50 m	文献[1]	0.861	0.949	0.976	0.114	0.898	4.935	0.206	-
	文献[5]	0.936	0.985	0.995	0.071	0.268	2.271	0.116	-
	文献[18]	0.959	0.994	0.999	0.060	0.182	2.005	0.092	-
	文献[4]	0.967	0.995	0.999	0.056	0.161	1.830	0.086	0.734
	Ours	0.967	0.995	0.999	0.056	0.156	1.768	0.084	0.470

Method	Param (M)	Flops (B)	Higher value is better			Lower value is better
Method	Param (M)	Flops (B)	δ<1.25	δ<1.25²	δ<1.25³	Abs Rel	Sq Rel	RMSE	RMSE log
InceptionV3^[12]	18.13	30.25	0.936	0.990	0.997	0.074	0.302	2.922	0.114
Resnet101^[11]	44.11	98.60	0.960	0.993	0.999	0.063	0.203	2.424	0.095
Vgg19^[18]	14.75	104.30	0.959	0.994	0.999	0.060	0.202	2.361	0.092
DenseNet161^[19]	34.19	104.59	0.960	0.995	0.999	0.059	0.202	2.374	0.090
ResNext101^[9]	74.14	134.76	0.963	0.995	0.999	0.058	0.199	2.328	0.088