融合注意力的拉普拉斯金字塔单目深度估计

doi:10.11996/JG.j.2095-302X.2023040728

图学学报 ›› 2023, Vol. 44 ›› Issue (4): 728-738.DOI: 10.11996/JG.j.2095-302X.2023040728

• 图像处理与计算机视觉 • 上一篇下一篇

融合注意力的拉普拉斯金字塔单目深度估计

余伟群(), 刘佳涛, 张亚萍()

云南师范大学信息学院，云南昆明 650500

收稿日期:2022-11-22 接受日期:2023-03-27 出版日期:2023-08-31 发布日期:2023-08-16
通讯作者: 张亚萍(1979-)，女，教授，博士。主要研究方向为计算机视觉、计算机图形学。E-mail：zhangyp@ynnu.edu.cn
作者简介:
余伟群(1998-)，男，硕士研究生。研究方向为计算机视觉、图像处理。E-mail：yudalao888@163.com
基金资助:
国家自然科学基金项目(61863037);云南省“万人计划”青年拔尖人才专项

Monocular depth estimation based on Laplacian pyramid with attention fusion

YU Wei-qun(), LIU Jia-tao, ZHANG Ya-ping()

School of Information Science and Technology, Yunnan Normal University, Kunming Yunnan 650500, China

Received:2022-11-22 Accepted:2023-03-27 Online:2023-08-31 Published:2023-08-16
Contact: ZHANG Ya-ping (1979-), professor, Ph.D. Her main research interests cover computer vision, computer graphic. E-mail：zhangyp@ynnu.edu.cn
About author:
YU Wei-qun (1998-), master student. His main research interests cover computer vision, image processing. E-mail：yudalao888@163.com
Supported by:
National Natural Science Foundation of China(61863037);Ten Thousand Talent Plans for Young Top-Notch Talents of Yunnan Province

摘要/Abstract

摘要：

随着深度神经网络的迅速发展，基于深度学习的单目深度估计研究集中于通过编码器-解码器结构回归深度，并取得了重大成果。针对在大多数传统方法中，解码过程通常重复简单的上采样操作，存在无法充分利用编码器的特性进行单目深度估计的问题，提出一种结合注意力机制的致密特征解码结构，以单张RGB图像作为输入，将编码器各层级的特征图融合到拉普拉斯金字塔分支中，加强特征融合的深度和广度；在解码器中引入注意力机制，进一步提高了深度估计精度；结合数据损失和结构相似性损失，提高模型训练的稳定性及收敛速度，降低模型的训练代价。实验结果表明，在KITTI数据集上与现有的模型相比，均方根误差相较于先进的算法LapDepth降低了4.8%，训练代价降低了36%，深度估计精度和收敛速度均有较显著地提升。

关键词: 深度学习, 单目深度估计, 注意力机制, 拉普拉斯金字塔, 拉普拉斯残差

Abstract:

With the rapid development of deep neural networks, research on deep learning-based monocular depth estimation has centered on regressing depth through encoder-decoder structures and has yielded significant results. However, most traditional methods typically entail the repetition of simple upsampling operations during the decoding process, which fail to take full advantage of the characteristics of the encoder for monocular depth estimation. To address this problem, this study proposed a dense feature decoding structure combined with an attention mechanism. Utilizing a single RGB image as input, the feature map of each level of the encoder was fused into the branch of the Laplace pyramid to heighten the utilization of the feature map at each level. Attention mechanisms were introduced into the decoder to further enhance depth estimation. Finally, data loss and structural similarity loss were combined to reinforce the stability and convergence speed of model training and diminish the training cost of the model. The experimental results demonstrated that compared with the existing model on the KITTI dataset, the root mean square error decreased by 4.8% and the training cost was reduced by 36% relative to the advanced algorithm LapDepth, with a more significant improvement in depth estimation accuracy and convergence speed.

Key words: deep learning, monocular depth estimation, attention mechanism, Laplacian pyramid, Laplacian residuals

中图分类号:

TP391

余伟群, 刘佳涛, 张亚萍. 融合注意力的拉普拉斯金字塔单目深度估计[J]. 图学学报, 2023, 44(4): 728-738.

YU Wei-qun, LIU Jia-tao, ZHANG Ya-ping. Monocular depth estimation based on Laplacian pyramid with attention fusion[J]. Journal of Graphics, 2023, 44(4): 728-738.

图/表 15

参考文献 26

[1]	GODARD C, MAC AODHA O, BROSTOW G J. Unsupervised monocular depth estimation with left-right consistency[C]// 2017 IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2017: 270-279.
[2]	蒲正东, 陈姝, 邹北骥, 等. 基于高分辨率网络的自监督单目深度估计方法[J]. 计算机辅助设计与图形学学报, 2023, 35(1): 118-127.
	PU Z D, CHEN S, ZOU B J, et al. A self-supervised monocular depth estimation method based on high resolution convolutional neural network[J]. Journal of Computer-Aided Design & Computer Graphics, 2023, 35(1): 118-127 (in Chinese).
[3]	赵霖, 赵滟, 靳捷. 基于局部注意力和位姿迭代优化的自监督单目深度估计算法[J]. 信号处理, 2022, 38(5): 1088-1097.
	ZHAO L, ZHAO Y, JIN J. A self-supervised monocular depth estimation algorithm based on local attention and iterative pose refinement[J]. Journal of Signal Processing, 2022, 38(5): 1088-1097 (in Chinese).
[4]	SONG M, LIM S, KIM W. Monocular depth estimation using Laplacian pyramid-based depth residuals[J]. IEEE Transactions on Circuits and Systems for Video Technology, 2021, 31(11): 4381-4393. DOI URL
[5]	FU H, GONG M M, WANG C H, et al. Deep ordinal regression network for monocular depth estimation[C]// 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2018: 2002-2011.
[6]	YANG M K, YU K, ZHANG C, et al. DenseASPP for semantic segmentation in street scenes[C]// 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2018: 3684-3692.
[7]	张涛, 张晓利, 任彦. Transformer与CNN融合的单目图像深度估计[J]. 哈尔滨理工大学学报, 2022, 27(6): 88-94.
	ZHANG T, ZHANG X L, REN Y. Monocular image depth estimation based on the fusion of transformer and CNN[J]. Journal of Harbin University of Science and Technology, 2022, 27(6): 88-94 (in Chinese).
[8]	RANFTL R, BOCHKOVSKIY A, KOLTUN V. Vision transformers for dense prediction[C]// 2021 IEEE/CVF International Conference on Computer Vision. New York: IEEE Press, 2022: 12179-12188.
[9]	WOO S, PARK J, LEE J Y, et al. CBAM: convolutional block attention module[C]// European Conference on Computer Vision. Cham: Springer International Publishing, 2018: 3-19.
[10]	HOU Q B, ZHOU D Q, FENG J S. Coordinate attention for efficient mobile network design[C]// 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2021: 13713-13722.
[11]	ZHANG Q L, YANG Y B. SA-net: shuffle attention for deep convolutional neural networks[C]// 2021 IEEE International Conference on Acoustics, Speech and Signal Processing. New York: IEEE Press, 2021: 2235-2239.
[12]	XIE S N, GIRSHICK R, DOLLÁR P, et al. Aggregated residual transformations for deep neural networks[C]// 2017 IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2017: 1492-1500.
[13]	HUANG G, LIU Z, VAN DER MAATEN L, et al. Densely connected convolutional networks[C]// 2017 IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2017: 2261-2269.
[14]	RUSSAKOVSKY O, DENG J, SU H, et al. ImageNet large scale visual recognition challenge[J]. International Journal of Computer Vision, 2015, 115(3): 211-252. DOI URL
[15]	HE K, ZHANG X, REN S, et al. Deep residual learning for image recognition[C]// 2016 IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2016: 770-778.
[16]	SZEGEDY C, VANHOUCKE V, IOFFE S, et al. Rethinking the inception architecture for computer vision[C]// 2016 IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2016: 2818-2826.
[17]	EIGEN D, PUHRSCH C, FERGUS R. Depth map prediction from a single image using a multi-scale deep network[EB/OL]. [2022-06-15]. https://arxiv.org/abs/1406.2283.
[18]	LEE J H, HAN M K, KO D W, et al. From big to small: multi-scale local planar guidance for monocular depth estimation[EB/OL]. [2022-06-15]. https://arxiv.org/abs/1907.10326.
[19]	UHRIG J, SCHNEIDER N, SCHNEIDER L, et al. Sparsity invariant CNNs[C]// 2017 International Conference on 3D Vision (3DV). New York: IEEE Press, 2017: 11-20.
[20]	PASZKE A, GROSS S, MASSA F, et al. PyTorch: an imperative style, high-performance deep learning library[EB/OL]. [2022-06-15]. https://arxiv.org/abs/1912.01703.
[21]	LOSHCHILOV I, HUTTER F. Decoupled weight decay regularization[EB/OL]. [2022-06-15]. https://arxiv.org/abs/1711.05101.
[22]	SIMONYAN K, ZISSERMAN A. Very deep convolutional networks for large-scale image recognition[EB/OL]. [2022-06-15]. https://arxiv.org/abs/1409.1556.
[23]	HUANG G, LIU Z, VAN DER MAATEN L, et al. Densely connected convolutional networks[C]// 2017 IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2017: 4700-4708
[24]	HU J, SHEN L, SUN G. Squeeze-and-excitation networks[C]// 2018 IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2018: 7132-7141.
[25]	MISRA D, NALAMADA T, ARASANIPALAI A U, et al. Rotate to attend: convolutional triplet attention module[C]// 2021 IEEE/CVF Winter Conference on Applications of Computer Vision. New York: IEEE Press, 2021: 3139-3148.
[26]	WANG Z, SIMONCELLI E P, BOVIK A C. Multiscale structural similarity for image quality assessment[C]// The Thrity-Seventh Asilomar Conference on Signals, Systems & Computers. New York: IEEE Press, 2004: 1398-1402.

Encoder
Block	Filter	Stride	Channel	In	Out	Input
layer1	7×7	2	3/64	S	S/2	Input RGB
Maxpool	3×3	2	64/64	S/2	S/4	F(layer1)
layer2	3×3	2	64/256	S/4	S/4	F(Maxpool)
layer3	3×3	2	256/512	S/8	S/8	F(layer2)
layer4	3×3	2	512/1024	S/16	S/16	F(layer3)
Decoder
Block	Filter size	Up	Channel	In	Out	Input	Lev
reduction	1×1	1	1024/512	S/16	S/16	F(layer4)	-
ASPP	3×3	1	512/512	S/16	S/16	F(reduction)	-
sa	1×1	1	512/512	S/16	S/16	F(ASPP)	-
dec5	3×3	1	512/1	S/16	S/16	F(sa)	5th
dec4up	3×3	2	512/256	S/16	S/8	F(sa)	4th
dec4ca	1×1	2	1024/512	S/16	S/8	F((UP(CA(layer4))©layer3))	4th
dec4reduc	1×1	1	768/252	S/8	S/8	F(dec4ca©dec4up)	4th
dec4upr	3×3	2	2/1	S/16	S/8	F(UP(R5) © UP(CA(R5)))	4th
dec4bneck	3×3	1	256/256	S/8	S/8	F(dec4reduc© dec4upr ©L4)	4th
dec4	3×3	1	256/1	S/8	S/8	F(dec4bneck)	4th
dec3up	3×3	2	256/128	S/8	S/4	F(dec4bneck)	3rd
dec3ca	1×1	2	512/128	S/8	S/4	F( (UP(CA(layer3))©layer2))	3rd
dec3reduc	1×1	1	384/124	S/4	S/4	F(dec3ca©dec3up)	3rd
dec3upr	3×3	2	2/1	S/8	S/4	F(UP(R4) © UP(CA(R4)))	3rd
dec3bneck	3×3	1	128/128	S/4	S/4	F(dec3reduc© dec3upr ©L3)	3rd
dec3	3×3	1	128/1	S/4	S/4	F(dec3bneck)	3rd
dec2up	3×3	2	128/64	S/4	S/2	F(dec3bneck)	2nd
dec2ca	1×1	2	128/64	S/4	S/2	F((UP(CA(layer2))©Maxpool))	2nd
dec2reduc	1×1	1	128/60	S/2	S/2	F(dec2ca©dec2up)	2nd
dec2upr	3×3	2	2/1	S/4	S/2	F(UP(R3) © UP(CA(R3)))	2nd
dec2bneck	3×3	1	64/64	S/2	S/2	F(dec2reduc© dec2upr ©L2)	2nd
dec2	3×3	1	64/1	S/2	S/2	F(dec2bneck)	2nd
dec1up	3×3	2	64/60	S/2	S	F(dec2bneck)	1st
dec1upr	3×3	2	2/1	S/2	S	F(UP(R2) © UP(CA(R2)))	1st
dec1bneck	3×3	1	64/64	S	S	F(dec1reduc© dec1upr ©L1)	1st
dec1	3×3	1	64/1	S	S	F(dec1bneck)	1st

Encoder
Block	Filter	Stride	Channel	In	Out	Input
layer1	7×7	2	3/64	S	S/2	Input RGB
Maxpool	3×3	2	64/64	S/2	S/4	F(layer1)
layer2	3×3	2	64/256	S/4	S/4	F(Maxpool)
layer3	3×3	2	256/512	S/8	S/8	F(layer2)
layer4	3×3	2	512/1024	S/16	S/16	F(layer3)
Decoder
Block	Filter size	Up	Channel	In	Out	Input	Lev
reduction	1×1	1	1024/512	S/16	S/16	F(layer4)	-
ASPP	3×3	1	512/512	S/16	S/16	F(reduction)	-
sa	1×1	1	512/512	S/16	S/16	F(ASPP)	-
dec5	3×3	1	512/1	S/16	S/16	F(sa)	5th
dec4up	3×3	2	512/256	S/16	S/8	F(sa)	4th
dec4ca	1×1	2	1024/512	S/16	S/8	F((UP(CA(layer4))©layer3))	4th
dec4reduc	1×1	1	768/252	S/8	S/8	F(dec4ca©dec4up)	4th
dec4upr	3×3	2	2/1	S/16	S/8	F(UP(R5) © UP(CA(R5)))	4th
dec4bneck	3×3	1	256/256	S/8	S/8	F(dec4reduc© dec4upr ©L4)	4th
dec4	3×3	1	256/1	S/8	S/8	F(dec4bneck)	4th
dec3up	3×3	2	256/128	S/8	S/4	F(dec4bneck)	3rd
dec3ca	1×1	2	512/128	S/8	S/4	F( (UP(CA(layer3))©layer2))	3rd
dec3reduc	1×1	1	384/124	S/4	S/4	F(dec3ca©dec3up)	3rd
dec3upr	3×3	2	2/1	S/8	S/4	F(UP(R4) © UP(CA(R4)))	3rd
dec3bneck	3×3	1	128/128	S/4	S/4	F(dec3reduc© dec3upr ©L3)	3rd
dec3	3×3	1	128/1	S/4	S/4	F(dec3bneck)	3rd
dec2up	3×3	2	128/64	S/4	S/2	F(dec3bneck)	2nd
dec2ca	1×1	2	128/64	S/4	S/2	F((UP(CA(layer2))©Maxpool))	2nd
dec2reduc	1×1	1	128/60	S/2	S/2	F(dec2ca©dec2up)	2nd
dec2upr	3×3	2	2/1	S/4	S/2	F(UP(R3) © UP(CA(R3)))	2nd
dec2bneck	3×3	1	64/64	S/2	S/2	F(dec2reduc© dec2upr ©L2)	2nd
dec2	3×3	1	64/1	S/2	S/2	F(dec2bneck)	2nd
dec1up	3×3	2	64/60	S/2	S	F(dec2bneck)	1st
dec1upr	3×3	2	2/1	S/2	S	F(UP(R2) © UP(CA(R2)))	1st
dec1bneck	3×3	1	64/64	S	S	F(dec1reduc© dec1upr ©L1)	1st
dec1	3×3	1	64/1	S	S	F(dec1bneck)	1st

Method		Higher value is better			Lower value is better
Method		δ<1.25	δ<1.25²	δ<1.25³	Abs Rel	Sq Rel	RMSE	RMSE log	Total_iter (M)
Cap=80 m	文献[1]	0.916	0.980	0.994	0.085	0.584	3.938	0.135	-
	文献[5]	0.932	0.984	0.994	0.072	0.307	2.727	0.120	-
	文献[18]	0.950	0.993	0.999	0.064	0.254	2.815	0.100	-
	文献[4]	0.962	0.994	0.999	0.059	0.212	2.446	0.091	0.734
	Ours	0.963	0.995	0.999	0.058	0.199	2.328	0.088	0.470
Cap=50 m	文献[1]	0.861	0.949	0.976	0.114	0.898	4.935	0.206	-
	文献[5]	0.936	0.985	0.995	0.071	0.268	2.271	0.116	-
	文献[18]	0.959	0.994	0.999	0.060	0.182	2.005	0.092	-
	文献[4]	0.967	0.995	0.999	0.056	0.161	1.830	0.086	0.734
	Ours	0.967	0.995	0.999	0.056	0.156	1.768	0.084	0.470

Method		Higher value is better			Lower value is better
Method		δ<1.25	δ<1.25²	δ<1.25³	Abs Rel	Sq Rel	RMSE	RMSE log	Total_iter (M)
Cap=80 m	文献[1]	0.916	0.980	0.994	0.085	0.584	3.938	0.135	-
	文献[5]	0.932	0.984	0.994	0.072	0.307	2.727	0.120	-
	文献[18]	0.950	0.993	0.999	0.064	0.254	2.815	0.100	-
	文献[4]	0.962	0.994	0.999	0.059	0.212	2.446	0.091	0.734
	Ours	0.963	0.995	0.999	0.058	0.199	2.328	0.088	0.470
Cap=50 m	文献[1]	0.861	0.949	0.976	0.114	0.898	4.935	0.206	-
	文献[5]	0.936	0.985	0.995	0.071	0.268	2.271	0.116	-
	文献[18]	0.959	0.994	0.999	0.060	0.182	2.005	0.092	-
	文献[4]	0.967	0.995	0.999	0.056	0.161	1.830	0.086	0.734
	Ours	0.967	0.995	0.999	0.056	0.156	1.768	0.084	0.470

Method	Param (M)	Flops (B)	Higher value is better			Lower value is better
Method	Param (M)	Flops (B)	δ<1.25	δ<1.25²	δ<1.25³	Abs Rel	Sq Rel	RMSE	RMSE log
InceptionV3^[12]	18.13	30.25	0.936	0.990	0.997	0.074	0.302	2.922	0.114
Resnet101^[11]	44.11	98.60	0.960	0.993	0.999	0.063	0.203	2.424	0.095
Vgg19^[18]	14.75	104.30	0.959	0.994	0.999	0.060	0.202	2.361	0.092
DenseNet161^[19]	34.19	104.59	0.960	0.995	0.999	0.059	0.202	2.374	0.090
ResNext101^[9]	74.14	134.76	0.963	0.995	0.999	0.058	0.199	2.328	0.088

融合注意力的拉普拉斯金字塔单目深度估计

Monocular depth estimation based on Laplacian pyramid with attention fusion

RichHTML

PDF (PC)

可视化

摘要/Abstract

引用本文

使用本文

图/表 15

参考文献 26

相关文章 15

编辑推荐

Metrics

本文评价

[1]	杨陈成 , 董秀成 , 侯兵 , 张党成 , 向贤明 , 冯琪茗 . 基于参考的Transformer纹理迁移深度图像超分辨率重建 [J]. 图学学报, 2023, 44(5): 861-867.
[2]	党宏社 , 许怀彪 , 张选德 . 融合结构信息的深度学习立体匹配算法 [J]. 图学学报, 2023, 44(5): 899-906.
[3]	翟永杰, 郭聪彬, 王乾铭, 赵宽, 白云山, 张冀 . 基于隐含空间知识融合的输电线路多金具检测方法 [J]. 图学学报, 2023, 44(5): 918-927.
[4]	杨红菊, 高敏, 张常有, 薄文, 武文佳, 曹付元. 一种面向图像修复的局部优化生成模型 [J]. 图学学报, 2023, 44(5): 955-965.
[5]	宋焕生, 文雅, 孙士杰, 宋翔宇, 张朝阳, 李旭 . 基于改进教师学生网络的隧道火灾检测 [J]. 图学学报, 2023, 44(5): 978-987.
[6]	毕春艳, 刘越. 基于深度学习的视频人体动作识别综述[J]. 图学学报, 2023, 44(4): 625-639.
[7]	李利霞, 王鑫, 王军, 张又元. 基于特征融合与注意力机制的无人机图像小目标检测算法[J]. 图学学报, 2023, 44(4): 658-666.
[8]	曹义亲, 周一纬, 徐露. 基于E-YOLOX的实时金属表面缺陷检测算法[J]. 图学学报, 2023, 44(4): 677-690.
[9]	李鑫, 普园媛, 赵征鹏, 徐丹, 钱文华. 内容语义和风格特征匹配一致的艺术风格迁移[J]. 图学学报, 2023, 44(4): 699-709.
[10]	邵俊棋, 钱文华, 徐启豪. 基于条件残差生成对抗网络的风景图生成[J]. 图学学报, 2023, 44(4): 710-717.
[11]	郭印宏, 王立春, 李爽. 基于重复性和特异性约束的图像特征匹配[J]. 图学学报, 2023, 44(4): 739-746.
[12]	胡欣, 周运强, 肖剑, 杨杰. 基于改进YOLOv5的螺纹钢表面缺陷检测[J]. 图学学报, 2023, 44(3): 427-437.
[13]	毛爱坤, 刘昕明, 陈文壮, 宋绍楼. 改进YOLOv5算法的变电站仪表目标检测方法[J]. 图学学报, 2023, 44(3): 448-455.
[14]	郝鹏飞, 刘立群, 顾任远. YOLO-RD-Apple果园异源图像遮挡果实检测模型[J]. 图学学报, 2023, 44(3): 456-464.
[15]	罗文宇, 傅明月. 基于YoloX-ECA模型的非法野泳野钓现场监测技术[J]. 图学学报, 2023, 44(3): 465-472.