欢迎访问《图学学报》 分享到:

图学学报 ›› 2024, Vol. 45 ›› Issue (3): 454-463.DOI: 10.11996/JG.j.2095-302X.2024030454

• 图像处理与计算机视觉 • 上一篇    下一篇

结合金字塔结构和注意力机制的单目深度估计

李滔(), 胡婷, 武丹丹   

  1. 西华大学电气与电子信息学院,四川 成都 610039
  • 收稿日期:2023-12-25 接受日期:2024-02-06 出版日期:2024-06-30 发布日期:2024-06-06
  • 第一作者:李滔(1983-),女,副教授,博士。主要研究方向为图像/视频压缩及复原、图像超分辨率重建和深度图像修复。E-mail:litao@mail.xhu.edu.cn
  • 基金资助:
    四川省科技计划项目(2021YJ0109);国家自然科学基金项目(61901392);国家自然科学基金项目(62041109)

Monocular depth estimation combining pyramid structure and attention mechanism

LI Tao(), HU Ting, WU Dandan   

  1. School of Electrical Engineering and Electronic Information, Xihua University, Chengdu Sichuan 610039, China
  • Received:2023-12-25 Accepted:2024-02-06 Published:2024-06-30 Online:2024-06-06
  • First author:LI Tao (1983-), associate professor, Ph.D. Her main research interests cover image/video compression and restoration, image super-resolution reconstruction, and deep image completion. E-mail:litao@mail.xhu.edu.cn
  • Supported by:
    The Department of Science and Technology of Sichuan Province(2021YJ0109);National Natural Science Foundation of China(61901392);National Natural Science Foundation of China(62041109)

摘要:

单目深度估计是由单幅彩色图像预测出一幅稠密的深度图像。针对目前单目深度估计算法存在边界模糊、上下文信息捕捉能力不足等问题,提出了一种结合金字塔结构和注意力机制的单目深度估计算法。算法采用编码器-解码器的总体框架,其中编码器选用PVTv2网络,目的是利用Transformer网络在建模全局信息方面的优势以获取更充分的全局语义信息;解码器由深度估计主分支和2个金字塔子分支组成。深度估计主分支通过空间和通道注意力机制来自适应地关注编码器和解码器特征间重要的特征区域和特征通道;拉普拉斯金字塔子分支和深度残差金字塔子分支旨在从彩色图像和深度估计主分支深度特征中学习到丰富的局部信息并传递到深度估计主分支,进一步解决单目深度估计中细节缺失、结构混乱等问题。实验结果表明,与先进的算法P3Depth相比,在室内公开数据集NYU Depth V2上,该算法的δ1.25阈值精度提升了1.22%,绝对误差和根均方误差分别降低了5.8%和2.8%;而在室外公开数据集KITTI上,该算法的绝对误差、根均方对数误差和根均方误差分别降低了8.5%,3.9%和0.4%。该算法提升了深度估计精度并得到了良好的视觉呈现效果。

关键词: 深度学习, 单目深度估计, 金字塔结构, 注意力机制, Transformer

Abstract:

Monocular depth estimation is the prediction of a dense depth image from a single color image. A monocular depth estimation algorithm combining pyramid structure and attention mechanism was proposed to address the issues of boundary ambiguity and insufficient capture of contextual information in current monocular depth estimation algorithms. The algorithm adopted the overall framework of encoder-decoder, in which the encoder selected the PVTv2 network to obtain more adequate global semantic information by taking advantage of the Transformer network in modeling global information. The decoder consisted of a depth estimation main branch and two pyramid sub-branches. The depth estimation main branch adaptively focused on important feature regions and feature channels between the encoder and decoder features through spatial and channel attention mechanisms. The Laplacian pyramid sub-branch and depth residual pyramid sub-branch aimed to learn rich local information from color images and depth estimation main branch depth features, transferring it to the depth estimation main branch to address the problems of missing details and chaotic structures in monocular depth estimation. Experimental results demonstrated that on the indoor public dataset NYU Depth V2, compared with the advanced algorithm P3Depth, the accuracy of δ1.25 threshold was increased by 1.22%, the absolute error and root mean square error were decreased by 5.8% and 2.8%, respectively. On the outdoor public dataset KITTI, the absolute error, root mean square logarithmic error, and root mean square error of the algorithm were decreased by 8.5%, 3.9%, and 0.4%, respectively. The algorithm improved the accuracy of depth estimation and achieved a good visual rendering.

Key words: deep learning, monocular depth estimation, pyramid structure, attention mechanism, Transformer

中图分类号: