欢迎访问《图学学报》 分享到:

图学学报 ›› 2024, Vol. 45 ›› Issue (3): 472-481.DOI: 10.11996/JG.j.2095-302X.2024030472

• 图像处理与计算机视觉 • 上一篇    下一篇

基于全局注意力的正交融合图像描述符

艾列富1(), 陶勇1,2, 蒋常玉1   

  1. 1.安庆师范大学计算机与信息学院,安徽 安庆 246133
    2.安徽三联学院智慧交通现代产业学院,安徽 合肥 230601
  • 收稿日期:2023-09-11 接受日期:2023-12-29 出版日期:2024-06-30 发布日期:2024-06-11
  • 第一作者:艾列富(1985-),男,副教授,博士。主要研究方向为基于内容的图像检索和机器学习。E-mail:ailiefu@qq.com
  • 基金资助:
    安徽省自然科学基金项目(1608085MF144);安徽省自然科学基金项目(1908085MF194);安徽省高校自然科学研究重点项目(KJ2020A0498)

Orthogonal fusion image descriptor based on global attention

AI Liefu1(), TAO Yong1,2, JIANG Changyu1   

  1. 1. School of Computer and Information, Anqing Normal University, Anqing Anhui 246133, China
    2. School of Smart Transportation Modern Industry, Anhui Sanlian University, Hefei Anhui 230601, China
  • Received:2023-09-11 Accepted:2023-12-29 Published:2024-06-30 Online:2024-06-11
  • First author:AI Liefu (1985-), associate professor, Ph.D. His main research interests cover content-based image retrieval and machine learning. E-mail:ailiefu@qq.com
  • Supported by:
    National Natural Science Foundation of Anhui Province in China(1608085MF144);National Natural Science Foundation of Anhui Province in China(1908085MF194);University Science Research Project of Anhui Province in China(KJ2020A0498)

摘要:

图像描述符是计算机视觉任务重要研究对象,被广泛应用于图像分类、分割、识别与检索等领域。深度图像描述符在局部特征提取分支缺少高维特征的空间与通道信息的关联性,导致局部特征表达的信息不充分。为此,提出一种融合局部、全局特征的图像描述符,在局部特征提取分支进行膨胀卷积提取多尺度特征图,输出的特征拼接后经过含有多层感知器的全局注意力机制捕捉具有关联性的通道-空间信息,再加工后输出最终的局部特征;高维的全局分支经过全局池化和全卷积生成全局特征向量;提取局部特征在全局特征向量上的正交值与全局特征串联后聚合形成最终的描述符。同时,在特征约束方面,使用包含子类心的角域度损失函数增大模型在大规模数据集的鲁棒性。在国际公开数据集Roxford5k和Rparis6k上进行实验,所提出描述符的平均检索精度在medium和hard模式分别为81.87%和59.74%以及91.61%和79.12%,比深度正交融合描述符分别提升了1.70%,1.56%,2.00%和1.83%,较其他图像描述符具有更好的检索精度。

关键词: 图像描述符, 膨胀卷积, 全局注意力, 特征融合, 子类心角度域损失

Abstract:

Image descriptors are important research objects in computer vision tasks and are widely applied to the fields of image classification, segmentation, recognition, and retrieval. The depth image descriptor lacks the correlation between the high-dimensional feature space and channel information in the local feature extraction branch, resulting in insufficient information for local feature expression. Therefore, an image descriptor combining local and global features was proposed. The multi-scale feature map was extracted through dilated convolution in the local feature extraction branch. After the output features were spliced, the relevant channel-space information was captured through a global attention mechanism with a multilayer perceptron. Then the final local features were output after processing. The high-dimensional global branches generated global feature vectors through global pooling and full convolution. The orthogonal values of local features were extracted on the global feature vector, and were then concatenated with the global features to form the final descriptor. At the same time, the robustness of the model in large-scale datasets were enhanced by employing the angular domain loss function containing the sub-class center. The experimental results on the publicly available datasets Roxford5k and Rparis6k demonstrated that in medium and hard modes, the average retrieval accuracy of this descriptor reached 81.87% and 59.74%, and 91.61% and 79.12%, respectively. This represented an improvement of 1.70% and 1.56%, and 2.00% and 1.83% compared to that of deep orthogonal fusion descriptors. It exhibited superior retrieval accuracy over other image descriptors.

Key words: image descriptor, dilated convolution, global attention, feature fusion, sub-center arcface

中图分类号: