基于全局注意力的正交融合图像描述符

doi:10.11996/JG.j.2095-302X.2024030472

图学学报 ›› 2024, Vol. 45 ›› Issue (3): 472-481.DOI: 10.11996/JG.j.2095-302X.2024030472

• 图像处理与计算机视觉 • 上一篇下一篇

基于全局注意力的正交融合图像描述符

艾列富¹(), 陶勇¹^,², 蒋常玉¹

1.安庆师范大学计算机与信息学院，安徽安庆 246133
2.安徽三联学院智慧交通现代产业学院，安徽合肥 230601

收稿日期:2023-09-11 接受日期:2023-12-29 出版日期:2024-06-30 发布日期:2024-06-11
第一作者:艾列富(1985-)，男，副教授，博士。主要研究方向为基于内容的图像检索和机器学习。E-mail：ailiefu@qq.com
基金资助:
安徽省自然科学基金项目(1608085MF144);安徽省自然科学基金项目(1908085MF194);安徽省高校自然科学研究重点项目(KJ2020A0498)

Orthogonal fusion image descriptor based on global attention

AI Liefu¹(), TAO Yong¹^,², JIANG Changyu¹

1. School of Computer and Information, Anqing Normal University, Anqing Anhui 246133, China
2. School of Smart Transportation Modern Industry, Anhui Sanlian University, Hefei Anhui 230601, China

Received:2023-09-11 Accepted:2023-12-29 Published:2024-06-30 Online:2024-06-11
First author：AI Liefu (1985-), associate professor, Ph.D. His main research interests cover content-based image retrieval and machine learning. E-mail：ailiefu@qq.com
Supported by:
National Natural Science Foundation of Anhui Province in China(1608085MF144);National Natural Science Foundation of Anhui Province in China(1908085MF194);University Science Research Project of Anhui Province in China(KJ2020A0498)

摘要/Abstract

摘要：

图像描述符是计算机视觉任务重要研究对象，被广泛应用于图像分类、分割、识别与检索等领域。深度图像描述符在局部特征提取分支缺少高维特征的空间与通道信息的关联性，导致局部特征表达的信息不充分。为此，提出一种融合局部、全局特征的图像描述符，在局部特征提取分支进行膨胀卷积提取多尺度特征图，输出的特征拼接后经过含有多层感知器的全局注意力机制捕捉具有关联性的通道-空间信息，再加工后输出最终的局部特征；高维的全局分支经过全局池化和全卷积生成全局特征向量；提取局部特征在全局特征向量上的正交值与全局特征串联后聚合形成最终的描述符。同时，在特征约束方面，使用包含子类心的角域度损失函数增大模型在大规模数据集的鲁棒性。在国际公开数据集Roxford5k和Rparis6k上进行实验，所提出描述符的平均检索精度在medium和hard模式分别为81.87%和59.74%以及91.61%和79.12%，比深度正交融合描述符分别提升了1.70%，1.56%，2.00%和1.83%，较其他图像描述符具有更好的检索精度。

关键词: 图像描述符, 膨胀卷积, 全局注意力, 特征融合, 子类心角度域损失

Abstract:

Image descriptors are important research objects in computer vision tasks and are widely applied to the fields of image classification, segmentation, recognition, and retrieval. The depth image descriptor lacks the correlation between the high-dimensional feature space and channel information in the local feature extraction branch, resulting in insufficient information for local feature expression. Therefore, an image descriptor combining local and global features was proposed. The multi-scale feature map was extracted through dilated convolution in the local feature extraction branch. After the output features were spliced, the relevant channel-space information was captured through a global attention mechanism with a multilayer perceptron. Then the final local features were output after processing. The high-dimensional global branches generated global feature vectors through global pooling and full convolution. The orthogonal values of local features were extracted on the global feature vector, and were then concatenated with the global features to form the final descriptor. At the same time, the robustness of the model in large-scale datasets were enhanced by employing the angular domain loss function containing the sub-class center. The experimental results on the publicly available datasets Roxford5k and Rparis6k demonstrated that in medium and hard modes, the average retrieval accuracy of this descriptor reached 81.87% and 59.74%, and 91.61% and 79.12%, respectively. This represented an improvement of 1.70% and 1.56%, and 2.00% and 1.83% compared to that of deep orthogonal fusion descriptors. It exhibited superior retrieval accuracy over other image descriptors.

Key words: image descriptor, dilated convolution, global attention, feature fusion, sub-center arcface

中图分类号:

TP391

艾列富, 陶勇, 蒋常玉. 基于全局注意力的正交融合图像描述符[J]. 图学学报, 2024, 45(3): 472-481.

AI Liefu, TAO Yong, JIANG Changyu. Orthogonal fusion image descriptor based on global attention[J]. Journal of Graphics, 2024, 45(3): 472-481.

图/表 11

图1 描述符形成结构图

Fig. 1 Descriptor formation structure diagram

图2 局部特征提取分支结构图

Fig. 2 Branch structure diagram for local feature extraction

图3 GAM结构图

Fig. 3 GAM structure chart

图4 GAM通道注意子模块结构图

Fig. 4 Structure diagram of GAM channel attention submodule

图5 GAM空间注意子模块结构图

Fig. 5 Structure diagram of GAM spatial attention submodule

图6 损失函数不同个数子类心结果图

Fig. 6 Result graph of different number of subclasses of loss function

图7 全局池化不同p值的结果图

Fig. 7 Result graph of global pooling with different p-values

表1 消除部分模块的实验结果/%

Table 1 Experimental results of eliminating partial modules/%

消融实验	Roxf- medium	Roxf- hard	Rpar- medium	Rpar- hard
空洞卷积(×)	81.81	59.35	89.87	79.08
全局注意力(×)	81.44	59.18	91.13	78.87
自注意力(×)	80.84	58.67	90.77	78.34
GA-DOLG	81.87	59.74	91.61	79.12

表2 不同描述符算法的实验结果/%

Table 2 Experimental results of different descriptor algorithms/%

描述符	Roxf- medium	Roxf- hard	Rpar- medium	Rpar- hard
DELF	76.00	52.40	80.20	58.60
ASMK	79.10	52.70	91.00	81.00
DELG	79.08	58.40	88.78	76.20
How-ASMK	79.40	56.90	81.60	62.40
Hot-Refresh	67.34	53.28	81.63	68.96
DOLG	80.50	58.82	89.81	77.70
GA-DOLG	81.87	59.74	91.61	79.12

表3 mAP@10的实验结果/%

Table 3 mAP@10 Experimental results/%

描述符	Roxf- medium	Roxf- hard	Rpar- medium	Rpar- hard
DOLG	92.57	71.14	98.43	93.71
GA-DOLG	93.76	72.27	99.41	94.17

图8 描述符的可视化结果图((a)图像检索示例1；(b)图像检索示例2)

Fig. 8 Visualization results of descriptors ((a) Image retrieval 1; (b) Image retrieval 2)

参考文献 38

[1]	LOWE D G. Distinctive image features from scale-invariant keypoints[J]. International Journal of Computer Vision, 2004, 60(2): 91-110.
[2]	BAY H, TUYTELAARS T, VAN GOOL L. SURF: speeded up robust features[M]//Computer Vision - ECCV 2006. Heidelberg: Springer, 2006: 404-417.
[3]	SIVIC J, ZISSERMAN A. Video Google: a text retrieval approach to object matching in videos[C]// The 9th IEEE International Conference on Computer Vision. New York: IEEE Press, 2008: 1470-1477.
[4]	PERRONNIN F, DANCE C. Fisher kernels on visual vocabularies for image categorization[C]// 2007 IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2007: 1-8.
[5]	JÉGOU H, DOUZE M, SCHMID C, et al. Aggregating local descriptors into a compact image representation[C]// 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2010: 3304-3311.
[6]	魏本昌, 郑丽, 管涛. 残差增强的图像描述符[J]. 计算机辅助设计与图形学学报, 2019, 31(6): 1039-1045.
	WEI B C, ZHENG L, GUAN T. Residual enhanced image descriptor[J]. Journal of Computer-Aided Design & Computer Graphics, 2019, 31(6): 1039-1045 (in Chinese).
[7]	JÉGOU H, DOUZE M, SCHMID C. Product quantization for nearest neighbor search[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2011, 33(1): 117-128. DOI PMID
[8]	WANG J D, ZHANG T, SONG J K, et al. A survey on learning to hash[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018, 40(4): 769-790. DOI PMID
[9]	BEIS J S, LOWE D G. Shape indexing using approximate nearest-neighbour search in high-dimensional spaces[C]// IEEE Computer Society Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2002: 1000-1006.
[10]	NISTER D, STEWENIUS H. Scalable recognition with a vocabulary tree[C]// 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2006: 2161-2168.
[11]	吴泽斌, 于俊清, 何云峰, 等. 一种用于图像检索的多层语义二值描述符[J]. 计算机学报, 2020, 43(9): 1641-1655.
	WU Z B, YU J Q, HE Y F, et al. Multi-level semantic binary descriptor for image retrieval[J]. Chinese Journal of Computers, 2020, 43(9): 1641-1655 (in Chinese).
[12]	SIMÉONI O, AVRITHIS Y, CHUM O. Local features and visual words emerge in activations[C]// 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2019: 11643-11652.
[13]	FISCHLER M A, BOLLES R C. Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography[M]// Readings in Computer Vision. Amsterdam: Elsevier, 1987: 726-740.
[14]	TOLIAS G, AVRITHIS Y, JÉGOU H. Image search with selective match kernels: aggregation across single and multiple images[J]. International Journal of Computer Vision, 2016, 116(3): 247-261.
[15]	NOH H, ARAUJO A, SIM J, et al. Large-scale image retrieval with attentive deep local features[C]// 2017 IEEE International Conference on Computer Vision. New York: IEEE Press, 2017: 3476-3485.
[16]	CAO B Y, ARAUJO A, SIM J. Unifying deep local and global features for image search[C]// European Conference on Computer Vision. Cham: Springer, 2020: 726-743.
[17]	ÖZTÜRK Ş, ÇELIK E, ÇUKUR T. Content-based medical image retrieval with opponent class adaptive margin loss[J]. Information Sciences, 2023, 637: 118938.
[18]	ARANDJELOVIĆ R, GRONAT P, TORII A, et al. NetVLAD: CNN architecture for weakly supervised place recognition[C]// IEEE Transactions on Pattern Analysis and Machine Intelligence. New York: IEEE Press, 2018: 1437-1451.
[19]	HAUSLER S, GARG S, XU M, et al. Patch-NetVLAD: multi-scale fusion of locally-global descriptors for place recognition[C]// 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2021: 14136-14147.
[20]	TOLIAS G, JENICEK T, CHUM O. Learning and aggregating deep local descriptors for instance-level recognition[C]// European Conference on Computer Vision. Cham: Springer, 2020: 460-477.
[21]	RADENOVIC F, TOLIAS G, CHUM O. Fine-tuning CNN image retrieval with No human annotation[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2019, 41(7): 1655-1668. DOI PMID
[22]	ZHU Y Y, CAO G, YANG Z Y, et al. Learning relation-based features for fine-grained image retrieval[J]. Pattern Recognition, 2023, 140: 109543.
[23]	YANG M, HE D L, FAN M, et al. DOLG: single-stage image retrieval with deep orthogonal fusion of local and global features[C]// 2021 IEEE/CVF International Conference on Computer Vision. New York: IEEE Press, 2021: 11752-11761.
[24]	LIU Y C, SHAO Z R, HOFFMANN N. Global attention mechanism: retain information to enhance channel-spatial interactions[EB/OL]. [2023-01-23]. http://arxiv.org/abs/2112.05561.pdf.
[25]	WANG P, LI X, YARAS C, et al. Understanding deep representation learning via layerwise feature compression and discrimination[EB/OL]. [2023-01-23]. http://arxiv.org/abs/2311.02960.pdf.
[26]	DENG J K, GUO J, XUE N N, et al. ArcFace: additive angular margin loss for deep face recognition[C]// 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2019: 4685-4694.
[27]	DENG J K, GUO J, LIU T L, et al. Sub-center ArcFace: boosting face recognition by large-scale noisy web faces[C]// European Conference on Computer Vision. Cham: Springer, 2020: 741-757.
[28]	HU J, SHEN L, SUN G. Squeeze-and-excitation networks[C]// 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2018: 7132-7141.
[29]	WOO S, PARK J, LEE J Y, et al. CBAM: convolutional block attention module[C]// European Conference on Computer Vision. Cham: Springer, 2018: 3-19.
[30]	PARK J, WOO S, LEE J Y, et al. BAM: bottleneck attention module[EB/OL]. [2023-01-23]. http://arxiv.org/abs/1807.06514.pdf.
[31]	MISRA D, NALAMADA T, ARASANIPALAI A U, et al. Rotate to attend: convolutional triplet attention module[C]// 2021 IEEE Winter Conference on Applications of Computer Vision. New York: IEEE Press, 2021: 3138-3147.
[32]	CHEN L C, PAPANDREOU G, SCHROFF F, et al. Rethinking atrous convolution for semantic image segmentation[EB/OL]. [2023-01-23]. http://arxiv.org/abs/1706.05587.pdf.
[33]	NOH H, ARAUJO A, SIM J, et al. Large-scale image retrieval with attentive deep local features[C]// 2017 IEEE International Conference on Computer Vision. New York: IEEE Press, 2017: 3476-3485.
[34]	QIN Q, HU W P, LIU B. Feature projection for improved text classification[C]// The 58th Annual Meeting of the Association for Computational Linguistics. Stroudsburg: Association for Computational Linguistics, 2020: 8161-8171.
[35]	HE K M, ZHANG X Y, REN S Q, et al. Identity mappings in deep residual networks[M]. Computer Vision - ECCV 2016. Cham: Springer International Publishing, 2016: 630-645.
[36]	WEYAND T, ARAUJO A, CAO B Y, et al. Google landmarks dataset v2-A large-scale benchmark for instance-level recognition and retrieval[C]// 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2020: 2572-2581.
[37]	RADENOVIC F, ISCEN A, TOLIAS G, et al. Revisiting Oxford and Paris: large-scale image retrieval benchmarking[C]// 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2018: 5706-5715.
[38]	ZHANG B J, GE Y X, SHEN Y T, et al. Hot-refresh model upgrades with regression-alleviating compatible training in image retrieval[EB/OL]. [2023-01-23]. http://arxiv.org/abs/2201.09724.pdf.

基于全局注意力的正交融合图像描述符

Orthogonal fusion image descriptor based on global attention

RichHTML

PDF (PC)

可视化

摘要/Abstract

引用本文

使用本文

图/表 11

参考文献 38

相关文章 15

编辑推荐

Metrics

本文评价

[1]	罗智徽, 胡海涛, 马潇峰, 程文刚. 基于同质中间模态的跨模态行人再识别方法[J]. 图学学报, 2024, 45(4): 670-682.
[2]	牛为华, 郭迅. 基于改进YOLOv8的船舰遥感图像旋转目标检测算法[J]. 图学学报, 2024, 45(4): 726-735.
[3]	崔克彬, 焦静颐. 基于MCB-FAH-YOLOv8的钢材表面缺陷检测算法[J]. 图学学报, 2024, 45(1): 112-125.
[4]	张丽媛, 赵海蓉, 何巍, 唐雄风. 融合全局-局部注意模块的Mask R-CNN膝关节囊肿检测方法[J]. 图学学报, 2023, 44(6): 1183-1190.
[5]	石佳豪, 姚莉. 基于语义引导的视频描述生成[J]. 图学学报, 2023, 44(6): 1191-1201.
[6]	李利霞, 王鑫, 王军, 张又元. 基于特征融合与注意力机制的无人机图像小目标检测算法[J]. 图学学报, 2023, 44(4): 658-666.
[7]	李鑫, 普园媛, 赵征鹏, 徐丹, 钱文华. 内容语义和风格特征匹配一致的艺术风格迁移[J]. 图学学报, 2023, 44(4): 699-709.
[8]	李雨, 闫甜甜, 周东生, 魏小鹏. 基于注意力机制与深度多尺度特征融合的自然场景文本检测[J]. 图学学报, 2023, 44(3): 473-481.
[9]	刘冰, 叶成绪. 面向不平衡数据的肺部疾病细粒度分类模型[J]. 图学学报, 2023, 44(3): 513-520.
[10]	史彩娟, 石泽, 闫巾玮, 毕阳阳. 基于双语义双向对齐VAE的广义零样本学习[J]. 图学学报, 2023, 44(3): 521-530.
[11]	陆秋, 邵铧泽, 张云磊. 动态平衡多尺度特征融合的结直肠息肉分割[J]. 图学学报, 2023, 44(2): 225-232.
[12]	李小波, 李阳贵, 郭宁, 范震. 融合注意力机制的YOLOv5口罩检测算法[J]. 图学学报, 2023, 44(1): 16-25.
[13]	张倩, 王夏黎, 王炜昊, 武历展, 李超. 基于多尺度特征融合的细胞计数方法[J]. 图学学报, 2023, 44(1): 41-49.
[14]	武历展, 王夏黎, 张倩, 王炜昊, 李超. 基于优化 YOLOv5s 的跌倒人物目标检测方法[J]. 图学学报, 2022, 43(5): 791-802.
[15]	王素琴, 任琪, 石敏, 朱登明. 基于异常检测的产品表面缺陷检测与分割[J]. 图学学报, 2022, 43(3): 377-386.