基于轻量化视觉Transformer的花卉识别

doi:10.11996/JG.j.2095-302X.2023020271

图学学报 ›› 2023, Vol. 44 ›› Issue (2): 271-279.DOI: 10.11996/JG.j.2095-302X.2023020271

• 图像处理与计算机视觉 • 上一篇下一篇

基于轻量化视觉Transformer的花卉识别

熊举举¹(), 徐杨¹^,²(), 范润泽¹, 孙少聪¹

1.贵州大学大数据与信息工程学院，贵州贵阳 550025
2.贵阳铝镁设计研究院有限公司，贵州贵阳 550025

收稿日期:2022-09-02 接受日期:2022-11-24 出版日期:2023-04-30 发布日期:2023-05-01
通讯作者: 徐杨(1980-)，男，副教授，博士。主要研究方向为数据采集、机器学习等。E-mail：xuy@gzu.edu.cn
作者简介:熊举举(2000-)，男，硕士研究生。主要研究方向为数字图像处理。E-mail：juxiong0416@163.com
基金资助:
贵州省科技计划项目(黔科合支撑[2021]一般176)

Flowers recognition based on lightweight visual transformer

XIONG Ju-ju¹(), XU Yang¹^,²(), FAN Run-ze¹, SUN Shao-cong¹

1. College of Big Data and Information Engineering, Guizhou University, Guiyang Guizhou 550025, China
2. Guiyang Aluminum-Magnesium Design and Research Institute Co., Ltd., Guiyang Guizhou 550025, China

Received:2022-09-02 Accepted:2022-11-24 Online:2023-04-30 Published:2023-05-01
Contact: XU Yang (1980-), associate professor, Ph.D. His main research interests cover data collection, machine learning, etc. E-mail：xuy@gzu.edu.cn
About author:XIONG Ju-ju (2000-), master student. His main research interest covers image processing. E-mail：juxiong0416@163.com
Supported by:
Science and Technology Plan Project of Guizhou Province(Qian Kehe [2021] General 176)

摘要/Abstract

摘要：

由于不同种类花卉之间的相似性以及同种花卉的差异性，提取局部特征信息的卷积神经网络(CNN)在花卉图像的识别上取得的结果不够理想。在Swin Transformer (Swin-T)网络的基础上，提出了一种轻量型的Transformer网络LWFormer。首先，该网络将基于移动窗口的PoolFormer模块引入Swin-T网络的第一、二阶段，对网络进行轻量化。其次，引入了双通道注意力机制，2个独立的通道分别关注了特征图的“位置”和“内容”，提高网络提取全局特征信息的能力。最后，使用了对比损失函数，进一步优化了网络的性能。在Oxford 102 Flower Dataset和104 Flowers Garden of Eden这2个公开的数据集上对改进的模型进行评估，并与其他方法进行对比，在这2个数据集上，分别得到了88.1%与87.3%的准确率。与Swin-T网络相比，该网络参数量降低了33.45%，FLOPs降低了28.89%，throughtput提高了91.45%，准确率提高了1.8%。实验结果表明，该网络在提升了准确率的同时降低了参数量，得到了速度与精度地提升。

关键词: 花卉识别, 轻量化, 注意力机制, 双通道注意力, 对比损失函数

Abstract:

Due to the similarity between different kinds of flowers and the dissimilarity within the same kind of flowers, the results of convolutional neural network (CNN) that extracts local feature information in flower image recognition are not ideal. Based on the Swin Transformer (Swin-T) network, this paper proposed a lightweight Transformer network LWFormer. Firstly, the network introduced the mobile window-based PoolFormer module into the first and second stages of the Swin-T network to lightweight the network. Secondly, a dual-channel attention mechanism was introduced, in which two independent channels focused on the “location” and “content” of the feature map, respectively, to improve the network′s ability to extract global feature information. Finally, a contrastive loss function was employed to further optimize the performance of the network. The enhanced model was evaluated on two public datasets, Oxford 102 Flower Dataset and 104 Flowers Garden of Eden, and compared with other methods. On these two datasets, the accuracy rates were 88.1% and 87.3%, respectively. Compared with the Swin-T network, the network parameters were reduced by 33.45%, FLOPs was reduced by 28.89%, throughput was increased by 91.45%, and accuracy was increased by 1.8%. Experimental results showed that the proposed network could improve the accuracy while reducing the number of parameters, thus enhancing the speed and accuracy.

Key words: flower recognition, lightweight, attention mechanism, dual-channel attention, contrastive loss function

中图分类号:

TP391

熊举举, 徐杨, 范润泽, 孙少聪. 基于轻量化视觉Transformer的花卉识别[J]. 图学学报, 2023, 44(2): 271-279.

XIONG Ju-ju, XU Yang, FAN Run-ze, SUN Shao-cong. Flowers recognition based on lightweight visual transformer[J]. Journal of Graphics, 2023, 44(2): 271-279.

图/表 10

参考文献 29

[1]	LOWE D G. Distinctive image features from scale-invariant keypoints[J]. International Journal of Computer Vision, 2004, 60(2): 91-110. DOI URL
[2]	DALAL N, TRIGGS B. Histograms of oriented gradients for human detection[C]// 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2005: 886-893.
[3]	郭艾侠, 彭明明, 邢仲璟. 机器视觉技术在荔枝识别与定位研究中的应用[J]. 计算机工程与应用, 2017, 53(17): 218-223, 259. DOI
	GUO A X, PENG M M, XING Z J. Study on recognition and positioning of litchi based on technology of machine vision[J]. Computer Engineering and Applications, 2017, 53(17): 218-223, 259. (in Chinese) DOI
[4]	王永皎, 张引, 张三元. 基于图像处理的植物叶面积测量方法[J]. 计算机工程, 2006, 32(8): 210-212.
	WANG Y J, ZHANG Y, ZHANG S Y. Approach to measure plant leaf area based on image process[J]. Computer Engineering, 2006, 32(8): 210-212. (in Chinese)
[5]	VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all You need[C]// The 31st International Conference on Neural Information Processing Systems. New York: ACM, 2017: 6000-6010.
[6]	YU W H, LUO M, ZHOU P, et al. MetaFormer is actually what You need for vision[EB/OL]. [2022-06-21]. https://arxiv.org/abs/2111.11418.
[7]	PARK J, WOO S, LEE J Y, et al. BAM: bottleneck attention module[EB/OL]. [2022-05-14]. https://arxiv.org/abs/1807.06514.
[8]	KRIZHEVSKY A, SUTSKEVER I, HINTON G E. ImageNet classification with deep convolutional neural networks[C]// The 25th International Conference on Neural Information Processing Systems-Volume 1. New York:ACM, 2012: 1097-1105.
[9]	SIMONYAN K, ZISSERMAN A. Very deep convolutional networks for large-scale image recognition[EB/OL]. [2022-05- 16]. https://arxiv.org/abs/1409.1556.
[10]	SZEGEDY C, LIU W, JIA Y Q, et al. Going deeper with convolutions[C]// 2015 IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2015: 1-9.
[11]	HE K M, ZHANG X Y, REN S Q, et al. Deep residual learning for image recognition[C]// 2016 IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2016: 770-778.
[12]	LEE S H, CHAN C S, WILKIN P, et al. Deep-plant: plant identification with convolutional neural networks[C]// 2015 IEEE International Conference on Image Processing. New York: IEEE Press, 2015: 452-456.
[13]	XIA X L, XU C, NAN B. Inception-v3 for flower classification[C]// 2017 2nd International Conference on Image, Vision and Computing. New York: IEEE Press, 2017: 783-787.
[14]	GAVAI N R, JAKHADE Y A, TRIBHUVAN S A, et al. MobileNets for flower classification using TensorFlow[C]// 2017 International Conference on Big Data, IoT and Data Science. New York: IEEE Press, 2018: 154-158.
[15]	CAO S, SONG B. Visual attentional-driven deep learning method for flower recognition[J]. Mathematical Biosciences and Engineering: MBE, 2021, 18(3): 1981-1991. DOI URL
[16]	DOSOVITSKIY A, BEYER L, KOLESNIKOV A, et al. An image is worth 16×16 words: transformers for image recognition at scale[EB/OL]. [2022-05-16]. https://arxiv.org/abs/2010.11929.
[17]	TOUVRON H, CORD M, DOUZE M, et al. Training data-efficient image transformers & distillation through attention[EB/OL]. [2022-05-16]. https://arxiv.org/abs/2012.12877.
[18]	YUAN L, CHEN Y P, WANG T, et al. Tokens-to-token ViT: training vision transformers from scratch on ImageNet[C]// 2021 IEEE/CVF International Conference on Computer Vision. New York: IEEE Press, 2022: 538-547.
[19]	LEE-THORP J, AINSLIE J, ECKSTEIN I, et al. FNet: mixing tokens with Fourier transforms[EB/OL]. [2022-05-16]. https://arxiv.org/abs/2105.03824.
[20]	TOLSTIKHIN I, HOULSBY N, KOLESNIKOV A, et al. MLP-mixer: an all-MLP architecture for vision[EB/OL]. [2022-05-16]. https://arxiv.org/abs/2105.01601.
[21]	LIU Z, LIN Y T, CAO Y, et al. Swin transformer: hierarchical vision transformer using shifted windows[C]// 2021 IEEE/CVF International Conference on Computer Vision. New York: IEEE Press, 2022: 9992-10002.
[22]	LIN T Y, DOLLÁR P, GIRSHICK R, et al. Feature pyramid networks for object detection[C]// 2017 IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2017: 936-944.
[23]	RONNEBERGER O, FISCHER P, BROX T. U-net: convolutional networks for biomedical image segmentation[M]// Lecture Notes in Computer Science. Cham: Springer International Publishing, 2015: 234-241.
[24]	HE K M, ZHANG X Y, REN S Q, et al. Deep residual learning for image recognition[C]// 2016 IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2016: 770-778.
[25]	XIE S N, GIRSHICK R, DOLLÁR P, et al. Aggregated residual transformations for deep neural networks[C]// 2017 IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2017: 5987-5995.
[26]	HENDRYCKS D, GIMPEL K. Gaussian error linear units (GELUs)[EB/OL]. [2022-05-16]. https://arxiv.org/abs/1606.08415.
[27]	RADOSAVOVIC I, KOSARAJU R P, GIRSHICK R, et al. Designing network design spaces[C]// 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2020: 10425-10433.
[28]	HUMPHREY E J, BELLO J P. Rethinking automatic chord recognition with convolutional neural networks[C]// The 11th International Conference on Machine Learning and Applications. New York: IEEE Press, 2013: 357-362.
[29]	DONG X Y, BAO J M, CHEN D D, et al. CSWin transformer: a general vision transformer backbone with cross-shaped windows[EB/OL]. [2022-05-16]. https://arxiv.org/abs/2107.00652.

Method	Size	Parametric(M)	FLOPs (G)	Throughput (image/s)	Accuracy (%)
Method	Size	Parametric(M)	FLOPs (G)	Throughput (image/s)	102数据集	104数据集
RegNet-4G	224²	20.60	4.00	367.4	85.3	84.6
EffcientNet-B4	380²	19.30	4.20	410.9	87.9	87.5
Inception-V3	299²	27.16	6.00	96.5	80.3	79.8
MobileNet-160	224²	5.50	0.58	755.1	78.3	77.4
ViT-B	384²	86.40	55.40	27.2	82.9	82.6
DeiT-S	224²	22.10	4.60	298.7	84.8	84.3
Swin-T	224²	29.00	4.50	239.9	86.3	85.9
CSwin-T	224²	23.00	4.30	215.6	87.7	87.2
Ours	224²	19.30	3.20	459.3	88.1	87.3

Method	Size	Parametric(M)	FLOPs (G)	Throughput (image/s)	Accuracy (%)
Method	Size	Parametric(M)	FLOPs (G)	Throughput (image/s)	102数据集	104数据集
RegNet-4G	224²	20.60	4.00	367.4	85.3	84.6
EffcientNet-B4	380²	19.30	4.20	410.9	87.9	87.5
Inception-V3	299²	27.16	6.00	96.5	80.3	79.8
MobileNet-160	224²	5.50	0.58	755.1	78.3	77.4
ViT-B	384²	86.40	55.40	27.2	82.9	82.6
DeiT-S	224²	22.10	4.60	298.7	84.8	84.3
Swin-T	224²	29.00	4.50	239.9	86.3	85.9
CSwin-T	224²	23.00	4.30	215.6	87.7	87.2
Ours	224²	19.30	3.20	459.3	88.1	87.3

Method	Stage1	Stage2	Stage3	Stage4	Parametric (M)	FLOPs (G)	Accuracy (%)
Swin-T	×	×	×	×	29.00	4.50	86.3
Ours-1	√	×	×	×	26.91	3.90	85.7
Ours-2	√	√	×	×	19.05	2.97	84.9
Ours-3	√	√	√	×	17.26	2.64	83.1
Ours-4	√	√	√	√	14.35	2.13	80.6

Method	Stage1	Stage2	Stage3	Stage4	Parametric (M)	FLOPs (G)	Accuracy (%)
Swin-T	×	×	×	×	29.00	4.50	86.3
Ours-1	√	×	×	×	26.91	3.90	85.7
Ours-2	√	√	×	×	19.05	2.97	84.9
Ours-3	√	√	√	×	17.26	2.64	83.1
Ours-4	√	√	√	√	14.35	2.13	80.6

Method	Stage1	Stage2	Stage3	Stage4	Parametric (M)	FLOPs (G)	Accuracy (%)
Swin-T	-	-	-	-	29.00	4.50	86.3
Ours-2	-	-	-	-	19.05	2.97	84.9
	√	-	-	-	19.12	3.03	85.2
	-	√	-	-	19.19	3.11	86.1
	-	-	√	-	19.30	3.20	87.6
	-	-	-	√	19.48	3.32	86.6

基于轻量化视觉Transformer的花卉识别

Flowers recognition based on lightweight visual transformer

RichHTML

PDF (PC)

可视化

摘要/Abstract

引用本文

使用本文

图/表 10

参考文献 29

相关文章 15

编辑推荐

Metrics

本文评价

Method	L_cross	L_con	L_cross+L_con	Accuracy (%)
Swin-T	√	-	-	86.3
	-	√	-	86.4
	-	-	√	86.6
Ours	√	-	-	87.6
	-	√	-	87.8
	-	-	√	88.1

[1]	李利霞, 王鑫, 王军, 张又元 . 基于特征融合与注意力机制的无人机图像小目标检测算法 [J]. 图学学报, 2023, 44(4): 658-666.
[2]	李鑫, 普园媛, 赵征鹏, 徐丹, 钱文华 . 内容语义和风格特征匹配一致的艺术风格迁移 [J]. 图学学报, 2023, 44(4): 699-709.
[3]	余伟群, 刘佳涛, 张亚萍. 融合注意力的拉普拉斯金字塔单目深度估计 [J]. 图学学报, 2023, 44(4): 728-738.
[4]	胡欣, 周运强, 肖剑, 杨杰. 基于改进YOLOv5的螺纹钢表面缺陷检测[J]. 图学学报, 2023, 44(3): 427-437.
[5]	郝鹏飞, 刘立群, 顾任远. YOLO-RD-Apple果园异源图像遮挡果实检测模型[J]. 图学学报, 2023, 44(3): 456-464.
[6]	罗文宇, 傅明月. 基于YoloX-ECA模型的非法野泳野钓现场监测技术[J]. 图学学报, 2023, 44(3): 465-472.
[7]	李雨, 闫甜甜, 周东生, 魏小鹏. 基于注意力机制与深度多尺度特征融合的自然场景文本检测[J]. 图学学报, 2023, 44(3): 473-481.
[8]	吴文欢, 张淏坤. 融合空间十字注意力与通道注意力的语义分割网络[J]. 图学学报, 2023, 44(3): 531-539.
[9]	谢国波, 贺笛轩, 何宇钦, 林志毅. 基于P-CenterNet的光学遥感图像烟囱检测[J]. 图学学报, 2023, 44(2): 233-240.
[10]	成浪, 敬超. 基于改进YOLOv7的X线图像旋转目标检测[J]. 图学学报, 2023, 44(2): 324-334.
[11]	曹义亲, 伍铭林, 徐露. 基于改进YOLOv5算法的钢材表面缺陷检测[J]. 图学学报, 2023, 44(2): 335-345.
[12]	张伟康, 孙浩, 陈鑫凯, 李叙兵, 姚立纲, 东辉. 基于改进YOLOv5的智能除草机器人蔬菜苗田杂草检测研究[J]. 图学学报, 2023, 44(2): 346-356.
[13]	李小波 , 李阳贵 , 郭宁 , 范震 . 融合注意力机制的 YOLOv5 口罩检测算法[J]. 图学学报, 2023, 44(1): 16-25.
[14]	皮骏, 刘宇恒, 李久昊. 基于 YOLOv5s 的轻量化森林火灾检测算法研究[J]. 图学学报, 2023, 44(1): 26-32.
[15]	邵文斌, 刘玉杰, 孙晓瑞, 李宗民. 基于残差增强注意力的跨模态行人重识别[J]. 图学学报, 2023, 44(1): 33-40.