Flowers recognition based on lightweight visual transformer

doi:10.11996/JG.j.2095-302X.2023020271

Abstract

Abstract:

Due to the similarity between different kinds of flowers and the dissimilarity within the same kind of flowers, the results of convolutional neural network (CNN) that extracts local feature information in flower image recognition are not ideal. Based on the Swin Transformer (Swin-T) network, this paper proposed a lightweight Transformer network LWFormer. Firstly, the network introduced the mobile window-based PoolFormer module into the first and second stages of the Swin-T network to lightweight the network. Secondly, a dual-channel attention mechanism was introduced, in which two independent channels focused on the “location” and “content” of the feature map, respectively, to improve the network′s ability to extract global feature information. Finally, a contrastive loss function was employed to further optimize the performance of the network. The enhanced model was evaluated on two public datasets, Oxford 102 Flower Dataset and 104 Flowers Garden of Eden, and compared with other methods. On these two datasets, the accuracy rates were 88.1% and 87.3%, respectively. Compared with the Swin-T network, the network parameters were reduced by 33.45%, FLOPs was reduced by 28.89%, throughput was increased by 91.45%, and accuracy was increased by 1.8%. Experimental results showed that the proposed network could improve the accuracy while reducing the number of parameters, thus enhancing the speed and accuracy.

Key words: flower recognition, lightweight, attention mechanism, dual-channel attention, contrastive loss function

CLC Number:

TP391

XIONG Ju-ju, XU Yang, FAN Run-ze, SUN Shao-cong. Flowers recognition based on lightweight visual transformer[J]. Journal of Graphics, 2023, 44(2): 271-279.

Figures/Tables 10

References 29

[1]	LOWE D G. Distinctive image features from scale-invariant keypoints[J]. International Journal of Computer Vision, 2004, 60(2): 91-110. DOI URL
[2]	DALAL N, TRIGGS B. Histograms of oriented gradients for human detection[C]// 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2005: 886-893.
[3]	郭艾侠, 彭明明, 邢仲璟. 机器视觉技术在荔枝识别与定位研究中的应用[J]. 计算机工程与应用, 2017, 53(17): 218-223, 259. DOI
	GUO A X, PENG M M, XING Z J. Study on recognition and positioning of litchi based on technology of machine vision[J]. Computer Engineering and Applications, 2017, 53(17): 218-223, 259. (in Chinese) DOI
[4]	王永皎, 张引, 张三元. 基于图像处理的植物叶面积测量方法[J]. 计算机工程, 2006, 32(8): 210-212.
	WANG Y J, ZHANG Y, ZHANG S Y. Approach to measure plant leaf area based on image process[J]. Computer Engineering, 2006, 32(8): 210-212. (in Chinese)
[5]	VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all You need[C]// The 31st International Conference on Neural Information Processing Systems. New York: ACM, 2017: 6000-6010.
[6]	YU W H, LUO M, ZHOU P, et al. MetaFormer is actually what You need for vision[EB/OL]. [2022-06-21]. https://arxiv.org/abs/2111.11418.
[7]	PARK J, WOO S, LEE J Y, et al. BAM: bottleneck attention module[EB/OL]. [2022-05-14]. https://arxiv.org/abs/1807.06514.
[8]	KRIZHEVSKY A, SUTSKEVER I, HINTON G E. ImageNet classification with deep convolutional neural networks[C]// The 25th International Conference on Neural Information Processing Systems-Volume 1. New York:ACM, 2012: 1097-1105.
[9]	SIMONYAN K, ZISSERMAN A. Very deep convolutional networks for large-scale image recognition[EB/OL]. [2022-05- 16]. https://arxiv.org/abs/1409.1556.
[10]	SZEGEDY C, LIU W, JIA Y Q, et al. Going deeper with convolutions[C]// 2015 IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2015: 1-9.
[11]	HE K M, ZHANG X Y, REN S Q, et al. Deep residual learning for image recognition[C]// 2016 IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2016: 770-778.
[12]	LEE S H, CHAN C S, WILKIN P, et al. Deep-plant: plant identification with convolutional neural networks[C]// 2015 IEEE International Conference on Image Processing. New York: IEEE Press, 2015: 452-456.
[13]	XIA X L, XU C, NAN B. Inception-v3 for flower classification[C]// 2017 2nd International Conference on Image, Vision and Computing. New York: IEEE Press, 2017: 783-787.
[14]	GAVAI N R, JAKHADE Y A, TRIBHUVAN S A, et al. MobileNets for flower classification using TensorFlow[C]// 2017 International Conference on Big Data, IoT and Data Science. New York: IEEE Press, 2018: 154-158.
[15]	CAO S, SONG B. Visual attentional-driven deep learning method for flower recognition[J]. Mathematical Biosciences and Engineering: MBE, 2021, 18(3): 1981-1991. DOI URL
[16]	DOSOVITSKIY A, BEYER L, KOLESNIKOV A, et al. An image is worth 16×16 words: transformers for image recognition at scale[EB/OL]. [2022-05-16]. https://arxiv.org/abs/2010.11929.
[17]	TOUVRON H, CORD M, DOUZE M, et al. Training data-efficient image transformers & distillation through attention[EB/OL]. [2022-05-16]. https://arxiv.org/abs/2012.12877.
[18]	YUAN L, CHEN Y P, WANG T, et al. Tokens-to-token ViT: training vision transformers from scratch on ImageNet[C]// 2021 IEEE/CVF International Conference on Computer Vision. New York: IEEE Press, 2022: 538-547.
[19]	LEE-THORP J, AINSLIE J, ECKSTEIN I, et al. FNet: mixing tokens with Fourier transforms[EB/OL]. [2022-05-16]. https://arxiv.org/abs/2105.03824.
[20]	TOLSTIKHIN I, HOULSBY N, KOLESNIKOV A, et al. MLP-mixer: an all-MLP architecture for vision[EB/OL]. [2022-05-16]. https://arxiv.org/abs/2105.01601.
[21]	LIU Z, LIN Y T, CAO Y, et al. Swin transformer: hierarchical vision transformer using shifted windows[C]// 2021 IEEE/CVF International Conference on Computer Vision. New York: IEEE Press, 2022: 9992-10002.
[22]	LIN T Y, DOLLÁR P, GIRSHICK R, et al. Feature pyramid networks for object detection[C]// 2017 IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2017: 936-944.
[23]	RONNEBERGER O, FISCHER P, BROX T. U-net: convolutional networks for biomedical image segmentation[M]// Lecture Notes in Computer Science. Cham: Springer International Publishing, 2015: 234-241.
[24]	HE K M, ZHANG X Y, REN S Q, et al. Deep residual learning for image recognition[C]// 2016 IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2016: 770-778.
[25]	XIE S N, GIRSHICK R, DOLLÁR P, et al. Aggregated residual transformations for deep neural networks[C]// 2017 IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2017: 5987-5995.
[26]	HENDRYCKS D, GIMPEL K. Gaussian error linear units (GELUs)[EB/OL]. [2022-05-16]. https://arxiv.org/abs/1606.08415.
[27]	RADOSAVOVIC I, KOSARAJU R P, GIRSHICK R, et al. Designing network design spaces[C]// 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2020: 10425-10433.
[28]	HUMPHREY E J, BELLO J P. Rethinking automatic chord recognition with convolutional neural networks[C]// The 11th International Conference on Machine Learning and Applications. New York: IEEE Press, 2013: 357-362.
[29]	DONG X Y, BAO J M, CHEN D D, et al. CSWin transformer: a general vision transformer backbone with cross-shaped windows[EB/OL]. [2022-05-16]. https://arxiv.org/abs/2107.00652.

Method	Size	Parametric(M)	FLOPs (G)	Throughput (image/s)	Accuracy (%)
Method	Size	Parametric(M)	FLOPs (G)	Throughput (image/s)	102数据集	104数据集
RegNet-4G	224²	20.60	4.00	367.4	85.3	84.6
EffcientNet-B4	380²	19.30	4.20	410.9	87.9	87.5
Inception-V3	299²	27.16	6.00	96.5	80.3	79.8
MobileNet-160	224²	5.50	0.58	755.1	78.3	77.4
ViT-B	384²	86.40	55.40	27.2	82.9	82.6
DeiT-S	224²	22.10	4.60	298.7	84.8	84.3
Swin-T	224²	29.00	4.50	239.9	86.3	85.9
CSwin-T	224²	23.00	4.30	215.6	87.7	87.2
Ours	224²	19.30	3.20	459.3	88.1	87.3

Method	Size	Parametric(M)	FLOPs (G)	Throughput (image/s)	Accuracy (%)
Method	Size	Parametric(M)	FLOPs (G)	Throughput (image/s)	102数据集	104数据集
RegNet-4G	224²	20.60	4.00	367.4	85.3	84.6
EffcientNet-B4	380²	19.30	4.20	410.9	87.9	87.5
Inception-V3	299²	27.16	6.00	96.5	80.3	79.8
MobileNet-160	224²	5.50	0.58	755.1	78.3	77.4
ViT-B	384²	86.40	55.40	27.2	82.9	82.6
DeiT-S	224²	22.10	4.60	298.7	84.8	84.3
Swin-T	224²	29.00	4.50	239.9	86.3	85.9
CSwin-T	224²	23.00	4.30	215.6	87.7	87.2
Ours	224²	19.30	3.20	459.3	88.1	87.3

Method	Stage1	Stage2	Stage3	Stage4	Parametric (M)	FLOPs (G)	Accuracy (%)
Swin-T	×	×	×	×	29.00	4.50	86.3
Ours-1	√	×	×	×	26.91	3.90	85.7
Ours-2	√	√	×	×	19.05	2.97	84.9
Ours-3	√	√	√	×	17.26	2.64	83.1
Ours-4	√	√	√	√	14.35	2.13	80.6

Method	Stage1	Stage2	Stage3	Stage4	Parametric (M)	FLOPs (G)	Accuracy (%)
Swin-T	×	×	×	×	29.00	4.50	86.3
Ours-1	√	×	×	×	26.91	3.90	85.7
Ours-2	√	√	×	×	19.05	2.97	84.9
Ours-3	√	√	√	×	17.26	2.64	83.1
Ours-4	√	√	√	√	14.35	2.13	80.6

Method	Stage1	Stage2	Stage3	Stage4	Parametric (M)	FLOPs (G)	Accuracy (%)
Swin-T	-	-	-	-	29.00	4.50	86.3
Ours-2	-	-	-	-	19.05	2.97	84.9
	√	-	-	-	19.12	3.03	85.2
	-	√	-	-	19.19	3.11	86.1
	-	-	√	-	19.30	3.20	87.6
	-	-	-	√	19.48	3.32	86.6