基于注意力机制与深度多尺度特征融合的自然场景文本检测

doi:10.11996/JG.j.2095-302X.2023030473

图学学报 ›› 2023, Vol. 44 ›› Issue (3): 473-481.DOI: 10.11996/JG.j.2095-302X.2023030473

• 图像处理与计算机视觉 • 上一篇下一篇

基于注意力机制与深度多尺度特征融合的自然场景文本检测

李雨¹(), 闫甜甜¹, 周东生¹^,²(), 魏小鹏²

1.大连大学软件工程学院计算机辅助设计国家地方联合工程实验室，辽宁大连 116622
2.大连理工大学计算机科学与技术学院，辽宁大连116024

收稿日期:2022-10-27 接受日期:2023-01-12 出版日期:2023-06-30 发布日期:2023-06-30
通讯作者: 周东生(1978-)，男，教授，博士。主要研究方向为计算机图形学及视觉、人机交互等。E-mail：zhouds@dlu.edu.cn
作者简介:
李雨(1997-)，女，硕士研究生。主要研究方向为计算机视觉。E-mail：y18337275282@163.com
基金资助:
国家自然科学基金重点项目(U21A20491);辽宁省中央引导地方科技发展专项(2021JH6/10500140);辽宁省高等学校创新团队支持计划项目(LT2020015);大连市重点领域创新团队支持计划项目(2021RT06)

Natural scene text detection based on attention mechanism and deep multi-scale feature fusion

LI Yu¹(), YAN Tian-tian¹, ZHOU Dong-sheng¹^,²(), WEI Xiao-peng²

1. National and Local Joint Engineering Laboratory of Computer Aided Design, School of Software Engineering, Dalian University, Dalian Liaoning 116622, China
2. School of Computer Science and Technology, Dalian University of Technology, Dalian Liaoning 116024, China

Received:2022-10-27 Accepted:2023-01-12 Online:2023-06-30 Published:2023-06-30
Contact: ZHOU Dong-sheng (1978-), professor, Ph.D. His main research interests cover computer graphics and vision, human-robot interaction, etc. E-mail：zhouds@dlu.edu.cn
About author:
LI Yu (1997-), master student. Her main research interest covers computer vision. E-mail：y18337275282@163.com
Supported by:
Key Program of National Natural Science Foundation of China(U21A20491);Special Project of Central Government Guiding Local Science and Technology Development(2021JH6/10500140);Program for Innovative Research Team in University of Liaoning Province(LT2020015);Support Plan for Key Field Innovation Team of Dalian(2021RT06)

摘要/Abstract

摘要：

针对现有场景文本检测方法不能深入挖掘并充分融合多尺度文本实例判别性特征的问题，提出一种基于注意力机制与深度多尺度特征融合的自然场景文本检测方法。首先采用带有注意力增强的ResNeSt50作为骨干网络，提取文本实例在不同尺度上更具判别力的特征表示；然后设计深度多尺度特征融合模块，将不同尺度的特征信息进行交互，自适应地学习不同尺度特征图对应的权重矩阵，用于融合文本实例在不同尺度特征图上具有判别力的特征信息，从而获得更具鲁棒性的多尺度融合特征图；最后利用自适应的二值化后处理模块生成更加精确的文本区域边界框。为评估其有效性，大量实验在ICDAR2015，ICDAR2013和CTW1500数据集上进行验证，结果表明该方法相较于其他先进的检测方法取得了有竞争力的检测结果，展现出良好的鲁棒性和泛化能力。

关键词: 自然场景文本检测, 注意力机制, 多尺度特征融合, 二值化, 自适应

Abstract:

A scene text detection method based on attention mechanism and deep multi-scale feature fusion was proposed to address the issue that existing scene text detection methods cannot deeply mine and fully fuse discriminative multi-scale text instance features. The ResNeSt50 network with attention enhancement served as the backbone network to extract more discriminative feature representation related to text instance across different scales. Furthermore, a deep multi-scale feature fusion module was designed to interact with feature information related to feature maps of different scales. This module adaptively learned the corresponding weight matrix related to feature maps of different scales, which were used to further mine and fuse discriminative feature information about text instances on feature maps of different scales, thus yielding a robust multi-scale fusion feature map. Finally, an adaptive binarization post-processing module was adopted to generate a more accurate text area bounding box. To evaluate the effectiveness of the proposed method, extensive experiments were conducted on ICDAR2015, ICDAR2013, and CTW1500 datasets. The results demonstrated that the proposed method achieved competitive detection results compared with other advanced detection methods and presented excellent robustness and generalization ability.

Key words: natural scene text detection, attention mechanism, multi-scale feature fusion, binarization, adaptive

中图分类号:

TP391

李雨, 闫甜甜, 周东生, 魏小鹏. 基于注意力机制与深度多尺度特征融合的自然场景文本检测[J]. 图学学报, 2023, 44(3): 473-481.

LI Yu, YAN Tian-tian, ZHOU Dong-sheng, WEI Xiao-peng. Natural scene text detection based on attention mechanism and deep multi-scale feature fusion[J]. Journal of Graphics, 2023, 44(3): 473-481.

图/表 9

参考文献 39

[1]	王建新, 王子亚, 田萱. 基于深度学习的自然场景文本检测与识别综述[J]. 软件学报, 2020, 31(5): 1465-1496.
	WANG J X, WANG Z Y, TIAN X. Review of natural scene text detection and recognition based on deep learning[J]. Journal of Software, 2020, 31(5): 1465-1496. (in Chinese)
[2]	刘崇宇, 陈晓雪, 罗灿杰, 等. 自然场景文本检测与识别的深度学习方法[J]. 中国图象图形学报, 2021, 26(6): 1330-1367.
	LIU C Y, CHEN X X, LUO C J, et al. Deep learning methods for scene text detection and recognition[J]. Journal of Image and Graphics, 2021, 26(6): 1330-1367. (in Chinese)
[3]	KIM K I, JUNG K, KIM J H. Texture-based approach for text detection in images using support vector machines and continuously adaptive mean shift algorithm[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2003, 25(12): 1631-1639. DOI URL
[4]	MINETTO R, THOME N, CORD M, et al. T-HOG: an effective gradient-based descriptor for single line text regions[J]. Pattern Recognition, 2013, 46(3): 1078-1090. DOI URL
[5]	LIAO M H, SHI B G, BAI X. TextBoxes++: a single-shot oriented scene text detector[J]. IEEE Transactions on Image, 2018, 27(8): 3676-3690.
[6]	ZHANG S, LIU Y L, JIN L W, et al. OPMP: an omnidirectional pyramid mask proposal network for arbitrary-shape scene text detection[J]. IEEE Transactions on Multimedia, 2021, 23: 454-467. DOI URL
[7]	REN S Q, HE K M, GIRSHICK R, et al. Faster R-CNN: towards real-time object detection with region proposal networks[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2015, 39(6): 1137-1149. DOI URL
[8]	LIU W, ANGUELOV D, ERHAN D, et al. SSD: single shot MultiBox detector[C]// European Conference on Computer Vision. Cham: Springer International Publishing, 2016: 21-37.
[9]	易尧华, 何婧婧, 卢利琼, 等. 顾及目标关联的自然场景文本检测[J]. 中国图象图形学报, 2020, 25(1): 126-135.
	YI Y H, HE J J, LU L Q, et al. Association of text and other objects for text detection with natural scene images[J]. Journal of Image and Graphics, 2020, 25(1): 126-135. (in Chinese)
[10]	WANG C, ZHAO S, ZHU L, et al. Semi-supervised pixel-level scene text segmentation by mutually guided network[J]. IEEE Transactions on Image, 2021, 30(5): 8212-8221.
[11]	师广琛, 巫义锐. 像素聚合和特征增强的任意形状场景文本检测[J]. 中国图象图形学报, 2021, 26(7): 1614-1624.
	SHI G C, WU Y R. Arbitrary shape scene-text detection based on pixel aggregation and feature enhancement[J]. Journal of Image and Graphics, 2021, 26(7): 1614-1624. (in Chinese)
[12]	LIAO M H, LYU P Y, HE M H, et al. Mask TextSpotter: an end-to-end trainable neural network for spotting text with arbitrary shapes[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021, 43(2): 532-548. DOI URL
[13]	HE K M, GKIOXARI G, DOLLÁR P, et al. Mask R-CNN[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020, 42(2): 386-397. DOI PMID
[14]	LIAO M H, ZOU ZS, WAN Z Y, et al. Real-time scene text detection with differentiable binarization[J] IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023, 45(1): 919-931. DOI URL
[15]	HU J, SHEN L, ALBANIE S, et al. Squeeze-and-excitation networks[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020, 42(8): 2011-2023. DOI PMID
[16]	WOO S, PARK J, LEE J Y, et al. CBAM: Convolutional Block Attention Module[C]// European Conference on Computer Vision. Cham: Springer International Publishing, 2018: 3-19.
[17]	ZHANG L, LIU Y F, XIAO H, et al. Efficient scene text detection with textual attention tower[C]// ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing. New York: IEEE Press, 2020: 4272-4276.
[18]	LIU X H, CHEN X K, KUANG H L, et al. A multi-level feature fusion network for scene text detection with text attention mechanism[C]// 2021 IEEE 4th Advanced Information Management, Communicates, Electronic and Automation Control Conference. New York: IEEE Press, 2021: 954-958.
[19]	梁浩然, 叶凌晨, 梁荣华, 等. 注意力监督策略下的自然场景文本检测算法[J]. 计算机辅助设计与图形学学报, 2022, 34(7): 1011-1019.
	LIANG H R, YE L C, LIANG R H, et al. Text detection algorithm for natural scenes under attention supervision strategy[J]. Journal of Computer-Aided Design & Computer Graphics, 2022, 34(7): 1011-1019. (in Chinese)
[20]	李晓玉, 宋永红, 余涛. 结合感受野增强和全卷积网络的场景文字检测方法[J]. 自动化学报, 2022, 48(3): 797-807.
	LI X Y, SONG Y H, YU T. Text detection in natural scene images based on enhanced receptive field and fully convolution network[J]. Acta Automatica Sinica, 2022, 48(3): 797-807. (in Chinese)
[21]	杨锶齐, 易尧华, 汤梓伟, 等. 嵌入注意力机制的自然场景文本检测方法[J]. 计算机工程与应用, 2021, 57(24): 185-191. DOI
	YANG S Q, YI Y H, TANG Z W, et al. Text detection in natural scenes embedded attention mechanism[J]. Computer Engineering and Applications, 2021, 57(24): 185-191. (in Chinese) DOI
[22]	LI X, WANG W H, HOU W B, et al. Shape robust text detection with progressive scale expansion network[C]// 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2020: 9328-9337.
[23]	WANG Y X, XIE H T, ZHA Z J, et al. R-net: a relationship network for efficient and accurate scene text detection[J]. IEEE Transactions on Multimedia, 2021, 23: 1316-1329. DOI URL
[24]	SHAO H L, JI Y, LI Y, et al. BDFPN: Bi-direction feature pyramid network for scene text detection[C]// 2021 International Joint Conference on Neural Networks. New York: IEEE Press, 2021: 1-8.
[25]	ZHANG H, WU C R, ZHANG Z Y, et al. ResNeSt: split-attention networks[C]// 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops. New York: IEEE Press, 2022: 2735-2745.
[26]	HE K M, ZHANG X Y, REN S Q, et al. Deep residual learning for image recognition[C]// 2016 IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2016: 770-778.
[27]	KARATZAS D, GOMEZ-BIGORDA L, NICOLAOU A, et al. ICDAR 2015 competition on robust reading[C]// The 13th International Conference on Document Analysis and Recognition. New York: IEEE Press, 2015: 1156-1160.
[28]	DAI P W, LI Y, ZHANG H, et al. Accurate scene text detection via scale-aware data augmentation and shape similarity constraint[J]. IEEE Transactions on Multimedia, 2021, 24: 1883-1895. DOI URL
[29]	LIAO M H, ZOU Z S, WAN Z Y, et al. Real-time scene text detection with differentiable binarization and adaptive scale fusion[EB/OL]. (2022-02-21) [2022-10-22]. https://arxiv.org/abs/2202.10304.
[30]	陈卓, 王国胤, 刘群. 结合多粒度特征融合的自然场景文本检测方法[J]. 计算机科学, 2021, 48(12): 243-248. DOI
	CHEN Z, WANG G Y, LIU Q. Natural scene text detection algorithm combining multi-granularity feature fusion[J]. Computer Science, 2021, 48(12): 243-248. (in Chinese) DOI
[31]	KESERWANI P, DHANKHAR A, SAINI R, et al. Quadbox: quadrilateral bounding box based scene text detection using vector regression[J]. IEEE Access, 2021, 9: 36802-36818. DOI URL
[32]	CAO Y C, MA S S, PAN H C. FDTA: fully convolutional scene text detection with text attention[J]. IEEE Access, 2020, 8: 155441-155449. DOI URL
[33]	LONG S B, RUAN J Q, ZHANG W J, et al. TextSnake: a flexible representation for detecting text of arbitrary shapes[C]// European Conference on Computer Vision. Cham: Springer International Publishing, 2018: 19-35.
[34]	邵海琳, 季怡, 刘纯平, 等. 基于增强特征金字塔网络的场景文本检测算法[J]. 计算机科学, 2022, 49(2): 248-255. DOI
	SHAO H L, JI Y, LIU C P, et al. Scene text detection algorithm based on enhanced feature pyramid network[J]. Computer Science, 2022, 49(2): 248-255. (in Chinese) DOI
[35]	杨剑锋, 王润民, 何璇, 等. 基于FCN的多方向自然场景文字检测方法[J]. 计算机工程与应用, 2020, 56(2): 164-170. DOI
	YANG J F, WANG R M, HE X, et al. Multi-oriented natural scene text detection algorithm based on FCN[J]. Computer Engineering and Applications, 2020, 56(2): 164-170. (in Chinese) DOI
[36]	WANG F F, CHEN Y F, WU F, et al. TextRay: contour-based geometric modeling for arbitrary-shaped scene text detection[C]// The 28th ACM International Conference on Multimedia. New York: ACM, 2020: 111-119.
[37]	WANG X B, JIANG Y Y, LUO Z B, et al. Arbitrary shape scene text detection with adaptive text region representation[C]// 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2020: 6442-6451.
[38]	WU W J, ZHANG D B, FU Y, et al. End-to-end video text spotting with transformer[EB/OL]. [2022-05-10]. https://www.researchgate.net/publication/359390680_End-to-End_Video_Text_Spotting_with_Transformer.
[39]	LIU S P, XIAN Y T, LI H F, et al. Text detection in natural scene images using morphological component analysis and Laplacian dictionary[J]. IEEE/CAA Journal of Automatica Sinica, 20207, 7(1): 214-222.

添加的模块				评估结果(%)
ResNet50	ResNeSt50	MFCAM	AFM	P	R	F
√	-	-	-	85.0	82.7	83.8
-	√	-	-	87.2	83.2	85.2
-	√	√	-	90.6	84.0	87.2
-	√	-	√	88.3	84.5	86.4
-	√	√	√	91.3	85.7	88.4

添加的模块				评估结果(%)
ResNet50	ResNeSt50	MFCAM	AFM	P	R	F
√	-	-	-	85.0	82.7	83.8
-	√	-	-	87.2	83.2	85.2
-	√	√	-	90.6	84.0	87.2
-	√	-	√	88.3	84.5	86.4
-	√	√	√	91.3	85.7	88.4

方法	P	R	F
TextBox++^[5]	87.2	76.7	81.7
OPMP^[6]	89.1	85.5	87.3
ASBNet^[19]	78.2	84.3	81.2
ERFFC^[20]	85.4	78.9	82.0
PSENet^[22]	86.9	84.5	85.7
SADA^[28]	88.8	82.6	85.6
EFPN^[29]	89.2	82.0	85.5
TDMF^[30]	79.9	85.3	82.5
Quadbox^[31]	88.7	81.8	85.1
FDTA^[32]	89.0	81.2	84.9
TextSnake^[33]	84.9	80.4	82.6
DBNet++^[34]	90.9	83.9	87.3
文献[35]	80.3	69.1	74.5
本文	91.3	85.7	88.4

方法	P	R	F
TextBox++^[5]	87.2	76.7	81.7
OPMP^[6]	89.1	85.5	87.3
ASBNet^[19]	78.2	84.3	81.2
ERFFC^[20]	85.4	78.9	82.0
PSENet^[22]	86.9	84.5	85.7
SADA^[28]	88.8	82.6	85.6
EFPN^[29]	89.2	82.0	85.5
TDMF^[30]	79.9	85.3	82.5
Quadbox^[31]	88.7	81.8	85.1
FDTA^[32]	89.0	81.2	84.9
TextSnake^[33]	84.9	80.4	82.6
DBNet++^[34]	90.9	83.9	87.3
文献[35]	80.3	69.1	74.5
本文	91.3	85.7	88.4

方法	P	R	F
OPMP^[6]	85.1	80.8	82.9
TS^[9]	78.2	77.8	78.0
Non-Local PAN^[11]	78.9	83.8	81.3
ASBNet^[19]	85.1	75.5	80.0
PSENet^[22]	80.6	75.6	78.0
SADA^[28]	86.2	80.4	83.2
TextSnake^[33]	67.9	85.3	75.6
TextRay^[36]	80.4	82.8	81.6
ATRR^[37]	80.1	80.2	80.1
本文	81.6	87.2	84.3

基于注意力机制与深度多尺度特征融合的自然场景文本检测

Natural scene text detection based on attention mechanism and deep multi-scale feature fusion

RichHTML

PDF (PC)

可视化

摘要/Abstract

引用本文

使用本文

图/表 9

参考文献 39

相关文章 15

编辑推荐

Metrics

本文评价

方法	P	R	F
TextBox++^[5]	86.0	74.0	80.0
Faster-RCNN^[7]	71.0	75.0	73.0
TS^[9]	88.0	81.7	84.7
文献[17]	80.8	69.1	74.5
文献[21]	88.0	84.5	86.2
TDMF^[30]	83.2	68.4	73.0
QuadBox^[31]	88.0	81.0	84.0
文献[35]	80.8	69.1	74.5
TransDETR^[38]	80.6	70.2	75.0
SRMCA^[39]	79.0	81.0	80.0
本文	89.5	85.8	87.6

[1]	李利霞, 王鑫, 王军, 张又元 . 基于特征融合与注意力机制的无人机图像小目标检测算法 [J]. 图学学报, 2023, 44(4): 658-666.
[2]	李鑫, 普园媛, 赵征鹏, 徐丹, 钱文华 . 内容语义和风格特征匹配一致的艺术风格迁移 [J]. 图学学报, 2023, 44(4): 699-709.
[3]	余伟群, 刘佳涛, 张亚萍. 融合注意力的拉普拉斯金字塔单目深度估计 [J]. 图学学报, 2023, 44(4): 728-738.
[4]	胡欣, 周运强, 肖剑, 杨杰. 基于改进YOLOv5的螺纹钢表面缺陷检测[J]. 图学学报, 2023, 44(3): 427-437.
[5]	郝鹏飞, 刘立群, 顾任远. YOLO-RD-Apple果园异源图像遮挡果实检测模型[J]. 图学学报, 2023, 44(3): 456-464.
[6]	罗文宇, 傅明月. 基于YoloX-ECA模型的非法野泳野钓现场监测技术[J]. 图学学报, 2023, 44(3): 465-472.
[7]	吴文欢, 张淏坤. 融合空间十字注意力与通道注意力的语义分割网络[J]. 图学学报, 2023, 44(3): 531-539.
[8]	陆秋, 邵铧泽, 张云磊 . 动态平衡多尺度特征融合的结直肠息肉分割[J]. 图学学报, 2023, 44(2): 225-232.
[9]	谢国波, 贺笛轩, 何宇钦, 林志毅. 基于 P-CenterNet 的光学遥感图像烟囱检测[J]. 图学学报, 2023, 44(2): 233-249.
[10]	熊举举 , 徐杨 , 范润泽 , 孙少聪 . 基于轻量化视觉 Transformer 的花卉识别 [J]. 图学学报, 2023, 44(2): 271-279.
[11]	成浪, 敬超. 基于改进 YOLOv7 的 X 线图像旋转目标检测[J]. 图学学报, 2023, 44(2): 324-334.
[12]	曹义亲, 伍铭林, 徐露 . 基于改进 YOLOv5 算法的钢材表面缺陷检测[J]. 图学学报, 2023, 44(2): 335-345.
[13]	张伟康, 孙浩, 陈鑫凯, 李叙兵, 姚立纲, 东辉. 基于改进 YOLOv5 的智能除草机器人蔬菜苗田杂草检测研究[J]. 图学学报, 2023, 44(2): 346-356.
[14]	李小波 , 李阳贵 , 郭宁 , 范震 . 融合注意力机制的 YOLOv5 口罩检测算法[J]. 图学学报, 2023, 44(1): 16-25.
[15]	邵文斌, 刘玉杰, 孙晓瑞, 李宗民. 基于残差增强注意力的跨模态行人重识别[J]. 图学学报, 2023, 44(1): 33-40.