基于增强特征提取网络与语义特征融合的多方向文本检测

doi:10.11996/JG.j.2095-302X.2024010056

图学学报 ›› 2024, Vol. 45 ›› Issue (1): 56-64.DOI: 10.11996/JG.j.2095-302X.2024010056

• 图像处理与计算机视觉 • 上一篇下一篇

基于增强特征提取网络与语义特征融合的多方向文本检测

吕伶¹(), 李华¹(), 王武²

1.长春理工大学计算机科学技术学院，吉林长春 130000
2.北方导航控制技术股份有限公司，北京 100000

收稿日期:2023-07-18 接受日期:2023-10-25 出版日期:2024-02-29 发布日期:2024-02-29
通讯作者:李华(1977-)，女，教授，博士。主要研究方向为计算机视觉、虚拟现实技术。E-mail：lihua@cust.edu.cn
第一作者:吕伶(1999-)，女，硕士研究生。主要研究方向为计算机视觉。E-mail：15382351657@163.com
基金资助:
吉林省自然科学基金项目(20210101412JC)

Multi-directional text detection based on the fusion of enhanced feature extraction network and semantic feature

LV Ling¹(), LI Hua¹(), WANG Wu²

1. School of Computer Science and Technology, Changchun University of Science and Technology, Changchun Jilin 130000, China
2. North Navigation Control Technology Co., Ltd, Beijing 100000, China

Received:2023-07-18 Accepted:2023-10-25 Published:2024-02-29 Online:2024-02-29
First author：LV Ling (1999-), master student. Her main research interest covers computer vision. E-mail：15382351657@163.com
Supported by:
Jilin Natural Science Foundation(20210101412JC)

摘要/Abstract

摘要：

针对自然场景文本长度不定、角度倾斜等难题，提出了一种基于增强特征提取网络与语义特征融合的文本检测方法。通过结合可变形卷积与空洞卷积，设计了一种增强扩张残差模块EDRM (Enhanced Dilated Residual Module)，将其应用于ResNet18的conv4_x与conv5_x层，并以此作为骨干网络，在改善网络特征提取能力的同时提高特征图像分辨率，减少空间信息丢失。其次，针对现有算法提取文本语义特征仍不充分的问题，将双向长短期记忆网络BiLSTM (Bi-directional Long Short-Term Memory)引入特征融合部分，增强融合特征图对自然场景文本的表征能力以及特征序列的关联性，同时提高模型的文本定位能力。在多方向文本数据集ICDAR2015、长文本数据集MSRA-TD500上对模型展开评估，实验结果表明，该算法与当下高效的DBNet算法相比，F值分别提升1.8%、3.3%，表现出良好的竞争力。

长春理工大学李华教授及其学生吕伶等提出一种基于增强特征提取网络与语义特征融合的文本检测方法。通过结合可变形卷积与空洞卷积，设计一种增强扩张残差模块，将其应用于ResNet18的conv4_x与conv5_x层，以此改善网络特征提取能力，减少空间信息丢失。与此同时，将双向长短期记忆网络引入特征融合部分，增强融合特征图对自然场景文本的表征能力以及特征序列的关联性，提高对文本的定位能力。

关键词: 可变形卷积, 空洞卷积, 文本检测, 语义特征, 双向长短期记忆网络

Abstract:

A text detection method was proposed based on an enhanced feature extraction network and semantic feature fusion, thus addressing the challenges such as variable length and oblique angle of scene text. An enhanced dilated residual module (EDRM) was designed by combining deformable convolution with atrous convolution for the layers conv4_x and conv5_x of ResNet18. This module served as the backbone network, enhancing the capability of feature extraction while increasing the feature map resolution and reducing the loss of spatial information. Secondly, to address the inadequacies of the existing algorithms in extracting text semantic features, bi-directional long short-term memory (BiLSTM) was applied to the feature fusion section, enhancing the representation ability of fusion feature map for scene text, the correlation of feature sequences, and the text localization ability of the model. The model was evaluated on the multi-directional text dataset ICDAR2015 and the long text dataset MSRA-TD500. The results demonstrated that compared with the current efficient DBNet algorithm, the F value of the proposed algorithm increased by 1.8% and 3.3 %, respectively, showing strong competitiveness.

Key words: deformable convolution, atrous convolution, text detection, semantic feature, bi-directional long short-term memory

中图分类号:

TP391

吕伶, 李华, 王武. 基于增强特征提取网络与语义特征融合的多方向文本检测[J]. 图学学报, 2024, 45(1): 56-64.

LV Ling, LI Hua, WANG Wu. Multi-directional text detection based on the fusion of enhanced feature extraction network and semantic feature[J]. Journal of Graphics, 2024, 45(1): 56-64.

图/表 12

图1 网络模型整体框架

Fig. 1 Overall framework of our network model

图2 EDRM模块结构图

Fig. 2 EDRM module structure

图3 改进的ResNet18结构

Fig. 3 Improved ResNet18 structure

图4 连接部分细节图

Fig. 4 Detail of connection section

表1 消融实验结果

Table 1 Results of ablation experiment

所使用的模块				评估结果/%
ResNet18	EDRM-ResNet18	2-BiLSTM+FPN	3-BiLSTM+FPN	P	R	F
√	-	-	-	89.6	75.5	81.9
-	√	-	-	89.6	77.3	83.0
√	-	√	-	88.8	76.7	82.3
√	-	-	√	89.2	76.8	82.5
-	√	√	-	89.9	77.5	83.2
-	√	-	√	88.4	79.5	83.7

表2 ICDAR2015数据集对比/%

Table 2 Comparison results on ICDAR2015 dataset/%

方法	P	R	F
CTPN^[8]	74.2	51.6	60.9
EAST^[17]	83.6	78.5	78.2
SegLink^[18]	73.1	76.8	75.0
TextBoxes++^[19]	87.2	76.7	81.7
PANNet^[11]	84.0	81.9	82.9
ATTR^[20]	85.8	79.7	82.6
文献[21]	82.6	81.9	82.2
PAN++^[22]	85.9	80.4	83.1
DBNet++^[23]	90.1	77.2	83.1
文献[24]	84.8	81.3	83.0
DBNet^[12]	89.6	75.5	81.9
Ours	88.4	79.5	83.7

表3 MSRA-TD500数据集对比/%

Table 3 Comparison results on MSRA-TD500 dataset/%

方法	P	R	F
DeepReg^[25]	77.0	70.0	74.0
RRPN^[26]	82.0	68.0	74.0
EAST^[17]	87.3	67.4	76.1
SegLink^[18]	86.0	70.0	77.0
RRD^[27]	87.0	73.7	79.0
PixelLink^[28]	83.0	73.2	77.8
TextSnake^[29]	83.2	73.9	78.3
PAN++^[22]	81.6	80.3	80.9
DBNet++^[23]	89.7	76.5	82.6
DBNet^[12]	86.6	75.3	80.6
Ours	87.1	80.9	83.9

图5 二值图损失对比

Fig. 5 Comparison of binary map loss

图6 概率图损失对比

Fig. 6 Comparison of probability map loss

图7 阈值图损失对比

Fig. 7 Comparison of threshold map loss

图8 复杂背景下的检测对比((a)原始图像；(b) DBNet[12]；(c) DBNet++[21]；(d)本文)

Fig. 8 Comparison of detection in complex scenes ((a) Original image; (b) DBNet[12]; (c) DBNet++[21]; (d) Our results)

图9 不同形态文本的检测对比((a)原始图像；(b) DBNet[12；(c) DBNet++[21]；(d)本文)

Fig. 9 Detection comparison of different forms of text ((a) Original image; (b) DBNet[12; (c) DBNet++[21]; (d) Our results)

参考文献 29

[1]	侯杰波. 复杂场景文本检测方法研究[D]. 北京: 北京科技大学, 2021.
	HOU J B. Research on text detection in complex scenes[D]. Beijing: University of Science and Technology Beijing, 2021 (in Chinese).
[2]	GREENHALGH J, MIRMEHDI M. Recognizing text-based traffic signs[J]. IEEE Transactions on Intelligent Transportation Systems, 2014, 16(3): 1360-1369. DOI URL
[3]	CANNY J. A computational approach to edge detection[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 1986, 8(6): 679-698. PMID
[4]	SHIVAKUMARA P, PHAN T Q, TAN C L. A Laplacian approach to multi-oriented text detection in video[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2011, 33(2): 412-419. DOI PMID
[5]	CHEN X R, YUILLE A L. Detecting and reading text in natural scenes[C]// 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2004. CVPR. New York: IEEE Press, 2004:II.
[6]	LIU Y L, JIN L W. Deep matching prior network: toward tighter multi-oriented text detection[C]// 2017 IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2017: 3454-3461.
[7]	LIU W, ANGUELOV D, ERHAN D, et al. SSD: single shot MultiBox detector[C]// European Conference on Computer Vision. Cham: Springer, 2016: 21-37.
[8]	TIAN Z, HUANG W L, HE T, et al. Detecting text in natural image with connectionist text proposal network[C]// European Conference on Computer Vision. Cham: Springer, 2016: 56-72.
[9]	BAEK Y, LEE B, HAN D, et al. Character region awareness for text detection[C]// 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2020: 9357-9366.
[10]	WANG W H, XIE E Z, LI X, et al. Shape robust text detection with progressive scale expansion network[C]// 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2020: 9328-9337.
[11]	WANG W H, XIE E Z, SONG X G, et al. Efficient and accurate arbitrary-shaped text detection with pixel aggregation network[C]// 2019 IEEE/CVF International Conference on Computer Vision. New York: IEEE Press, 2020: 8439-8448.
[12]	LIAO M H, WAN Z Y, YAO C, et al. Real-time scene text detection with differentiable binarization[C]// The AAAI conference on artificial intelligence. New York: AAAI, 2020, 34(7): 11474-11481.
[13]	DAI J F, QI H Z, XIONG Y W, et al. Deformable convolutional networks[C]// 2017 IEEE International Conference on Computer Vision. New York: IEEE Press, 2017: 764-773.
[14]	CHEN L C, PAPANDREOU G, KOKKINOS I, et al. DeepLab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018, 40(4): 834-848. DOI URL
[15]	KARATZAS D, GOMEZ-BIGORDA L, NICOLAOU A, et al. ICDAR 2015 competition on robust reading[C]// 2015 13th International Conference on Document Analysis and Recognition. New York: IEEE Press, 2015: 1156-1160.
[16]	YAO C, BAI X, LIU W Y, et al. Detecting texts of arbitrary orientations in natural images[C]// 2012 IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2012: 1083-1090.
[17]	ZHOU X Y, YAO C, WEN H, et al. EAST: an efficient and accurate scene text detector[C]// 2017 IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2017: 2642-2651.
[18]	SHI B G, BAI X, BELONGIE S. Detecting oriented text in natural images by linking segments[C]// 2017 IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2017: 3482-3490.
[19]	LIAO M H, SHI B G, BAI X. TextBoxes++: a single-shot oriented scene text detector[J]. IEEE Transactions on Image Processing, 2018, 27(8): 3676-3690. DOI PMID
[20]	JIANG X F, XU S G, ZHANG S Q, et al. Arbitrary-shaped text detection with adaptive text region representation[J]. IEEE Access, 2020, 8: 102106-102118. DOI URL
[21]	SHENG T, LIAN Z H. Bidirectional regression for Arbitrary- shaped text detection[M]//Document Analysis and Recognition - ICDAR 2021. Cham: Springer International Publishing, 2021: 187-201.
[22]	WANG W H, XIE E Z, LI X, et al. PAN++: towards efficient and accurate end-to-end spotting of arbitrarily-shaped text[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022, 44(9): 5349-5367.
[23]	LIAO M H, ZOU Z S, WAN Z Y, et al. Real-time scene text detection with differentiable binarization and adaptive scale fusion[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023, 45(1): 919-931. DOI URL
[24]	徐健, 郭湛澎, 刘秀平, 等. 基于注意力机制的多方向文本检测[J]. 光电子·激光, 2023: 166-173.
	XU J, GUO Z P, LIU X P, et al. Multi-directional text detection based on attention mechanism[J]. Journal of Optoelectronics·Laser, 2023: 166-173 (in Chinese).
[25]	HE W H, ZHANG X Y, YIN F, et al. Deep direct regression for multi-oriented scene text detection[C]// 2017 IEEE International Conference on Computer Vision. New York: IEEE Press, 2017: 745-753.
[26]	MA J Q, SHAO W Y, YE H, et al. Arbitrary-oriented scene text detection via rotation proposals[J]. IEEE Transactions on Multimedia, 2018, 20(11): 3111-3122. DOI URL
[27]	LIAO M H, ZHU Z, SHI B G, et al. Rotation-sensitive regression for oriented scene text detection[C]// 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2018: 5909-5918.
[28]	DENG D, LIU H F, LI X L, et al. PixelLink: detecting scene text via instance segmentation[J]. Proceedings of the AAAI Conference on Artificial Intelligence, 2018, 32(1): 6773-6780.
[29]	LONG S B, RUAN J Q, ZHANG W J, et al. TextSnake: a flexible representation for detecting text of arbitrary shapes[C]// European Conference on Computer Vision. Cham: Springer, 2018: 19-35.

基于增强特征提取网络与语义特征融合的多方向文本检测

Multi-directional text detection based on the fusion of enhanced feature extraction network and semantic feature

RichHTML

PDF (PC)

可视化

摘要/Abstract

引用本文

使用本文

图/表 12

参考文献 29

相关文章 14

编辑推荐

Metrics

本文评价

[1]	魏敏, 姚鑫. 基于多尺度与注意力机制的两阶段风暴单体外推研究[J]. 图学学报, 2024, 45(4): 696-704.
[2]	牛为华, 郭迅. 基于改进YOLOv8的船舰遥感图像旋转目标检测算法[J]. 图学学报, 2024, 45(4): 726-735.
[3]	武兵, 田莹. 基于注意力机制的多尺度道路损伤检测算法研究[J]. 图学学报, 2024, 45(4): 770-778.
[4]	郭宗洋, 刘立东, 蒋东华, 刘子翔, 朱熟康, 陈京华. 基于语义引导神经网络的人体动作识别算法[J]. 图学学报, 2024, 45(1): 26-34.
[5]	魏陈浩, 杨睿, 刘振丙, 蓝如师, 孙希延, 罗笑南. 具有双层路由注意力的YOLOv8道路场景目标检测方法[J]. 图学学报, 2023, 44(6): 1104-1111.
[6]	高昂, 梁兴柱, 夏晨星, 张春炯. 一种改进YOLOv8的密集行人检测算法[J]. 图学学报, 2023, 44(5): 890-898.
[7]	郝帅, 赵新生, 马旭, 张旭, 何田, 侯李祥. 基于TR-YOLOv5的输电线路多类缺陷目标检测方法[J]. 图学学报, 2023, 44(4): 667-676.
[8]	王道累, 康博, 朱瑞. 基于深度学习的电力设备铭牌文本检测方法[J]. 图学学报, 2023, 44(4): 691-698.
[9]	李雨, 闫甜甜, 周东生, 魏小鹏. 基于注意力机制与深度多尺度特征融合的自然场景文本检测[J]. 图学学报, 2023, 44(3): 473-481.
[10]	张倩, 王夏黎, 王炜昊, 武历展, 李超. 基于多尺度特征融合的细胞计数方法[J]. 图学学报, 2023, 44(1): 41-49.
[11]	马彦博, 李琳, 陈缘, 赵洋, 胡锐. 基于时空融合的多帧压缩视频增强方法[J]. 图学学报, 2022, 43(4): 651-658.
[12]	刘南杉, 裴云强, 蒋皓, 韩永国, 吴亚东, 王赋攀, 易思恒. 基于VD-MobileNet 网络的 WebAR生活垃圾分类信息可视化方法[J]. 图学学报, 2022, 43(4): 667-676.
[13]	方洪波, 万广, 陈忠辉, 黄以卫, 张文勇, 谢本亮. 基于改进 YOLOv5s 的离线手写数学符号识别[J]. 图学学报, 2022, 43(3): 387-395.
[14]	李华恩, 赵洋, 陈缘, 张效娟. 基于递归对齐网络的黑白老卡通高清重制[J]. 图学学报, 2022, 43(3): 434-442.