多尺度模态感知在文本指代实例分割中的研究与应用

doi:10.11996/JG.j.2095-302X.2022061150

图学学报 ›› 2022, Vol. 43 ›› Issue (6): 1150-1158.DOI: 10.11996/JG.j.2095-302X.2022061150

• 图像处理与计算机视觉 • 上一篇下一篇

多尺度模态感知在文本指代实例分割中的研究与应用

1. 北京工业大学人工智能与自动化学院，北京 100124； 2. 大连理工大学数学科学学院，辽宁大连 116024

出版日期:2022-12-30 发布日期:2023-01-11
基金资助:
第7批全国博士后创新人才支持计划(BX20220025)；第70批全国博士后面上资助(2021M700303)

Multi-scale modality perception network for referring image segmentation

1. School of Artificial Intelligence and Automation, Beijing University of Technology, Beijing 100124, China; 2. School of Mathematical Sciences, Dalian University of Technology, Dalian Liaoning 116024, China

Online:2022-12-30 Published:2023-01-11
Supported by:
The 7th National Postdoctoral Innovative Talent Support Program (BX20220025); The 70th Batch of National Post-Doctoral Research Grants (2021M700303)

摘要/Abstract

摘要：

文本指代实例分割(RIS)任务是解析文本描述所指代的实例，并在对应图像中分割出该实例，是计算机视觉与媒体领域中热门的研究课题。当前，大多数 RIS 方法基于单尺度文本/图像模态信息的融合，以感知指代实例的位置和语义信息。然而，单一尺度模态信息很难同时涵盖定位不同大小实例所需的语义和结构上下文信息，阻碍了模型对任意大小指代实例的感知，进而影响模型对不同大小指代实例的分割。对此，设计多尺度视觉-语言交互感知模块和多尺度掩膜预测模块：前者增强模型对不同尺度实例语义与文本语义之间的融合与感知；后者通过充分捕捉不同尺度实例的所需语义和结构信息提升指代实例分割的表现。由此，提出了多尺度模态感知的文本指代实例分割模型(MMPN-RIS)。实验结果表明，MMPN-RIS 模型在 RefCOCO， RefCOCO+和 RefCOCOg 3 个公开数据集的 oIoU 指标上均达到了前沿性能；针对文本指代不同尺度实例的分割，MMPN-RIS 模型有着较好的表现。

关键词: 视觉与语言, 文本指代实例分割, 异模态融合与感知, 特征金字塔

Abstract:

Referring image segmentation (RIS) is the task of parsing the instance referred to by the text description and segmenting the instance in the corresponding image. It is a popular research topic in computer vision and media. Currently, most RIS methods are based on the fusion of single-scale text/image modality information to perceive the location and semantic information of referential instances. However, it is difficult for single-scale modal information to simultaneously cover both the semantics and structural context information required to locate instances of different sizes. This defect hinders the model from perceiving referent instances of any size, which affects the model’s segmentation of referent instances of different sizes. This paper designed a Multi-scale Visual-Language Interaction Perception Module and a Multi-scale Mask Prediction Module to solve this problem. The former could enhance the model’s ability to perceive instances at different scales and promote effective alignment of semantics between different modalities. The latter could improve the performance of referring instance segmentation by fully capturing the required semantic and structural information of instances at different scales. Therefore, this paper proposed a multi-scale modality perception network for referring image segmentation (MMPN-RIS). The experimental results show that the MMPN-RIS model has achieved cutting-edge performance on the oIoU indicators of the three public datasets RefCOCO, RefCOCO+, and RefCOCOg. For the RIS of different scales, the MMPN-RIS model could also yield good performance.

Key words: visual and language, referring image segmentation, multi-modality fusion and perception, feature pyramid network

中图分类号:

TP 391

刘静 , 胡永利 , 刘秀平 , 谭红臣 , 尹宝才 . 多尺度模态感知在文本指代实例分割中的研究与应用[J]. 图学学报, 2022, 43(6): 1150-1158.

LIU Jing , HU Yong-li , LIU Xiu-ping , TAN Hong-chen , YIN Bao-cai. Multi-scale modality perception network for referring image segmentation[J]. Journal of Graphics, 2022, 43(6): 1150-1158.

[1]	董哲同, 蔺宏伟. 计算机辅助拓扑设计 ——持续同调在几何设计和处理中的应用[J]. 图学学报, 2022, 43(6): 957-966.
[2]	陈伟 , 蔡占川 , 李坚 , 梁延研 , 熊刚强 , 宋瑞霞 . U-系统与 V-系统的理论及应用综述 [J]. 图学学报, 2022, 43(6): 1002-1017.
[3]	胡文恺, 马鸿宇, 刘亚醉, 魏小东, 赵罡, 申立勇, 李新 . T 样条用于计算机辅助设计、分析和制造的新型表示方法 [J]. 图学学报, 2022, 43(6): 1018-1033.
[4]	李明 , 张乘虎 , 扈婧乔 , 胡心卓 , 刘继凯 . 多孔模型设计方法[J]. 图学学报, 2022, 43(6): 1034-1048.
[5]	王涵 , 朱春钢 . 基于伸缩因子的 toric-Bézier 曲线自由变形[J]. 图学学报, 2022, 43(6): 1070-1079.
[6]	吴晨 , 曹力 , 秦宇 , 吴苗苗 , 顾兆光 . 基于参考图像的原子模型渲染方法[J]. 图学学报, 2022, 43(6): 1080-1087.
[7]	朱鹏辉, 袁宏涛, 聂勇伟, 李桂清. AC-HAPE3D：基于强化学习的异形填充算法 [J]. 图学学报, 2022, 43(6): 1096-1103.
[8]	关启超, 刘浩, 王远成, 傅孝明. 误差有界的低扭曲非结构 T 样条曲面拟合[J]. 图学学报, 2022, 43(6): 1104-1113.
[9]	何科雨, 陈中贵. 基于圆堆砌的纹理生成方法[J]. 图学学报, 2022, 43(6): 1114-1123.
[10]	郭文, 李冬, 袁飞 . 多尺度注意力融合和抗噪声的轻量点云人脸识别模型[J]. 图学学报, 2022, 43(6): 1124-1133.
[11]	崔振东, 李宗民, 杨树林, 刘玉杰, 李华. 基于语义分割引导的三维目标检测[J]. 图学学报, 2022, 43(6): 1134-1142.
[12]	孙亚男, 温玉辉, 舒叶芷, 刘永进. 融合动作特征的多模态情绪识别 [J]. 图学学报, 2022, 43(6): 1159-1169.
[13]	范溢华 , 王永振 , 燕雪峰 , 宫丽娜 , 郭延文 , 魏明强 . 人脸识别任务驱动的低光照图像增强算法 [J]. 图学学报, 2022, 43(6): 1170-1181.
[14]	墨瀚林, 郝优, 郭锐, 郝宏翔, 张贺, 李琪, 李华. 图形图像积分与微分不变量的构造与应用[J]. 图学学报, 2022, 43(6): 1182-1192.
[15]	耿圆, 谭红臣, 李敬华, 王立春. 基于视觉信息积累的行人重识别网络[J]. 图学学报, 2022, 43(6): 1193-1200.

多尺度模态感知在文本指代实例分割中的研究与应用

Multi-scale modality perception network for referring image segmentation

PDF (PC)

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

编辑推荐

Metrics

本文评价