欢迎访问《图学学报》 分享到:

图学学报 ›› 2022, Vol. 43 ›› Issue (6): 1150-1158.DOI: 10.11996/JG.j.2095-302X.2022061150

• 图像处理与计算机视觉 • 上一篇    下一篇

多尺度模态感知在文本指代实例分割中的研究与应用

  

  1. 1. 北京工业大学人工智能与自动化学院,北京 100124;  2. 大连理工大学数学科学学院,辽宁 大连 116024
  • 出版日期:2022-12-30 发布日期:2023-01-11
  • 基金资助:
    第7批全国博士后创新人才支持计划(BX20220025);第70批全国博士后面上资助(2021M700303)

Multi-scale modality perception network for referring image segmentation

  1. 1. School of Artificial Intelligence and Automation, Beijing University of Technology, Beijing 100124, China;  2. School of Mathematical Sciences, Dalian University of Technology, Dalian Liaoning 116024, China
  • Online:2022-12-30 Published:2023-01-11
  • Supported by:
    The 7th National Postdoctoral Innovative Talent Support Program (BX20220025); The 70th Batch of National Post-Doctoral Research Grants (2021M700303) 

摘要:

文本指代实例分割(RIS)任务是解析文本描述所指代的实例,并在对应图像中分割出该实例,是 计算机视觉与媒体领域中热门的研究课题。当前,大多数 RIS 方法基于单尺度文本/图像模态信息的融合,以 感知指代实例的位置和语义信息。然而,单一尺度模态信息很难同时涵盖定位不同大小实例所需的语义和结构 上下文信息,阻碍了模型对任意大小指代实例的感知,进而影响模型对不同大小指代实例的分割。对此,设计 多尺度视觉-语言交互感知模块和多尺度掩膜预测模块:前者增强模型对不同尺度实例语义与文本语义之间的 融合与感知;后者通过充分捕捉不同尺度实例的所需语义和结构信息提升指代实例分割的表现。由此,提出了 多尺度模态感知的文本指代实例分割模型(MMPN-RIS)。实验结果表明,MMPN-RIS 模型在 RefCOCO, RefCOCO+和 RefCOCOg 3 个公开数据集的 oIoU 指标上均达到了前沿性能;针对文本指代不同尺度实例的分 割,MMPN-RIS 模型有着较好的表现。

关键词: 视觉与语言, 文本指代实例分割, 异模态融合与感知, 特征金字塔

Abstract:

Referring image segmentation (RIS) is the task of parsing the instance referred to by the text description and segmenting the instance in the corresponding image. It is a popular research topic in computer vision and media. Currently, most RIS methods are based on the fusion of single-scale text/image modality information to perceive the location and semantic information of referential instances. However, it is difficult for single-scale modal information to simultaneously cover both the semantics and structural context information required to locate instances of different sizes. This defect hinders the model from perceiving referent instances of any size, which affects the model’s segmentation of referent instances of different sizes. This paper designed a Multi-scale Visual-Language Interaction Perception Module and a Multi-scale Mask Prediction Module to solve this problem. The former could enhance the model’s ability to perceive instances at different scales and promote effective alignment of semantics between different modalities. The latter could improve the performance of referring instance segmentation by fully capturing the required semantic and structural information of instances at different scales. Therefore, this paper proposed a multi-scale modality perception network for referring image segmentation (MMPN-RIS). The experimental results show that the MMPN-RIS model has achieved cutting-edge performance on the oIoU indicators of the three public datasets RefCOCO, RefCOCO+, and RefCOCOg. For the RIS of different scales, the MMPN-RIS model could also yield good performance. 

Key words: visual and language, referring image segmentation, multi-modality fusion and perception, feature pyramid network 

中图分类号: