Welcome to Journal of Graphics share: 

Journal of Graphics ›› 2022, Vol. 43 ›› Issue (6): 1150-1158.DOI: 10.11996/JG.j.2095-302X.2022061150

• Image Processing and Computer Vision • Previous Articles     Next Articles

Multi-scale modality perception network for referring image segmentation

  

  1. 1. School of Artificial Intelligence and Automation, Beijing University of Technology, Beijing 100124, China;  2. School of Mathematical Sciences, Dalian University of Technology, Dalian Liaoning 116024, China
  • Online:2022-12-30 Published:2023-01-11
  • Supported by:
    The 7th National Postdoctoral Innovative Talent Support Program (BX20220025); The 70th Batch of National Post-Doctoral Research Grants (2021M700303) 

Abstract:

Referring image segmentation (RIS) is the task of parsing the instance referred to by the text description and segmenting the instance in the corresponding image. It is a popular research topic in computer vision and media. Currently, most RIS methods are based on the fusion of single-scale text/image modality information to perceive the location and semantic information of referential instances. However, it is difficult for single-scale modal information to simultaneously cover both the semantics and structural context information required to locate instances of different sizes. This defect hinders the model from perceiving referent instances of any size, which affects the model’s segmentation of referent instances of different sizes. This paper designed a Multi-scale Visual-Language Interaction Perception Module and a Multi-scale Mask Prediction Module to solve this problem. The former could enhance the model’s ability to perceive instances at different scales and promote effective alignment of semantics between different modalities. The latter could improve the performance of referring instance segmentation by fully capturing the required semantic and structural information of instances at different scales. Therefore, this paper proposed a multi-scale modality perception network for referring image segmentation (MMPN-RIS). The experimental results show that the MMPN-RIS model has achieved cutting-edge performance on the oIoU indicators of the three public datasets RefCOCO, RefCOCO+, and RefCOCOg. For the RIS of different scales, the MMPN-RIS model could also yield good performance. 

Key words: visual and language, referring image segmentation, multi-modality fusion and perception, feature pyramid network 

CLC Number: