Welcome to Journal of Graphics

Journal of Graphics ›› 2026, Vol. 47 ›› Issue (2): 275-285.DOI: 10.11996/JG.j.2095-302X.2026020275

• Image Processing and Computer Vision • Previous Articles     Next Articles

Text-to-image person re-identification based on multi-granularity color learning

ZHOU Tenglong, YANG Wenjie(), YIN Shaohua, YU Yuanlong   

  1. College of Computer and Data Science, Fuzhou University, Fuzhou Fujian 350108, China
  • Received:2025-10-18 Accepted:2025-12-05 Online:2026-04-30 Published:2026-05-20
  • Contact: YANG Wenjie
  • Supported by:
    National Natural Science Foundation of China(62401153);National Natural Science Foundation of China(U21A20471);Fujian Provincial Natural Science Foundation(2024J01287);Fujian Provincial Science and Technology Plan Project(2025H6028)

Abstract:

Text-to-image person re-identification aims to retrieve a target person from an image database using natural-language descriptions. This task is of considerable practical importance for applications in video surveillance and public safety. Although existing text-to-image person re-identification methods have made significant progress in cross-modal fine-grained alignment, the exploration of color as a key discriminative cue remains insufficient. This is primarily due to the significant semantic gap between discrete textual color descriptions and continuous visual color representations. This modality difference can mislead the model’s feature-learning process and ultimately limits the final retrieval accuracy. To address these challenges, a novel framework for text-to-image person re-identification based on Multi-Granularity Color Learning (MGCL) was proposed. Our method employed a dual-tower vision-language model architecture as the feature-extraction backbone and learned color information at three distinct granularities: global, phrase, and word. This multi-granularity design aimed to capture and align color information in a coarse-to-fine manner, thereby comprehensively enhancing the color perception and cross-modal alignment accuracy. At the global granularity, color-consistency modeling was introduced. A decoder with a cross-attention mechanism was used to fuse grayscale-image embeddings with joint image-text embeddings to reconstruct the visual representation of the color image. This module guided the model to learn an implicit mapping from textual concepts to the continuous visual-color space, thus alleviating the semantic differences in cross-modal color representations. At the phrase granularity, a color-phrase multi-label classification task was designed. This task aligned the reconstructed visual representation of the color image with a pre-constructed feature library of color phrases by projecting them into a shared semantic space. The objective was to strengthen the precise model understanding of “color-object” associations. At the word granularity, a color-aware replacement detection mechanism was proposed. This mechanism enhanced the model’s sensitivity to specific color words by masking them in the text and then training the model to predict whether they had been substituted. Experimental results demonstrated that MGCL achieved more precise cross-modal fine-grained alignment through its multi-granularity color learning. It obtained superior performance on three public datasets: CUHK-PEDES, ICFG-PEDES, and RSTPReid, validating the effectiveness of the method for the text-to-image person re-identification task.

Key words: text-to-image person re-identification, cross-modal fine-grained alignment, multi-granularity learning, vision-language model, multi-label classification

CLC Number: