欢迎访问《图学学报》

图学学报 ›› 2026, Vol. 47 ›› Issue (2): 275-285.DOI: 10.11996/JG.j.2095-302X.2026020275

• 图像处理与计算机视觉 • 上一篇    下一篇

基于颜色多粒度学习的文本-图像行人再识别

周腾龙, 杨文杰(), 阴绍桦, 于元隆   

  1. 福州大学计算机与大数据学院福建 福州 350108
  • 收稿日期:2025-10-18 接受日期:2025-12-05 出版日期:2026-04-30 发布日期:2026-05-20
  • 通讯作者:杨文杰,E-mail:hokkien.ywj@gmail.com
  • 基金资助:
    国家自然科学基金(62401153);国家自然科学基金(U21A20471);福建省自然科学基金(2024J01287);福建省科技计划项目(2025H6028)

Text-to-image person re-identification based on multi-granularity color learning

ZHOU Tenglong, YANG Wenjie(), YIN Shaohua, YU Yuanlong   

  1. College of Computer and Data Science, Fuzhou University, Fuzhou Fujian 350108, China
  • Received:2025-10-18 Accepted:2025-12-05 Published:2026-04-30 Online:2026-05-20
  • Contact: YANG Wenjie,E-mail:hokkien.ywj@gmail.com
  • Supported by:
    National Natural Science Foundation of China(62401153);National Natural Science Foundation of China(U21A20471);Fujian Provincial Natural Science Foundation(2024J01287);Fujian Provincial Science and Technology Plan Project(2025H6028)

摘要:

文本-图像行人再识别旨在使用自然语言描述从图像数据库中检索目标行人,该任务在视频监控和公共安全领域具有重要的应用价值。尽管现有的文本-图像行人再识别方法在跨模态细粒度对齐方面已取得显著进展,但其对颜色这一关键判别线索的探索尚不充分,未能有效弥合文本颜色描述的离散性与图像颜色表示的连续性之间存在的显著语义鸿沟。由于模态差异不仅易误导模型的特征学习,也限制了最终的检索精度。针对上述问题,提出了一种基于颜色多粒度学习的文本-图像行人再识别方法(MGCL)。采用双塔视觉语言模型架构作为特征提取网络,从全局、短语和单词3个粒度对颜色信息进行建模,旨在由粗到精地捕捉和对齐颜色信息,从而全面提升模型的颜色感知能力与跨模态对齐精度。在全局粒度,引入颜色一致性建模,通过一个带有交叉注意力机制的解码器,融合灰度图像嵌入与图文联合嵌入,以重建彩色图像的视觉表示。并引导模型学习文本概念到连续视觉颜色空间的隐式映射,从而缓解跨模态颜色表达的语义差异;在短语粒度,设计颜色短语多标签分类任务,将重建彩色图像的视觉表示与预先构建的颜色短语特征库投射到共享语义空间中进行对齐,强化模型对“颜色-物体”的精确理解;在单词粒度,提出颜色感知替换检测机制,通过对文本中的颜色词进行掩码并重建判断颜色词是否被替换,增强模型对颜色词的敏感性。实验结果表明,MGCL通过颜色多粒度学习实现了更精确的跨模态细粒度对齐,在3个公开数据集CUHK-PEDES,ICFG-PEDES和RSTPReid上均取得了优越性能,验证了该方法在文本-图像行人再识别任务中的有效性。

关键词: 文本-图像行人再识别, 跨模态细粒度对齐, 多粒度学习, 视觉语言模型, 多标签分类

Abstract:

Text-to-image person re-identification aims to retrieve a target person from an image database using natural-language descriptions. This task is of considerable practical importance for applications in video surveillance and public safety. Although existing text-to-image person re-identification methods have made significant progress in cross-modal fine-grained alignment, the exploration of color as a key discriminative cue remains insufficient. This is primarily due to the significant semantic gap between discrete textual color descriptions and continuous visual color representations. This modality difference can mislead the model’s feature-learning process and ultimately limits the final retrieval accuracy. To address these challenges, a novel framework for text-to-image person re-identification based on Multi-Granularity Color Learning (MGCL) was proposed. Our method employed a dual-tower vision-language model architecture as the feature-extraction backbone and learned color information at three distinct granularities: global, phrase, and word. This multi-granularity design aimed to capture and align color information in a coarse-to-fine manner, thereby comprehensively enhancing the color perception and cross-modal alignment accuracy. At the global granularity, color-consistency modeling was introduced. A decoder with a cross-attention mechanism was used to fuse grayscale-image embeddings with joint image-text embeddings to reconstruct the visual representation of the color image. This module guided the model to learn an implicit mapping from textual concepts to the continuous visual-color space, thus alleviating the semantic differences in cross-modal color representations. At the phrase granularity, a color-phrase multi-label classification task was designed. This task aligned the reconstructed visual representation of the color image with a pre-constructed feature library of color phrases by projecting them into a shared semantic space. The objective was to strengthen the precise model understanding of “color-object” associations. At the word granularity, a color-aware replacement detection mechanism was proposed. This mechanism enhanced the model’s sensitivity to specific color words by masking them in the text and then training the model to predict whether they had been substituted. Experimental results demonstrated that MGCL achieved more precise cross-modal fine-grained alignment through its multi-granularity color learning. It obtained superior performance on three public datasets: CUHK-PEDES, ICFG-PEDES, and RSTPReid, validating the effectiveness of the method for the text-to-image person re-identification task.

Key words: text-to-image person re-identification, cross-modal fine-grained alignment, multi-granularity learning, vision-language model, multi-label classification

中图分类号: