基于颜色多粒度学习的文本-图像行人再识别

doi:10.11996/JG.j.2095-302X.2026020275

摘要/Abstract

摘要：

文本-图像行人再识别旨在使用自然语言描述从图像数据库中检索目标行人，该任务在视频监控和公共安全领域具有重要的应用价值。尽管现有的文本-图像行人再识别方法在跨模态细粒度对齐方面已取得显著进展，但其对颜色这一关键判别线索的探索尚不充分，未能有效弥合文本颜色描述的离散性与图像颜色表示的连续性之间存在的显著语义鸿沟。由于模态差异不仅易误导模型的特征学习，也限制了最终的检索精度。针对上述问题，提出了一种基于颜色多粒度学习的文本-图像行人再识别方法(MGCL)。采用双塔视觉语言模型架构作为特征提取网络，从全局、短语和单词3个粒度对颜色信息进行建模，旨在由粗到精地捕捉和对齐颜色信息，从而全面提升模型的颜色感知能力与跨模态对齐精度。在全局粒度，引入颜色一致性建模，通过一个带有交叉注意力机制的解码器，融合灰度图像嵌入与图文联合嵌入，以重建彩色图像的视觉表示。并引导模型学习文本概念到连续视觉颜色空间的隐式映射，从而缓解跨模态颜色表达的语义差异；在短语粒度，设计颜色短语多标签分类任务，将重建彩色图像的视觉表示与预先构建的颜色短语特征库投射到共享语义空间中进行对齐，强化模型对“颜色-物体”的精确理解；在单词粒度，提出颜色感知替换检测机制，通过对文本中的颜色词进行掩码并重建判断颜色词是否被替换，增强模型对颜色词的敏感性。实验结果表明，MGCL通过颜色多粒度学习实现了更精确的跨模态细粒度对齐，在3个公开数据集CUHK-PEDES，ICFG-PEDES和RSTPReid上均取得了优越性能，验证了该方法在文本-图像行人再识别任务中的有效性。

关键词: 文本-图像行人再识别, 跨模态细粒度对齐, 多粒度学习, 视觉语言模型, 多标签分类

Abstract:

Text-to-image person re-identification aims to retrieve a target person from an image database using natural-language descriptions. This task is of considerable practical importance for applications in video surveillance and public safety. Although existing text-to-image person re-identification methods have made significant progress in cross-modal fine-grained alignment, the exploration of color as a key discriminative cue remains insufficient. This is primarily due to the significant semantic gap between discrete textual color descriptions and continuous visual color representations. This modality difference can mislead the model’s feature-learning process and ultimately limits the final retrieval accuracy. To address these challenges, a novel framework for text-to-image person re-identification based on Multi-Granularity Color Learning (MGCL) was proposed. Our method employed a dual-tower vision-language model architecture as the feature-extraction backbone and learned color information at three distinct granularities: global, phrase, and word. This multi-granularity design aimed to capture and align color information in a coarse-to-fine manner, thereby comprehensively enhancing the color perception and cross-modal alignment accuracy. At the global granularity, color-consistency modeling was introduced. A decoder with a cross-attention mechanism was used to fuse grayscale-image embeddings with joint image-text embeddings to reconstruct the visual representation of the color image. This module guided the model to learn an implicit mapping from textual concepts to the continuous visual-color space, thus alleviating the semantic differences in cross-modal color representations. At the phrase granularity, a color-phrase multi-label classification task was designed. This task aligned the reconstructed visual representation of the color image with a pre-constructed feature library of color phrases by projecting them into a shared semantic space. The objective was to strengthen the precise model understanding of “color-object” associations. At the word granularity, a color-aware replacement detection mechanism was proposed. This mechanism enhanced the model’s sensitivity to specific color words by masking them in the text and then training the model to predict whether they had been substituted. Experimental results demonstrated that MGCL achieved more precise cross-modal fine-grained alignment through its multi-granularity color learning. It obtained superior performance on three public datasets: CUHK-PEDES, ICFG-PEDES, and RSTPReid, validating the effectiveness of the method for the text-to-image person re-identification task.

Key words: text-to-image person re-identification, cross-modal fine-grained alignment, multi-granularity learning, vision-language model, multi-label classification

中图分类号:

周腾龙, 杨文杰, 阴绍桦, 于元隆. 基于颜色多粒度学习的文本-图像行人再识别[J]. 图学学报, 2026, 47(2): 275-285.

ZHOU Tenglong, YANG Wenjie, YIN Shaohua, YU Yuanlong. Text-to-image person re-identification based on multi-granularity color learning[J]. Journal of Graphics, 2026, 47(2): 275-285.

图/表 13

图1 跨模态颜色语义鸿沟示例

Fig. 1 Illustration of the cross-modal color semantic gap

图2 颜色多粒度学习整体框架图

Fig. 2 Overall framework of the multi-granularity color learning

图3 CPMC示意图

Fig. 3 CPMC schematic diagram

图4 CRD示意图

Fig. 4 CRD schematic diagram

表1 MGCL与现有最先进方法性能对比

Table 1 Performance comparison between MGCL and the state of the art methods

模型	来源	CUHK-PEDES				ICFG-PEDES				RSTPReid
模型	来源	R@1/%	R@5/%	R@10/%	mAP/%	R@1/%	R@5/%	R@10/%	mAP/%	R@1/%	R@5/%	R@10/%	mAP/%
IRRA^[13]	CVPR23	73.38	89.93	93.71	66.13	63.46	80.25	85.82	38.06	63.46	80.25	85.82	38.06
PLOT^[15]	ECCV24	75.28	90.42	94.12	─	65.76	81.39	86.73	─	65.76	81.39	86.73	─
RaSa^[14]	IJCAI23	76.51	90.29	94.25	69.38	65.28	80.40	85.12	41.29	65.28	80.40	85.12	41.29
APTM^[16]	MM23	76.53	90.04	94.15	66.91	68.51	82.99	87.56	41.22	68.51	82.99	87.56	41.22
CFAM^[25]	CVPR24	75.60	90.53	94.36	67.27	65.38	81.17	86.35	39.42	62.45	83.50	91.10	49.50
CADA^[26]	TMM24	78.37	91.57	94.58	68.87	67.81	82.34	87.14	39.85	69.60	86.75	92.40	52.74
ICL^[27]	CVPR25	77.91	90.27	94.14	69.13	69.02	82.45	87.36	41.21	70.55	85.95	91.65	53.68
MGCL	Ours	78.68	91.08	94.75	70.16	70.31	83.58	87.86	44.78	70.84	87.88	93.70	55.32

表2 MGCL关键部件的消融实验

Table 2 Ablation study on the key components of MGCL

序号	部件				CUHK-PEDES		ICFG-PEDES		RSTPReid
序号	Bsl	CCM	CPMC	CRD	R@1/%	mAP/%	R@1/%	mAP/%	R@1/%	mAP/%
1	√				76.52	68.72	68.02	43.28	68.50	53.16
2	√	√			77.54	69.40	69.10	43.96	69.75	54.22
3	√		√		77.68	69.34	69.24	43.98	69.96	54.50
4	√			√	77.47	69.28	69.07	43.58	69.62	54.37
5	√	√	√	√	78.68	70.16	70.31	44.78	70.84	55.32

图5 颜色抖动示例((a) 原始图像；(b) 颜色抖动图像)

Fig. 5 Color jitter examples ((a) Original images; (b) Color jitter image)

表3 颜色抖动消融实验

Table 3 Ablation study on color jitter

模型	颜色抖动	R@1/%	R@5/%	R@10/%	mAP/%
MGCL	w/o	70.84	87.88	93.70	55.32
MGCL	w	45.05	71.30	79.55	26.26
Bsl	w/o	68.50	86.35	91.45	53.16
Bsl	w	41.90	67.55	76.90	24.23

图6 图文特征t-SNE可视化((a) 基线模型；(b) MGCL模型)

Fig. 6 t-SNE visualization for image and text features ((a) Baseline model; (b) MGCL model)

图7 基线与MGCL对“颜色-物体”理解对比((a) 基线模型；(b) MGCL模型)

Fig. 7 Comparison of color-object understanding between baseline and MGCL ((a) Baseline model; (b) MGCL model)

图8 基线与MGCL的前10个检索结果对比((a),(c),(e) 基线模型；(b),(d),(f) MGCL模型)

Fig. 8 Comparison of top-10 retrieval results between baseline and MGCL ((a),(c),(e) Baseline model; (b),(d),(f) MGCL model)

图9 损失权重对模型的性能影响

Fig. 9 Impact of loss weights on model performance

表4 模型参数与推理效率

Table 4 Model complexity and inference efficiency

模型	参数量/M	计算量/GFLOPs	推理速度/ms
IRRA^[13]	194.5	13.0	13.4
RaSa^[14]	210.2	58.1	19.8
MGCL	285.6	63.5	22.5

参考文献 28

[1]	耿圆, 谭红臣, 李敬华, 等. 基于视觉信息积累的行人重识别网络[J]. 图学学报, 2022, 43(6): 1193-1200.
	GENG Y, TAN H C, LI J H, et al. Visual information accumulation network for person re-identification[J]. Journal of Graphics, 2022, 43(6): 1193-1200 (in Chinese). DOI
[2]	张云鹏, 王洪元, 张继, 等. 近邻中心迭代策略的单标注视频行人重识别[J]. 软件学报, 2021, 32(12): 4025-4035.
	ZHANG Y P, WANG H Y, ZHANG J, et al. One-shot video-based person re-identification based on neighborhood center iteration strategy[J]. Journal of Software, 2021, 32(12): 4025-4035 (in Chinese).
[3]	杨文娟, 王文明, 王全玉, 等. 基于感知哈希和视觉词袋模型的图像检索方法[J]. 图学学报, 2019, 40(3): 519-524. DOI
	YANG W J, WANG W M, WANG Q Y, et al. Image retrieval method based on perceptual hash algorithm and bag of visual words[J]. Journal of Graphics, 2019, 40(3): 519-524 (in Chinese).
[4]	LI S, XIAO T, LI H S, et al. Person search with natural language description[C]// 2017 IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2017: 1970-1979.
[5]	SIMONYAN K, ZISSERMAN A. Very deep convolutional networks for large-scale image recognition[EB/OL]. (2014-09-14) [2025-08-18]. https://arxiv.org/abs/1409.1556.
[6]	HE K M, ZHANG X Y, REN S Q, et al. Deep residual learning for image recognition[C]// 2016 IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2016: 770-778.
[7]	GRAVES A. Long short-term memory[M]//GRAVES A. Supervised Sequence Labelling with Recurrent Neural Networks. Heidelberg: Springer, 2012: 37-45.
[8]	DEVLIN J, CHANG M W, LEE K, et al. BERT: pre-training of deep bidirectional transformers for language understanding[C]// 2019 Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies. Albuquerque: ACL, 2019: 4171-4186.
[9]	LI J N, SELVARAJU R R, GOTMARE A D, et al. Align before fuse: vision and language representation learning with momentum distillation[C]// The 35th International Conference on Neural Information Processing Systems. Red Hook: Curran Associates Inc., 2021: 742.
[10]	LI J N, LI D X, XIONG C M, et al. BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation[J/OL]. [2025-08-17]. https://proceedings.mlr.press/v162/li22n.html.
[11]	ZHANG Y, LU H C. Deep cross-modal projection learning for image-text matching[C]// The 15th European Conference on Computer Vision. Cham: Springer, 2018: 686-701.
[12]	ZHENG Z D, ZHENG L, GARRETT M, et al. Dual-path convolutional image-text embeddings with instance loss[J]. ACM Transactions on Multimedia Computing, Communications, and Applications, 2020, 16(2): 51.
[13]	JIANG D, YE M. Cross-modal implicit relation reasoning and aligning for text-to-image person retrieval[C]// 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2023: 2787-2797.
[14]	BAI Y, CAO M, GAO D M, et al. RaSa: relation and sensitivity aware representation learning for text-based person search[EB/OL]. [2025-08-17]. https://dl.acm.org/doi/10.24963/ijcai.2023/62.
[15]	YANG S Y, ZHOU Y N, ZHENG Z D, et al. Towards unified text-based person retrieval: a large-scale multi-attribute and language search benchmark[C]// The 31st ACM International Conference on Multimedia. New York: ACM, 2023: 4492-4501.
[16]	PARK J, KIM D, JEONG B, et al. PLOT: text-based person search with part slot attention for corresponding part discovery[C]// The 18th European Conference on Computer Vision. Cham: Springer, 2025: 474-490.
[17]	LOCATELLO F, WEISSENBORN D, UNTERTHINER T, et al. Object-centric learning with slot attention[C]// The 34th International Conference on Neural Information Processing Systems. Red Hook: Curran Associates Inc., 2020: 967.
[18]	SWAIN M J, BALLARD D H. Color indexing[J]. International Journal of Computer Vision, 1991, 7(1): 11-32. DOI URL
[19]	STRICKER M A, ORENGO M. Similarity of color images[C]// SPIE 2420, Storage and Retrieval for Image and Video Databases III. Bellingham: SPIE, 1995: 381-392.
[20]	ZHANG R, ISOLA P, EFROS A A. Colorful image colorization[C]// The 14th European Conference on Computer Vision. Cham: Springer, 2016: 649-666.
[21]	KANG X Y, YANG T, OUYANG W Q, et al. DDColor: towards photo-realistic image colorization via dual decoders[C]// 2023 IEEE/CVF International Conference on Computer Vision. New York: IEEE Press, 2023: 328-338.
[22]	GOMEZ-VILLA A, HERNÁNDEZ-CÁMARA P, BUTT M A, et al. Color names in vision-language models[EB/OL]. [2025-09-26]. https://arxiv.org/abs/2509.22524.
[23]	BAI J Z, BAI S, CHU Y F, et al. Qwen technical report[EB/OL]. [2025-09-28]. https://arxiv.org/abs/2309.16609.
[24]	WANG W H, BAO H B, HUANG S H, et al. MiniLMv2: multi-head self-attention relation distillation for compressing pretrained transformers[C]// Findings of the Association for Computational Linguistics. Albuquerque: ACL, 2021: 2140-2151.
[25]	ZUO J L, ZHOU H Y, NIE Y, et al. UFineBench: towards text-based person retrieval with ultra-fine granularity[C]// 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2024: 22010-22019.
[26]	LIN D X, PENG Y X, MENG J K, et al. Cross-modal adaptive dual association for text-to-image person retrieval[J]. IEEE Transactions on Multimedia, 2024, 26: 6609-6620. DOI URL
[27]	QIN Y, CHEN C, FU Z H, et al. Human-centered interactive learning via MLLMs for text-to-image person re-identification[C]// 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2025: 14390-14399.
[28]	SELVARAJU R R, COGSWELL M, DAS A, et al. Grad-CAM: visual explanations from deep networks via gradient-based localization[C]// 2017 IEEE International Conference on Computer Vision. New York: IEEE Press, 2017: 618-626.