Text-to-image person re-identification based on multi-granularity color learning

doi:10.11996/JG.j.2095-302X.2026020275

Abstract

Abstract:

Text-to-image person re-identification aims to retrieve a target person from an image database using natural-language descriptions. This task is of considerable practical importance for applications in video surveillance and public safety. Although existing text-to-image person re-identification methods have made significant progress in cross-modal fine-grained alignment, the exploration of color as a key discriminative cue remains insufficient. This is primarily due to the significant semantic gap between discrete textual color descriptions and continuous visual color representations. This modality difference can mislead the model’s feature-learning process and ultimately limits the final retrieval accuracy. To address these challenges, a novel framework for text-to-image person re-identification based on Multi-Granularity Color Learning (MGCL) was proposed. Our method employed a dual-tower vision-language model architecture as the feature-extraction backbone and learned color information at three distinct granularities: global, phrase, and word. This multi-granularity design aimed to capture and align color information in a coarse-to-fine manner, thereby comprehensively enhancing the color perception and cross-modal alignment accuracy. At the global granularity, color-consistency modeling was introduced. A decoder with a cross-attention mechanism was used to fuse grayscale-image embeddings with joint image-text embeddings to reconstruct the visual representation of the color image. This module guided the model to learn an implicit mapping from textual concepts to the continuous visual-color space, thus alleviating the semantic differences in cross-modal color representations. At the phrase granularity, a color-phrase multi-label classification task was designed. This task aligned the reconstructed visual representation of the color image with a pre-constructed feature library of color phrases by projecting them into a shared semantic space. The objective was to strengthen the precise model understanding of “color-object” associations. At the word granularity, a color-aware replacement detection mechanism was proposed. This mechanism enhanced the model’s sensitivity to specific color words by masking them in the text and then training the model to predict whether they had been substituted. Experimental results demonstrated that MGCL achieved more precise cross-modal fine-grained alignment through its multi-granularity color learning. It obtained superior performance on three public datasets: CUHK-PEDES, ICFG-PEDES, and RSTPReid, validating the effectiveness of the method for the text-to-image person re-identification task.

Key words: text-to-image person re-identification, cross-modal fine-grained alignment, multi-granularity learning, vision-language model, multi-label classification

CLC Number:

ZHOU Tenglong, YANG Wenjie, YIN Shaohua, YU Yuanlong. Text-to-image person re-identification based on multi-granularity color learning[J]. Journal of Graphics, 2026, 47(2): 275-285.

Figures/Tables 13

References 28

[1]	耿圆, 谭红臣, 李敬华, 等. 基于视觉信息积累的行人重识别网络[J]. 图学学报, 2022, 43(6): 1193-1200.
	GENG Y, TAN H C, LI J H, et al. Visual information accumulation network for person re-identification[J]. Journal of Graphics, 2022, 43(6): 1193-1200 (in Chinese). DOI
[2]	张云鹏, 王洪元, 张继, 等. 近邻中心迭代策略的单标注视频行人重识别[J]. 软件学报, 2021, 32(12): 4025-4035.
	ZHANG Y P, WANG H Y, ZHANG J, et al. One-shot video-based person re-identification based on neighborhood center iteration strategy[J]. Journal of Software, 2021, 32(12): 4025-4035 (in Chinese).
[3]	杨文娟, 王文明, 王全玉, 等. 基于感知哈希和视觉词袋模型的图像检索方法[J]. 图学学报, 2019, 40(3): 519-524. DOI
	YANG W J, WANG W M, WANG Q Y, et al. Image retrieval method based on perceptual hash algorithm and bag of visual words[J]. Journal of Graphics, 2019, 40(3): 519-524 (in Chinese).
[4]	LI S, XIAO T, LI H S, et al. Person search with natural language description[C]// 2017 IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2017: 1970-1979.
[5]	SIMONYAN K, ZISSERMAN A. Very deep convolutional networks for large-scale image recognition[EB/OL]. (2014-09-14) [2025-08-18]. https://arxiv.org/abs/1409.1556.
[6]	HE K M, ZHANG X Y, REN S Q, et al. Deep residual learning for image recognition[C]// 2016 IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2016: 770-778.
[7]	GRAVES A. Long short-term memory[M]//GRAVES A. Supervised Sequence Labelling with Recurrent Neural Networks. Heidelberg: Springer, 2012: 37-45.
[8]	DEVLIN J, CHANG M W, LEE K, et al. BERT: pre-training of deep bidirectional transformers for language understanding[C]// 2019 Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies. Albuquerque: ACL, 2019: 4171-4186.
[9]	LI J N, SELVARAJU R R, GOTMARE A D, et al. Align before fuse: vision and language representation learning with momentum distillation[C]// The 35th International Conference on Neural Information Processing Systems. Red Hook: Curran Associates Inc., 2021: 742.
[10]	LI J N, LI D X, XIONG C M, et al. BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation[J/OL]. [2025-08-17]. https://proceedings.mlr.press/v162/li22n.html.
[11]	ZHANG Y, LU H C. Deep cross-modal projection learning for image-text matching[C]// The 15th European Conference on Computer Vision. Cham: Springer, 2018: 686-701.
[12]	ZHENG Z D, ZHENG L, GARRETT M, et al. Dual-path convolutional image-text embeddings with instance loss[J]. ACM Transactions on Multimedia Computing, Communications, and Applications, 2020, 16(2): 51.
[13]	JIANG D, YE M. Cross-modal implicit relation reasoning and aligning for text-to-image person retrieval[C]// 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2023: 2787-2797.
[14]	BAI Y, CAO M, GAO D M, et al. RaSa: relation and sensitivity aware representation learning for text-based person search[EB/OL]. [2025-08-17]. https://dl.acm.org/doi/10.24963/ijcai.2023/62.
[15]	YANG S Y, ZHOU Y N, ZHENG Z D, et al. Towards unified text-based person retrieval: a large-scale multi-attribute and language search benchmark[C]// The 31st ACM International Conference on Multimedia. New York: ACM, 2023: 4492-4501.
[16]	PARK J, KIM D, JEONG B, et al. PLOT: text-based person search with part slot attention for corresponding part discovery[C]// The 18th European Conference on Computer Vision. Cham: Springer, 2025: 474-490.
[17]	LOCATELLO F, WEISSENBORN D, UNTERTHINER T, et al. Object-centric learning with slot attention[C]// The 34th International Conference on Neural Information Processing Systems. Red Hook: Curran Associates Inc., 2020: 967.
[18]	SWAIN M J, BALLARD D H. Color indexing[J]. International Journal of Computer Vision, 1991, 7(1): 11-32. DOI URL
[19]	STRICKER M A, ORENGO M. Similarity of color images[C]// SPIE 2420, Storage and Retrieval for Image and Video Databases III. Bellingham: SPIE, 1995: 381-392.
[20]	ZHANG R, ISOLA P, EFROS A A. Colorful image colorization[C]// The 14th European Conference on Computer Vision. Cham: Springer, 2016: 649-666.
[21]	KANG X Y, YANG T, OUYANG W Q, et al. DDColor: towards photo-realistic image colorization via dual decoders[C]// 2023 IEEE/CVF International Conference on Computer Vision. New York: IEEE Press, 2023: 328-338.
[22]	GOMEZ-VILLA A, HERNÁNDEZ-CÁMARA P, BUTT M A, et al. Color names in vision-language models[EB/OL]. [2025-09-26]. https://arxiv.org/abs/2509.22524.
[23]	BAI J Z, BAI S, CHU Y F, et al. Qwen technical report[EB/OL]. [2025-09-28]. https://arxiv.org/abs/2309.16609.
[24]	WANG W H, BAO H B, HUANG S H, et al. MiniLMv2: multi-head self-attention relation distillation for compressing pretrained transformers[C]// Findings of the Association for Computational Linguistics. Albuquerque: ACL, 2021: 2140-2151.
[25]	ZUO J L, ZHOU H Y, NIE Y, et al. UFineBench: towards text-based person retrieval with ultra-fine granularity[C]// 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2024: 22010-22019.
[26]	LIN D X, PENG Y X, MENG J K, et al. Cross-modal adaptive dual association for text-to-image person retrieval[J]. IEEE Transactions on Multimedia, 2024, 26: 6609-6620. DOI URL
[27]	QIN Y, CHEN C, FU Z H, et al. Human-centered interactive learning via MLLMs for text-to-image person re-identification[C]// 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2025: 14390-14399.
[28]	SELVARAJU R R, COGSWELL M, DAS A, et al. Grad-CAM: visual explanations from deep networks via gradient-based localization[C]// 2017 IEEE International Conference on Computer Vision. New York: IEEE Press, 2017: 618-626.

模型	来源	CUHK-PEDES				ICFG-PEDES				RSTPReid
模型	来源	R@1/%	R@5/%	R@10/%	mAP/%	R@1/%	R@5/%	R@10/%	mAP/%	R@1/%	R@5/%	R@10/%	mAP/%
IRRA^[13]	CVPR23	73.38	89.93	93.71	66.13	63.46	80.25	85.82	38.06	63.46	80.25	85.82	38.06
PLOT^[15]	ECCV24	75.28	90.42	94.12	─	65.76	81.39	86.73	─	65.76	81.39	86.73	─
RaSa^[14]	IJCAI23	76.51	90.29	94.25	69.38	65.28	80.40	85.12	41.29	65.28	80.40	85.12	41.29
APTM^[16]	MM23	76.53	90.04	94.15	66.91	68.51	82.99	87.56	41.22	68.51	82.99	87.56	41.22
CFAM^[25]	CVPR24	75.60	90.53	94.36	67.27	65.38	81.17	86.35	39.42	62.45	83.50	91.10	49.50
CADA^[26]	TMM24	78.37	91.57	94.58	68.87	67.81	82.34	87.14	39.85	69.60	86.75	92.40	52.74
ICL^[27]	CVPR25	77.91	90.27	94.14	69.13	69.02	82.45	87.36	41.21	70.55	85.95	91.65	53.68
MGCL	Ours	78.68	91.08	94.75	70.16	70.31	83.58	87.86	44.78	70.84	87.88	93.70	55.32

模型	来源	CUHK-PEDES				ICFG-PEDES				RSTPReid
模型	来源	R@1/%	R@5/%	R@10/%	mAP/%	R@1/%	R@5/%	R@10/%	mAP/%	R@1/%	R@5/%	R@10/%	mAP/%
IRRA^[13]	CVPR23	73.38	89.93	93.71	66.13	63.46	80.25	85.82	38.06	63.46	80.25	85.82	38.06
PLOT^[15]	ECCV24	75.28	90.42	94.12	─	65.76	81.39	86.73	─	65.76	81.39	86.73	─
RaSa^[14]	IJCAI23	76.51	90.29	94.25	69.38	65.28	80.40	85.12	41.29	65.28	80.40	85.12	41.29
APTM^[16]	MM23	76.53	90.04	94.15	66.91	68.51	82.99	87.56	41.22	68.51	82.99	87.56	41.22
CFAM^[25]	CVPR24	75.60	90.53	94.36	67.27	65.38	81.17	86.35	39.42	62.45	83.50	91.10	49.50
CADA^[26]	TMM24	78.37	91.57	94.58	68.87	67.81	82.34	87.14	39.85	69.60	86.75	92.40	52.74
ICL^[27]	CVPR25	77.91	90.27	94.14	69.13	69.02	82.45	87.36	41.21	70.55	85.95	91.65	53.68
MGCL	Ours	78.68	91.08	94.75	70.16	70.31	83.58	87.86	44.78	70.84	87.88	93.70	55.32

序号	部件				CUHK-PEDES		ICFG-PEDES		RSTPReid
序号	Bsl	CCM	CPMC	CRD	R@1/%	mAP/%	R@1/%	mAP/%	R@1/%	mAP/%
1	√				76.52	68.72	68.02	43.28	68.50	53.16
2	√	√			77.54	69.40	69.10	43.96	69.75	54.22
3	√		√		77.68	69.34	69.24	43.98	69.96	54.50
4	√			√	77.47	69.28	69.07	43.58	69.62	54.37
5	√	√	√	√	78.68	70.16	70.31	44.78	70.84	55.32

序号	部件				CUHK-PEDES		ICFG-PEDES		RSTPReid
序号	Bsl	CCM	CPMC	CRD	R@1/%	mAP/%	R@1/%	mAP/%	R@1/%	mAP/%
1	√				76.52	68.72	68.02	43.28	68.50	53.16
2	√	√			77.54	69.40	69.10	43.96	69.75	54.22
3	√		√		77.68	69.34	69.24	43.98	69.96	54.50
4	√			√	77.47	69.28	69.07	43.58	69.62	54.37
5	√	√	√	√	78.68	70.16	70.31	44.78	70.84	55.32

模型	颜色抖动	R@1/%	R@5/%	R@10/%	mAP/%
MGCL	w/o	70.84	87.88	93.70	55.32
MGCL	w	45.05	71.30	79.55	26.26
Bsl	w/o	68.50	86.35	91.45	53.16
Bsl	w	41.90	67.55	76.90	24.23