A network based on the homogeneous middle modality for cross-modality person re-identification

doi:10.11996/JG.j.2095-302X.2024040670

Abstract

Abstract:

Visible-infrared cross-modality person re-identification (VI-ReID) aims to retrieve and match visible and infrared images of the same person captured by different cameras. In addition to addressing the intra-modality discrepancies caused by various factors such as viewpoint, pose, and scale variations in person re-identification, the modality discrepancy between the visible and infrared images represents a significant challenge for VI-ReID. Existing methods usually constrain only the features of the two modalities to reduce modality differences, while ignoring the essential differences in the imaging mechanism of cross-modality images. To address this, this paper attempted to narrow the discrepancy between modalities by jointly generating an intermediate modality from two modalities and optimizing feature learning on a vision Transformer (ViT)-based network through the fusion of local and global features. A feature fusion network based on the homogeneous middle modality (H-modality) was proposed for VI-ReID. Firstly, an H-modality generator was designed, using a parameter-sharing encoder-decoder structure, constrained by distribution consistency loss to bring the generated images closer in feature space. By jointly generating H-modality images from visible and infrared images, the three modal images were projected into a unified feature space for joint constraining, thereby reducing the discrepancy between visible and infrared modalities and achieving image-level alignment. Furthermore, a transformer-based VI-ReID method based on the H-modality was proposed, with an additional local branch to enhance the network’s local perception capability. In global feature extraction, a head enrich module was introduced to push multiple heads in the class token to obtain diverse patterns in the last transformer block. The method combined global features with local features, improving the model’s discriminative ability. The effect of each improvement was investigated through ablation experiments, where different combinations of Sliding window, H-modality, local feature, and global feature enhancements were designed on the baseline ViT model. The results indicated that each improvement led to performance gains, demonstrating the effectiveness of the proposed method. The proposed method achieved rank-1/mAP of 67.68%/64.37% and 86.16%/79.11% on the SYSU-MM01 and RegDB datasets, respectively, outperforming most state-of-the-art methods. The proposed H-modality can effectively reduce the modality discrepancy between visible and infrared images, and the feature fusion network can obtain more discriminative features. Extensive experiments on the SYSU-MM01 and RegDB datasets have demonstrated the superior performance of the proposed network compared with the state-of-the-art methods.

Key words: person re-identification, cross-modality, Transformer, middle modality, feature fusion

CLC Number:

TP391

LUO Zhihui, HU Haitao, MA Xiaofeng, CHENG Wengang. A network based on the homogeneous middle modality for cross-modality person re-identification[J]. Journal of Graphics, 2024, 45(4): 670-682.

Figures/Tables 11

References 54

[1]	HAO C F, LI Y F, SUN J, et al. MFEN: multi-scale feature expansion network for visible-infrared person re-identification[C]// Proceedings of the International Conference on Computer Vision and Deep Learning. New York: ACM, 2024: 1-6.
[2]	REN K, ZHANG L. Implicit discriminative knowledge learning for visible-infrared person re-identification[EB/OL]. [2024-03-21]. http://arxiv.org/html/2403.11708v2.
[3]	WU A C, ZHENG W S, YU H X, et al. RGB-infrared cross-modality person re-identification[C]// 2017 IEEE International Conference on Computer Vision. New York: IEEE Press, 2017: 5380-5389.
[4]	YE M, LAN X Y, LI J W, et al. Hierarchical discriminative learning for visible thermal person re-identification[J]. Proceedings of the AAAI Conference on Artificial Intelligence, 2018, 32(1): 7501-7508.
[5]	QIAN Y H, YANG X, TANG S K. Dual-space aggregation learning and random erasure for visible infrared person re-identification[J]. IEEE Access, 2023, 11: 75440-75450.
[6]	WANG G A, ZHANG T Z, CHENG J, et al. RGB-infrared cross-modality person re-identification via joint pixel and feature alignment[C]// 2019 IEEE/CVF International Conference on Computer Vision. New York: IEEE Press, 2019: 3623-3632.
[7]	ZHONG X, LU T Y, HUANG W X, et al. Grayscale enhancement colorization network for visible-infrared person re-identification[J]. IEEE Transactions on Circuits and Systems for Video Technology, 2022, 32(3): 1418-1430.
[8]	ZHANG Q, LAI C Z, LIU J N, et al. FMCNet: feature-level modality compensation for visible-infrared person re-identification[C]// 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2022: 7349-7358.
[9]	LI D G, WEI X, HONG X P, et al. Infrared-visible cross-modal person re-identification with an X modality[J]. Proceedings of the AAAI Conference on Artificial Intelligence, 2020, 34(4): 4610-4617.
[10]	YE M, SHEN J B, SHAO L. Visible-infrared person re-identification via homogeneous augmented tri-modal learning[J]. IEEE Transactions on Information Forensics and Security, 2021, 16: 728-739.
[11]	LIU H J, XIA D X, JIANG W. Towards homogeneous modality learning and multi-granularity information exploration for visible-infrared person re-identification[J]. IEEE Journal of Selected Topics in Signal Processing, 2023, 17(3): 545-559.
[12]	VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[EB/OL]. [2024-01-20]. http://arxiv.org/abs/1706.03762.
[13]	DOSOVITSKIY A, BEYER L, KOLESNIKOV A, et al. An image is worth 16 x 16 words: transformers for image recognition at scale[EB/OL]. [2024-01-20]. http://arxiv.org/abs/2010.11929.
[14]	WANG X L, GIRSHICK R, GUPTA A, et al. Non-local neural networks[C]// 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2018: 7794-7803.
[15]	ZHANG Z Z, LAN C L, ZENG W J, et al. Relation-aware global attention for person re-identification[C]// 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2020: 3186-3195.
[16]	HE S T, LUO H, WANG P C, et al. TransReID: transformer-based object re-identification[C]// 2021 IEEE/CVF International Conference on Computer Vision. New York: IEEE Press, 2021: 14993-15002.
[17]	LIANG T F, JIN Y, LIU W, et al. Cross-modality transformer with modality mining for visible-infrared person re-identification[J]. IEEE Transactions on Multimedia, 2023, 25: 8432-8444.
[18]	SUN Y F, ZHENG L, YANG Y, et al. Beyond part models: person retrieval with refined part pooling (and a strong convolutional baseline)[C]// European Conference on Computer Vision. Cham: Springer, 2018: 501-518.
[19]	WANG G S, YUAN Y F, CHEN X, et al. Learning discriminative features with multiple granularities for person re-identification[C]// The 26th ACM international conference on Multimedia. New York: ACM, 2018: 274-282.
[20]	ZHENG F, DENG C, SUN X, et al. Pyramidal person re-IDentification via multi-loss dynamic training[C]// 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2019: 8506-8514.
[21]	GONG J H, ZHAO S Y, LAM K M, et al. Spectrum-irrelevant fine-grained representation for visible-infrared person re-identification[J]. Computer Vision and Image Understanding, 2023, 232: 103703.
[22]	QIU L X, CHEN S, YAN Y, et al. High-order structure based middle-feature learning for visible-infrared person re-identification[J]. Proceedings of the AAAI Conference on Artificial Intelligence, 2024, 38(5): 4596-4604.
[23]	TAN L, DAI P Y, JI R R, et al. Dynamic prototype mask for occluded person re-identification[C]// The 30th ACM International Conference on Multimedia. New York: ACM, 2022: 531-540.
[24]	HU W P, LIU B H, ZENG H T, et al. Adversarial decoupling and modality-invariant representation learning for visible-infrared person re-identification[J]. IEEE Transactions on Circuits and Systems for Video Technology, 2022, 32(8): 5095-5109.
[25]	YANG B, CHEN J, YE M. Top-K visual tokens transformer: selecting tokens for visible-infrared person re-identification[C]// ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing. New York: IEEE Press, 2023: 1-5.
[26]	WANG H P, WU W. Attention based Dual Stream Modality-aware model for Visible-Infrared Person re-identification[C]// 2023 26th International Conference on Computer Supported Cooperative Work in Design. New York: IEEE Press, 2023: 897-902.
[27]	CHEN X M, ZHENG X T, LU X Q. Identity feature disentanglement for visible-infrared person re-identification[J]. ACM Transactions on Multimedia Computing, Communications, and Applications, 2023, 19(6): 1-20.
[28]	邵文斌, 刘玉杰, 孙晓瑞, 等. 基于残差增强注意力的跨模态行人重识别[J]. 图学学报, 2023, 44(1): 33-40. DOI
	SHAO W B, LIU Y J, SUN X R, et al. Cross modality person re-identification based on residual enhanced attention[J]. Journal of Graphics, 2023, 44(1): 33-40 (in Chinese).
[29]	DAI P Y, JI R, WANG H B, et al. Cross-modality person re-identification with generative adversarial training[C]// The 27th International Joint Conference on Artificial Intelligence. New York: ACM, 2018: 677-683.
[30]	YE M, LAN X Y, LENG Q M, et al. Cross-modality person re-identification via modality-aware collaborative ensemble learning[J]. IEEE Transactions on Image Processing, 2020, 29: 9387-9399.
[31]	KANSAL K, SUBRAMANYAM A V, WANG Z, et al. SDL: spectrum-disentangled representation learning for visible- infrared person re-identification[J]. IEEE Transactions on Circuits and Systems for Video Technology, 2020, 30(10): 3422-3432.
[32]	LU H, ZOU X Z, ZHANG P P. Learning progressive modality-shared transformers for effective visible-infrared person re-identification[J]. Proceedings of the AAAI Conference on Artificial Intelligence, 2023, 37(2): 1835-1843.
[33]	LIU H J, MA S, XIA D X, et al. SFANet: a spectrum-aware feature augmentation network for visible-infrared person reidentification[J]. IEEE Transactions on Neural Networks and Learning Systems, 2023, 34(4): 1958-1971.
[34]	DAI L C, LUO Z M, LI S Z. Exploring part features for unsupervised visible-infrared person re-identification[C]// The 1st ICMR Workshop on Multimedia Object re-Identification. New York: ACM, 2024: 1-5.
[35]	耿圆, 谭红臣, 李敬华, 等. 基于视觉信息积累的行人重识别网络[J]. 图学学报, 2022, 43(6): 1193-1200.
	GENG Y, TAN H C, LI J H, et al. Visual information accumulation network for person re-identification[J]. Journal of Graphics, 2022, 43(6): 1193-1200 (in Chinese). DOI
[36]	ZHU Y X, YANG Z, WANG L, et al. Hetero-Center loss for cross-modality person re-identification[J]. Neurocomputing, 2020, 386: 97-109.
[37]	DAI H P, XIE Q, LI J C, et al. Visible-infrared person re-identification with human body parts assistance[C]// 2021 International Conference on Multimedia Retrieval. New York: ACM, 2021: 631-637.
[38]	PARK H, LEE S, LEE J, et al. Learning by Aligning: visible-Infrared Person re-identification using Cross-Modal Correspondences[C]// 2021 IEEE/CVF International Conference on Computer Vision. New York: IEEE Press, 2021: 12046-12055.
[39]	QIAN Y H, TANG S K. Pose attention-guided paired-images generation for visible-infrared person re-identification[J]. IEEE Signal Processing Letters, 2024, 31: 346-350.
[40]	GWON S, KIM S, SEO K. Balanced and essential modality-specific and modality-shared representations for visible-infrared person re-identification[J]. IEEE Signal Processing Letters, 2024, 31: 491-495.
[41]	LING Y G, ZHONG Z, LUO Z M, et al. Class-aware modality mix and center-guided metric learning for visible-thermal person re-identification[C]// The 28th ACM International Conference on Multimedia. New York: ACM, 2020: 889-897.
[42]	ZHANG Y K, YAN Y, LU Y, et al. Towards a unified middle modality learning for visible-infrared person re-identification[C]// The 29th ACM International Conference on Multimedia. New York: ACM, 2021: 788-796.
[43]	ZHAO J Q, WANG H Z, ZHOU Y, et al. Spatial-channel enhanced transformer for visible-infrared person re-identification[J]. IEEE Transactions on Multimedia, 2023, 25: 3668-3680.
[44]	WU R Q, JIAO B L, WANG W X, et al. Enhancing visible-infrared person re-identification with modality- and instance-aware visual prompt learning[C]// The 2024 International Conference on Multimedia Retrieval. New York: ACM, 2024: 579-588.
[45]	CHEN C Q, YE M, QI M B, et al. Structure-aware positional transformer for visible-infrared person re-identification[J]. IEEE Transactions on Image Processing, 2022, 31: 2352-2364. DOI PMID
[46]	CHAI Z H, LING Y G, LUO Z M, et al. Dual-stream transformer with distribution alignment for visible-infrared person re-identification[J]. IEEE Transactions on Circuits and Systems for Video Technology, 2023, 33(11): 6764-6776.
[47]	NGUYEN D T, HONG H G, KIM K W, et al. Person recognition system based on a combination of body images from visible light and thermal cameras[J]. Sensors, 2017, 17(3): 605.
[48]	WU B T, FENG Y J, SUN Y F, et al. Feature aggregation via attention mechanism for visible-thermal person re-identification[J]. IEEE Signal Processing Letters, 2023, 30: 140-144.
[49]	TIAN X D, ZHANG Z Z, LIN S H, et al. Farewell to mutual information: variational distillation for cross-modal person re-identification[C]// 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2021: 1522-1531.
[50]	YANG B, CHEN J, YE M. Towards grand unified representation learning for unsupervised visible-infrared person re-identification[C]// 2023 IEEE/CVF International Conference on Computer Vision. New York: IEEE Press, 2023: 11069-11079.
[51]	LIU J L, SUN Y F, ZHU F, et al. Learning memory-augmented unidirectional metrics for cross-modality person re-identification[C]// 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2022: 19366-19375.
[52]	SI T Z, HE F Z, LI P L, et al. Tri-modality consistency optimization with heterogeneous augmented images for visible-infrared person re-identification[J]. Neurocomputing, 2023, 523: 170-181.
[53]	LIU H, MIAO Z L, YANG B, et al. A base-derivative framework for cross-modality RGB-infrared person re-identification[C]// 2020 25th International Conference on Pattern Recognition. New York: IEEE Press, 2021: 7640-7646.
[54]	VAN DER MAATEN L, HINTON G. Visualizing data using t-SNE[J]. Journal of Machine Learning Research, 2008, 9(11): 2579-2605.

方法	Venue	全搜索			室内搜索
方法	Venue	rank-1	rank-10	mAP	rank-1	rank-10	mAP
DMiR^[24]	TCSVT 2022	50.54	88.12	49.29	53.92	92.50	62.49
BDF^[53]	ICPR 2021	51.05	87.75	49.63	55.93	91.55	63.38
HAT^[10]	TIFS 2020	55.29	92.14	53.89	62.10	95.75	69.37
IFD^[27]	TOMM 2023	55.30	-	52.40	57.20	-	64.30
FAM+NNCLoss^[48]	SPL2023	55.75	87.51	51.52	58.24	91.08	65.65
DSAL^[5]	ACCESS 2023	58.16	90.43	55.43	60.48	93.25	66.94
ADSM^[26]	CSCWD 2023	59.69	91.68	57.84	64.20	94.33	70.46
VSD^[49]	CVPR 2021	60.02	94.18	58.80	66.05	96.59	72.98
GUR^[50]	ICCV 2023	60.95	-	56.99	64.22	-	69.49
MAUM-G^[51]	CVPR 2022	61.59	-	59.96	67.07	-	73.58
TCOM^[52]	NeuroC 2023	63.92	94.39	60.71	68.35	97.37	73.08
FMCNet^[8]	CVPR 2022	66.34	-	62.51	68.15	-	74.09
PMT^[32]	AAAI 2023	67.53	95.36	64.98	71.66	96.73	76.52
本文		67.68	95.42	64.37	70.82	97.83	76.64

方法	Venue	全搜索			室内搜索
方法	Venue	rank-1	rank-10	mAP	rank-1	rank-10	mAP
DMiR^[24]	TCSVT 2022	50.54	88.12	49.29	53.92	92.50	62.49
BDF^[53]	ICPR 2021	51.05	87.75	49.63	55.93	91.55	63.38
HAT^[10]	TIFS 2020	55.29	92.14	53.89	62.10	95.75	69.37
IFD^[27]	TOMM 2023	55.30	-	52.40	57.20	-	64.30
FAM+NNCLoss^[48]	SPL2023	55.75	87.51	51.52	58.24	91.08	65.65
DSAL^[5]	ACCESS 2023	58.16	90.43	55.43	60.48	93.25	66.94
ADSM^[26]	CSCWD 2023	59.69	91.68	57.84	64.20	94.33	70.46
VSD^[49]	CVPR 2021	60.02	94.18	58.80	66.05	96.59	72.98
GUR^[50]	ICCV 2023	60.95	-	56.99	64.22	-	69.49
MAUM-G^[51]	CVPR 2022	61.59	-	59.96	67.07	-	73.58
TCOM^[52]	NeuroC 2023	63.92	94.39	60.71	68.35	97.37	73.08
FMCNet^[8]	CVPR 2022	66.34	-	62.51	68.15	-	74.09
PMT^[32]	AAAI 2023	67.53	95.36	64.98	71.66	96.73	76.52
本文		67.68	95.42	64.37	70.82	97.83	76.64

方法	Venue	rank-1	rank-10	mAP
HAT^[10]	TIFS 2020	71.83	87.16	67.56
VSD^[49]	CVPR 2021	73.20	-	71.60
GUR^[50]	ICCV 2023	73.91	-	70.23
DMiR^[24]	TCSVT 2022	75.79	89.86	69.97
SFANet^[33]	TNNLS 2023	76.31	91.02	68.00
IFD^[27]	TOMM 2023	76.90	-	72.30
BDF^[53]	ICPR 2021	80.67	87.72	78.83
SIFR^[21]	CVIU 2023	81.73	-	75.07
MAUM-G^[51]	CVPR 2022	83.39	-	78.75
TVTR^[25]	ICASSP 2023	84.10	-	79.50
PMT^[32]	AAAI 2023	84.83	-	76.55
DSAL^[5]	ACCESS 2023	86.45	94.36	80.20
FAM+NNCLoss^[48]	SPL 2023	87.31	95.67	76.70
本文		86.16	95.79	79.11

方法	Venue	rank-1	rank-10	mAP
HAT^[10]	TIFS 2020	71.83	87.16	67.56
VSD^[49]	CVPR 2021	73.20	-	71.60
GUR^[50]	ICCV 2023	73.91	-	70.23
DMiR^[24]	TCSVT 2022	75.79	89.86	69.97
SFANet^[33]	TNNLS 2023	76.31	91.02	68.00
IFD^[27]	TOMM 2023	76.90	-	72.30
BDF^[53]	ICPR 2021	80.67	87.72	78.83
SIFR^[21]	CVIU 2023	81.73	-	75.07
MAUM-G^[51]	CVPR 2022	83.39	-	78.75
TVTR^[25]	ICASSP 2023	84.10	-	79.50
PMT^[32]	AAAI 2023	84.83	-	76.55
DSAL^[5]	ACCESS 2023	86.45	94.36	80.20
FAM+NNCLoss^[48]	SPL 2023	87.31	95.67	76.70
本文		86.16	95.79	79.11

序号	实验设置	rank-1	rank-10	rank-20	mAP
1	基线(CNN)	49.89	86.67	93.70	49.13
2	基线(CNN)+辅助灰度模态	51.64	87.16	94.41	50.31
3	基线(CNN)+X模态	52.21	87.46	94.62	50.84
4	基线(CNN)+H模态	54.58	89.41	95.13	53.12