Welcome to Journal of Graphics share: 

Journal of Graphics ›› 2024, Vol. 45 ›› Issue (4): 670-682.DOI: 10.11996/JG.j.2095-302X.2024040670

• Image Processing and Computer Vision • Previous Articles     Next Articles

A network based on the homogeneous middle modality for cross-modality person re-identification

LUO Zhihui1(), HU Haitao1,2, MA Xiaofeng1, CHENG Wengang1,2()   

  1. 1. School of Control and Computer Engineering, North China Electric Power University, Beijing 102206, China
    2. Engineering Research Center of Intelligent Computing for Complex Energy Systems, Ministry of Education, Baoding Hebei 071003, China
  • Received:2024-03-07 Accepted:2024-06-20 Online:2024-08-31 Published:2024-09-03
  • Contact: CHENG Wengang
  • About author:First author contact:

    LUO Zhihui (1999-), master student. His main research interest covers cross-modality person re-identification. E-mail:zhluo@ncepu.edu.cn

  • Supported by:
    National Key R&D Program of China(2023YFB3812100);Project of the Education Management Information Center Ministry of Education(MOE-CIEM-20240013)

Abstract:

Visible-infrared cross-modality person re-identification (VI-ReID) aims to retrieve and match visible and infrared images of the same person captured by different cameras. In addition to addressing the intra-modality discrepancies caused by various factors such as viewpoint, pose, and scale variations in person re-identification, the modality discrepancy between the visible and infrared images represents a significant challenge for VI-ReID. Existing methods usually constrain only the features of the two modalities to reduce modality differences, while ignoring the essential differences in the imaging mechanism of cross-modality images. To address this, this paper attempted to narrow the discrepancy between modalities by jointly generating an intermediate modality from two modalities and optimizing feature learning on a vision Transformer (ViT)-based network through the fusion of local and global features. A feature fusion network based on the homogeneous middle modality (H-modality) was proposed for VI-ReID. Firstly, an H-modality generator was designed, using a parameter-sharing encoder-decoder structure, constrained by distribution consistency loss to bring the generated images closer in feature space. By jointly generating H-modality images from visible and infrared images, the three modal images were projected into a unified feature space for joint constraining, thereby reducing the discrepancy between visible and infrared modalities and achieving image-level alignment. Furthermore, a transformer-based VI-ReID method based on the H-modality was proposed, with an additional local branch to enhance the network’s local perception capability. In global feature extraction, a head enrich module was introduced to push multiple heads in the class token to obtain diverse patterns in the last transformer block. The method combined global features with local features, improving the model’s discriminative ability. The effect of each improvement was investigated through ablation experiments, where different combinations of Sliding window, H-modality, local feature, and global feature enhancements were designed on the baseline ViT model. The results indicated that each improvement led to performance gains, demonstrating the effectiveness of the proposed method. The proposed method achieved rank-1/mAP of 67.68%/64.37% and 86.16%/79.11% on the SYSU-MM01 and RegDB datasets, respectively, outperforming most state-of-the-art methods. The proposed H-modality can effectively reduce the modality discrepancy between visible and infrared images, and the feature fusion network can obtain more discriminative features. Extensive experiments on the SYSU-MM01 and RegDB datasets have demonstrated the superior performance of the proposed network compared with the state-of-the-art methods.

Key words: person re-identification, cross-modality, Transformer, middle modality, feature fusion

CLC Number: