欢迎访问《图学学报》 分享到:

图学学报 ›› 2024, Vol. 45 ›› Issue (4): 670-682.DOI: 10.11996/JG.j.2095-302X.2024040670

• 图像处理与计算机视觉 • 上一篇    下一篇

基于同质中间模态的跨模态行人再识别方法

罗智徽1(), 胡海涛1,2, 马潇峰1, 程文刚1,2()   

  1. 1.华北电力大学控制与计算机工程学院,北京 102206
    2.复杂能源系统智能计算教育部工程研究中心,河北 保定 071003
  • 收稿日期:2024-03-07 接受日期:2024-06-20 出版日期:2024-08-31 发布日期:2024-09-03
  • 通讯作者:程文刚(1977-),男,副教授,博士。主要研究方向为多媒体信息处理。E-mail:wgcheng@ncepu.edu.cn
  • 第一作者:罗智徽(1999-),男,硕士研究生。主要研究方向为跨模态行人再识别。E-mail:zhluo@ncepu.edu.cn
  • 基金资助:
    国家重点研发计划项目(2023YFB3812100);教育部教育管理信息中心项目(MOE-CIEM-20240013)

A network based on the homogeneous middle modality for cross-modality person re-identification

LUO Zhihui1(), HU Haitao1,2, MA Xiaofeng1, CHENG Wengang1,2()   

  1. 1. School of Control and Computer Engineering, North China Electric Power University, Beijing 102206, China
    2. Engineering Research Center of Intelligent Computing for Complex Energy Systems, Ministry of Education, Baoding Hebei 071003, China
  • Received:2024-03-07 Accepted:2024-06-20 Published:2024-08-31 Online:2024-09-03
  • Contact: CHENG Wengang (1977-), associate professor, Ph.D. His main research interest covers multimedia information processing. E-mail:wgcheng@ncepu.edu.cn
  • First author:LUO Zhihui (1999-), master student. His main research interest covers cross-modality person re-identification. E-mail:zhluo@ncepu.edu.cn
  • Supported by:
    National Key R&D Program of China(2023YFB3812100);Project of the Education Management Information Center Ministry of Education(MOE-CIEM-20240013)

摘要:

可见光-红外跨模态行人再识别(VI-ReID)旨在对不同摄像头采集同一行人的可见光图像和红外图像进行检索与匹配。除了存在可见光行人再识别(ReID)中因位姿、视角、局部遮挡等造成的模态内差异外,可见光图像和红外图像的模态间差异是VI-ReID的主要挑战。现有方法通常对2种模态的图像进行联合特征学习来缩小模态间差异,忽略了可见光和红外两种模态图像在通道上的本质不同。为此,本文试图从2种模态共同生成一种中间模态来辅助缩小模态间差异,并在标准ViT(vision transformer)网络上通过局部特征和全局特征的融合来优化特征嵌入学习。首先,设计同质中间模态生成器,通过可见光图像和红外图像共同生成同质中间模态(H-modality)图像,将3种模态图像投影到统一的特征空间进行联合约束,从而借助中间模态缩小可见光模态和红外模态间的差异,实现图像级对齐。进一步提出一种基于同质中间模态的Transformer跨模态行人再识别方法,使用ViT提取全局特征,设计一个局部分支以增强网络的局部感知能力。在全局特征提取中,为了增强全局特征的多样性,引入头部多样性模块(head enrich module)使不同的头聚合图像不同的模式。该方法融合全局特征与局部特征,能够提高模型的判别能力,在SYSU-MM01和RegDB数据集上的rank-1/mAP分别达到67.68%/64.37%和86.16%/79.11%,优于现有大多数最前沿的方法。

关键词: 行人再识别, 跨模态, Transformer, 中间模态, 特征融合

Abstract:

Visible-infrared cross-modality person re-identification (VI-ReID) aims to retrieve and match visible and infrared images of the same person captured by different cameras. In addition to addressing the intra-modality discrepancies caused by various factors such as viewpoint, pose, and scale variations in person re-identification, the modality discrepancy between the visible and infrared images represents a significant challenge for VI-ReID. Existing methods usually constrain only the features of the two modalities to reduce modality differences, while ignoring the essential differences in the imaging mechanism of cross-modality images. To address this, this paper attempted to narrow the discrepancy between modalities by jointly generating an intermediate modality from two modalities and optimizing feature learning on a vision Transformer (ViT)-based network through the fusion of local and global features. A feature fusion network based on the homogeneous middle modality (H-modality) was proposed for VI-ReID. Firstly, an H-modality generator was designed, using a parameter-sharing encoder-decoder structure, constrained by distribution consistency loss to bring the generated images closer in feature space. By jointly generating H-modality images from visible and infrared images, the three modal images were projected into a unified feature space for joint constraining, thereby reducing the discrepancy between visible and infrared modalities and achieving image-level alignment. Furthermore, a transformer-based VI-ReID method based on the H-modality was proposed, with an additional local branch to enhance the network’s local perception capability. In global feature extraction, a head enrich module was introduced to push multiple heads in the class token to obtain diverse patterns in the last transformer block. The method combined global features with local features, improving the model’s discriminative ability. The effect of each improvement was investigated through ablation experiments, where different combinations of Sliding window, H-modality, local feature, and global feature enhancements were designed on the baseline ViT model. The results indicated that each improvement led to performance gains, demonstrating the effectiveness of the proposed method. The proposed method achieved rank-1/mAP of 67.68%/64.37% and 86.16%/79.11% on the SYSU-MM01 and RegDB datasets, respectively, outperforming most state-of-the-art methods. The proposed H-modality can effectively reduce the modality discrepancy between visible and infrared images, and the feature fusion network can obtain more discriminative features. Extensive experiments on the SYSU-MM01 and RegDB datasets have demonstrated the superior performance of the proposed network compared with the state-of-the-art methods.

Key words: person re-identification, cross-modality, Transformer, middle modality, feature fusion

中图分类号: