欢迎访问《图学学报》 分享到:

图学学报 ›› 2026, Vol. 47 ›› Issue (1): 78-89.DOI: 10.11996/JG.j.2095-302X.2026010078

• 图像处理与计算机视觉 • 上一篇    下一篇

基于深度融合多模态特征的三维点云分类小样本类增量学习

朱晨浠1, 卢奕南1, 伍铁如2, 龚文勇3, 马锐2()   

  1. 1 吉林大学计算机科学与技术学院吉林 长春 130012
    2 吉林大学人工智能学院吉林 长春 130012
    3 暨南大学信息科学技术学院广东 广州 510632
  • 收稿日期:2025-06-30 接受日期:2025-08-23 出版日期:2026-02-28 发布日期:2026-03-16
  • 通讯作者:马锐,E-mail:ruim@jlu.edu.cn
  • 基金资助:
    国家自然科学基金(62202199)

Deep fusion of multimodal features for few-shot class-incremental 3D point cloud classification

ZHU Chenxi1, LU Yinan1, WU Tieru2, GONG Wenyong3, MA Rui2()   

  1. 1 College of Computer Science and Technology, Jilin University, Changchun Jilin 130012, China
    2 School of Artificial Intelligence, Jilin University, Changchun Jilin 130012, China
    3 College of Information Science and Technology, Jinan University, Guangzhou Guangdong 510632, China
  • Received:2025-06-30 Accepted:2025-08-23 Published:2026-02-28 Online:2026-03-16
  • Supported by:
    National Natural Science Foundation of China(62202199)

摘要:

传统3D点云分类方法在小样本类增量学习(FSCIL)场景下容易出现泛化能力不足和灾难性遗忘等问题。预训练语言-图像模型(CLIP)因具备丰富的2D形状先验知识,被证明能够有效提升3D FSCIL性能,但现有基于CLIP的框架在多模态特征提取与融合过程中仍缺乏灵活性与自适应性,导致增量阶段的分类准确率受限。为解决这些不足,提出了一种深度融合多模态特征的3D FSCIL方法,通过引入基于门控单元与残差块的自适应适配器实现多尺度特征对齐与冗余信息抑制,并设计基于自注意力机制的多模态全局特征动态融合模块,根据不同样本特性自适应调整多路特征的权重分配,从而获得更加一致且互补的融合表示。具体地,将点云渲染为多视角深度图,分别利用原始CLIP视觉编码器与在深度图上预训练的CLIP编码器提取特征,并结合点云几何特征,经自适应适配器处理后送入注意力融合模块,与CLIP文本编码器提取的语义特征对齐进行分类。此外,结合对比学习损失、多视角与几何扰动数据增强策略以及记忆回放机制,有效缓解小样本条件下的过拟合与遗忘问题。在ShapeNet、ModelNet及CO3D数据集上的实验结果表明,与现有主流3D FSCIL方法相比,该方法在各增量阶段均取得更高的准确率,且相对准确度下降率与最大阶段跳变率显著降低。

关键词: 3D点云, 增量学习, 小样本学习, 3D分类, 预训练模型

Abstract:

Traditional 3D point-cloud classification methods tend to suffer from insufficient generalization and catastrophic forgetting in Few-Shot Class-incremental Learning (FSCIL) scenarios. The pretrained vision-language model CLIP (Contrastive Language-Image Pre-training), which contains rich 2D shape priors, has been shown to effectively enhance 3D FSCIL performance. However, existing CLIP-based frameworks still lack flexibility and adaptability in multimodal feature extraction and fusion, which limits classification accuracy during incremental stages. To address these shortcomings, a 3D FSCIL approach with deeply fused multimodal features was proposed. An adaptive adapter based on gated units and residual blocks was introduced to achieve multi-scale feature alignment and redundancy suppression, and a multimodal global feature dynamic fusion module with self-attention was designed to adaptively adjust the weight allocation of different feature streams according to sample characteristics, thereby obtaining more consistent and complementary fused representations. Specifically, point clouds were rendered into multi-view depth maps, and features were extracted using both the original CLIP visual encoder and a CLIP encoder pretrained on depth maps, combined with point-cloud geometric features. After processing through the adaptive adapter, these features were fed into the attention-based fusion module and aligned with semantic features extracted by the CLIP text encoder for classification. In addition, contrastive learning loss, multi-view and geometric perturbation-based data augmentation strategies, and a memory-replay mechanism were incorporated to effectively mitigate overfitting and forgetting under few-shot conditions. Experiments on ShapeNet, ModelNet, and CO3D demonstrated that the proposed method consistently achieved higher accuracy across incremental stages compared with existing 3D FSCIL approaches, while significantly reducing both relative accuracy drop rates and maximum stage fluctuations.

Key words: 3D point cloud, incremental learning, few-shot learning, 3D classification, pre-trained model

中图分类号: