Deep fusion of multimodal features for few-shot class-incremental 3D point cloud classification

doi:10.11996/JG.j.2095-302X.2026010078

Abstract

Abstract:

Traditional 3D point-cloud classification methods tend to suffer from insufficient generalization and catastrophic forgetting in Few-Shot Class-incremental Learning (FSCIL) scenarios. The pretrained vision-language model CLIP (Contrastive Language-Image Pre-training), which contains rich 2D shape priors, has been shown to effectively enhance 3D FSCIL performance. However, existing CLIP-based frameworks still lack flexibility and adaptability in multimodal feature extraction and fusion, which limits classification accuracy during incremental stages. To address these shortcomings, a 3D FSCIL approach with deeply fused multimodal features was proposed. An adaptive adapter based on gated units and residual blocks was introduced to achieve multi-scale feature alignment and redundancy suppression, and a multimodal global feature dynamic fusion module with self-attention was designed to adaptively adjust the weight allocation of different feature streams according to sample characteristics, thereby obtaining more consistent and complementary fused representations. Specifically, point clouds were rendered into multi-view depth maps, and features were extracted using both the original CLIP visual encoder and a CLIP encoder pretrained on depth maps, combined with point-cloud geometric features. After processing through the adaptive adapter, these features were fed into the attention-based fusion module and aligned with semantic features extracted by the CLIP text encoder for classification. In addition, contrastive learning loss, multi-view and geometric perturbation-based data augmentation strategies, and a memory-replay mechanism were incorporated to effectively mitigate overfitting and forgetting under few-shot conditions. Experiments on ShapeNet, ModelNet, and CO3D demonstrated that the proposed method consistently achieved higher accuracy across incremental stages compared with existing 3D FSCIL approaches, while significantly reducing both relative accuracy drop rates and maximum stage fluctuations.

Key words: 3D point cloud, incremental learning, few-shot learning, 3D classification, pre-trained model

CLC Number:

ZHU Chenxi, LU Yinan, WU Tieru, GONG Wenyong, MA Rui. Deep fusion of multimodal features for few-shot class-incremental 3D point cloud classification[J]. Journal of Graphics, 2026, 47(1): 78-89.

Figures/Tables 7

References 47

[1]	CHANG A X, FUNKHOUSER T, GUIBAS L, et al. ShapeNet:an information-rich 3D model repository[EB/OL]. [2025- 04-30]. https://arxiv.org/abs/1512.03012.pdf.
[2]	DEITKE M, SCHWENK D, SALVADOR J, et al. Objaverse: a universe of annotated 3D objects[C]// 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2023: 13142-13153.
[3]	REIZENSTEIN J, SHAPOVALOV R, HENZLER P, et al. Common objects in 3D: large-scale learning and evaluation of real-life 3D category reconstruction[C]// 2021 IEEE/CVF International Conference on Computer Vision. New York: IEEE Press, 2021: 10881-10891.
[4]	UY M A, PHAM Q H, HUA B S, et al. Revisiting point cloud classification: a new benchmark dataset and classification model on real-world data[C]// 2019 IEEE/CVF International Conference on Computer Vision. New York: IEEE Press, 2019: 1588-1597.
[5]	WU T, ZHANG J R, FU X, et al. OmniObject3D: large-vocabulary 3D object dataset for realistic perception, reconstruction and generation[C]// 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2023: 803-814.
[6]	QI C R, SU H, MO K C, et al. PointNet: deep learning on point sets for 3D classification and segmentation[C]// 2017 IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2017: 77-85.
[7]	QI C R, YI L, SU H, et al. PointNet++: deep hierarchical feature learning on point sets in a metric space[C]// The 31st International Conference on Neural Information Processing Systems. Red Hook: Curran Associates Inc., 2017: 5105-5114.
[8]	TAO X Y, HONG X P, CHANG X Y, et al. Few-shot class-incremental learning[C]// 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2020: 12180-12189.
[9]	QIN C W, JOTY S. Continual few-shot relation learning via embedding space regularization and data augmentation[C]// The 60th Annual Meeting of the Association for Computational Linguistics. New York: Association for Computational Linguistics, 2022: 2776-2789.
[10]	ZHOU D W, CAI Z W, YE H J, et al. Revisiting class-incremental learning with pre-trained models: generalizability and adaptivity are all you need[J]. International Journal of Computer Vision, 2025, 133(3): 1012-1032. DOI
[11]	DEVLIN J, CHANG M W, LEE K, et al. BERT: pre-training of deep bidirectional transformers for language understanding[C]// 2019 Conference of the North American Chapter of the Association for Computational Linguistics. New York: Association for Computational Linguistics, 2019: 4171-4186.
[12]	HE K M, CHEN X L, XIE S N, et al. Masked autoencoders are scalable vision learners[C]// 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2022: 15979-15988.
[13]	RADFORD A, KIM J W, HALLACY C, et al. Learning transferable visual models from natural language supervision[EB/OL]. [2025-04-30]. http://proceedings.mlr.press/v139/radford21a.html.
[14]	HUANG T Y, DONG B W, YANG Y H, et al. CLIP2Point: transfer CLIP to point cloud classification with image-depth pre-training[C]// 2023 IEEE/CVF International Conference on Computer Vision. New York: IEEE Press, 2023: 22100-22110.
[15]	ZENG Y H, JIANG C H, MAO J G, et al. CLIP²: contrastive language-image-point pretraining from real-world point cloud data[C]// 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2023: 15244-15253.
[16]	ZHANG R R, GUO Z Y, ZHANG W, et al. PointCLIP: point cloud understanding by clip[C]// 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2022: 8542-8552.
[17]	CHOWDHURY T, CHERAGHIAN A, RAMASINGHE S, et al. Few-shot class-incremental learning for 3D point cloud objects[C]// The 17th European Conference on Computer Vision. Cham: Springer, 2022: 204-220.
[18]	XU W, HUANG T Y, QU T Y, et al. FILP-3D: enhancing 3D few-shot class-incremental learning with pre-trained vision- language models[J]. Pattern Recognition, 2025, 165: 111558. DOI URL
[19]	LI Y Y, BU R, SUN M C, et al. PointCNN: convolution on X-transformed points[C]// The 32nd International Conference on Neural Information Processing Systems. Red Hook: Curran Associates Inc., 2018: 828-838.
[20]	LIU Y C, FAN B, XIANG S M, et al. Relation-shape convolutional neural network for point cloud analysis[C]// 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2019: 8887-8896.
[21]	POULENARD A, RAKOTOSAONA M J, PONTY Y, et al. Effective rotation-invariant point CNN with spherical harmonics kernels[C]// 2019 International Conference on 3D Vision. New York: IEEE Press, 2019: 47-56.
[22]	RAO Y M, LU J W, ZHOU J. Spherical fractal convolutional neural networks for point cloud recognition[C]// 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2019: 452-460.
[23]	WU W X, QI Z G, LI F X. PointConv: deep convolutional networks on 3D point clouds[C]// 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2019: 9613-9622.
[24]	XU Y F, FAN T Q, XU M Y, et al. SpiderCNN: deep learning on point sets with parameterized convolutional filters[C]// The 15th European Conference on Computer Vision. Cham: Springer, 2018: 90-105.
[25]	LI G C, MÜLLER M, THABET A, et al. DeepGCNs: can GCNs go as deep as CNNs?[C]// 2019 IEEE/CVF International Conference on Computer Vision. New York: IEEE Press, 2019: 9266-9275.
[26]	WANG Y, SUN Y B, LIU Z W, et al. Dynamic graph CNN for learning on point clouds[J]. ACM Transactions on Graphics, 2019, 38(5): 146.
[27]	YU X M, TANG L L, RAO Y M, et al. Point-BERT: pre-training 3D point cloud transformers with masked point modeling[C]// 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2022: 19291-19300.
[28]	PANG Y T, WANG W X, TAY F E H, et al. Masked autoencoders for point cloud self-supervised learning[C]// The 17th European Conference on Computer Vision. Cham: Springer, 2022: 604-621.
[29]	ZHANG R R, GUO Z Y, FANG R Y, et al. Point-M2AE: multi-scale masked autoencoders for hierarchical point cloud pre-training[C]// The 36th International Conference on Neural Information Processing Systems. Red Hook: Curran Associates Inc., 2022: 1962.
[30]	ZHAO H S, JIANG L, JIA J Y, et al. Point transformer[C]// 2021 IEEE/CVF International Conference on Computer Vision. New York: IEEE Press, 2021: 16239-16248.
[31]	CHEN K L, LEE C G. Incremental few-shot learning via vector quantization in deep embedded space[EB/OL]. [2025-04-30]. https://dblp.org/db/conf/iclr/iclr2021.html#ChenL21.
[32]	CHERAGHIAN A, RAHMAN S, FANG P F, et al. Semantic- aware knowledge distillation for few-shot class-incremental learning[C]// 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2021: 2534-2543.
[33]	MAZUMDER P, SINGH P, RAI P. Few-shot lifelong learning[C]// The 35th AAAI Conference on Artificial Intelligence. Palo Alto: AAAI Press, 2021: 2337-2345.
[34]	PENG C, ZHAO K, WANG T R, et al. Few-shot class- incremental learning from an open-set perspective[C]// The 17th European Conference on Computer Vision. Cham: Springer, 2022: 382-397.
[35]	XIANG X, TAN Y W, WAN Q, et al. Coarse-to-fine incremental few-shot learning[C]// The 17th European Conference on Computer Vision. Cham: Springer, 2022: 205-222.
[36]	ZHANG C, SONG N, LIN G S, et al. Few-shot incremental learning with continually evolved classifiers[C]// 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2021: 12450-12459.
[37]	ZHOU D W, WANG F Y, YE H J, et al. Forward compatible few-shot class-incremental learning[C]// 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2022: 9036-9046.
[38]	HERSCHE M, KARUNARATNE G, CHERUBINI G, et al. Constrained few-shot class-incremental learning[C]// 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2022: 9047-9057.
[39]	LIU H, GU L, CHI Z X, et al. Few-shot class-incremental learning via entropy-regularized data-free replay[C]// The 17th European Conference on Computer Vision. Cham: Springer, 2022: 146-162.
[40]	WANG R Q, DUAN X Y, KANG G L, et al. AttriCLIP: a non-incremental learner for incremental knowledge learning[C]// 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2023: 3654-3663.
[41]	DOSOVITSKIY A, BEYER L, KOLESNIKOV A, et al. An image is worth 16x16 words:transformers for image recognition at scale[EB/OL]. [2025-04-22]. https://arxiv.org/abs/2010.11929.pdf.
[42]	WANG H C, LIU Q, YUE X Y, et al. Unsupervised point cloud pre-training via occlusion completion[C]// 2021 IEEE/CVF International Conference on Computer Vision. New York: IEEE Press, 2021: 9762-9772.
[43]	VAN DEN OORD A, LI Y Z, VINYALS O. Representation learning with contrastive predictive coding[EB/OL]. [2025-03- 10]. https://arxiv.org/abs/1807.03748.pdf.
[44]	WU Z R, SONG S R, KHOSLA A, et al. 3D ShapeNets: a deep representation for volumetric shapes[C]// 2015 IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2015: 1912-1920.
[45]	LIN T Y, MAIRE M, BELONGIE S, et al. Microsoft COCO: common objects in context[C]// The 13th European Conference on Computer Vision. Cham: Springer, 2014: 740-755.
[46]	LOSHCHILOV I, HUTTER F. Decoupled weight decay regularization[EB/OL]. [2025-02-14]. https://arxiv.org/abs/1711.05101.pdf.
[47]	XUE L, GAO M F, XING C, et al. ULIP: learning a unified representation of language, images, and point clouds for 3D understanding[C]// 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2023: 1179-1189.

方法	Acc↑							Δ↓	Δʹ↓
方法	1	2	3	4	5	6	7	Δ↓	Δʹ↓
ULIP	86.3	83.3	80.3	75.8	72.8	62.9	65.1	24.6	9.9
FACT	82.6	77.0	72.4	69.8	68.4	67.7	67.3	18.5	5.6
Microshape	86.9	84.6	82.8	78.3	78.5	71.5	68.6	21.1	7.0
FILP-3D	90.5	87.1	84.5	81.8	80.9	80.2	77.6	14.3	3.4
本文方法	90.7	88.8	87.0	84.8	84.7	83.6	82.6	8.9	2.2

方法	Acc↑							Δ↓	Δʹ↓
方法	1	2	3	4	5	6	7	Δ↓	Δʹ↓
ULIP	86.3	83.3	80.3	75.8	72.8	62.9	65.1	24.6	9.9
FACT	82.6	77.0	72.4	69.8	68.4	67.7	67.3	18.5	5.6
Microshape	86.9	84.6	82.8	78.3	78.5	71.5	68.6	21.1	7.0
FILP-3D	90.5	87.1	84.5	81.8	80.9	80.2	77.6	14.3	3.4
本文方法	90.7	88.8	87.0	84.8	84.7	83.6	82.6	8.9	2.2

方法	Acc↑												Δ↓	Δʹ↓
方法	1	2	3	4	5	6	7	8	9	10	11	12	Δ↓	Δʹ↓
ULIP	86.3	85.6	81.7	74.0	71.7	68.1	67.6	64.5	59.5	58.4	55.2	57.5	28.8	7.7
FACT	82.4	77.2	74.5	73.1	71.3	70.4	67.2	65.2	63.8	61.8	59.9	59.8	27.4	5.2
Microshape	85.2	78.6	71.0	72.0	75.2	68.8	56.1	58.5	62.9	59.1	52.2	59.4	30.3	12.1
FILP-3D	89.9	84.9	84.9	83.2	81.8	80.6	78.6	77.1	76.1	74.8	73.5	72.2	19.7	5.0
本文方法	90.5	88.0	86.7	84.4	83.9	82.1	79.4	78.5	77.7	76.6	74.8	74.0	18.2	2.7

方法	Acc↑												Δ↓	Δʹ↓
方法	1	2	3	4	5	6	7	8	9	10	11	12	Δ↓	Δʹ↓
ULIP	86.3	85.6	81.7	74.0	71.7	68.1	67.6	64.5	59.5	58.4	55.2	57.5	28.8	7.7
FACT	82.4	77.2	74.5	73.1	71.3	70.4	67.2	65.2	63.8	61.8	59.9	59.8	27.4	5.2
Microshape	85.2	78.6	71.0	72.0	75.2	68.8	56.1	58.5	62.9	59.1	52.2	59.4	30.3	12.1
FILP-3D	89.9	84.9	84.9	83.2	81.8	80.6	78.6	77.1	76.1	74.8	73.5	72.2	19.7	5.0
本文方法	90.5	88.0	86.7	84.4	83.9	82.1	79.4	78.5	77.7	76.6	74.8	74.0	18.2	2.7

方法	Acc↑							Δ↓	Δʹ↓
方法	1	2	3	4	5	6	7	Δ↓	Δʹ↓
无自适应适配器	90.5	87.9	85.5	83.0	82.7	81.9	80.8	10.7	2.6
无注意力融合	90.8	87.8	85.7	83.0	82.6	77.8	77.1	15.0	4.8
均无	90.5	87.1	84.5	81.8	80.9	80.2	77.6	14.3	3.4
本文方法	90.7	88.8	87.0	84.8	84.7	83.6	82.6	8.9	2.2