基于VL模型蒸馏与LLM解析的三维场景图生成方法

doi:10.11996/JG.j.2095-302X.2026020360

摘要/Abstract

摘要：

针对三维场景图生成任务中点云在语义关系表达能力上存在不足，需要与生成对应图像并融合而导致推理阶段产生额外计算开销的问题，提出一种基于视觉-语言模型(VL model)蒸馏与大语言模型(LLM)的三维场景图生成方法。以三维点云为输入，通过渲染生成对应图像并在特征空间上对齐，实现从视觉-语言模型到图神经网络(GNN)的知识蒸馏，从而建立点云实例与对应文本之间的映射关系，构建点云-语言模型(PL model)。该模型利用LLM增强对复杂语义关系的理解能力，并通过GNN有效聚合节点特征，在不依赖额外图像信息的情况下，捕捉点云的语义和空间关系，实现面向室内环境的三维场景图生成。实验结果表明，该方法不仅能够在开放词表任务中实现对三维室内环境的稳健理解，而且相比依赖VL mode的端到端三维场景图生成方法，可显著降低推理阶段的计算开销和时间成本，体现出良好的性能与实用价值。

关键词: 三维场景理解, 三维场景图生成, 视觉-语言模型, 大语言模型, 知识蒸馏

Abstract:

To address the limitation of point clouds in expressing semantic relationships for 3D scene-graph generation tasks, which typically requires rendering corresponding images and fusing multimodal features-thereby introducing additional computational overhead during inference, a 3D scene-graph generation method based on Vision-Language Model (VL model) distillation and Large Language Models (LLM) was proposed. The method took 3D point clouds as input, rendered corresponding images, and aligned their feature spaces to distill knowledge from the VL model into a Graph Neural Network (GNN), thereby establishing a mapping between point-cloud instances and corresponding textual descriptions and constructing a Point-cloud-Language model (PL model). The PL model leveraged an LLM to enhance the understanding of complex semantic relationships and effectively aggregated node features through the GNN. It could capture both semantic and spatial relationships of point clouds without relying on additional image information, enabling 3D scene-graph generation for indoor environments. Experimental results demonstrated that the proposed method not only achieved robust understanding of 3D indoor environments in open-vocabulary tasks, but also significantly reduced computational overhead and inference time compared with end-to-end 3D scene-graph generation approaches that relied on vision-language models, highlighting its strong performance and practical applicability.

Key words: 3D scene understanding, 3D scene-graph generation, vision-language model, large language model, knowledge distillation

中图分类号:

TP391.41

卢亚光, 沈旭昆, 胡勇. 基于VL模型蒸馏与LLM解析的三维场景图生成方法[J]. 图学学报, 2026, 47(2): 360-367.

LU Yaguang, SHEN Xukun, HU Yong. 3D scene-graph generation via vision-language model distillation and large language model parsing[J]. Journal of Graphics, 2026, 47(2): 360-367.

图/表 9

参考文献 26

[1]	CHANG X J, REN P Z, XU P F, et al. A comprehensive survey of scene graphs: generation and application[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023, 45(1): 1-26. DOI URL
[2]	JOHNSON J, KRISHNA R, STARK M, et al. Image retrieval using scene graphs[C]// 2015 IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2015: 3668-3678.
[3]	WU S C, WALD J, TATENO K, et al. SceneGraphFusion: incremental 3D scene graph prediction from RGB-D sequences[C]// 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2021: 7511-7521.
[4]	WU S C, TATENO K, NAVAB N, et al. Incremental 3D semantic scene graph prediction from RGB sequences[C]// 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2023: 5064-5074.
[5]	LU Y G, HU Y, FENG H Y, et al. Generating reconstructable collaborative virtual environments via graph matching for mixed reality remote collaboration[J]. The Visual Computer, 2025, 41(8): 5935-5947. DOI
[6]	DAHNERT M, HOU J, NIEßNER M, et al. Panoptic 3D scene reconstruction from a single RGB image[C]// The 35th International Conference on Neural Information Processing Systems. Red Hook: Curran Associates Inc., 2021: 633.
[7]	WALD J, DHAMO H, NAVAB N, et al. Learning 3D semantic scene graphs from 3D indoor reconstructions[C]// 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2020: 3960-3969.
[8]	WALD J, NAVAB N, TOMBARI F. Learning 3D semantic scene graphs with instance embeddings[J]. International Journal of Computer Vision, 2022, 130(3): 630-651. DOI
[9]	KOCH S, HERMOSILLA P, VASKEVICIUS N, et al. SGRec3D: self-supervised 3D scene graph learning via object-level scene reconstruction[C]// 2024 IEEE/CVF Winter Conference on Applications of Computer Vision. New York: IEEE Press, 2024: 3392-3402.
[10]	ARMENI I, HE Z Y, ZAMIR A, et al. 3D scene graph:a structure for unified semantics, 3D space, and camera[C]// 2019 IEEE/CVF International Conference on Computer Vision. New York: IEEE Press, 2019: 5663-5672.
[11]	HUGHES N, CHANG Y, CARLONE L. Hydra:a real-time spatial perception system for 3D scene graph construction and optimization[EB/OL]. [2025-08-23]. https://dblp.org/db/conf/rss/rss2022.html#HughesCC22.
[12]	KOCH S, HERMOSILLA P, VASKEVICIUS N, et al. Lang3DSG: language-based contrastive pre-training for 3D Scene Graph prediction[C]// 2024 International Conference on 3D Vision. New York: IEEE Press, 2024: 1037-1047.
[13]	RADFORD A, KIM J W, HALLACY C, et al. Learning transferable visual models from natural language supervision[EB/OL]. [2025-08-23]. http://proceedings.mlr.press/v139/radford21a.html.
[14]	LV C S, QI M S, LI X, et al. SGFormer: semantic graph transformer for point cloud-based 3D scene graph generation[C]// The 38th AAAI Conference on Artificial Intelligence. Palo Alto: AAAI Press, 2024: 4035-4043.
[15]	CHANG H N, KOWNDINYA B, LU S Y, et al. Context-aware entity grounding with open vocabulary 3D scene graphs[C]// The 7th Conference on Robot Learning. New York: PMLR Press, 2023: 1950-1974.
[16]	REIMERS N, GUREVYCH I. Sentence-BERT: sentence embeddings using Siamese BERT-networks[EB/OL]. [2025-08-23]. https://aclanthology.org/D19-1410/.
[17]	KOCH S, VASKEVICIUS N, COLOSI M, et al. Open3DSG: open-vocabulary 3D scene graphs from point clouds with queryable objects and open-set relationships[C]// 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2024: 14183-14193.
[18]	GHIASI G, GU X Y, CUI Y, et al. Scaling open-vocabulary image segmentation with image-level labels[C]// The 17th European Conference on Computer Vision. Cham: Springer, 2022: 540-557.
[19]	LI J N, LI D X, XIONG C M, et al. BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation[EB/OL]. [2025-08-23]. https://proceedings.mlr.press/v162/li22n.html.
[20]	LI J N, LI D X, SAVARESE S, et al. BLIP-2:bootstrapping language-image pre-training with frozen image encoders and large language models[EB/OL]. [2025-08-23]. https://proceedings.mlr.press/v202/li23q.html.
[21]	DEVLIN J, CHANG M W, LEE K, et al. BERT: pre-training of deep bidirectional transformers for language understanding[EB/OL]. [2025-08-23]. https://aclanthology.org/N19-1423/.
[22]	CHEN L G X, WANG X J, LU J L, et al. CLIP-driven open-vocabulary 3D scene graph generation via cross-modality contrastive learning[C]// 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2024: 27863-27873.
[23]	WANG Z Q, CHENG B W, ZHAO L C, et al. VL-Sat: visual-linguistic semantics assisted training for 3D semantic scene graph prediction in point cloud[C]// 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2023: 21560-21569.
[24]	QI C R, YI L, SU H, et al. PointNet++: deep hierarchical feature learning on point sets in a metric space[EB/OL]. [2025-08-23]. https://proceedings.neurips.cc/paper_files/paper/2017/file/d8bf84be3800d12f74d8b05e9b89836f-Paper.pdf.
[25]	ARMENI I, SENER O, ZAMIR A R, et al. 3D semantic parsing of large-scale indoor spaces[C]// 2016 IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2016: 1534-1543.
[26]	ZHAO L, TAO W B. JSNet: joint instance and semantic segmentation of 3D point clouds[C]// The 34th AAAI Conference on Artificial Intelligence. Palo Alto: AAAI Press, 2020: 12951-12958.

方法	Object		Predicate		Relationship
方法	R@5	R@10	R@3	R@5	R@50	R@100
3DSSG^[7]	0.68	0.78	0.89	0.93	0.40	0.66
SGFN^[3]	0.70	0.80	0.97	0.99	0.85	0.87
SGRec3D^[9]	0.80	0.87	0.97	0.99	0.89	0.91
VL-SAT^[23]	0.78	0.86	0.98	0.99	0.90	0.93
Open3DSG^[17]	0.57	0.68	0.63	0.70	0.64	0.66
Ours	0.61	0.73	0.61	0.65	0.63	0.67

方法	Object		Predicate		Relationship
方法	R@5	R@10	R@3	R@5	R@50	R@100
3DSSG^[7]	0.68	0.78	0.89	0.93	0.40	0.66
SGFN^[3]	0.70	0.80	0.97	0.99	0.85	0.87
SGRec3D^[9]	0.80	0.87	0.97	0.99	0.89	0.91
VL-SAT^[23]	0.78	0.86	0.98	0.99	0.90	0.93
Open3DSG^[17]	0.57	0.68	0.63	0.70	0.64	0.66
Ours	0.61	0.73	0.61	0.65	0.63	0.67

方法	类别	Head	Body	Tail	All
3DSSG^[7]	Object R@5	0.88	0.45	0.06	0.30
SGRec3D^[9]		0.92	0.78	0.24	0.45
VL-SAT^[23]		0.92	0.73	0.31	0.46
Open3DSG^[17]		0.60	0.50	0.42	0.45
Ours		0.71	0.52	0.40	0.47
3DSSG^[7]	Predicate R@3	0.94	0.83	0.41	0.57
SGRec3D^[9]		0.97	0.96	0.65	0.69
VL-SAT^[23]		0.99	0.94	0.58	0.75
Open3DSG^[17]		0.38	0.29	0.57	0.37
Ours		0.35	0.25	0.51	0.33

方法	类别	Head	Body	Tail	All
3DSSG^[7]	Object R@5	0.88	0.45	0.06	0.30
SGRec3D^[9]		0.92	0.78	0.24	0.45
VL-SAT^[23]		0.92	0.73	0.31	0.46
Open3DSG^[17]		0.60	0.50	0.42	0.45
Ours		0.71	0.52	0.40	0.47
3DSSG^[7]	Predicate R@3	0.94	0.83	0.41	0.57
SGRec3D^[9]		0.97	0.96	0.65	0.69
VL-SAT^[23]		0.99	0.94	0.58	0.75
Open3DSG^[17]		0.38	0.29	0.57	0.37
Ours		0.35	0.25	0.51	0.33

方法	Object		Predicate
方法	R@5	mR@5	R@3	mR@3
2D	0.72	0.63	0.63	0.25
3D	0.43	0.22	0.58	0.31
2D-3D	0.75	0.58	0.61	0.35