3D scene-graph generation via vision-language model distillation and large language model parsing

doi:10.11996/JG.j.2095-302X.2026020360

Abstract

Abstract:

To address the limitation of point clouds in expressing semantic relationships for 3D scene-graph generation tasks, which typically requires rendering corresponding images and fusing multimodal features-thereby introducing additional computational overhead during inference, a 3D scene-graph generation method based on Vision-Language Model (VL model) distillation and Large Language Models (LLM) was proposed. The method took 3D point clouds as input, rendered corresponding images, and aligned their feature spaces to distill knowledge from the VL model into a Graph Neural Network (GNN), thereby establishing a mapping between point-cloud instances and corresponding textual descriptions and constructing a Point-cloud-Language model (PL model). The PL model leveraged an LLM to enhance the understanding of complex semantic relationships and effectively aggregated node features through the GNN. It could capture both semantic and spatial relationships of point clouds without relying on additional image information, enabling 3D scene-graph generation for indoor environments. Experimental results demonstrated that the proposed method not only achieved robust understanding of 3D indoor environments in open-vocabulary tasks, but also significantly reduced computational overhead and inference time compared with end-to-end 3D scene-graph generation approaches that relied on vision-language models, highlighting its strong performance and practical applicability.

Key words: 3D scene understanding, 3D scene-graph generation, vision-language model, large language model, knowledge distillation

CLC Number:

TP391.41

LU Yaguang, SHEN Xukun, HU Yong. 3D scene-graph generation via vision-language model distillation and large language model parsing[J]. Journal of Graphics, 2026, 47(2): 360-367.

Figures/Tables 9

References 26

[1]	CHANG X J, REN P Z, XU P F, et al. A comprehensive survey of scene graphs: generation and application[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023, 45(1): 1-26. DOI URL
[2]	JOHNSON J, KRISHNA R, STARK M, et al. Image retrieval using scene graphs[C]// 2015 IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2015: 3668-3678.
[3]	WU S C, WALD J, TATENO K, et al. SceneGraphFusion: incremental 3D scene graph prediction from RGB-D sequences[C]// 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2021: 7511-7521.
[4]	WU S C, TATENO K, NAVAB N, et al. Incremental 3D semantic scene graph prediction from RGB sequences[C]// 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2023: 5064-5074.
[5]	LU Y G, HU Y, FENG H Y, et al. Generating reconstructable collaborative virtual environments via graph matching for mixed reality remote collaboration[J]. The Visual Computer, 2025, 41(8): 5935-5947. DOI
[6]	DAHNERT M, HOU J, NIEßNER M, et al. Panoptic 3D scene reconstruction from a single RGB image[C]// The 35th International Conference on Neural Information Processing Systems. Red Hook: Curran Associates Inc., 2021: 633.
[7]	WALD J, DHAMO H, NAVAB N, et al. Learning 3D semantic scene graphs from 3D indoor reconstructions[C]// 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2020: 3960-3969.
[8]	WALD J, NAVAB N, TOMBARI F. Learning 3D semantic scene graphs with instance embeddings[J]. International Journal of Computer Vision, 2022, 130(3): 630-651. DOI
[9]	KOCH S, HERMOSILLA P, VASKEVICIUS N, et al. SGRec3D: self-supervised 3D scene graph learning via object-level scene reconstruction[C]// 2024 IEEE/CVF Winter Conference on Applications of Computer Vision. New York: IEEE Press, 2024: 3392-3402.
[10]	ARMENI I, HE Z Y, ZAMIR A, et al. 3D scene graph:a structure for unified semantics, 3D space, and camera[C]// 2019 IEEE/CVF International Conference on Computer Vision. New York: IEEE Press, 2019: 5663-5672.
[11]	HUGHES N, CHANG Y, CARLONE L. Hydra:a real-time spatial perception system for 3D scene graph construction and optimization[EB/OL]. [2025-08-23]. https://dblp.org/db/conf/rss/rss2022.html#HughesCC22.
[12]	KOCH S, HERMOSILLA P, VASKEVICIUS N, et al. Lang3DSG: language-based contrastive pre-training for 3D Scene Graph prediction[C]// 2024 International Conference on 3D Vision. New York: IEEE Press, 2024: 1037-1047.
[13]	RADFORD A, KIM J W, HALLACY C, et al. Learning transferable visual models from natural language supervision[EB/OL]. [2025-08-23]. http://proceedings.mlr.press/v139/radford21a.html.
[14]	LV C S, QI M S, LI X, et al. SGFormer: semantic graph transformer for point cloud-based 3D scene graph generation[C]// The 38th AAAI Conference on Artificial Intelligence. Palo Alto: AAAI Press, 2024: 4035-4043.
[15]	CHANG H N, KOWNDINYA B, LU S Y, et al. Context-aware entity grounding with open vocabulary 3D scene graphs[C]// The 7th Conference on Robot Learning. New York: PMLR Press, 2023: 1950-1974.
[16]	REIMERS N, GUREVYCH I. Sentence-BERT: sentence embeddings using Siamese BERT-networks[EB/OL]. [2025-08-23]. https://aclanthology.org/D19-1410/.
[17]	KOCH S, VASKEVICIUS N, COLOSI M, et al. Open3DSG: open-vocabulary 3D scene graphs from point clouds with queryable objects and open-set relationships[C]// 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2024: 14183-14193.
[18]	GHIASI G, GU X Y, CUI Y, et al. Scaling open-vocabulary image segmentation with image-level labels[C]// The 17th European Conference on Computer Vision. Cham: Springer, 2022: 540-557.
[19]	LI J N, LI D X, XIONG C M, et al. BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation[EB/OL]. [2025-08-23]. https://proceedings.mlr.press/v162/li22n.html.
[20]	LI J N, LI D X, SAVARESE S, et al. BLIP-2:bootstrapping language-image pre-training with frozen image encoders and large language models[EB/OL]. [2025-08-23]. https://proceedings.mlr.press/v202/li23q.html.
[21]	DEVLIN J, CHANG M W, LEE K, et al. BERT: pre-training of deep bidirectional transformers for language understanding[EB/OL]. [2025-08-23]. https://aclanthology.org/N19-1423/.
[22]	CHEN L G X, WANG X J, LU J L, et al. CLIP-driven open-vocabulary 3D scene graph generation via cross-modality contrastive learning[C]// 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2024: 27863-27873.
[23]	WANG Z Q, CHENG B W, ZHAO L C, et al. VL-Sat: visual-linguistic semantics assisted training for 3D semantic scene graph prediction in point cloud[C]// 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2023: 21560-21569.
[24]	QI C R, YI L, SU H, et al. PointNet++: deep hierarchical feature learning on point sets in a metric space[EB/OL]. [2025-08-23]. https://proceedings.neurips.cc/paper_files/paper/2017/file/d8bf84be3800d12f74d8b05e9b89836f-Paper.pdf.
[25]	ARMENI I, SENER O, ZAMIR A R, et al. 3D semantic parsing of large-scale indoor spaces[C]// 2016 IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2016: 1534-1543.
[26]	ZHAO L, TAO W B. JSNet: joint instance and semantic segmentation of 3D point clouds[C]// The 34th AAAI Conference on Artificial Intelligence. Palo Alto: AAAI Press, 2020: 12951-12958.

方法	Object		Predicate		Relationship
方法	R@5	R@10	R@3	R@5	R@50	R@100
3DSSG^[7]	0.68	0.78	0.89	0.93	0.40	0.66
SGFN^[3]	0.70	0.80	0.97	0.99	0.85	0.87
SGRec3D^[9]	0.80	0.87	0.97	0.99	0.89	0.91
VL-SAT^[23]	0.78	0.86	0.98	0.99	0.90	0.93
Open3DSG^[17]	0.57	0.68	0.63	0.70	0.64	0.66
Ours	0.61	0.73	0.61	0.65	0.63	0.67

方法	Object		Predicate		Relationship
方法	R@5	R@10	R@3	R@5	R@50	R@100
3DSSG^[7]	0.68	0.78	0.89	0.93	0.40	0.66
SGFN^[3]	0.70	0.80	0.97	0.99	0.85	0.87
SGRec3D^[9]	0.80	0.87	0.97	0.99	0.89	0.91
VL-SAT^[23]	0.78	0.86	0.98	0.99	0.90	0.93
Open3DSG^[17]	0.57	0.68	0.63	0.70	0.64	0.66
Ours	0.61	0.73	0.61	0.65	0.63	0.67

方法	类别	Head	Body	Tail	All
3DSSG^[7]	Object R@5	0.88	0.45	0.06	0.30
SGRec3D^[9]		0.92	0.78	0.24	0.45
VL-SAT^[23]		0.92	0.73	0.31	0.46
Open3DSG^[17]		0.60	0.50	0.42	0.45
Ours		0.71	0.52	0.40	0.47
3DSSG^[7]	Predicate R@3	0.94	0.83	0.41	0.57
SGRec3D^[9]		0.97	0.96	0.65	0.69
VL-SAT^[23]		0.99	0.94	0.58	0.75
Open3DSG^[17]		0.38	0.29	0.57	0.37
Ours		0.35	0.25	0.51	0.33

方法	类别	Head	Body	Tail	All
3DSSG^[7]	Object R@5	0.88	0.45	0.06	0.30
SGRec3D^[9]		0.92	0.78	0.24	0.45
VL-SAT^[23]		0.92	0.73	0.31	0.46
Open3DSG^[17]		0.60	0.50	0.42	0.45
Ours		0.71	0.52	0.40	0.47
3DSSG^[7]	Predicate R@3	0.94	0.83	0.41	0.57
SGRec3D^[9]		0.97	0.96	0.65	0.69
VL-SAT^[23]		0.99	0.94	0.58	0.75
Open3DSG^[17]		0.38	0.29	0.57	0.37
Ours		0.35	0.25	0.51	0.33

方法	Object		Predicate
方法	R@5	mR@5	R@3	mR@3
2D	0.72	0.63	0.63	0.25
3D	0.43	0.22	0.58	0.31
2D-3D	0.75	0.58	0.61	0.35