Journal of Graphics ›› 2026, Vol. 47 ›› Issue (2): 360-367.DOI: 10.11996/JG.j.2095-302X.2026020360
• Computer Graphics and Virtual Reality • Previous Articles Next Articles
LU Yaguang1, SHEN Xukun1,2, HU Yong1,2(
)
Received:2025-10-23
Accepted:2025-12-15
Online:2026-04-30
Published:2026-05-20
Contact:
HU Yong
CLC Number:
LU Yaguang, SHEN Xukun, HU Yong. 3D scene-graph generation via vision-language model distillation and large language model parsing[J]. Journal of Graphics, 2026, 47(2): 360-367.
Add to citation manager EndNote|Ris|BibTeX
URL: http://www.txxb.com.cn/EN/10.11996/JG.j.2095-302X.2026020360
| 方法 | Object | Predicate | Relationship | |||
|---|---|---|---|---|---|---|
| R@5 | R@10 | R@3 | R@5 | R@50 | R@100 | |
| 3DSSG[ | 0.68 | 0.78 | 0.89 | 0.93 | 0.40 | 0.66 |
| SGFN[ | 0.70 | 0.80 | 0.97 | 0.99 | 0.85 | 0.87 |
| SGRec3D[ | 0.80 | 0.87 | 0.97 | 0.99 | 0.89 | 0.91 |
| VL-SAT[ | 0.78 | 0.86 | 0.98 | 0.99 | 0.90 | 0.93 |
| Open3DSG[ | 0.57 | 0.68 | 0.63 | 0.70 | 0.64 | 0.66 |
| Ours | 0.61 | 0.73 | 0.61 | 0.65 | 0.63 | 0.67 |
Table 1 Performance comparison results on the 3DSSG dataset
| 方法 | Object | Predicate | Relationship | |||
|---|---|---|---|---|---|---|
| R@5 | R@10 | R@3 | R@5 | R@50 | R@100 | |
| 3DSSG[ | 0.68 | 0.78 | 0.89 | 0.93 | 0.40 | 0.66 |
| SGFN[ | 0.70 | 0.80 | 0.97 | 0.99 | 0.85 | 0.87 |
| SGRec3D[ | 0.80 | 0.87 | 0.97 | 0.99 | 0.89 | 0.91 |
| VL-SAT[ | 0.78 | 0.86 | 0.98 | 0.99 | 0.90 | 0.93 |
| Open3DSG[ | 0.57 | 0.68 | 0.63 | 0.70 | 0.64 | 0.66 |
| Ours | 0.61 | 0.73 | 0.61 | 0.65 | 0.63 | 0.67 |
| 方法 | 类别 | Head | Body | Tail | All |
|---|---|---|---|---|---|
| 3DSSG[ | Object R@5 | 0.88 | 0.45 | 0.06 | 0.30 |
| SGRec3D[ | 0.92 | 0.78 | 0.24 | 0.45 | |
| VL-SAT[ | 0.92 | 0.73 | 0.31 | 0.46 | |
| Open3DSG[ | 0.60 | 0.50 | 0.42 | 0.45 | |
| Ours | 0.71 | 0.52 | 0.40 | 0.47 | |
| 3DSSG[ | Predicate R@3 | 0.94 | 0.83 | 0.41 | 0.57 |
| SGRec3D[ | 0.97 | 0.96 | 0.65 | 0.69 | |
| VL-SAT[ | 0.99 | 0.94 | 0.58 | 0.75 | |
| Open3DSG[ | 0.38 | 0.29 | 0.57 | 0.37 | |
| Ours | 0.35 | 0.25 | 0.51 | 0.33 |
Table 2 Category balance evaluation results on the 3DSSG dataset
| 方法 | 类别 | Head | Body | Tail | All |
|---|---|---|---|---|---|
| 3DSSG[ | Object R@5 | 0.88 | 0.45 | 0.06 | 0.30 |
| SGRec3D[ | 0.92 | 0.78 | 0.24 | 0.45 | |
| VL-SAT[ | 0.92 | 0.73 | 0.31 | 0.46 | |
| Open3DSG[ | 0.60 | 0.50 | 0.42 | 0.45 | |
| Ours | 0.71 | 0.52 | 0.40 | 0.47 | |
| 3DSSG[ | Predicate R@3 | 0.94 | 0.83 | 0.41 | 0.57 |
| SGRec3D[ | 0.97 | 0.96 | 0.65 | 0.69 | |
| VL-SAT[ | 0.99 | 0.94 | 0.58 | 0.75 | |
| Open3DSG[ | 0.38 | 0.29 | 0.57 | 0.37 | |
| Ours | 0.35 | 0.25 | 0.51 | 0.33 |
| 方法 | Object | Predicate | ||
|---|---|---|---|---|
| R@5 | mR@5 | R@3 | mR@3 | |
| 2D | 0.72 | 0.63 | 0.63 | 0.25 |
| 3D | 0.43 | 0.22 | 0.58 | 0.31 |
| 2D-3D | 0.75 | 0.58 | 0.61 | 0.35 |
Table 3 Feature ablation experiment results on the 3DSSG dataset
| 方法 | Object | Predicate | ||
|---|---|---|---|---|
| R@5 | mR@5 | R@3 | mR@3 | |
| 2D | 0.72 | 0.63 | 0.63 | 0.25 |
| 3D | 0.43 | 0.22 | 0.58 | 0.31 |
| 2D-3D | 0.75 | 0.58 | 0.61 | 0.35 |
| 模块/方法 | Results (Scence) | |||
|---|---|---|---|---|
| Time/ms | FLOPs | Params | GPU Memory | |
| 图像渲染 | 90 012.61 | 0 | ||
| 点云预处理 | 1.28 | 0 | ||
| CLIP (ViT-B/32) text encoder | 138.72 | ~2.00 G | ~43.00 M | 606.00 MB |
| CLIP (ViT-B/32) image encoder | 7.37 | ~4.50 G | ~86.00 M | |
| BLIP-2 (Opt-2.7b) image encoder | 11.08 | ~5.50 G | ~87.00 M | 14.24 GB |
| BLIP-2 (Opt-2.7b) Q-former | 106.12 | ~0.10 G | ~188.00 M | |
| BLIP-2 (Opt-2.7b) LLM | 2 241.45 | ~126.00 G | ~2.70 B | |
| Object-PointNet++ | 13.94 | 8.79 G | 2.20 M | 22.00 MB |
| Predicate-PointNet++ | 13.88 | 26.37 G | 2.20 M | 26.00 MB |
| GNN (+ 特征维度转换transformer) | 9.08 (+ 17.35) | 75.71 M (+ 1.49 G) | 7.50 M (+ 746.20 M) | 24.00 MB (+ 2.78 GB) |
| 后处理LLM (Flan-T5-XL) | 1 416.10 | ~471.00 G | ~3.70 B | 10.62 GB |
| 基于VL模型 | 93 933.45 | ~611.1 G | ~6.80 B | 25.47 GB |
| 本文方法 | 3 957.92 | ~635.8 G | ~7.39 B | 28.32 GB |
Table 4 Performance evaluation results on the S3DIS dataset
| 模块/方法 | Results (Scence) | |||
|---|---|---|---|---|
| Time/ms | FLOPs | Params | GPU Memory | |
| 图像渲染 | 90 012.61 | 0 | ||
| 点云预处理 | 1.28 | 0 | ||
| CLIP (ViT-B/32) text encoder | 138.72 | ~2.00 G | ~43.00 M | 606.00 MB |
| CLIP (ViT-B/32) image encoder | 7.37 | ~4.50 G | ~86.00 M | |
| BLIP-2 (Opt-2.7b) image encoder | 11.08 | ~5.50 G | ~87.00 M | 14.24 GB |
| BLIP-2 (Opt-2.7b) Q-former | 106.12 | ~0.10 G | ~188.00 M | |
| BLIP-2 (Opt-2.7b) LLM | 2 241.45 | ~126.00 G | ~2.70 B | |
| Object-PointNet++ | 13.94 | 8.79 G | 2.20 M | 22.00 MB |
| Predicate-PointNet++ | 13.88 | 26.37 G | 2.20 M | 26.00 MB |
| GNN (+ 特征维度转换transformer) | 9.08 (+ 17.35) | 75.71 M (+ 1.49 G) | 7.50 M (+ 746.20 M) | 24.00 MB (+ 2.78 GB) |
| 后处理LLM (Flan-T5-XL) | 1 416.10 | ~471.00 G | ~3.70 B | 10.62 GB |
| 基于VL模型 | 93 933.45 | ~611.1 G | ~6.80 B | 25.47 GB |
| 本文方法 | 3 957.92 | ~635.8 G | ~7.39 B | 28.32 GB |
| [1] |
CHANG X J, REN P Z, XU P F, et al. A comprehensive survey of scene graphs: generation and application[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023, 45(1): 1-26.
DOI URL |
| [2] | JOHNSON J, KRISHNA R, STARK M, et al. Image retrieval using scene graphs[C]// 2015 IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2015: 3668-3678. |
| [3] | WU S C, WALD J, TATENO K, et al. SceneGraphFusion: incremental 3D scene graph prediction from RGB-D sequences[C]// 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2021: 7511-7521. |
| [4] | WU S C, TATENO K, NAVAB N, et al. Incremental 3D semantic scene graph prediction from RGB sequences[C]// 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2023: 5064-5074. |
| [5] |
LU Y G, HU Y, FENG H Y, et al. Generating reconstructable collaborative virtual environments via graph matching for mixed reality remote collaboration[J]. The Visual Computer, 2025, 41(8): 5935-5947.
DOI |
| [6] | DAHNERT M, HOU J, NIEßNER M, et al. Panoptic 3D scene reconstruction from a single RGB image[C]// The 35th International Conference on Neural Information Processing Systems. Red Hook: Curran Associates Inc., 2021: 633. |
| [7] | WALD J, DHAMO H, NAVAB N, et al. Learning 3D semantic scene graphs from 3D indoor reconstructions[C]// 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2020: 3960-3969. |
| [8] |
WALD J, NAVAB N, TOMBARI F. Learning 3D semantic scene graphs with instance embeddings[J]. International Journal of Computer Vision, 2022, 130(3): 630-651.
DOI |
| [9] | KOCH S, HERMOSILLA P, VASKEVICIUS N, et al. SGRec3D: self-supervised 3D scene graph learning via object-level scene reconstruction[C]// 2024 IEEE/CVF Winter Conference on Applications of Computer Vision. New York: IEEE Press, 2024: 3392-3402. |
| [10] | ARMENI I, HE Z Y, ZAMIR A, et al. 3D scene graph:a structure for unified semantics, 3D space, and camera[C]// 2019 IEEE/CVF International Conference on Computer Vision. New York: IEEE Press, 2019: 5663-5672. |
| [11] | HUGHES N, CHANG Y, CARLONE L. Hydra:a real-time spatial perception system for 3D scene graph construction and optimization[EB/OL]. [2025-08-23]. https://dblp.org/db/conf/rss/rss2022.html#HughesCC22. |
| [12] | KOCH S, HERMOSILLA P, VASKEVICIUS N, et al. Lang3DSG: language-based contrastive pre-training for 3D Scene Graph prediction[C]// 2024 International Conference on 3D Vision. New York: IEEE Press, 2024: 1037-1047. |
| [13] | RADFORD A, KIM J W, HALLACY C, et al. Learning transferable visual models from natural language supervision[EB/OL]. [2025-08-23]. http://proceedings.mlr.press/v139/radford21a.html. |
| [14] | LV C S, QI M S, LI X, et al. SGFormer: semantic graph transformer for point cloud-based 3D scene graph generation[C]// The 38th AAAI Conference on Artificial Intelligence. Palo Alto: AAAI Press, 2024: 4035-4043. |
| [15] | CHANG H N, KOWNDINYA B, LU S Y, et al. Context-aware entity grounding with open vocabulary 3D scene graphs[C]// The 7th Conference on Robot Learning. New York: PMLR Press, 2023: 1950-1974. |
| [16] | REIMERS N, GUREVYCH I. Sentence-BERT: sentence embeddings using Siamese BERT-networks[EB/OL]. [2025-08-23]. https://aclanthology.org/D19-1410/. |
| [17] | KOCH S, VASKEVICIUS N, COLOSI M, et al. Open3DSG: open-vocabulary 3D scene graphs from point clouds with queryable objects and open-set relationships[C]// 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2024: 14183-14193. |
| [18] | GHIASI G, GU X Y, CUI Y, et al. Scaling open-vocabulary image segmentation with image-level labels[C]// The 17th European Conference on Computer Vision. Cham: Springer, 2022: 540-557. |
| [19] | LI J N, LI D X, XIONG C M, et al. BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation[EB/OL]. [2025-08-23]. https://proceedings.mlr.press/v162/li22n.html. |
| [20] | LI J N, LI D X, SAVARESE S, et al. BLIP-2:bootstrapping language-image pre-training with frozen image encoders and large language models[EB/OL]. [2025-08-23]. https://proceedings.mlr.press/v202/li23q.html. |
| [21] | DEVLIN J, CHANG M W, LEE K, et al. BERT: pre-training of deep bidirectional transformers for language understanding[EB/OL]. [2025-08-23]. https://aclanthology.org/N19-1423/. |
| [22] | CHEN L G X, WANG X J, LU J L, et al. CLIP-driven open-vocabulary 3D scene graph generation via cross-modality contrastive learning[C]// 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2024: 27863-27873. |
| [23] | WANG Z Q, CHENG B W, ZHAO L C, et al. VL-Sat: visual-linguistic semantics assisted training for 3D semantic scene graph prediction in point cloud[C]// 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2023: 21560-21569. |
| [24] | QI C R, YI L, SU H, et al. PointNet++: deep hierarchical feature learning on point sets in a metric space[EB/OL]. [2025-08-23]. https://proceedings.neurips.cc/paper_files/paper/2017/file/d8bf84be3800d12f74d8b05e9b89836f-Paper.pdf. |
| [25] | ARMENI I, SENER O, ZAMIR A R, et al. 3D semantic parsing of large-scale indoor spaces[C]// 2016 IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2016: 1534-1543. |
| [26] | ZHAO L, TAO W B. JSNet: joint instance and semantic segmentation of 3D point clouds[C]// The 34th AAAI Conference on Artificial Intelligence. Palo Alto: AAAI Press, 2020: 12951-12958. |
| [1] | ZHOU Qiang, HUANG Yaoqiu, SHI Weimin, ZHOU Zhong. Video attractiveness assessment method for scenic live stream recommendations [J]. Journal of Graphics, 2026, 47(2): 264-274. |
| [2] | ZHOU Tenglong, YANG Wenjie, YIN Shaohua, YU Yuanlong. Text-to-image person re-identification based on multi-granularity color learning [J]. Journal of Graphics, 2026, 47(2): 275-285. |
| [3] | BAO Yongtang, WANG Moqin, WANG Zhihui, MA Guangxiao. Perceptually-aligned panoramic image quality assessment via global semantic feature fusion [J]. Journal of Graphics, 2026, 47(2): 332-340. |
| [4] | LI Zhangming, GUAN Weifan, CHANG Zhengwei, ZHANG Linghao, HU Qinghao. A mixed-precision quantization method for large language models via memory alignment [J]. Journal of Graphics, 2026, 47(1): 39-46. |
| [5] | CHEN Zhizhang, FENG Yingchaojie, WENG Luoxuan, SHEN Jian, CHEN Wei. DRec: large language model-driven data analysis recommendation system [J]. Journal of Graphics, 2025, 46(5): 1028-1041. |
| [6] | XU Pei, HUANG Kaiqi. An efficient reinforcement learning method based on large language model [J]. Journal of Graphics, 2024, 45(6): 1165-1177. |
| [7] | CHEN Xiaojiao, SHU Yunfeng, WANG Ruihan, ZHOU Jiahuan, CHEN Wei. Large language model powered UI evaluation system [J]. Journal of Graphics, 2024, 45(6): 1178-1187. |
| [8] | YU Han, CHEN Zhiyuan, XIONG Xirui, DAI Yuanxing, CAI Hongming. Intelligent MBSE design approach based on retrieval augmented large language model [J]. Journal of Graphics, 2024, 45(6): 1188-1199. |
| [9] | XU Jinglin, PENG Yang, OU Jinwu, TAN Junjie, SHU Jiangpeng, YU Fangqiang. An intelligent maintenance system for public buildings integrating digital twin and large language model [J]. Journal of Graphics, 2024, 45(6): 1200-1206. |
| [10] | WU Jingyi, JING Jun, HE Yifan, ZHANG Shiyu, KANG Yunfeng, TANG Wei, KONG Delan, LIU Xiangdong. Traffic anomaly event analysis method for highway scenes based on multimodal large language models [J]. Journal of Graphics, 2024, 45(6): 1266-1276. |
| [11] | JIANG Can, ZHENG Zhe, LIANG Xiong, LIN Jiarui, MA Zhiliang, LU Xinzheng. A new interaction paradigm for building design driven by large language model: proof of concept with Rhino7 [J]. Journal of Graphics, 2024, 45(3): 594-600. |
| [12] | CHEN Bao-yu, ZHANG Yi, YU Bing-bing, LIU Xiu-ping. Two-stage adjustable perceptual distillation network for virtual try-on [J]. Journal of Graphics, 2022, 43(2): 316-323. |
| Viewed | ||||||
|
Full text |
|
|||||
|
Abstract |
|
|||||