欢迎访问《图学学报》

图学学报 ›› 2026, Vol. 47 ›› Issue (2): 360-367.DOI: 10.11996/JG.j.2095-302X.2026020360

• 计算机图形学与虚拟现实 • 上一篇    下一篇

基于VL模型蒸馏与LLM解析的三维场景图生成方法

卢亚光1, 沈旭昆1,2, 胡勇1,2()   

  1. 1 北京航空航天大学虚拟现实技术与系统国家重点实验室北京 100191
    2 北京航空航天大学新媒体艺术与设计学院北京 100191
  • 收稿日期:2025-10-23 接受日期:2025-12-15 出版日期:2026-04-30 发布日期:2026-05-20
  • 通讯作者:胡勇,E-mail:huyong@buaa.edu.cn

3D scene-graph generation via vision-language model distillation and large language model parsing

LU Yaguang1, SHEN Xukun1,2, HU Yong1,2()   

  1. 1 State Key Laboratory of Virtual Reality Technology and Systems, Beihang University, Beijing 100191, China
    2 School of New Media Art and Design, Beihang University, Beijing 100191, China
  • Received:2025-10-23 Accepted:2025-12-15 Published:2026-04-30 Online:2026-05-20
  • Contact: HU Yong,E-mail:huyong@buaa.edu.cn

摘要:

针对三维场景图生成任务中点云在语义关系表达能力上存在不足,需要与生成对应图像并融合而导致推理阶段产生额外计算开销的问题,提出一种基于视觉-语言模型(VL model)蒸馏与大语言模型(LLM)的三维场景图生成方法。以三维点云为输入,通过渲染生成对应图像并在特征空间上对齐,实现从视觉-语言模型到图神经网络(GNN)的知识蒸馏,从而建立点云实例与对应文本之间的映射关系,构建点云-语言模型(PL model)。该模型利用LLM增强对复杂语义关系的理解能力,并通过GNN有效聚合节点特征,在不依赖额外图像信息的情况下,捕捉点云的语义和空间关系,实现面向室内环境的三维场景图生成。实验结果表明,该方法不仅能够在开放词表任务中实现对三维室内环境的稳健理解,而且相比依赖VL mode的端到端三维场景图生成方法,可显著降低推理阶段的计算开销和时间成本,体现出良好的性能与实用价值。

关键词: 三维场景理解, 三维场景图生成, 视觉-语言模型, 大语言模型, 知识蒸馏

Abstract:

To address the limitation of point clouds in expressing semantic relationships for 3D scene-graph generation tasks, which typically requires rendering corresponding images and fusing multimodal features-thereby introducing additional computational overhead during inference, a 3D scene-graph generation method based on Vision-Language Model (VL model) distillation and Large Language Models (LLM) was proposed. The method took 3D point clouds as input, rendered corresponding images, and aligned their feature spaces to distill knowledge from the VL model into a Graph Neural Network (GNN), thereby establishing a mapping between point-cloud instances and corresponding textual descriptions and constructing a Point-cloud-Language model (PL model). The PL model leveraged an LLM to enhance the understanding of complex semantic relationships and effectively aggregated node features through the GNN. It could capture both semantic and spatial relationships of point clouds without relying on additional image information, enabling 3D scene-graph generation for indoor environments. Experimental results demonstrated that the proposed method not only achieved robust understanding of 3D indoor environments in open-vocabulary tasks, but also significantly reduced computational overhead and inference time compared with end-to-end 3D scene-graph generation approaches that relied on vision-language models, highlighting its strong performance and practical applicability.

Key words: 3D scene understanding, 3D scene-graph generation, vision-language model, large language model, knowledge distillation

中图分类号: