Welcome to Journal of Graphics

Journal of Graphics ›› 2026, Vol. 47 ›› Issue (2): 360-367.DOI: 10.11996/JG.j.2095-302X.2026020360

• Computer Graphics and Virtual Reality • Previous Articles     Next Articles

3D scene-graph generation via vision-language model distillation and large language model parsing

LU Yaguang1, SHEN Xukun1,2, HU Yong1,2()   

  1. 1 State Key Laboratory of Virtual Reality Technology and Systems, Beihang University, Beijing 100191, China
    2 School of New Media Art and Design, Beihang University, Beijing 100191, China
  • Received:2025-10-23 Accepted:2025-12-15 Online:2026-04-30 Published:2026-05-20
  • Contact: HU Yong

Abstract:

To address the limitation of point clouds in expressing semantic relationships for 3D scene-graph generation tasks, which typically requires rendering corresponding images and fusing multimodal features-thereby introducing additional computational overhead during inference, a 3D scene-graph generation method based on Vision-Language Model (VL model) distillation and Large Language Models (LLM) was proposed. The method took 3D point clouds as input, rendered corresponding images, and aligned their feature spaces to distill knowledge from the VL model into a Graph Neural Network (GNN), thereby establishing a mapping between point-cloud instances and corresponding textual descriptions and constructing a Point-cloud-Language model (PL model). The PL model leveraged an LLM to enhance the understanding of complex semantic relationships and effectively aggregated node features through the GNN. It could capture both semantic and spatial relationships of point clouds without relying on additional image information, enabling 3D scene-graph generation for indoor environments. Experimental results demonstrated that the proposed method not only achieved robust understanding of 3D indoor environments in open-vocabulary tasks, but also significantly reduced computational overhead and inference time compared with end-to-end 3D scene-graph generation approaches that relied on vision-language models, highlighting its strong performance and practical applicability.

Key words: 3D scene understanding, 3D scene-graph generation, vision-language model, large language model, knowledge distillation

CLC Number: