欢迎访问《图学学报》 分享到:

图学学报 ›› 2025, Vol. 46 ›› Issue (2): 312-321.DOI: 10.11996/JG.j.2095-302X.2025020312

• 计算机图形学与虚拟现实 • 上一篇    下一篇

基于2D特征蒸馏的3D高斯泼溅语义分割与编辑

刘高屹1(), 胡瑞珍2(), 刘利刚1   

  1. 1.中国科学技术大学数学科学学院,安徽 合肥 230026
    2.深圳大学计算机与软件学院,广东 深圳 518060
  • 收稿日期:2024-08-22 接受日期:2024-12-22 出版日期:2025-04-30 发布日期:2025-04-24
  • 通讯作者:胡瑞珍(1988-),女,教授,博士。主要研究方向为计算机图形学、具身智能等。E-mail:ruizhen.hu@szu.edu.cn
  • 第一作者:刘高屹(1998-),男,硕士研究生。主要研究方向为计算机图形学。E-mail:liugaoyi@mail.ustc.edu.cn
  • 基金资助:
    国家自然科学基金(62025207)

3D Gaussian splatting semantic segmentation and editing based on 2D feature distillation

LIU Gaoyi1(), HU Ruizhen2(), LIU Ligang1   

  1. 1. School of Mathematical Sciences, University of Science and Technology of China, Hefei Anhui 230026, China
    2. College of Computer Science & Software Engineering, Shenzhen University, Shenzhen Guangdong 518060, China
  • Received:2024-08-22 Accepted:2024-12-22 Published:2025-04-30 Online:2025-04-24
  • First author:LIU Gaoyi (1998-), master student. His main research interest covers computer graphics. E-mail:liugaoyi@mail.ustc.edu.cn
  • Supported by:
    National Natural Science Foundation of China(62025207)

摘要:

三维场景的语义理解是人类感知世界的基本方式之一。一些语义任务,如开放词汇分割和语义编辑,是计算机视觉和计算机图形学的重要研究领域。由于缺乏大型、多样化的三维开放词汇分割数据集,直接训练一个稳健、可泛化的模型并非易事。为此,提出了基于2D特征蒸馏的3D高斯泼溅,这是一种将SAM和CLIP大模型的语义嵌入蒸馏到3D高斯的方法。对于每个场景,通过SAM和CLIP获取逐像素语义特征,然后使用3D高斯可微分渲染进行训练,以获得特定场景的语义特征场。在语义分割任务中,为获得场景中每个对象的精确分割边界,设计了一种多步骤的分割掩码选择策略,无需繁琐的分层特征提取和训练过程,即可得到新视角图像精确的开放词汇语义分割。利用显式的3D高斯场景表示,有效实现了文本与三维对象间的对应,从而进行语义编辑。实验表明,该方法与所比较方法相比,在语义分割任务中获得相当或更好的定性和定量结果,同时通过三维高斯语义特征场实现了开放词汇语义编辑。

关键词: 三维场景, 3D高斯泼溅, 语义分割, 特征场, 开放词汇的语义编辑

Abstract:

Semantic understanding of 3D scenes constitutes one of the fundamental ways humans perceive the world. Some semantic tasks, such as open vocabulary segmentation, and semantic editing, are essential research domains in computer vision and computer graphics. However, the absence of large and diverse segmentation datasets of 3D open vocabulary makes it challenging to directly train a robust and generalizable model. To address this issue, 3D Gaussian splatting based on 2D feature distillation was proposed, which distills semantic embeddings from the SAM and CLIP macromodels into 3D Gaussians. For each scene, pixel-wise semantic features were obtained via SAM and CLIP, and training was conducted using 3D Gaussian differentiable rendering to generate a scene-specific semantic feature field. In the semantic segmentation task, in order to obtain the accurate segmentation boundary of each object in the scene, a multi-step segmentation mask selection strategy was designed to obtain the accurate open vocabulary semantic segmentation for the new perspective images without requiring the tedious hierarchical feature extraction and training processes. Through explicit 3D Gaussian scene representations, the correspondence between text and 3D objects was effectively established, enabling semantic editing. Experiments demonstrated that the method achieved comparable or superior qualitative and quantitative results in semantic segmentation tasks compared to existing methods, while enabling open vocabulary semantic editing through a 3D Gaussian semantic feature field.

Key words: 3D scene, 3D Gaussian splatting, semantic segmentation, feature field, open vocabulary semantic editing

中图分类号: