基于2D特征蒸馏的3D高斯泼溅语义分割与编辑

doi:10.11996/JG.j.2095-302X.2025020312

图学学报 ›› 2025, Vol. 46 ›› Issue (2): 312-321.DOI: 10.11996/JG.j.2095-302X.2025020312

• 计算机图形学与虚拟现实 • 上一篇下一篇

基于2D特征蒸馏的3D高斯泼溅语义分割与编辑

刘高屹¹(), 胡瑞珍²(), 刘利刚¹

1.中国科学技术大学数学科学学院，安徽合肥 230026
2.深圳大学计算机与软件学院，广东深圳 518060

收稿日期:2024-08-22 接受日期:2024-12-22 出版日期:2025-04-30 发布日期:2025-04-24
通讯作者:胡瑞珍(1988-)，女，教授，博士。主要研究方向为计算机图形学、具身智能等。E-mail：ruizhen.hu@szu.edu.cn
第一作者:刘高屹(1998-)，男，硕士研究生。主要研究方向为计算机图形学。E-mail：liugaoyi@mail.ustc.edu.cn
基金资助:
国家自然科学基金(62025207)

3D Gaussian splatting semantic segmentation and editing based on 2D feature distillation

LIU Gaoyi¹(), HU Ruizhen²(), LIU Ligang¹

1. School of Mathematical Sciences, University of Science and Technology of China, Hefei Anhui 230026, China
2. College of Computer Science & Software Engineering, Shenzhen University, Shenzhen Guangdong 518060, China

Received:2024-08-22 Accepted:2024-12-22 Published:2025-04-30 Online:2025-04-24
First author：LIU Gaoyi (1998-), master student. His main research interest covers computer graphics. E-mail：liugaoyi@mail.ustc.edu.cn
Supported by:
National Natural Science Foundation of China(62025207)

摘要/Abstract

摘要：

三维场景的语义理解是人类感知世界的基本方式之一。一些语义任务，如开放词汇分割和语义编辑，是计算机视觉和计算机图形学的重要研究领域。由于缺乏大型、多样化的三维开放词汇分割数据集，直接训练一个稳健、可泛化的模型并非易事。为此，提出了基于2D特征蒸馏的3D高斯泼溅，这是一种将SAM和CLIP大模型的语义嵌入蒸馏到3D高斯的方法。对于每个场景，通过SAM和CLIP获取逐像素语义特征，然后使用3D高斯可微分渲染进行训练，以获得特定场景的语义特征场。在语义分割任务中，为获得场景中每个对象的精确分割边界，设计了一种多步骤的分割掩码选择策略，无需繁琐的分层特征提取和训练过程，即可得到新视角图像精确的开放词汇语义分割。利用显式的3D高斯场景表示，有效实现了文本与三维对象间的对应，从而进行语义编辑。实验表明，该方法与所比较方法相比，在语义分割任务中获得相当或更好的定性和定量结果，同时通过三维高斯语义特征场实现了开放词汇语义编辑。

关键词: 三维场景, 3D高斯泼溅, 语义分割, 特征场, 开放词汇的语义编辑

Abstract:

Semantic understanding of 3D scenes constitutes one of the fundamental ways humans perceive the world. Some semantic tasks, such as open vocabulary segmentation, and semantic editing, are essential research domains in computer vision and computer graphics. However, the absence of large and diverse segmentation datasets of 3D open vocabulary makes it challenging to directly train a robust and generalizable model. To address this issue, 3D Gaussian splatting based on 2D feature distillation was proposed, which distills semantic embeddings from the SAM and CLIP macromodels into 3D Gaussians. For each scene, pixel-wise semantic features were obtained via SAM and CLIP, and training was conducted using 3D Gaussian differentiable rendering to generate a scene-specific semantic feature field. In the semantic segmentation task, in order to obtain the accurate segmentation boundary of each object in the scene, a multi-step segmentation mask selection strategy was designed to obtain the accurate open vocabulary semantic segmentation for the new perspective images without requiring the tedious hierarchical feature extraction and training processes. Through explicit 3D Gaussian scene representations, the correspondence between text and 3D objects was effectively established, enabling semantic editing. Experiments demonstrated that the method achieved comparable or superior qualitative and quantitative results in semantic segmentation tasks compared to existing methods, while enabling open vocabulary semantic editing through a 3D Gaussian semantic feature field.

Key words: 3D scene, 3D Gaussian splatting, semantic segmentation, feature field, open vocabulary semantic editing

中图分类号:

TP391

刘高屹, 胡瑞珍, 刘利刚. 基于2D特征蒸馏的3D高斯泼溅语义分割与编辑[J]. 图学学报, 2025, 46(2): 312-321.

LIU Gaoyi, HU Ruizhen, LIU Ligang. 3D Gaussian splatting semantic segmentation and editing based on 2D feature distillation[J]. Journal of Graphics, 2025, 46(2): 312-321.

图/表 12

图1 本文方法管道图

Fig. 1 Method pipeline diagram of this paper

图2 开放词汇编辑示意图

Fig. 2 Open vocabulary editing schematic

表1 在3D-OVS数据集上mIoU比较结果/%

Table 1 mIoU comparison results on 3D-OVS dataset/%

方法	Bed	Bench	Room	Sofa	Lawn	总体
LSeg^[26]	56.0	6.0	19.2	4.5	17.5	20.6
ODISE^[52]	52.6	24.1	52.5	48.3	39.8	43.5
OV-Seg^[53]	78.9	89.9	71.4	66.1	81.2	77.5
3D-OVS^[15]	89.5	89.3	92.8	74	88.2	86.8
LangSplat^[27]	92.5	94.2	94.1	90.0	96.1	93.4
Ours	95.0	93.8	93.5	95.5	96.3	94.8

表2 在3D-OVS数据集上精度比较结果/%

Table 2 Precision comparison results on 3D-OVS dataset/%

方法	Bed	Bench	Room	Sofa	Lawn	总体
LSeg^[26]	87.6	42.7	46.1	16.5	77.5	54.1
ODISE^[52]	86.5	39.0	59.7	35.4	82.5	60.6
OV-Seg^[53]	40.4	89.2	49.1	69.6	92.1	68.1
3D-OVS^[15]	96.7	96.3	98.9	91.6	97.3	96.2
LangSplat^[27]	99.2	98.6	99.3	97.9	99.4	98.9
Ours	99.3	97.8	99.1	99.1	99.5	99.0

图3 3D-OVS数据集分割结果的可视化((a)床；(b)蓝色沙发；(c)沙发；(d)草坪)

Fig. 3 Visualization of segmentation results for 3D-OVS dataset ((a) Bed; (b) Blue sofa; (c) Sofa; (d) Lawn)

表3 与原始高斯模型在渲染图像质量评估指标的对比

Table 3 Comparison with the original Gaussian model in the quality assessment metrics of the rendered images

Metrics	PSNR↑	SSIM↑	LPIPS↓
Original 3DGS	35.81	0.971	0.065
Ours	38.33	0.978	0.061

图4 通过与三维语义高斯交互进行开放词汇编辑((a)提取“维尼熊”；(b)提取“兔子”；(c)改变“红色Switch”颜色；(d)改变“皮卡丘”颜色；(e)提取“非洲菊”；(f)改变“尖叫鸡”颜色；(g)改变“非洲菊”颜色；(h)改变“维尼熊”颜色)

Fig. 4 Open vocabulary editing through interaction with 3D semantic Gaussians ((a) Extract the ‘Winnie-the-Pooh’; (b) Extract the ‘rabbit’; (c) Change the color of ‘a red Switch’; (d) Change the color of ‘Pikachu’; (e) Extract the ‘gerbera’; (f) Change the color of ‘shrilling chicken’; (g) Change the color of ‘gerbera’; (h) Change the color of ‘Winnie-the-Pooh’)

图5 开放词汇实例对象的三维高斯球((a) Dove沐浴露；(b)红色非洲菊；(c)香水；(d)黑色Nike鞋；(e)墨镜；(f)木制尤克里里)

Fig. 5 3D Gaussian spheres for open vocabulary instance objects ((a) Dove body wash; (b) A red gerbera; (c) A bottle of perfume; (d) A black Nike shoe; (e) Sunglasses; (f) A wooden ukulele)

图6 场景中每个语义对象的3D高斯分割

Fig. 6 3D Gaussian segmentation of each semantic object in the scene

表4 在消融实验中不同语义特征提取模块对比/%

Table 4 Ablation experiment, comparison of different semantic feature extraction methods/%

特征提取模块	mIoU	Accuracy
patch_baesd	89.5	91.2
SAM	94.8	99.0

表5 在消融实验中分割策略对新视角语义分割精度的影响/%

Table 5 Ablation experiment, the effect of segmentation strategies on the accuracy of semantic segmentation for new perspectives/%

分割策略	mIoU	Accuracy
仅使用相似度图Φ	89.1	90.8
+ACF	90.9	91.5
+TF	92.6	95.3
Ours	94.8	99.0

图7 渲染特征图可视化((a)特征图; (b) RGB图像)

Fig. 7 Rendered feature map visualization ((a) Feature map; (b) RGB image)

参考文献 53

[1]	SHEN W, YANG G, YU A, et al. Distilled feature fields enable few-shot language-guided manipulation[EB/OL]. [2024-06-19]. https://proceedings.mlr.press/v229/shen23a.html.
[2]	WANG Y X, ZHANG M T, LI Z R, et al. D³Fields: dynamic 3D descriptor fields for zero-shot generalizable rearrangement[EB/OL]. [2024-06-19]. https://arxiv.org/abs/2309.16118.
[3]	RASHID A, SHARMA S, KIM C M, et al. Language embedded radiance fields for zero-shot task-oriented grasping[EB/OL]. [2024-06-19]. https://lerftogo.github.io/desktop.html.
[4]	KOBAYASHI S, MATSUMOTO E, SITZMANN V. Decomposing NeRF for editing via feature field distillation[C]// The 36th International Conference on Neural Information Processing Systems. Red Hook: Curran Associates Inc., 2022: 1694.
[5]	KERR J, KIM C M, GOLDBERG K, et al. LERF: language embedded radiance fields[C]// 2023 IEEE/CVF International Conference on Computer Vision. New York: IEEE Press, 2023: 19672-19682.
[6]	JATAVALLABHULA K M, KUWAJERWALA A, GU Q, et al. ConceptFusion:open-set multimodal 3D mapping[EB/OL]. [2024-06-19]. https://arxiv.org/abs/2302.07241.
[7]	BEHLEY J, GARBADE M, MILIOTO A, et al. Towards 3D LiDAR-based semantic scene understanding of 3D point cloud sequences: the SemanticKITTI Dataset[J]. The International Journal of Robotics Research, 2021, 40(8/9): 959-967.
[8]	CAESAR H, BANKITI V, LANG A H, et al. nuScenes: a multimodal dataset for autonomous driving[C]// 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2020: 11618-11628.
[9]	牛辰庚, 刘玉杰, 李宗民, 等. 基于点云数据的三维目标识别和模型分割方法[J]. 图学学报, 2019, 40(2): 274-281. DOI
	NIU C G, LIU Y J, LI Z M, et al. 3D object recognition and model segmentation based on point cloud data[J]. Journal of Graphics, 2019, 40(2): 274-281 (in Chinese).
[10]	HU Q Y, YANG B, FANG G C, et al. SQN: weakly-supervised semantic segmentation of large-scale 3D point clouds[C]// The 17th European Conference on Computer Vision. Cham: Springer, 2022: 600-619.
[11]	HA H, SONG S R. Semantic abstraction:open-world 3D scene understanding from 2D vision-language models[EB/OL]. [2024-06-19]. https://proceedings.mlr.press/v205/ha23a.html.
[12]	PENG S Y, GENOVA K, JIANG C Y, et al. OpenScene: 3D scene understanding with open vocabularies[C]// 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2023: 815-824.
[13]	MILDENHALL B, SRINIVASAN P P, TANCIK M, et al. NeRF: representing scenes as neural radiance fields for view synthesis[J]. Communications of the ACM, 2021, 65(1): 99-106.
[14]	范腾, 杨浩, 尹稳, 等. 基于神经辐射场的多尺度视图合成研究[J]. 图学学报, 2023, 44(6): 1140-1148. DOI
	FAN T, YANG H, YIN W, et al. Multi-scale view synthesis based on neural radiance field[J]. Journal of Graphics, 2023, 44(6): 1140-1148 (in Chinese). DOI
[15]	LIU K H, ZHAN F N, ZHANG J H, et al. Weakly supervised 3D open-vocabulary segmentation[C]// The 37th International Conference on Neural Information Processing Systems. Red Hook: Curran Associates Inc., 2023: 2325.
[16]	RADFORD A, KIM J W, HALLACY C, et al. Learning transferable visual models from natural language supervision[EB/OL]. [2024-06-19]. http://proceedings.mlr.press/v139/radford21a.
[17]	CARON M, TOUVRON H, MISRA I, et al. Emerging properties in self-supervised vision transformers[C]// 2021 IEEE/CVF International Conference on Computer Vision. New York: IEEE Press, 2021: 9630-9640.
[18]	LIANG S N, LIU Y C, WU S Z, et al. ONeRF:unsupervised 3D object segmentation from multiple views[EB/OL]. [2024-06-19]. https://arxiv.org/abs/2211.12038.
[19]	STELZNER K, KERSTING K, KOSIOREK A R. Decomposing 3D scenes into objects via unsupervised volume segmentation[EB/OL]. [2024-06-19]. https://arxiv.org/abs/2104.01148.
[20]	ZARZAR J, ROJAS S, GIANCOLA S, et al. SegNeRF:3D part segmentation with neural radiance fields[EB/OL]. [2024-06-19]. https://arxiv.org/abs/2211.11215.
[21]	TSCHERNEZKI V, LARLUS D, VEDALDI A. NeuralDiff: segmenting 3D objects that move in egocentric videos[C]// 2021 International Conference on 3D Vision. New York: IEEE Press, 2021: 910-919.
[22]	ZHI S F, LAIDLOW T, LEUTENEGGER S, et al. In-place scene labelling and understanding with implicit scene representation[C]// 2021 IEEE/CVF International Conference on Computer Vision. New York: IEEE Press, 2021: 15818-15827.
[23]	SIDDIQUI Y, PORZI L, BULÒ S R, et al. Panoptic lifting for 3D scene understanding with neural fields[C]// 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2023: 9043-9052.
[24]	GOEL R, SIRIKONDA D, SAINI S, et al. Interactive segmentation of radiance fields[C]// 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2023: 4201-4211.
[25]	TSCHERNEZKI V, LAINA I, LARLUS D, et al. Neural feature fusion fields: 3D distillation of self-supervised 2D image representations[C]// 2022 International Conference on 3D Vision. New York: IEEE Press, 2022: 443-453.
[26]	LI B Y, WEINBERGER K Q, BELONGIE S, et al. Language-driven semantic segmentation[EB/OL]. [2024-06-19]. https://arxiv.org/abs/2201.03546.
[27]	QIN M H, LI W H, ZHOU J W, et al. LangSplat: 3D language Gaussian splatting[C]// 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2024: 20051-20060.
[28]	KERBL B, KOPANAS G, LEIMKUEHLER T, et al. 3D Gaussian splatting for real-time radiance field rendering[J]. ACM Transactions on Graphics, 2023, 42(4): 139.
[29]	OQUAB M, DARCET T, MOUTAKANNI T, et al. DINOv2:learning robust visual features without supervision[EB/OL]. [2024-06-19]. https://arxiv.org/abs/2304.07193
[30]	JIA C, YANG Y F, XIA Y, et al. Scaling up visual and vision-language representation learning with noisy text supervision[EB/OL]. [2024-06-19]. https://proceedings.mlr.press/v139/jia21b.html.
[31]	ROMBACH R, BLATTMANN A, LORENZ D, et al. High-resolution image synthesis with latent diffusion models[C]// 2022 IEEE/CVF conference on computer vision and pattern recognition. New York: IEEE Press, 2022: 10674-10685.
[32]	ZHOU C, LI Q, LI C, et al. A comprehensive survey on pretrained foundation models:a history from BERT to ChatGPT[EB/OL]. (2023-05-01) [2024-06-19]. https://link.springer.com/article/10.1007/s13042-024-02443-6.
[33]	BOMMASANI R, HUDSON D A, ADELI E, et al. On the opportunities and risks of foundation models[EB/OL]. [2024-06-19]. https://arxiv.org/abs/2108.07258.
[34]	KIRILLOV A, MINTUN E, RAVI N, et al. Segment anything[C]// 2023 IEEE/CVF International Conference on Computer Vision. New York: IEEE Press, 2023: 3992-4003.
[35]	RAMESH A, DHARIWAL P, NICHOL A, et al. Hierarchical text-conditional image generation with CLIP latents[EB/OL]. [2024-06-19]. https://arxiv.org/abs/2204.06125.
[36]	WANG Z Q, LU Y, LI Q, et al. CRIS: CLIP-driven referring image segmentation[C]// 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2022: 11676-11685.
[37]	SONG H Y, DONG L, ZHANG W N, et al. CLIP models are few-shot learners:empirical studies on VQA and visual entailment[EB/OL]. [2024-06-19]. https://arxiv.org/abs/2203.07190.
[38]	GAO S H, LIN Z J, XIE X Y, et al. EditAnything: empowering unparalleled flexibility in image editing and generation[C]// The 31st ACM International Conference on Multimedia. New York: ACM, 2023: 9414-9416.
[39]	YAO J F, WANG X G, YE L, et al. Matte anything: interactive natural image matting with segment anything model[J]. Image and Vision Computing, 2024, 147: 105067.
[40]	CHENG Y M, LI L L, XU Y Y, et al. Segment and track anything[EB/OL]. [2024-06-19]. https://arxiv.org/abs/2305.06558.
[41]	YANG J Y, GAO M Q, LI Z, et al. Track anything:segment anything meets videos[EB/OL]. [2024-06-19]. https://arxiv.org/abs/2304.11968.
[42]	LUITEN J, KOPANAS G, LEIBE B, et al. Dynamic 3D Gaussians: tracking by persistent dynamic view synthesis[C]// 2024 International Conference on 3D Vision. New York: IEEE Press, 2024: 800-809.
[43]	YANG Z Y, GAO X Y, ZHOU W, et al. Deformable 3D Gaussians for high-fidelity monocular dynamic scene reconstruction[C]// 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2024: 20331-20341.
[44]	WU G J, YI T R, FANG J M, et al. 4D Gaussian splatting for real-time dynamic scene rendering[C]// 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2024: 20310-20320.
[45]	ZHANG K, LUAN F J, WANG Q Q, et al. PhySG: inverse rendering with spherical Gaussians for physics-based material editing and relighting[C]// 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2021: 5449-5458.
[46]	TANG J X, REN J W, ZHOU H, et al. DreamGaussian:generative Gaussian splatting for efficient 3D content creation[EB/OL]. [2024-06-19]. https://arxiv.org/abs/2309.16653.
[47]	CHEN Y W, CHEN R, LEI J B, et al. TANGO: text-driven photorealistic and robust 3D stylization via lighting decomposition[C]// The 36th International Conference on Neural Information Processing Systems. Red Hook: Curran Associates Inc., 2022: 2242.
[48]	SCHÖNBERGER J L, FRAHM J M. Structure-from-motion revisited[C]// 2016 IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2016: 4104-4113.
[49]	MAX N. Optical models for direct volume rendering[J]. IEEE Transactions on Visualization and Computer Graphics, 1995, 1(2): 99-108.
[50]	DAI A, CHANG A X, SAVVA M, et al. ScanNet: richly-annotated 3D reconstructions of indoor scenes[C]// 2017 IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2017: 2432-2443.
[51]	STRAUB J, WHELAN T, MA L N, et al. The replica dataset: a digital replica of indoor spaces[EB/OL]. [2024-06-19]. https://arxiv.org/abs/1906.05797.
[52]	XU J R, LIU S F, VAHDAT A, et al. Open-vocabulary panoptic segmentation with text-to-image diffusion models[C]// 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2023: 2955-2966.
[53]	LIANG F, WU B C, DAI X L, et al. Open-vocabulary semantic segmentation with mask-adapted CLIP[C]// 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2023: 7061-7070.

基于2D特征蒸馏的3D高斯泼溅语义分割与编辑

3D Gaussian splatting semantic segmentation and editing based on 2D feature distillation

RichHTML

PDF (PC)

可视化

摘要/Abstract

引用本文

使用本文

图/表 12

参考文献 53

相关文章 9

编辑推荐

Metrics

本文评价

[1]	李治寰, 宁小娟, 吕志勇, 石争浩, 金海燕, 王映辉, 周文明. DEMF-Net：基于双分支增强和多尺度融合的大规模点云语义分割[J]. 图学学报, 2025, 46(2): 259-269.
[2]	张桂梅, 陶辉, 鲁飞飞, 彭昆. 基于双源判别器的域自适应城市场景语义分割[J]. 图学学报, 2023, 44(5): 907-917.
[3]	吴文欢, 张淏坤. 融合空间十字注意力与通道注意力的语义分割网络[J]. 图学学报, 2023, 44(3): 531-539.
[4]	崔振东, 李宗民, 杨树林, 刘玉杰, 李华. 基于语义分割引导的三维目标检测[J]. 图学学报, 2022, 43(6): 1134-1142.
[5]	范溢华 , 王永振 , 燕雪峰 , 宫丽娜 , 郭延文 , 魏明强 . 人脸识别任务驱动的低光照图像增强算法 [J]. 图学学报, 2022, 43(6): 1170-1181.
[6]	姚翰, 殷雪峰, 李童, 张肇轩, 杨鑫, 尹宝才. 基于多任务模型的深度预测算法研究[J]. 图学学报, 2021, 42(3): 446-453.
[7]	郑顾平，王敏，李刚 . 基于注意力机制的多尺度融合航拍影像语义分割[J]. 图学学报, 2018, 39(6): 1069-1077.
[8]	张志林，苗兰芳 . 基于深度图像的三维场景重建系统[J]. 图学学报, 2018, 39(6): 1123-1129.
[9]	焦良葆，陈瑞，张健. 一个新的线索KD树并行算法[J]. 图学学报, 2011, 32(5): 46-50.