3D Gaussian splatting semantic segmentation and editing based on 2D feature distillation

doi:10.11996/JG.j.2095-302X.2025020312

Abstract

Abstract:

Semantic understanding of 3D scenes constitutes one of the fundamental ways humans perceive the world. Some semantic tasks, such as open vocabulary segmentation, and semantic editing, are essential research domains in computer vision and computer graphics. However, the absence of large and diverse segmentation datasets of 3D open vocabulary makes it challenging to directly train a robust and generalizable model. To address this issue, 3D Gaussian splatting based on 2D feature distillation was proposed, which distills semantic embeddings from the SAM and CLIP macromodels into 3D Gaussians. For each scene, pixel-wise semantic features were obtained via SAM and CLIP, and training was conducted using 3D Gaussian differentiable rendering to generate a scene-specific semantic feature field. In the semantic segmentation task, in order to obtain the accurate segmentation boundary of each object in the scene, a multi-step segmentation mask selection strategy was designed to obtain the accurate open vocabulary semantic segmentation for the new perspective images without requiring the tedious hierarchical feature extraction and training processes. Through explicit 3D Gaussian scene representations, the correspondence between text and 3D objects was effectively established, enabling semantic editing. Experiments demonstrated that the method achieved comparable or superior qualitative and quantitative results in semantic segmentation tasks compared to existing methods, while enabling open vocabulary semantic editing through a 3D Gaussian semantic feature field.

Key words: 3D scene, 3D Gaussian splatting, semantic segmentation, feature field, open vocabulary semantic editing

CLC Number:

TP391

LIU Gaoyi, HU Ruizhen, LIU Ligang. 3D Gaussian splatting semantic segmentation and editing based on 2D feature distillation[J]. Journal of Graphics, 2025, 46(2): 312-321.

Figures/Tables 12

Fig. 1 Method pipeline diagram of this paper

Fig. 2 Open vocabulary editing schematic

Table 1 mIoU comparison results on 3D-OVS dataset/%

方法	Bed	Bench	Room	Sofa	Lawn	总体
LSeg^[26]	56.0	6.0	19.2	4.5	17.5	20.6
ODISE^[52]	52.6	24.1	52.5	48.3	39.8	43.5
OV-Seg^[53]	78.9	89.9	71.4	66.1	81.2	77.5
3D-OVS^[15]	89.5	89.3	92.8	74	88.2	86.8
LangSplat^[27]	92.5	94.2	94.1	90.0	96.1	93.4
Ours	95.0	93.8	93.5	95.5	96.3	94.8

Table 2 Precision comparison results on 3D-OVS dataset/%

方法	Bed	Bench	Room	Sofa	Lawn	总体
LSeg^[26]	87.6	42.7	46.1	16.5	77.5	54.1
ODISE^[52]	86.5	39.0	59.7	35.4	82.5	60.6
OV-Seg^[53]	40.4	89.2	49.1	69.6	92.1	68.1
3D-OVS^[15]	96.7	96.3	98.9	91.6	97.3	96.2
LangSplat^[27]	99.2	98.6	99.3	97.9	99.4	98.9
Ours	99.3	97.8	99.1	99.1	99.5	99.0

Fig. 3 Visualization of segmentation results for 3D-OVS dataset ((a) Bed; (b) Blue sofa; (c) Sofa; (d) Lawn)

Table 3 Comparison with the original Gaussian model in the quality assessment metrics of the rendered images

Metrics	PSNR↑	SSIM↑	LPIPS↓
Original 3DGS	35.81	0.971	0.065
Ours	38.33	0.978	0.061

Fig. 4 Open vocabulary editing through interaction with 3D semantic Gaussians ((a) Extract the ‘Winnie-the-Pooh’; (b) Extract the ‘rabbit’; (c) Change the color of ‘a red Switch’; (d) Change the color of ‘Pikachu’; (e) Extract the ‘gerbera’; (f) Change the color of ‘shrilling chicken’; (g) Change the color of ‘gerbera’; (h) Change the color of ‘Winnie-the-Pooh’)

Fig. 5 3D Gaussian spheres for open vocabulary instance objects ((a) Dove body wash; (b) A red gerbera; (c) A bottle of perfume; (d) A black Nike shoe; (e) Sunglasses; (f) A wooden ukulele)

Fig. 6 3D Gaussian segmentation of each semantic object in the scene

Table 4 Ablation experiment, comparison of different semantic feature extraction methods/%

特征提取模块	mIoU	Accuracy
patch_baesd	89.5	91.2
SAM	94.8	99.0

Table 5 Ablation experiment, the effect of segmentation strategies on the accuracy of semantic segmentation for new perspectives/%

分割策略	mIoU	Accuracy
仅使用相似度图Φ	89.1	90.8
+ACF	90.9	91.5
+TF	92.6	95.3
Ours	94.8	99.0

Fig. 7 Rendered feature map visualization ((a) Feature map; (b) RGB image)

References 53

[1]	SHEN W, YANG G, YU A, et al. Distilled feature fields enable few-shot language-guided manipulation[EB/OL]. [2024-06-19]. https://proceedings.mlr.press/v229/shen23a.html.
[2]	WANG Y X, ZHANG M T, LI Z R, et al. D³Fields: dynamic 3D descriptor fields for zero-shot generalizable rearrangement[EB/OL]. [2024-06-19]. https://arxiv.org/abs/2309.16118.
[3]	RASHID A, SHARMA S, KIM C M, et al. Language embedded radiance fields for zero-shot task-oriented grasping[EB/OL]. [2024-06-19]. https://lerftogo.github.io/desktop.html.
[4]	KOBAYASHI S, MATSUMOTO E, SITZMANN V. Decomposing NeRF for editing via feature field distillation[C]// The 36th International Conference on Neural Information Processing Systems. Red Hook: Curran Associates Inc., 2022: 1694.
[5]	KERR J, KIM C M, GOLDBERG K, et al. LERF: language embedded radiance fields[C]// 2023 IEEE/CVF International Conference on Computer Vision. New York: IEEE Press, 2023: 19672-19682.
[6]	JATAVALLABHULA K M, KUWAJERWALA A, GU Q, et al. ConceptFusion:open-set multimodal 3D mapping[EB/OL]. [2024-06-19]. https://arxiv.org/abs/2302.07241.
[7]	BEHLEY J, GARBADE M, MILIOTO A, et al. Towards 3D LiDAR-based semantic scene understanding of 3D point cloud sequences: the SemanticKITTI Dataset[J]. The International Journal of Robotics Research, 2021, 40(8/9): 959-967.
[8]	CAESAR H, BANKITI V, LANG A H, et al. nuScenes: a multimodal dataset for autonomous driving[C]// 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2020: 11618-11628.
[9]	牛辰庚, 刘玉杰, 李宗民, 等. 基于点云数据的三维目标识别和模型分割方法[J]. 图学学报, 2019, 40(2): 274-281. DOI
	NIU C G, LIU Y J, LI Z M, et al. 3D object recognition and model segmentation based on point cloud data[J]. Journal of Graphics, 2019, 40(2): 274-281 (in Chinese).
[10]	HU Q Y, YANG B, FANG G C, et al. SQN: weakly-supervised semantic segmentation of large-scale 3D point clouds[C]// The 17th European Conference on Computer Vision. Cham: Springer, 2022: 600-619.
[11]	HA H, SONG S R. Semantic abstraction:open-world 3D scene understanding from 2D vision-language models[EB/OL]. [2024-06-19]. https://proceedings.mlr.press/v205/ha23a.html.
[12]	PENG S Y, GENOVA K, JIANG C Y, et al. OpenScene: 3D scene understanding with open vocabularies[C]// 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2023: 815-824.
[13]	MILDENHALL B, SRINIVASAN P P, TANCIK M, et al. NeRF: representing scenes as neural radiance fields for view synthesis[J]. Communications of the ACM, 2021, 65(1): 99-106.
[14]	范腾, 杨浩, 尹稳, 等. 基于神经辐射场的多尺度视图合成研究[J]. 图学学报, 2023, 44(6): 1140-1148. DOI
	FAN T, YANG H, YIN W, et al. Multi-scale view synthesis based on neural radiance field[J]. Journal of Graphics, 2023, 44(6): 1140-1148 (in Chinese). DOI
[15]	LIU K H, ZHAN F N, ZHANG J H, et al. Weakly supervised 3D open-vocabulary segmentation[C]// The 37th International Conference on Neural Information Processing Systems. Red Hook: Curran Associates Inc., 2023: 2325.
[16]	RADFORD A, KIM J W, HALLACY C, et al. Learning transferable visual models from natural language supervision[EB/OL]. [2024-06-19]. http://proceedings.mlr.press/v139/radford21a.
[17]	CARON M, TOUVRON H, MISRA I, et al. Emerging properties in self-supervised vision transformers[C]// 2021 IEEE/CVF International Conference on Computer Vision. New York: IEEE Press, 2021: 9630-9640.
[18]	LIANG S N, LIU Y C, WU S Z, et al. ONeRF:unsupervised 3D object segmentation from multiple views[EB/OL]. [2024-06-19]. https://arxiv.org/abs/2211.12038.
[19]	STELZNER K, KERSTING K, KOSIOREK A R. Decomposing 3D scenes into objects via unsupervised volume segmentation[EB/OL]. [2024-06-19]. https://arxiv.org/abs/2104.01148.
[20]	ZARZAR J, ROJAS S, GIANCOLA S, et al. SegNeRF:3D part segmentation with neural radiance fields[EB/OL]. [2024-06-19]. https://arxiv.org/abs/2211.11215.
[21]	TSCHERNEZKI V, LARLUS D, VEDALDI A. NeuralDiff: segmenting 3D objects that move in egocentric videos[C]// 2021 International Conference on 3D Vision. New York: IEEE Press, 2021: 910-919.
[22]	ZHI S F, LAIDLOW T, LEUTENEGGER S, et al. In-place scene labelling and understanding with implicit scene representation[C]// 2021 IEEE/CVF International Conference on Computer Vision. New York: IEEE Press, 2021: 15818-15827.
[23]	SIDDIQUI Y, PORZI L, BULÒ S R, et al. Panoptic lifting for 3D scene understanding with neural fields[C]// 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2023: 9043-9052.
[24]	GOEL R, SIRIKONDA D, SAINI S, et al. Interactive segmentation of radiance fields[C]// 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2023: 4201-4211.
[25]	TSCHERNEZKI V, LAINA I, LARLUS D, et al. Neural feature fusion fields: 3D distillation of self-supervised 2D image representations[C]// 2022 International Conference on 3D Vision. New York: IEEE Press, 2022: 443-453.
[26]	LI B Y, WEINBERGER K Q, BELONGIE S, et al. Language-driven semantic segmentation[EB/OL]. [2024-06-19]. https://arxiv.org/abs/2201.03546.
[27]	QIN M H, LI W H, ZHOU J W, et al. LangSplat: 3D language Gaussian splatting[C]// 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2024: 20051-20060.
[28]	KERBL B, KOPANAS G, LEIMKUEHLER T, et al. 3D Gaussian splatting for real-time radiance field rendering[J]. ACM Transactions on Graphics, 2023, 42(4): 139.
[29]	OQUAB M, DARCET T, MOUTAKANNI T, et al. DINOv2:learning robust visual features without supervision[EB/OL]. [2024-06-19]. https://arxiv.org/abs/2304.07193
[30]	JIA C, YANG Y F, XIA Y, et al. Scaling up visual and vision-language representation learning with noisy text supervision[EB/OL]. [2024-06-19]. https://proceedings.mlr.press/v139/jia21b.html.
[31]	ROMBACH R, BLATTMANN A, LORENZ D, et al. High-resolution image synthesis with latent diffusion models[C]// 2022 IEEE/CVF conference on computer vision and pattern recognition. New York: IEEE Press, 2022: 10674-10685.
[32]	ZHOU C, LI Q, LI C, et al. A comprehensive survey on pretrained foundation models:a history from BERT to ChatGPT[EB/OL]. (2023-05-01) [2024-06-19]. https://link.springer.com/article/10.1007/s13042-024-02443-6.
[33]	BOMMASANI R, HUDSON D A, ADELI E, et al. On the opportunities and risks of foundation models[EB/OL]. [2024-06-19]. https://arxiv.org/abs/2108.07258.
[34]	KIRILLOV A, MINTUN E, RAVI N, et al. Segment anything[C]// 2023 IEEE/CVF International Conference on Computer Vision. New York: IEEE Press, 2023: 3992-4003.
[35]	RAMESH A, DHARIWAL P, NICHOL A, et al. Hierarchical text-conditional image generation with CLIP latents[EB/OL]. [2024-06-19]. https://arxiv.org/abs/2204.06125.
[36]	WANG Z Q, LU Y, LI Q, et al. CRIS: CLIP-driven referring image segmentation[C]// 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2022: 11676-11685.
[37]	SONG H Y, DONG L, ZHANG W N, et al. CLIP models are few-shot learners:empirical studies on VQA and visual entailment[EB/OL]. [2024-06-19]. https://arxiv.org/abs/2203.07190.
[38]	GAO S H, LIN Z J, XIE X Y, et al. EditAnything: empowering unparalleled flexibility in image editing and generation[C]// The 31st ACM International Conference on Multimedia. New York: ACM, 2023: 9414-9416.
[39]	YAO J F, WANG X G, YE L, et al. Matte anything: interactive natural image matting with segment anything model[J]. Image and Vision Computing, 2024, 147: 105067.
[40]	CHENG Y M, LI L L, XU Y Y, et al. Segment and track anything[EB/OL]. [2024-06-19]. https://arxiv.org/abs/2305.06558.
[41]	YANG J Y, GAO M Q, LI Z, et al. Track anything:segment anything meets videos[EB/OL]. [2024-06-19]. https://arxiv.org/abs/2304.11968.
[42]	LUITEN J, KOPANAS G, LEIBE B, et al. Dynamic 3D Gaussians: tracking by persistent dynamic view synthesis[C]// 2024 International Conference on 3D Vision. New York: IEEE Press, 2024: 800-809.
[43]	YANG Z Y, GAO X Y, ZHOU W, et al. Deformable 3D Gaussians for high-fidelity monocular dynamic scene reconstruction[C]// 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2024: 20331-20341.
[44]	WU G J, YI T R, FANG J M, et al. 4D Gaussian splatting for real-time dynamic scene rendering[C]// 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2024: 20310-20320.
[45]	ZHANG K, LUAN F J, WANG Q Q, et al. PhySG: inverse rendering with spherical Gaussians for physics-based material editing and relighting[C]// 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2021: 5449-5458.
[46]	TANG J X, REN J W, ZHOU H, et al. DreamGaussian:generative Gaussian splatting for efficient 3D content creation[EB/OL]. [2024-06-19]. https://arxiv.org/abs/2309.16653.
[47]	CHEN Y W, CHEN R, LEI J B, et al. TANGO: text-driven photorealistic and robust 3D stylization via lighting decomposition[C]// The 36th International Conference on Neural Information Processing Systems. Red Hook: Curran Associates Inc., 2022: 2242.
[48]	SCHÖNBERGER J L, FRAHM J M. Structure-from-motion revisited[C]// 2016 IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2016: 4104-4113.
[49]	MAX N. Optical models for direct volume rendering[J]. IEEE Transactions on Visualization and Computer Graphics, 1995, 1(2): 99-108.
[50]	DAI A, CHANG A X, SAVVA M, et al. ScanNet: richly-annotated 3D reconstructions of indoor scenes[C]// 2017 IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2017: 2432-2443.
[51]	STRAUB J, WHELAN T, MA L N, et al. The replica dataset: a digital replica of indoor spaces[EB/OL]. [2024-06-19]. https://arxiv.org/abs/1906.05797.
[52]	XU J R, LIU S F, VAHDAT A, et al. Open-vocabulary panoptic segmentation with text-to-image diffusion models[C]// 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2023: 2955-2966.
[53]	LIANG F, WU B C, DAI X L, et al. Open-vocabulary semantic segmentation with mask-adapted CLIP[C]// 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2023: 7061-7070.