Journal of Graphics ›› 2025, Vol. 46 ›› Issue (2): 312-321.DOI: 10.11996/JG.j.2095-302X.2025020312
• Computer Graphics and Virtual Reality • Previous Articles Next Articles
LIU Gaoyi1(), HU Ruizhen2(
), LIU Ligang1
Received:
2024-08-22
Accepted:
2024-12-22
Online:
2025-04-30
Published:
2025-04-24
Contact:
HU Ruizhen
About author:
First author contact:LIU Gaoyi (1998-), master student. His main research interest covers computer graphics. E-mail:liugaoyi@mail.ustc.edu.cn
Supported by:
CLC Number:
LIU Gaoyi, HU Ruizhen, LIU Ligang. 3D Gaussian splatting semantic segmentation and editing based on 2D feature distillation[J]. Journal of Graphics, 2025, 46(2): 312-321.
Add to citation manager EndNote|Ris|BibTeX
URL: http://www.txxb.com.cn/EN/10.11996/JG.j.2095-302X.2025020312
方法 | Bed | Bench | Room | Sofa | Lawn | 总体 |
---|---|---|---|---|---|---|
LSeg[ | 56.0 | 6.0 | 19.2 | 4.5 | 17.5 | 20.6 |
ODISE[ | 52.6 | 24.1 | 52.5 | 48.3 | 39.8 | 43.5 |
OV-Seg[ | 78.9 | 89.9 | 71.4 | 66.1 | 81.2 | 77.5 |
3D-OVS[ | 89.5 | 89.3 | 92.8 | 74 | 88.2 | 86.8 |
LangSplat[ | 92.5 | 94.2 | 94.1 | 90.0 | 96.1 | 93.4 |
Ours | 95.0 | 93.8 | 93.5 | 95.5 | 96.3 | 94.8 |
Table 1 mIoU comparison results on 3D-OVS dataset/%
方法 | Bed | Bench | Room | Sofa | Lawn | 总体 |
---|---|---|---|---|---|---|
LSeg[ | 56.0 | 6.0 | 19.2 | 4.5 | 17.5 | 20.6 |
ODISE[ | 52.6 | 24.1 | 52.5 | 48.3 | 39.8 | 43.5 |
OV-Seg[ | 78.9 | 89.9 | 71.4 | 66.1 | 81.2 | 77.5 |
3D-OVS[ | 89.5 | 89.3 | 92.8 | 74 | 88.2 | 86.8 |
LangSplat[ | 92.5 | 94.2 | 94.1 | 90.0 | 96.1 | 93.4 |
Ours | 95.0 | 93.8 | 93.5 | 95.5 | 96.3 | 94.8 |
方法 | Bed | Bench | Room | Sofa | Lawn | 总体 |
---|---|---|---|---|---|---|
LSeg[ | 87.6 | 42.7 | 46.1 | 16.5 | 77.5 | 54.1 |
ODISE[ | 86.5 | 39.0 | 59.7 | 35.4 | 82.5 | 60.6 |
OV-Seg[ | 40.4 | 89.2 | 49.1 | 69.6 | 92.1 | 68.1 |
3D-OVS[ | 96.7 | 96.3 | 98.9 | 91.6 | 97.3 | 96.2 |
LangSplat[ | 99.2 | 98.6 | 99.3 | 97.9 | 99.4 | 98.9 |
Ours | 99.3 | 97.8 | 99.1 | 99.1 | 99.5 | 99.0 |
Table 2 Precision comparison results on 3D-OVS dataset/%
方法 | Bed | Bench | Room | Sofa | Lawn | 总体 |
---|---|---|---|---|---|---|
LSeg[ | 87.6 | 42.7 | 46.1 | 16.5 | 77.5 | 54.1 |
ODISE[ | 86.5 | 39.0 | 59.7 | 35.4 | 82.5 | 60.6 |
OV-Seg[ | 40.4 | 89.2 | 49.1 | 69.6 | 92.1 | 68.1 |
3D-OVS[ | 96.7 | 96.3 | 98.9 | 91.6 | 97.3 | 96.2 |
LangSplat[ | 99.2 | 98.6 | 99.3 | 97.9 | 99.4 | 98.9 |
Ours | 99.3 | 97.8 | 99.1 | 99.1 | 99.5 | 99.0 |
Metrics | PSNR↑ | SSIM↑ | LPIPS↓ |
---|---|---|---|
Original 3DGS | 35.81 | 0.971 | 0.065 |
Ours | 38.33 | 0.978 | 0.061 |
Table 3 Comparison with the original Gaussian model in the quality assessment metrics of the rendered images
Metrics | PSNR↑ | SSIM↑ | LPIPS↓ |
---|---|---|---|
Original 3DGS | 35.81 | 0.971 | 0.065 |
Ours | 38.33 | 0.978 | 0.061 |
Fig. 4 Open vocabulary editing through interaction with 3D semantic Gaussians ((a) Extract the ‘Winnie-the-Pooh’; (b) Extract the ‘rabbit’; (c) Change the color of ‘a red Switch’; (d) Change the color of ‘Pikachu’; (e) Extract the ‘gerbera’; (f) Change the color of ‘shrilling chicken’; (g) Change the color of ‘gerbera’; (h) Change the color of ‘Winnie-the-Pooh’)
Fig. 5 3D Gaussian spheres for open vocabulary instance objects ((a) Dove body wash; (b) A red gerbera; (c) A bottle of perfume; (d) A black Nike shoe; (e) Sunglasses; (f) A wooden ukulele)
特征提取模块 | mIoU | Accuracy |
---|---|---|
patch_baesd | 89.5 | 91.2 |
SAM | 94.8 | 99.0 |
Table 4 Ablation experiment, comparison of different semantic feature extraction methods/%
特征提取模块 | mIoU | Accuracy |
---|---|---|
patch_baesd | 89.5 | 91.2 |
SAM | 94.8 | 99.0 |
分割策略 | mIoU | Accuracy |
---|---|---|
仅使用相似度图Φ | 89.1 | 90.8 |
+ACF | 90.9 | 91.5 |
+TF | 92.6 | 95.3 |
Ours | 94.8 | 99.0 |
Table 5 Ablation experiment, the effect of segmentation strategies on the accuracy of semantic segmentation for new perspectives/%
分割策略 | mIoU | Accuracy |
---|---|---|
仅使用相似度图Φ | 89.1 | 90.8 |
+ACF | 90.9 | 91.5 |
+TF | 92.6 | 95.3 |
Ours | 94.8 | 99.0 |
[1] | SHEN W, YANG G, YU A, et al. Distilled feature fields enable few-shot language-guided manipulation[EB/OL]. [2024-06-19]. https://proceedings.mlr.press/v229/shen23a.html. |
[2] | WANG Y X, ZHANG M T, LI Z R, et al. D3Fields: dynamic 3D descriptor fields for zero-shot generalizable rearrangement[EB/OL]. [2024-06-19]. https://arxiv.org/abs/2309.16118. |
[3] | RASHID A, SHARMA S, KIM C M, et al. Language embedded radiance fields for zero-shot task-oriented grasping[EB/OL]. [2024-06-19]. https://lerftogo.github.io/desktop.html. |
[4] | KOBAYASHI S, MATSUMOTO E, SITZMANN V. Decomposing NeRF for editing via feature field distillation[C]// The 36th International Conference on Neural Information Processing Systems. Red Hook: Curran Associates Inc., 2022: 1694. |
[5] | KERR J, KIM C M, GOLDBERG K, et al. LERF: language embedded radiance fields[C]// 2023 IEEE/CVF International Conference on Computer Vision. New York: IEEE Press, 2023: 19672-19682. |
[6] | JATAVALLABHULA K M, KUWAJERWALA A, GU Q, et al. ConceptFusion:open-set multimodal 3D mapping[EB/OL]. [2024-06-19]. https://arxiv.org/abs/2302.07241. |
[7] | BEHLEY J, GARBADE M, MILIOTO A, et al. Towards 3D LiDAR-based semantic scene understanding of 3D point cloud sequences: the SemanticKITTI Dataset[J]. The International Journal of Robotics Research, 2021, 40(8/9): 959-967. |
[8] | CAESAR H, BANKITI V, LANG A H, et al. nuScenes: a multimodal dataset for autonomous driving[C]// 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2020: 11618-11628. |
[9] |
牛辰庚, 刘玉杰, 李宗民, 等. 基于点云数据的三维目标识别和模型分割方法[J]. 图学学报, 2019, 40(2): 274-281.
DOI |
NIU C G, LIU Y J, LI Z M, et al. 3D object recognition and model segmentation based on point cloud data[J]. Journal of Graphics, 2019, 40(2): 274-281 (in Chinese). | |
[10] | HU Q Y, YANG B, FANG G C, et al. SQN: weakly-supervised semantic segmentation of large-scale 3D point clouds[C]// The 17th European Conference on Computer Vision. Cham: Springer, 2022: 600-619. |
[11] | HA H, SONG S R. Semantic abstraction:open-world 3D scene understanding from 2D vision-language models[EB/OL]. [2024-06-19]. https://proceedings.mlr.press/v205/ha23a.html. |
[12] | PENG S Y, GENOVA K, JIANG C Y, et al. OpenScene: 3D scene understanding with open vocabularies[C]// 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2023: 815-824. |
[13] | MILDENHALL B, SRINIVASAN P P, TANCIK M, et al. NeRF: representing scenes as neural radiance fields for view synthesis[J]. Communications of the ACM, 2021, 65(1): 99-106. |
[14] |
范腾, 杨浩, 尹稳, 等. 基于神经辐射场的多尺度视图合成研究[J]. 图学学报, 2023, 44(6): 1140-1148.
DOI |
FAN T, YANG H, YIN W, et al. Multi-scale view synthesis based on neural radiance field[J]. Journal of Graphics, 2023, 44(6): 1140-1148 (in Chinese).
DOI |
|
[15] | LIU K H, ZHAN F N, ZHANG J H, et al. Weakly supervised 3D open-vocabulary segmentation[C]// The 37th International Conference on Neural Information Processing Systems. Red Hook: Curran Associates Inc., 2023: 2325. |
[16] | RADFORD A, KIM J W, HALLACY C, et al. Learning transferable visual models from natural language supervision[EB/OL]. [2024-06-19]. http://proceedings.mlr.press/v139/radford21a. |
[17] | CARON M, TOUVRON H, MISRA I, et al. Emerging properties in self-supervised vision transformers[C]// 2021 IEEE/CVF International Conference on Computer Vision. New York: IEEE Press, 2021: 9630-9640. |
[18] | LIANG S N, LIU Y C, WU S Z, et al. ONeRF:unsupervised 3D object segmentation from multiple views[EB/OL]. [2024-06-19]. https://arxiv.org/abs/2211.12038. |
[19] | STELZNER K, KERSTING K, KOSIOREK A R. Decomposing 3D scenes into objects via unsupervised volume segmentation[EB/OL]. [2024-06-19]. https://arxiv.org/abs/2104.01148. |
[20] | ZARZAR J, ROJAS S, GIANCOLA S, et al. SegNeRF:3D part segmentation with neural radiance fields[EB/OL]. [2024-06-19]. https://arxiv.org/abs/2211.11215. |
[21] | TSCHERNEZKI V, LARLUS D, VEDALDI A. NeuralDiff: segmenting 3D objects that move in egocentric videos[C]// 2021 International Conference on 3D Vision. New York: IEEE Press, 2021: 910-919. |
[22] | ZHI S F, LAIDLOW T, LEUTENEGGER S, et al. In-place scene labelling and understanding with implicit scene representation[C]// 2021 IEEE/CVF International Conference on Computer Vision. New York: IEEE Press, 2021: 15818-15827. |
[23] | SIDDIQUI Y, PORZI L, BULÒ S R, et al. Panoptic lifting for 3D scene understanding with neural fields[C]// 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2023: 9043-9052. |
[24] | GOEL R, SIRIKONDA D, SAINI S, et al. Interactive segmentation of radiance fields[C]// 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2023: 4201-4211. |
[25] | TSCHERNEZKI V, LAINA I, LARLUS D, et al. Neural feature fusion fields: 3D distillation of self-supervised 2D image representations[C]// 2022 International Conference on 3D Vision. New York: IEEE Press, 2022: 443-453. |
[26] | LI B Y, WEINBERGER K Q, BELONGIE S, et al. Language-driven semantic segmentation[EB/OL]. [2024-06-19]. https://arxiv.org/abs/2201.03546. |
[27] | QIN M H, LI W H, ZHOU J W, et al. LangSplat: 3D language Gaussian splatting[C]// 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2024: 20051-20060. |
[28] | KERBL B, KOPANAS G, LEIMKUEHLER T, et al. 3D Gaussian splatting for real-time radiance field rendering[J]. ACM Transactions on Graphics, 2023, 42(4): 139. |
[29] | OQUAB M, DARCET T, MOUTAKANNI T, et al. DINOv2:learning robust visual features without supervision[EB/OL]. [2024-06-19]. https://arxiv.org/abs/2304.07193 |
[30] | JIA C, YANG Y F, XIA Y, et al. Scaling up visual and vision-language representation learning with noisy text supervision[EB/OL]. [2024-06-19]. https://proceedings.mlr.press/v139/jia21b.html. |
[31] | ROMBACH R, BLATTMANN A, LORENZ D, et al. High-resolution image synthesis with latent diffusion models[C]// 2022 IEEE/CVF conference on computer vision and pattern recognition. New York: IEEE Press, 2022: 10674-10685. |
[32] | ZHOU C, LI Q, LI C, et al. A comprehensive survey on pretrained foundation models:a history from BERT to ChatGPT[EB/OL]. (2023-05-01) [2024-06-19]. https://link.springer.com/article/10.1007/s13042-024-02443-6. |
[33] | BOMMASANI R, HUDSON D A, ADELI E, et al. On the opportunities and risks of foundation models[EB/OL]. [2024-06-19]. https://arxiv.org/abs/2108.07258. |
[34] | KIRILLOV A, MINTUN E, RAVI N, et al. Segment anything[C]// 2023 IEEE/CVF International Conference on Computer Vision. New York: IEEE Press, 2023: 3992-4003. |
[35] | RAMESH A, DHARIWAL P, NICHOL A, et al. Hierarchical text-conditional image generation with CLIP latents[EB/OL]. [2024-06-19]. https://arxiv.org/abs/2204.06125. |
[36] | WANG Z Q, LU Y, LI Q, et al. CRIS: CLIP-driven referring image segmentation[C]// 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2022: 11676-11685. |
[37] | SONG H Y, DONG L, ZHANG W N, et al. CLIP models are few-shot learners:empirical studies on VQA and visual entailment[EB/OL]. [2024-06-19]. https://arxiv.org/abs/2203.07190. |
[38] | GAO S H, LIN Z J, XIE X Y, et al. EditAnything: empowering unparalleled flexibility in image editing and generation[C]// The 31st ACM International Conference on Multimedia. New York: ACM, 2023: 9414-9416. |
[39] | YAO J F, WANG X G, YE L, et al. Matte anything: interactive natural image matting with segment anything model[J]. Image and Vision Computing, 2024, 147: 105067. |
[40] | CHENG Y M, LI L L, XU Y Y, et al. Segment and track anything[EB/OL]. [2024-06-19]. https://arxiv.org/abs/2305.06558. |
[41] | YANG J Y, GAO M Q, LI Z, et al. Track anything:segment anything meets videos[EB/OL]. [2024-06-19]. https://arxiv.org/abs/2304.11968. |
[42] | LUITEN J, KOPANAS G, LEIBE B, et al. Dynamic 3D Gaussians: tracking by persistent dynamic view synthesis[C]// 2024 International Conference on 3D Vision. New York: IEEE Press, 2024: 800-809. |
[43] | YANG Z Y, GAO X Y, ZHOU W, et al. Deformable 3D Gaussians for high-fidelity monocular dynamic scene reconstruction[C]// 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2024: 20331-20341. |
[44] | WU G J, YI T R, FANG J M, et al. 4D Gaussian splatting for real-time dynamic scene rendering[C]// 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2024: 20310-20320. |
[45] | ZHANG K, LUAN F J, WANG Q Q, et al. PhySG: inverse rendering with spherical Gaussians for physics-based material editing and relighting[C]// 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2021: 5449-5458. |
[46] | TANG J X, REN J W, ZHOU H, et al. DreamGaussian:generative Gaussian splatting for efficient 3D content creation[EB/OL]. [2024-06-19]. https://arxiv.org/abs/2309.16653. |
[47] | CHEN Y W, CHEN R, LEI J B, et al. TANGO: text-driven photorealistic and robust 3D stylization via lighting decomposition[C]// The 36th International Conference on Neural Information Processing Systems. Red Hook: Curran Associates Inc., 2022: 2242. |
[48] | SCHÖNBERGER J L, FRAHM J M. Structure-from-motion revisited[C]// 2016 IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2016: 4104-4113. |
[49] | MAX N. Optical models for direct volume rendering[J]. IEEE Transactions on Visualization and Computer Graphics, 1995, 1(2): 99-108. |
[50] | DAI A, CHANG A X, SAVVA M, et al. ScanNet: richly-annotated 3D reconstructions of indoor scenes[C]// 2017 IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2017: 2432-2443. |
[51] | STRAUB J, WHELAN T, MA L N, et al. The replica dataset: a digital replica of indoor spaces[EB/OL]. [2024-06-19]. https://arxiv.org/abs/1906.05797. |
[52] | XU J R, LIU S F, VAHDAT A, et al. Open-vocabulary panoptic segmentation with text-to-image diffusion models[C]// 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2023: 2955-2966. |
[53] | LIANG F, WU B C, DAI X L, et al. Open-vocabulary semantic segmentation with mask-adapted CLIP[C]// 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2023: 7061-7070. |
[1] | LI Zhihuan, NING Xiaojuan, LV Zhiyong, SHI Zhenghao, JIN Haiyan, WANG Yinghui, ZHOU Wenming. DEMF-Net: dual-branch feature enhancement and multi-scale fusion for semantic segmentation of large-scale point clouds [J]. Journal of Graphics, 2025, 46(2): 259-269. |
[2] | ZHANG Ji, CUI Wenshuai, ZHANG Ronghua, WANG Wenbin, LI Yaqi. A text-driven 3D scene editing method based on key views [J]. Journal of Graphics, 2024, 45(4): 834-844. |
[3] | ZHANG Gui-mei, TAO Hui, LU Fei-fei, PENG Kun. Domain adaptive urban scene semantic segmentation based on dual-source discriminator [J]. Journal of Graphics, 2023, 44(5): 907-917. |
[4] | WU Wen-huan, ZHANG Hao-kun. Semantic segmentation with fusion of spatial criss-cross and channel multi-head attention [J]. Journal of Graphics, 2023, 44(3): 531-539. |
[5] | CUI Zhen-dong , LI Zong-min, YANG Shu-lin , LIU Yu-jie , LI Hua. 3D object detection based on semantic segmentation guidance [J]. Journal of Graphics, 2022, 43(6): 1134-1142. |
[6] | FAN Yi-hua , WANG Yong-zhen , YAN Xue-feng , GONG Li-na , GUO Yan-wen , WEI Ming-qiang. Face recognition-driven low-light image enhancement [J]. Journal of Graphics, 2022, 43(6): 1170-1181. |
[7] | YAO Han, YIN Xue-feng, LI Tong, ZHANG Zhao-xuan, YANG Xin, YIN Bao-cai . Research on depth prediction algorithm based on multi-task model [J]. Journal of Graphics, 2021, 42(3): 446-453. |
[8] | ZHENG Guping, WANG Min, LI Gang . Semantic Segmentation of Multi-Scale Fusion Aerial Image Based on Attention Mechanism [J]. Journal of Graphics, 2018, 39(6): 1069-1077. |
[9] | ZHANG Zhilin, MIAO Lanfang . 3D Scene Reconstruction System Based on Depth Image [J]. Journal of Graphics, 2018, 39(6): 1123-1129. |
[10] | JIAO Liang-bao, CHEN Rui, ZHANG Jian. A Novel Algorithm of Clued KD-tree on SIMD Architecture [J]. Journal of Graphics, 2011, 32(5): 46-50. |
Viewed | ||||||
Full text |
|
|||||
Abstract |
|
|||||