图学学报 ›› 2025, Vol. 46 ›› Issue (2): 312-321.DOI: 10.11996/JG.j.2095-302X.2025020312
收稿日期:
2024-08-22
接受日期:
2024-12-22
出版日期:
2025-04-30
发布日期:
2025-04-24
通讯作者:
胡瑞珍(1988-),女,教授,博士。主要研究方向为计算机图形学、具身智能等。E-mail:ruizhen.hu@szu.edu.cn第一作者:
刘高屹(1998-),男,硕士研究生。主要研究方向为计算机图形学。E-mail:liugaoyi@mail.ustc.edu.cn
基金资助:
LIU Gaoyi1(), HU Ruizhen2(
), LIU Ligang1
Received:
2024-08-22
Accepted:
2024-12-22
Published:
2025-04-30
Online:
2025-04-24
First author:
LIU Gaoyi (1998-), master student. His main research interest covers computer graphics. E-mail:liugaoyi@mail.ustc.edu.cn
Supported by:
摘要:
三维场景的语义理解是人类感知世界的基本方式之一。一些语义任务,如开放词汇分割和语义编辑,是计算机视觉和计算机图形学的重要研究领域。由于缺乏大型、多样化的三维开放词汇分割数据集,直接训练一个稳健、可泛化的模型并非易事。为此,提出了基于2D特征蒸馏的3D高斯泼溅,这是一种将SAM和CLIP大模型的语义嵌入蒸馏到3D高斯的方法。对于每个场景,通过SAM和CLIP获取逐像素语义特征,然后使用3D高斯可微分渲染进行训练,以获得特定场景的语义特征场。在语义分割任务中,为获得场景中每个对象的精确分割边界,设计了一种多步骤的分割掩码选择策略,无需繁琐的分层特征提取和训练过程,即可得到新视角图像精确的开放词汇语义分割。利用显式的3D高斯场景表示,有效实现了文本与三维对象间的对应,从而进行语义编辑。实验表明,该方法与所比较方法相比,在语义分割任务中获得相当或更好的定性和定量结果,同时通过三维高斯语义特征场实现了开放词汇语义编辑。
中图分类号:
刘高屹, 胡瑞珍, 刘利刚. 基于2D特征蒸馏的3D高斯泼溅语义分割与编辑[J]. 图学学报, 2025, 46(2): 312-321.
LIU Gaoyi, HU Ruizhen, LIU Ligang. 3D Gaussian splatting semantic segmentation and editing based on 2D feature distillation[J]. Journal of Graphics, 2025, 46(2): 312-321.
方法 | Bed | Bench | Room | Sofa | Lawn | 总体 |
---|---|---|---|---|---|---|
LSeg[ | 56.0 | 6.0 | 19.2 | 4.5 | 17.5 | 20.6 |
ODISE[ | 52.6 | 24.1 | 52.5 | 48.3 | 39.8 | 43.5 |
OV-Seg[ | 78.9 | 89.9 | 71.4 | 66.1 | 81.2 | 77.5 |
3D-OVS[ | 89.5 | 89.3 | 92.8 | 74 | 88.2 | 86.8 |
LangSplat[ | 92.5 | 94.2 | 94.1 | 90.0 | 96.1 | 93.4 |
Ours | 95.0 | 93.8 | 93.5 | 95.5 | 96.3 | 94.8 |
表1 在3D-OVS数据集上mIoU比较结果/%
Table 1 mIoU comparison results on 3D-OVS dataset/%
方法 | Bed | Bench | Room | Sofa | Lawn | 总体 |
---|---|---|---|---|---|---|
LSeg[ | 56.0 | 6.0 | 19.2 | 4.5 | 17.5 | 20.6 |
ODISE[ | 52.6 | 24.1 | 52.5 | 48.3 | 39.8 | 43.5 |
OV-Seg[ | 78.9 | 89.9 | 71.4 | 66.1 | 81.2 | 77.5 |
3D-OVS[ | 89.5 | 89.3 | 92.8 | 74 | 88.2 | 86.8 |
LangSplat[ | 92.5 | 94.2 | 94.1 | 90.0 | 96.1 | 93.4 |
Ours | 95.0 | 93.8 | 93.5 | 95.5 | 96.3 | 94.8 |
方法 | Bed | Bench | Room | Sofa | Lawn | 总体 |
---|---|---|---|---|---|---|
LSeg[ | 87.6 | 42.7 | 46.1 | 16.5 | 77.5 | 54.1 |
ODISE[ | 86.5 | 39.0 | 59.7 | 35.4 | 82.5 | 60.6 |
OV-Seg[ | 40.4 | 89.2 | 49.1 | 69.6 | 92.1 | 68.1 |
3D-OVS[ | 96.7 | 96.3 | 98.9 | 91.6 | 97.3 | 96.2 |
LangSplat[ | 99.2 | 98.6 | 99.3 | 97.9 | 99.4 | 98.9 |
Ours | 99.3 | 97.8 | 99.1 | 99.1 | 99.5 | 99.0 |
表2 在3D-OVS数据集上精度比较结果/%
Table 2 Precision comparison results on 3D-OVS dataset/%
方法 | Bed | Bench | Room | Sofa | Lawn | 总体 |
---|---|---|---|---|---|---|
LSeg[ | 87.6 | 42.7 | 46.1 | 16.5 | 77.5 | 54.1 |
ODISE[ | 86.5 | 39.0 | 59.7 | 35.4 | 82.5 | 60.6 |
OV-Seg[ | 40.4 | 89.2 | 49.1 | 69.6 | 92.1 | 68.1 |
3D-OVS[ | 96.7 | 96.3 | 98.9 | 91.6 | 97.3 | 96.2 |
LangSplat[ | 99.2 | 98.6 | 99.3 | 97.9 | 99.4 | 98.9 |
Ours | 99.3 | 97.8 | 99.1 | 99.1 | 99.5 | 99.0 |
图3 3D-OVS数据集分割结果的可视化((a)床;(b)蓝色沙发;(c)沙发;(d)草坪)
Fig. 3 Visualization of segmentation results for 3D-OVS dataset ((a) Bed; (b) Blue sofa; (c) Sofa; (d) Lawn)
Metrics | PSNR↑ | SSIM↑ | LPIPS↓ |
---|---|---|---|
Original 3DGS | 35.81 | 0.971 | 0.065 |
Ours | 38.33 | 0.978 | 0.061 |
表3 与原始高斯模型在渲染图像质量评估指标的对比
Table 3 Comparison with the original Gaussian model in the quality assessment metrics of the rendered images
Metrics | PSNR↑ | SSIM↑ | LPIPS↓ |
---|---|---|---|
Original 3DGS | 35.81 | 0.971 | 0.065 |
Ours | 38.33 | 0.978 | 0.061 |
图4 通过与三维语义高斯交互进行开放词汇编辑((a)提取“维尼熊”;(b)提取“兔子”;(c)改变“红色Switch”颜色;(d)改变“皮卡丘”颜色;(e)提取“非洲菊”;(f)改变“尖叫鸡”颜色;(g)改变“非洲菊”颜色;(h)改变“维尼熊”颜色)
Fig. 4 Open vocabulary editing through interaction with 3D semantic Gaussians ((a) Extract the ‘Winnie-the-Pooh’; (b) Extract the ‘rabbit’; (c) Change the color of ‘a red Switch’; (d) Change the color of ‘Pikachu’; (e) Extract the ‘gerbera’; (f) Change the color of ‘shrilling chicken’; (g) Change the color of ‘gerbera’; (h) Change the color of ‘Winnie-the-Pooh’)
图5 开放词汇实例对象的三维高斯球((a) Dove沐浴露;(b)红色非洲菊;(c)香水;(d)黑色Nike鞋;(e)墨镜;(f)木制尤克里里)
Fig. 5 3D Gaussian spheres for open vocabulary instance objects ((a) Dove body wash; (b) A red gerbera; (c) A bottle of perfume; (d) A black Nike shoe; (e) Sunglasses; (f) A wooden ukulele)
特征提取模块 | mIoU | Accuracy |
---|---|---|
patch_baesd | 89.5 | 91.2 |
SAM | 94.8 | 99.0 |
表4 在消融实验中不同语义特征提取模块对比/%
Table 4 Ablation experiment, comparison of different semantic feature extraction methods/%
特征提取模块 | mIoU | Accuracy |
---|---|---|
patch_baesd | 89.5 | 91.2 |
SAM | 94.8 | 99.0 |
分割策略 | mIoU | Accuracy |
---|---|---|
仅使用相似度图Φ | 89.1 | 90.8 |
+ACF | 90.9 | 91.5 |
+TF | 92.6 | 95.3 |
Ours | 94.8 | 99.0 |
表5 在消融实验中分割策略对新视角语义分割精度的影响/%
Table 5 Ablation experiment, the effect of segmentation strategies on the accuracy of semantic segmentation for new perspectives/%
分割策略 | mIoU | Accuracy |
---|---|---|
仅使用相似度图Φ | 89.1 | 90.8 |
+ACF | 90.9 | 91.5 |
+TF | 92.6 | 95.3 |
Ours | 94.8 | 99.0 |
[1] | SHEN W, YANG G, YU A, et al. Distilled feature fields enable few-shot language-guided manipulation[EB/OL]. [2024-06-19]. https://proceedings.mlr.press/v229/shen23a.html. |
[2] | WANG Y X, ZHANG M T, LI Z R, et al. D3Fields: dynamic 3D descriptor fields for zero-shot generalizable rearrangement[EB/OL]. [2024-06-19]. https://arxiv.org/abs/2309.16118. |
[3] | RASHID A, SHARMA S, KIM C M, et al. Language embedded radiance fields for zero-shot task-oriented grasping[EB/OL]. [2024-06-19]. https://lerftogo.github.io/desktop.html. |
[4] | KOBAYASHI S, MATSUMOTO E, SITZMANN V. Decomposing NeRF for editing via feature field distillation[C]// The 36th International Conference on Neural Information Processing Systems. Red Hook: Curran Associates Inc., 2022: 1694. |
[5] | KERR J, KIM C M, GOLDBERG K, et al. LERF: language embedded radiance fields[C]// 2023 IEEE/CVF International Conference on Computer Vision. New York: IEEE Press, 2023: 19672-19682. |
[6] | JATAVALLABHULA K M, KUWAJERWALA A, GU Q, et al. ConceptFusion:open-set multimodal 3D mapping[EB/OL]. [2024-06-19]. https://arxiv.org/abs/2302.07241. |
[7] | BEHLEY J, GARBADE M, MILIOTO A, et al. Towards 3D LiDAR-based semantic scene understanding of 3D point cloud sequences: the SemanticKITTI Dataset[J]. The International Journal of Robotics Research, 2021, 40(8/9): 959-967. |
[8] | CAESAR H, BANKITI V, LANG A H, et al. nuScenes: a multimodal dataset for autonomous driving[C]// 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2020: 11618-11628. |
[9] |
牛辰庚, 刘玉杰, 李宗民, 等. 基于点云数据的三维目标识别和模型分割方法[J]. 图学学报, 2019, 40(2): 274-281.
DOI |
NIU C G, LIU Y J, LI Z M, et al. 3D object recognition and model segmentation based on point cloud data[J]. Journal of Graphics, 2019, 40(2): 274-281 (in Chinese). | |
[10] | HU Q Y, YANG B, FANG G C, et al. SQN: weakly-supervised semantic segmentation of large-scale 3D point clouds[C]// The 17th European Conference on Computer Vision. Cham: Springer, 2022: 600-619. |
[11] | HA H, SONG S R. Semantic abstraction:open-world 3D scene understanding from 2D vision-language models[EB/OL]. [2024-06-19]. https://proceedings.mlr.press/v205/ha23a.html. |
[12] | PENG S Y, GENOVA K, JIANG C Y, et al. OpenScene: 3D scene understanding with open vocabularies[C]// 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2023: 815-824. |
[13] | MILDENHALL B, SRINIVASAN P P, TANCIK M, et al. NeRF: representing scenes as neural radiance fields for view synthesis[J]. Communications of the ACM, 2021, 65(1): 99-106. |
[14] |
范腾, 杨浩, 尹稳, 等. 基于神经辐射场的多尺度视图合成研究[J]. 图学学报, 2023, 44(6): 1140-1148.
DOI |
FAN T, YANG H, YIN W, et al. Multi-scale view synthesis based on neural radiance field[J]. Journal of Graphics, 2023, 44(6): 1140-1148 (in Chinese).
DOI |
|
[15] | LIU K H, ZHAN F N, ZHANG J H, et al. Weakly supervised 3D open-vocabulary segmentation[C]// The 37th International Conference on Neural Information Processing Systems. Red Hook: Curran Associates Inc., 2023: 2325. |
[16] | RADFORD A, KIM J W, HALLACY C, et al. Learning transferable visual models from natural language supervision[EB/OL]. [2024-06-19]. http://proceedings.mlr.press/v139/radford21a. |
[17] | CARON M, TOUVRON H, MISRA I, et al. Emerging properties in self-supervised vision transformers[C]// 2021 IEEE/CVF International Conference on Computer Vision. New York: IEEE Press, 2021: 9630-9640. |
[18] | LIANG S N, LIU Y C, WU S Z, et al. ONeRF:unsupervised 3D object segmentation from multiple views[EB/OL]. [2024-06-19]. https://arxiv.org/abs/2211.12038. |
[19] | STELZNER K, KERSTING K, KOSIOREK A R. Decomposing 3D scenes into objects via unsupervised volume segmentation[EB/OL]. [2024-06-19]. https://arxiv.org/abs/2104.01148. |
[20] | ZARZAR J, ROJAS S, GIANCOLA S, et al. SegNeRF:3D part segmentation with neural radiance fields[EB/OL]. [2024-06-19]. https://arxiv.org/abs/2211.11215. |
[21] | TSCHERNEZKI V, LARLUS D, VEDALDI A. NeuralDiff: segmenting 3D objects that move in egocentric videos[C]// 2021 International Conference on 3D Vision. New York: IEEE Press, 2021: 910-919. |
[22] | ZHI S F, LAIDLOW T, LEUTENEGGER S, et al. In-place scene labelling and understanding with implicit scene representation[C]// 2021 IEEE/CVF International Conference on Computer Vision. New York: IEEE Press, 2021: 15818-15827. |
[23] | SIDDIQUI Y, PORZI L, BULÒ S R, et al. Panoptic lifting for 3D scene understanding with neural fields[C]// 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2023: 9043-9052. |
[24] | GOEL R, SIRIKONDA D, SAINI S, et al. Interactive segmentation of radiance fields[C]// 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2023: 4201-4211. |
[25] | TSCHERNEZKI V, LAINA I, LARLUS D, et al. Neural feature fusion fields: 3D distillation of self-supervised 2D image representations[C]// 2022 International Conference on 3D Vision. New York: IEEE Press, 2022: 443-453. |
[26] | LI B Y, WEINBERGER K Q, BELONGIE S, et al. Language-driven semantic segmentation[EB/OL]. [2024-06-19]. https://arxiv.org/abs/2201.03546. |
[27] | QIN M H, LI W H, ZHOU J W, et al. LangSplat: 3D language Gaussian splatting[C]// 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2024: 20051-20060. |
[28] | KERBL B, KOPANAS G, LEIMKUEHLER T, et al. 3D Gaussian splatting for real-time radiance field rendering[J]. ACM Transactions on Graphics, 2023, 42(4): 139. |
[29] | OQUAB M, DARCET T, MOUTAKANNI T, et al. DINOv2:learning robust visual features without supervision[EB/OL]. [2024-06-19]. https://arxiv.org/abs/2304.07193 |
[30] | JIA C, YANG Y F, XIA Y, et al. Scaling up visual and vision-language representation learning with noisy text supervision[EB/OL]. [2024-06-19]. https://proceedings.mlr.press/v139/jia21b.html. |
[31] | ROMBACH R, BLATTMANN A, LORENZ D, et al. High-resolution image synthesis with latent diffusion models[C]// 2022 IEEE/CVF conference on computer vision and pattern recognition. New York: IEEE Press, 2022: 10674-10685. |
[32] | ZHOU C, LI Q, LI C, et al. A comprehensive survey on pretrained foundation models:a history from BERT to ChatGPT[EB/OL]. (2023-05-01) [2024-06-19]. https://link.springer.com/article/10.1007/s13042-024-02443-6. |
[33] | BOMMASANI R, HUDSON D A, ADELI E, et al. On the opportunities and risks of foundation models[EB/OL]. [2024-06-19]. https://arxiv.org/abs/2108.07258. |
[34] | KIRILLOV A, MINTUN E, RAVI N, et al. Segment anything[C]// 2023 IEEE/CVF International Conference on Computer Vision. New York: IEEE Press, 2023: 3992-4003. |
[35] | RAMESH A, DHARIWAL P, NICHOL A, et al. Hierarchical text-conditional image generation with CLIP latents[EB/OL]. [2024-06-19]. https://arxiv.org/abs/2204.06125. |
[36] | WANG Z Q, LU Y, LI Q, et al. CRIS: CLIP-driven referring image segmentation[C]// 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2022: 11676-11685. |
[37] | SONG H Y, DONG L, ZHANG W N, et al. CLIP models are few-shot learners:empirical studies on VQA and visual entailment[EB/OL]. [2024-06-19]. https://arxiv.org/abs/2203.07190. |
[38] | GAO S H, LIN Z J, XIE X Y, et al. EditAnything: empowering unparalleled flexibility in image editing and generation[C]// The 31st ACM International Conference on Multimedia. New York: ACM, 2023: 9414-9416. |
[39] | YAO J F, WANG X G, YE L, et al. Matte anything: interactive natural image matting with segment anything model[J]. Image and Vision Computing, 2024, 147: 105067. |
[40] | CHENG Y M, LI L L, XU Y Y, et al. Segment and track anything[EB/OL]. [2024-06-19]. https://arxiv.org/abs/2305.06558. |
[41] | YANG J Y, GAO M Q, LI Z, et al. Track anything:segment anything meets videos[EB/OL]. [2024-06-19]. https://arxiv.org/abs/2304.11968. |
[42] | LUITEN J, KOPANAS G, LEIBE B, et al. Dynamic 3D Gaussians: tracking by persistent dynamic view synthesis[C]// 2024 International Conference on 3D Vision. New York: IEEE Press, 2024: 800-809. |
[43] | YANG Z Y, GAO X Y, ZHOU W, et al. Deformable 3D Gaussians for high-fidelity monocular dynamic scene reconstruction[C]// 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2024: 20331-20341. |
[44] | WU G J, YI T R, FANG J M, et al. 4D Gaussian splatting for real-time dynamic scene rendering[C]// 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2024: 20310-20320. |
[45] | ZHANG K, LUAN F J, WANG Q Q, et al. PhySG: inverse rendering with spherical Gaussians for physics-based material editing and relighting[C]// 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2021: 5449-5458. |
[46] | TANG J X, REN J W, ZHOU H, et al. DreamGaussian:generative Gaussian splatting for efficient 3D content creation[EB/OL]. [2024-06-19]. https://arxiv.org/abs/2309.16653. |
[47] | CHEN Y W, CHEN R, LEI J B, et al. TANGO: text-driven photorealistic and robust 3D stylization via lighting decomposition[C]// The 36th International Conference on Neural Information Processing Systems. Red Hook: Curran Associates Inc., 2022: 2242. |
[48] | SCHÖNBERGER J L, FRAHM J M. Structure-from-motion revisited[C]// 2016 IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2016: 4104-4113. |
[49] | MAX N. Optical models for direct volume rendering[J]. IEEE Transactions on Visualization and Computer Graphics, 1995, 1(2): 99-108. |
[50] | DAI A, CHANG A X, SAVVA M, et al. ScanNet: richly-annotated 3D reconstructions of indoor scenes[C]// 2017 IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2017: 2432-2443. |
[51] | STRAUB J, WHELAN T, MA L N, et al. The replica dataset: a digital replica of indoor spaces[EB/OL]. [2024-06-19]. https://arxiv.org/abs/1906.05797. |
[52] | XU J R, LIU S F, VAHDAT A, et al. Open-vocabulary panoptic segmentation with text-to-image diffusion models[C]// 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2023: 2955-2966. |
[53] | LIANG F, WU B C, DAI X L, et al. Open-vocabulary semantic segmentation with mask-adapted CLIP[C]// 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2023: 7061-7070. |
[1] | 李治寰, 宁小娟, 吕志勇, 石争浩, 金海燕, 王映辉, 周文明. DEMF-Net:基于双分支增强和多尺度融合的大规模点云语义分割[J]. 图学学报, 2025, 46(2): 259-269. |
[2] | 张桂梅, 陶辉, 鲁飞飞, 彭昆. 基于双源判别器的域自适应城市场景语义分割[J]. 图学学报, 2023, 44(5): 907-917. |
[3] | 吴文欢, 张淏坤. 融合空间十字注意力与通道注意力的语义分割网络[J]. 图学学报, 2023, 44(3): 531-539. |
[4] | 崔振东, 李宗民, 杨树林, 刘玉杰, 李华. 基于语义分割引导的三维目标检测[J]. 图学学报, 2022, 43(6): 1134-1142. |
[5] | 范溢华 , 王永振 , 燕雪峰 , 宫丽娜 , 郭延文 , 魏明强 . 人脸识别任务驱动的低光照图像增强算法 [J]. 图学学报, 2022, 43(6): 1170-1181. |
[6] | 姚 翰, 殷雪峰, 李 童, 张肇轩, 杨 鑫, 尹宝才. 基于多任务模型的深度预测算法研究[J]. 图学学报, 2021, 42(3): 446-453. |
[7] | 郑顾平, 王 敏, 李 刚 . 基于注意力机制的多尺度融合航拍影像语义分割[J]. 图学学报, 2018, 39(6): 1069-1077. |
[8] | 张志林, 苗兰芳 . 基于深度图像的三维场景重建系统[J]. 图学学报, 2018, 39(6): 1123-1129. |
[9] | 焦良葆, 陈 瑞, 张 健. 一个新的线索KD树并行算法[J]. 图学学报, 2011, 32(5): 46-50. |
阅读次数 | ||||||
全文 |
|
|||||
摘要 |
|
|||||