A text-driven 3D scene editing method based on key views

doi:10.11996/JG.j.2095-302X.2024040834

Abstract

Abstract:

The zero-shot image editing method based on denoising diffusion model has made remarkable achievements, and its application to 3D scene editing enables zero-shot text-driven 3D scene editing. However, its 3D editing results are easily affected by the 3D continuity of the diffusion model and over-editing, leading to erroneous editing results. To address these problems, a new text-driven 3D editing method was proposed, which started from the dataset and proposed key view-based data iteration and pixel-based abnormal data masking module. The key view data could guide the editing of a 3D area to minimize the effect of 3D inconsistent data, while the data masking module could filter out anomalies in the 2D input data. Using this method, vivid photo-quality text-driven 3D scene editing effects could be realized. Experiments have demonstrated that compared to some current advanced text-driven 3D scene editing methods, the erroneous editing in the 3D scenes could be greatly reduced, resulting in more vivid and realistic 3D editing effects. In addition, the editing results generated by the method in this paper were more diversified and more efficient.

Key words: diffusion model, text-driven, 3D scene editing, key views, data mask

CLC Number:

TP391

ZHANG Ji, CUI Wenshuai, ZHANG Ronghua, WANG Wenbin, LI Yaqi. A text-driven 3D scene editing method based on key views[J]. Journal of Graphics, 2024, 45(4): 834-844.

Figures/Tables 13

Fig. 1 Text-driven 3D scene editing method pipeline diagram based on key views

Fig. 2 Data mask effect ((a) The 3D scene before editing; (b) The real-time rendered images; (c) The images edited incorrectly; (d) Mask)

Fig. 3 Overall experimental results ((a) The 3D scenes before editing; (b)~(d) The rendered images of 3D scenes after editing)

Fig. 4 Comparison of experimental results ((a) The 3D scenes before editing; (b) Key views; (c) IN2N[9]; (d) Ours)

Fig. 5 Comparison of experimental results ((a) The 3D scenes before editing; (b) NeRF-Art[23]; (c) Ours)

Fig. 6 Results of ablation experiment ((a) Key view module + abnormal data mask module; (b) Only key view module;(c) Only abnormal data mask module)

Fig. 7 Qualitative experiments of key view images ((a) Different images of key view; (b)~(c) Results after being edited)

Fig. 8 Qualitative experiments of key views ((a) Key view and its image; (b)~(c) Results after being edited)

Fig. 9 Iterative editing process ((a) 2 500 batches; (b) 5 000 batches; (c) 7 500 batches; (d) 10 000 batches)

Fig. 10 Results of different key view weigh ((a) n=100; (b) n=50; (c) n=10)

Table 1 User study scores for different methods of text-driven 3D scene editing tasks

方法	语义符合	细节质量
NeRF-Art^[23]	6.5	6.3
IN2N^[9]	8.5	8.3
本文仅关键视图模块	8.8	9.1
本文仅数据掩码模块	8.7	8.4
本文	9.2	9.5

Table 2 The score of different methods using CLIP Direction Consistency Score

方法	CLIP一致性分数
IN2N^[9]	0.918 0
本文仅关键视图模块	0.919 5
本文仅数据掩码模块	0.919 7
本文	0.920 3

Table 3 Time required by different methods for text-driven 3D scene editing tasks

方法	时间/min
NeRF-Art^[23]	280
IN2N^[9]	47
本文	32

References 23

[1]	GOODFELLOW I J, POUGET-ABADIE J, MIRZA M, et al. Generative adversarial nets[C]// The 27th International Conference on Neural Information Processing Systems. New York: ACM, 2014: 2672-2680.
[2]	SUN J X, WANG X, ZHANG Y, et al. Fenerf: face editing in neural radiance fields[C]// 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2022: 7672-7682.
[3]	MILDENHALL B, SRINIVASAN P P, TANCIK M, et al. NeRF: representing scenes as neural radiance fields for view synthesis[EB/OL]. [2024-01-19]. http://arxiv.org/abs/2003.08934.
[4]	WANG S C, DUAN Y Q, DING H H, et al. Learning transferable human-object interaction detector with natural language supervision[C]// 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2022: 939-948.
[5]	SAHARIA C, CHAN W, SAXENA S, et al. Photorealistic text-to-image diffusion models with deep language understanding[C]// The 36th International Conference on Neural Information Processing Systems. New York: ACM, 2022: 36479-36494.
[6]	BROOKS T, HOLYNSKI A, EFROS A A. Instructpix2pix: learning to follow image editing instructions[C]// 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2023: 18392-18402.
[7]	KAMATA H, SAKUMA Y, HAYAKAWA A, et al. Instruct 3D-to-3D: text Instruction Guided 3D-to-3D conversion[EB/OL]. [2024-01-19]. http://arxiv.org/abs/2303.15780.
[8]	POOLE B, JAIN A, BARRON J T, et al. DreamFusion: text-to-3D using 2D Diffusion[EB/OL]. [2024-01-19]. http://arxiv.org/abs/2209.14988.
[9]	HAQUE A, TANCIK M, EFROS A A, et al. Instruct- NeRF2NeRF: editing 3D scenes with instructions[C]// 2023 IEEE/CVF International Conference on Computer Vision. New York: IEEE Press, 2023: 19740-19750.
[10]	ZHANG R, ISOLA P, EFROS A A, et al. The unreasonable effectiveness of deep features as a perceptual metric[C]// 2018 IEEE/CVF International Conference on Computer Vision. New York: IEEE Press, 2018: 586-595.
[11]	范腾, 杨浩, 尹稳, 等. 基于神经辐射场的多尺度视图合成研究[J]. 图学学报, 2023, 44(6): 1140-1148. DOI
	FAN T, YANG H, YIN W, et al. Multi-scale view synthesis based on neural radiance field[J]. Journal of Graphics, 2023, 44(6): 1140-1148 (in Chinese). DOI
[12]	TANCIK M, WEBER E, NG E, et al. Nerfstudio: a modular framework for neural radiance field development[C]// SIGGRAPH '23: ACM SIGGRAPH 2023 Conference Proceedings. New York: ACM, 2023: 1-12.
[13]	BARRON J T, MILDENHALL B, VERBIN D, et al. Mip-nerf 360: unbounded anti-aliased neural radiance fields[C]// 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2022: 5470-5479.
[14]	岳明宇, 高希峰, 毕重科. 三维建筑模型的低模网格生成[J]. 图学学报, 2023, 44(4): 764-774. DOI
	YUE M Y, GAO X F, BI C K. 3D low-poly mesh generation for building models[J]. Journal of Graphics, 2023, 44(4): 764-774 (in Chinese).
[15]	KAWAR B, ZADA S, LANG O, et al. Imagic: text-based real image editing with diffusion models[C]// 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2023: 6007-6017.
[16]	WINATA G I, MADOTTO A, LIN Z J, et al. Language models are few-shot multilingual learners[EB/OL]. [2024-01-19]. http://arxiv.org/abs/2109.07684.
[17]	TAKAGI Y, NISHIMOTO S. High-resolution image reco-nstruction with latent diffusion models from human brain activity[C]// 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2023: 14453-14463.
[18]	BAO C, ZHANG Y, YANG B, et al. Sine: semantic-driven image-based nerf editing with prior-guided editing field[C]// 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2023: 20919-20929.
[19]	王吉, 王森, 蒋智文, 等. 基于深度条件扩散模型的零样本文本驱动虚拟人生成方法[J]. 图学学报, 2023, 44(6): 1218-1226. DOI
	WANG J, WANG S, JIANG Z W, et al. Zero-shot text-driven avatar generation based on depth-conditioned diffusion model[J]. Journal of Graphics, 2023, 44(6): 1218-1226 (in Chinese). DOI
[20]	WANG Z Y, LU C, WANG Y K, et al. ProlificDreamer: high-fidelity and diverse text-to-3D generation with variational score distillation[EB/OL]. [2024-01-19]. http://arxiv.org/abs/2305.16213.
[21]	BHAT S F, BIRKL R, WOFK D, et al. ZoeDepth: zero-shot transfer by combining relative and metric depth[EB/OL]. [2024-01-19]. http://arxiv.org/abs/2302.12288.
[22]	REN S Q, HE K M, GIRSHICK R, et al. Faster R-CNN: towards real-time object detection with region proposal networks[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 39(6): 1137-1149. DOI PMID
[23]	WANG C, JIANG R X, CHAI M L, et al. NeRF-art: text-driven neural radiance fields stylization[EB/OL]. [2024-01-19]. https://arxiv.org/abs/2212.08070.