基于关键视图的文本驱动3D场景编辑方法

doi:10.11996/JG.j.2095-302X.2024040834

图学学报 ›› 2024, Vol. 45 ›› Issue (4): 834-844.DOI: 10.11996/JG.j.2095-302X.2024040834

• 计算机图形学与虚拟现实 • 上一篇下一篇

基于关键视图的文本驱动3D场景编辑方法

张冀¹^,²^,³(), 崔文帅¹, 张荣华¹^,³(), 王文彬¹, 李亚琦¹

1.华北电力大学计算机系，河北保定 071003
2.河北省能源电力知识计算重点实验室，河北保定 071003
3.复杂能源系统智能计算教育部工程研究中心，河北保定 071003

收稿日期:2024-02-10 接受日期:2024-05-15 出版日期:2024-08-31 发布日期:2024-09-03
通讯作者:张荣华(1973-)，男，高级工程师，硕士。主要研究方向为计算机图形学、3D AIGC和数字孪生等。E-mail：zronghua88@aliyun.com
第一作者:张冀(1972-)，男，副教授，博士。主要研究方向为智能信息处理、深度学习、图像处理等。E-mail：72zhangji@163.com
基金资助:
河北省科技计划资助项目(22310302D)

A text-driven 3D scene editing method based on key views

ZHANG Ji¹^,²^,³(), CUI Wenshuai¹, ZHANG Ronghua¹^,³(), WANG Wenbin¹, LI Yaqi¹

1. Department of Computer, North China Electric Power University, Baoding Hebei 071003, China
2. Hebei Key Laboratory of Knowledge Computing for Energy & Power, Baoding Hebei 071003, China
3. Engineering Research Center of Intelligent Computing for Complex Energy Systems, Ministry of Education, Baoding Hebei 071003, China

Received:2024-02-10 Accepted:2024-05-15 Published:2024-08-31 Online:2024-09-03
Contact: ZHANG Ronghua (1973-), senior engineer, master. His main research interests cover computer graphic, 3D AIGC and digital twin, etc. E-mail：zronghua88@aliyun.com
First author：ZHANG Ji (1972-), associate professor, Ph.D. His main research interests cover intelligent information processing, deep learning, image processing, etc. E-mail：72zhangji@163.com
Supported by:
Hebei Provincial Science and Technology Program Funding(22310302D)

摘要/Abstract

摘要：

基于去噪扩散模型的零样本图像编辑方法取得了瞩目的成就，将之应用于3D场景编辑可实现零样本的文本驱动3D场景编辑。然而，其3D编辑效果容易受扩散模型的3D连续性与过度编辑等问题影响，产生错误的编辑结果。针对这些问题，提出了一种新的文本驱动3D编辑方法，该方法从数据端着手，提出了基于关键视图的数据迭代方法与基于像素点的异常数据掩码模块。关键视图数据可以引导一个3D区域的编辑以减少3D不一致数据的影响，而数据掩码模块则可以过滤掉2D输入数据中的异常点。使用该方法，可以实现生动的照片级文本驱动3D场景编辑效果。实验证明，相较于一些目前先进的文本驱动3D场景编辑方法，可以大大减少3D场景中错误的编辑，实现更加生动的、更具真实感的3D编辑效果。此外，使用该方法生成的编辑结果更具多样性、编辑效率也更高。

关键词: 扩散模型, 文本驱动, 3D场景编辑, 关键视图, 数据掩码

Abstract:

The zero-shot image editing method based on denoising diffusion model has made remarkable achievements, and its application to 3D scene editing enables zero-shot text-driven 3D scene editing. However, its 3D editing results are easily affected by the 3D continuity of the diffusion model and over-editing, leading to erroneous editing results. To address these problems, a new text-driven 3D editing method was proposed, which started from the dataset and proposed key view-based data iteration and pixel-based abnormal data masking module. The key view data could guide the editing of a 3D area to minimize the effect of 3D inconsistent data, while the data masking module could filter out anomalies in the 2D input data. Using this method, vivid photo-quality text-driven 3D scene editing effects could be realized. Experiments have demonstrated that compared to some current advanced text-driven 3D scene editing methods, the erroneous editing in the 3D scenes could be greatly reduced, resulting in more vivid and realistic 3D editing effects. In addition, the editing results generated by the method in this paper were more diversified and more efficient.

Key words: diffusion model, text-driven, 3D scene editing, key views, data mask

中图分类号:

TP391

张冀, 崔文帅, 张荣华, 王文彬, 李亚琦. 基于关键视图的文本驱动3D场景编辑方法[J]. 图学学报, 2024, 45(4): 834-844.

ZHANG Ji, CUI Wenshuai, ZHANG Ronghua, WANG Wenbin, LI Yaqi. A text-driven 3D scene editing method based on key views[J]. Journal of Graphics, 2024, 45(4): 834-844.

图/表 13

图1 基于关键视图的文本驱动3D场景编辑方法管道图

Fig. 1 Text-driven 3D scene editing method pipeline diagram based on key views

图2 异常数据掩码效果图((a)未编辑的3D场景；(b)实时渲染图；(c)错误编辑图像；(d)掩码)

Fig. 2 Data mask effect ((a) The 3D scene before editing; (b) The real-time rendered images; (c) The images edited incorrectly; (d) Mask)

图3 总体实验结果((a)未编辑的3D场景；(b)~(d)编辑后的3D场景渲染图)

Fig. 3 Overall experimental results ((a) The 3D scenes before editing; (b)~(d) The rendered images of 3D scenes after editing)

图4 实验结果对比图((a)未编辑的3D场景；(b)关键视图；(c) IN2N[9]；(d)本文方法)

Fig. 4 Comparison of experimental results ((a) The 3D scenes before editing; (b) Key views; (c) IN2N[9]; (d) Ours)

图5 实验结果对比图((a)未编辑的3D场景；(b) NeRF-Art[23]；(c)本文方法)

Fig. 5 Comparison of experimental results ((a) The 3D scenes before editing; (b) NeRF-Art[23]; (c) Ours)

图6 消融实验结果图((a)关键视图模块+异常数据掩码；(b)仅关键视图模块；(c)仅异常数据掩码)

Fig. 6 Results of ablation experiment ((a) Key view module + abnormal data mask module; (b) Only key view module;(c) Only abnormal data mask module)

图7 关键视图图像定性实验((a)关键视图图像；(b)~(c)编辑结果)

Fig. 7 Qualitative experiments of key view images ((a) Different images of key view; (b)~(c) Results after being edited)

图8 关键视图定性实验((a)关键视图及其图像；(b)~(c)编辑结果)

Fig. 8 Qualitative experiments of key views ((a) Key view and its image; (b)~(c) Results after being edited)

图9 迭代编辑过程((a) 2 500批次；(b) 5 000批次；(c) 7 500批次；(d) 10 000批次)

Fig. 9 Iterative editing process ((a) 2 500 batches; (b) 5 000 batches; (c) 7 500 batches; (d) 10 000 batches)

图10 关键视图权重实验

Fig. 10 Results of different key view weigh ((a) n=100; (b) n=50; (c) n=10)

表1 不同方法在文本驱动3D场景编辑任务上的用户调查得分

Table 1 User study scores for different methods of text-driven 3D scene editing tasks

方法	语义符合	细节质量
NeRF-Art^[23]	6.5	6.3
IN2N^[9]	8.5	8.3
本文仅关键视图模块	8.8	9.1
本文仅数据掩码模块	8.7	8.4
本文	9.2	9.5

表2 不同方法的CLIP一致性分数得分

Table 2 The score of different methods using CLIP Direction Consistency Score

方法	CLIP一致性分数
IN2N^[9]	0.918 0
本文仅关键视图模块	0.919 5
本文仅数据掩码模块	0.919 7
本文	0.920 3

表3 不同方法在文本驱动3D场景编辑任务上的所需时间

Table 3 Time required by different methods for text-driven 3D scene editing tasks

方法	时间/min
NeRF-Art^[23]	280
IN2N^[9]	47
本文	32

参考文献 23

[1]	GOODFELLOW I J, POUGET-ABADIE J, MIRZA M, et al. Generative adversarial nets[C]// The 27th International Conference on Neural Information Processing Systems. New York: ACM, 2014: 2672-2680.
[2]	SUN J X, WANG X, ZHANG Y, et al. Fenerf: face editing in neural radiance fields[C]// 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2022: 7672-7682.
[3]	MILDENHALL B, SRINIVASAN P P, TANCIK M, et al. NeRF: representing scenes as neural radiance fields for view synthesis[EB/OL]. [2024-01-19]. http://arxiv.org/abs/2003.08934.
[4]	WANG S C, DUAN Y Q, DING H H, et al. Learning transferable human-object interaction detector with natural language supervision[C]// 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2022: 939-948.
[5]	SAHARIA C, CHAN W, SAXENA S, et al. Photorealistic text-to-image diffusion models with deep language understanding[C]// The 36th International Conference on Neural Information Processing Systems. New York: ACM, 2022: 36479-36494.
[6]	BROOKS T, HOLYNSKI A, EFROS A A. Instructpix2pix: learning to follow image editing instructions[C]// 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2023: 18392-18402.
[7]	KAMATA H, SAKUMA Y, HAYAKAWA A, et al. Instruct 3D-to-3D: text Instruction Guided 3D-to-3D conversion[EB/OL]. [2024-01-19]. http://arxiv.org/abs/2303.15780.
[8]	POOLE B, JAIN A, BARRON J T, et al. DreamFusion: text-to-3D using 2D Diffusion[EB/OL]. [2024-01-19]. http://arxiv.org/abs/2209.14988.
[9]	HAQUE A, TANCIK M, EFROS A A, et al. Instruct- NeRF2NeRF: editing 3D scenes with instructions[C]// 2023 IEEE/CVF International Conference on Computer Vision. New York: IEEE Press, 2023: 19740-19750.
[10]	ZHANG R, ISOLA P, EFROS A A, et al. The unreasonable effectiveness of deep features as a perceptual metric[C]// 2018 IEEE/CVF International Conference on Computer Vision. New York: IEEE Press, 2018: 586-595.
[11]	范腾, 杨浩, 尹稳, 等. 基于神经辐射场的多尺度视图合成研究[J]. 图学学报, 2023, 44(6): 1140-1148. DOI
	FAN T, YANG H, YIN W, et al. Multi-scale view synthesis based on neural radiance field[J]. Journal of Graphics, 2023, 44(6): 1140-1148 (in Chinese). DOI
[12]	TANCIK M, WEBER E, NG E, et al. Nerfstudio: a modular framework for neural radiance field development[C]// SIGGRAPH '23: ACM SIGGRAPH 2023 Conference Proceedings. New York: ACM, 2023: 1-12.
[13]	BARRON J T, MILDENHALL B, VERBIN D, et al. Mip-nerf 360: unbounded anti-aliased neural radiance fields[C]// 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2022: 5470-5479.
[14]	岳明宇, 高希峰, 毕重科. 三维建筑模型的低模网格生成[J]. 图学学报, 2023, 44(4): 764-774. DOI
	YUE M Y, GAO X F, BI C K. 3D low-poly mesh generation for building models[J]. Journal of Graphics, 2023, 44(4): 764-774 (in Chinese).
[15]	KAWAR B, ZADA S, LANG O, et al. Imagic: text-based real image editing with diffusion models[C]// 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2023: 6007-6017.
[16]	WINATA G I, MADOTTO A, LIN Z J, et al. Language models are few-shot multilingual learners[EB/OL]. [2024-01-19]. http://arxiv.org/abs/2109.07684.
[17]	TAKAGI Y, NISHIMOTO S. High-resolution image reco-nstruction with latent diffusion models from human brain activity[C]// 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2023: 14453-14463.
[18]	BAO C, ZHANG Y, YANG B, et al. Sine: semantic-driven image-based nerf editing with prior-guided editing field[C]// 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2023: 20919-20929.
[19]	王吉, 王森, 蒋智文, 等. 基于深度条件扩散模型的零样本文本驱动虚拟人生成方法[J]. 图学学报, 2023, 44(6): 1218-1226. DOI
	WANG J, WANG S, JIANG Z W, et al. Zero-shot text-driven avatar generation based on depth-conditioned diffusion model[J]. Journal of Graphics, 2023, 44(6): 1218-1226 (in Chinese). DOI
[20]	WANG Z Y, LU C, WANG Y K, et al. ProlificDreamer: high-fidelity and diverse text-to-3D generation with variational score distillation[EB/OL]. [2024-01-19]. http://arxiv.org/abs/2305.16213.
[21]	BHAT S F, BIRKL R, WOFK D, et al. ZoeDepth: zero-shot transfer by combining relative and metric depth[EB/OL]. [2024-01-19]. http://arxiv.org/abs/2302.12288.
[22]	REN S Q, HE K M, GIRSHICK R, et al. Faster R-CNN: towards real-time object detection with region proposal networks[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 39(6): 1137-1149. DOI PMID
[23]	WANG C, JIANG R X, CHAI M L, et al. NeRF-art: text-driven neural radiance fields stylization[EB/OL]. [2024-01-19]. https://arxiv.org/abs/2212.08070.

基于关键视图的文本驱动3D场景编辑方法

A text-driven 3D scene editing method based on key views

RichHTML

PDF (PC)

可视化

摘要/Abstract

引用本文

使用本文

图/表 13

参考文献 23

相关文章 1

编辑推荐

Metrics

本文评价