基于深度条件扩散模型的零样本文本驱动虚拟人生成方法

doi:10.11996/JG.j.2095-302X.2023061218

图学学报 ›› 2023, Vol. 44 ›› Issue (6): 1218-1226.DOI: 10.11996/JG.j.2095-302X.2023061218

• 计算机图形学与虚拟现实 • 上一篇下一篇

基于深度条件扩散模型的零样本文本驱动虚拟人生成方法

王吉¹(), 王森¹, 蒋智文¹, 谢志峰¹^,²(), 李梦甜¹^,²

1.上海大学影视工程系，上海 200072
2.上海电影特效工程技术研究中心，上海 200072

收稿日期:2023-06-29 接受日期:2023-09-07 出版日期:2023-12-31 发布日期:2023-12-17
通讯作者: 谢志峰(1982-)，男，副教授，博士。主要研究方向为图形图像处理和计算机视觉等。E-mail：zhifeng_xie@shu.edu.cn
作者简介:
王吉(1999-)，女，硕士研究生。主要研究方向为计算机视觉和计算机图形学。E-mail：wang_ji357@shu.edu.cn

Zero-shot text-driven avatar generation based on depth-conditioned diffusion model

WANG Ji¹(), WANG Sen¹, JIANG Zhi-wen¹, XIE Zhi-feng¹^,²(), LI Meng-tian¹^,²

1. Department of Film and Television Engineering, Shanghai University, Shanghai 200072, China
2. Shanghai Film Special Effects Engineering Technology Research Center, Shanghai 200072, China

Received:2023-06-29 Accepted:2023-09-07 Online:2023-12-31 Published:2023-12-17
Contact: XIE Zhi-feng (1982-), associate professor, Ph.D. His main research interests cover graphic image processing, computer vision, etc. E-mail：zhifeng_xie@shu.edu.cn
About author:
WANG Ji (1999-), master student. Her main research interests cover computer vison, computer graphics. E-mail：wang_ji357@shu.edu.cn

摘要/Abstract

摘要：

虚拟人生成技术对于虚拟现实和影视制作等领域有重要意义。针对现有虚拟人生成需要大量数据和制作成本等问题，提出一种基于扩散模型的零样本文本驱动的三维虚拟人生成方法，包括条件人体生成和迭代纹理细化2个阶段。第一阶段，首先利用神经网络初始化三维人体的隐式表示，然后，使用一个基于文本提示的深度条件扩散模型来引导神经隐式场生成用户所需的虚拟人模型。第二阶段，利用扩散模型进行去噪还原，针对第一阶段人体模型提供的纹理先验进行高精度的纹理图推理更新，进而迭代细化虚拟人的纹理表示，生成最终结果。使用该方法，用户可以创建一个生动的具有任意文本描述的虚拟人，而无需使用任何参考照片。实验结果表明，该方法可以在给定的文本提示条件下生成具有真实感的高质量、生动的虚拟人。

关键词: 扩散模型, 虚拟人, 零样本, 文本驱动的生成, 深度学习

Abstract:

Avatars generation holds significant implications for various fields, including virtual reality and film production. To address the challenges associated with data volume and production costs in existing avatar generation methods, we proposed a zero-shot text-driven avatar generation method based on a depth-conditioned diffusion model. The method comprised two stages: conditional human body generation and iterative texture refinement. In the first stage, a neural network was employed to establish the implicit representation of the avatar. Subsequently, a depth-conditioned diffusion model was utilized to guide the neural implicit field in generating the required avatar model based on user input. In the second stage, the diffusion model was employed to generate high-precision inference texture images, leveraging the texture prior obtained in the first stage. The texture representation of the avatar model was enhanced through an iterative optimization scheme. With this method, users could create realistic avatars with vivid characteristics, all from text descriptions. Experimental results substantiated the effectiveness of the proposed method, showcasing that it could yield high-quality avatars exhibiting realism when generated in response to given text prompts.

Key words: diffusion model, avatar generation, zero-shot, text-driven generation, deep learning

中图分类号:

TP391

王吉, 王森, 蒋智文, 谢志峰, 李梦甜. 基于深度条件扩散模型的零样本文本驱动虚拟人生成方法[J]. 图学学报, 2023, 44(6): 1218-1226.

WANG Ji, WANG Sen, JIANG Zhi-wen, XIE Zhi-feng, LI Meng-tian. Zero-shot text-driven avatar generation based on depth-conditioned diffusion model[J]. Journal of Graphics, 2023, 44(6): 1218-1226.

图/表 9

图1 基于深度条件扩散模型的虚拟人生成框架网络结构图

Fig. 1 Network architecture diagram for virtual human generation framework based on deep conditional diffusion model

图2 方法比较((a) DreamFusion[8]；(b) SJC[24]；(c) AvatarCLIP[7]；(d)本文最终结果；(e)本文第一阶段结果；(f)本文的几何结果)

Fig. 2 Method comparison ((a) DreamFusion[8]; (b) SJC[24]; (c) AvatarCLIP[6]; (d) The final result of this paper; (e) The results of the first stage of this paper; (f) Geometric result of this article)

图3 案例比较((a) AvatarCLIP[7]；(b)本文)

Fig. 3 Case comparation ((a) AvatarCLIP[7]; (b) Ours)

表1 不同方法在虚拟人生成任务的用户调查得分

Table 1 Different methods of user survey score in avatar generation task

方法	一致性(↑)	几何质量(↑)	纹理质量(↑)
DreamFusion^[8]	2.8	3.2	4.2
SJC^[24]	2.9	3.6	4.4
AvatarCLIP^[7]	4.2	4.1	3.2
本文	4.6	4.2	4.6
平均	3.6	3.8	4.1

表2 不同方法的量化比较

Table 2 Comparation of Different methods of running time

方法	CLIP score
DreamFusion^[8]	31.03
SJC^[24]	31.59
AvatarCLIP^[7]	32.18
本文	32.37

表3 不同方法的生成时间对比

Table 3 Comparation of Different methods of running time

方法	生成时间(↓)
DreamFusion^[20]	52 min
SJC^[24]	30 min
AvatarCLIP^[7]	1 h 40 min
本文	1 h 15 min

图4 不同训练批次下风格的插值过程

Fig. 4 Interpolation procedure of styles under different training batches

图5 消融实验

Fig. 5 Ablation studies

图6 迭代的纹理细化过程(((a)不同迭代周期下纹理引导图的更新过成；(b)相应周期内模型优化后的渲染结果)

Fig. 6 Iterative texture refinement process ((a) The updating process of texture guide map under different iterations; (b) The rendering e results of the optimized model in the corresponding period)

参考文献 27

[1]	REED S, AKATA Z, YAN X C, et al. Generative adversarial text to image synthesis[EB/OL]. [2023-01-23]. https://arxiv.org/pdf/1605.05396.pdf.
[2]	CHEN X, JIANG T J, SONG J, et al. gDNA: towards generative detailed neural avatars[C]// 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2022: 20427-2043.
[3]	HONG F Z, CHEN Z X, LAN Y S, et al. EVA3D: compositional 3D human generation from 2D image collections[EB/OL]. [2023-01-23]. https://arxiv.org/abs/2210.04888.
[4]	RAMESH A, PAVLOV M, GOH G, et al. Zero-shot text-to-image generation[EB/OL]. [2023-01-20]. https://arxiv.org/abs/2102.12092v1.
[5]	SAHARIA C, CHAN W, SAXENA S, et al. Photorealistic text-to-image diffusion models with deep language understanding[EB/OL]. [2023-01-23]. https://arxiv.org/abs/2205.11487.
[6]	ROMBACH R, BLATTMANN A, LORENZ D, et al. High-resolution image synthesis with latent diffusion models[C]// 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2022: 10684-10695.
[7]	HONG F Z, ZHANG M Y, PAN L A, et al. AvatarCLIP: zero-shot text-driven generation and animation of 3D avatars[J]. ACM Transactions on Graphics, 2022, 41(4): 1-19.
[8]	POOLE B, JAIN A, ARRON J B, et al. DreamFusion: text-to-3D using 2D diffusion[EB/OL]. [2023-01-23]. https://www.aminer.cn/pub/63365e7f90e50fcafd1a3612/.
[9]	WANG P, LIU L J, LIU Y, et al. NeuS: learning neural implicit surfaces by volume rendering for multi-view reconstruction[EB/OL]. [2023-01-23]. https://arxiv.org/abs/2106.10689.
[10]	LOPER M, MAHMOOD N, ROMERO J, et al. SMPL: a skinned multi-person linear model[J]. ACM Transactions on Graphics, 34(6): 248: 1-248: 16.
[11]	蔡兴泉, 霍宇晴, 李发建, 等. 面向太极拳学习的人体姿态估计及相似度计算[J]. 图学学报, 2022, 43(4): 695-706.
	CAI X Q, HUO Y Q, LI F J, et al. Human pose estimation and similarity calculation for Tai Chi learning[J]. Journal of Graphics, 2022, 43(4): 695-706 (in Chinese). DOI
[12]	王玉萍, 曾毅, 李胜辉, 等. 一种基于Transformer的三维人体姿态估计方法[J]. 图学学报, 2023, 44(1): 139-145. DOI
	WANG Y P, ZENG Y, LI S H, et al. A Transformer-based 3D human pose estimation method[J]. Journal of Graphics, 2023, 44(1): 139-145 (in Chinese).
[13]	张小蒙, 方贤勇, 汪粼波, 等. 基于改进分段铰链变换的人体重建技术[J]. 图学学报, 2020, 41(1): 108-115. DOI
	ZHANG X M, FANG X Y, WANG L B, et al. Human body reconstruction based on improved piecewise hinge transformation[J]. Journal of Graphics, 2020, 41(1): 108-115 (in Chinese).
[14]	BHATNAGAR B L, SMINCHISESCU C, THEOBALT C, et al. Combining implicit function learning and parametric models for 3D human reconstruction[C]// European Conference on Computer Vision. Cham: Springer, 2020: 311-329.
[15]	MILDENHALL B, SRINIVASAN P P, TANCIK M, et al. NeRF: representing scenes as neural radiance fields for view synthesis[C]// European Conference on Computer Vision. Cham: Springer, 2020: 405-421.
[16]	GROPP A, YARIV L, HAIM N, et al. Implicit geometric regularization for learning shapes[C]// The 37th International Conference on Machine Learning. New York: ACM, 2020: 3789-3799.
[17]	GRIGOREV A, ISKAKOV K, IANINA A, et al. StylePeople: a generative model of fullbody human avatars[C]// 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2021: 5151-5160.
[18]	RAMEEN A, YIPENG Q, PETER W. Image2stylegan: How to embed images into the stylegan latent space?[C]// 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2019: 4432-4441.
[19]	CORONA E, PUMAROLA A, ALENYA G, et al. SMPLicit: topology-aware generative model for clothed people[C]// 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2021: 11875-11885.
[20]	JAIN A, MILDENHALL B, BARRON J T, et al. Zero-shot text-guided object generation with dream fields[C]// 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2022: 857-866.
[21]	KHALID N M, XIE T H, BELILOVSKY E, et al. CLIP-Mesh:generating textured meshes from text using pretrained image-text models[C]// SA'22: SIGGRAPH Asia 2022 Conference Papers. New York: ACM, 2022: 1-8.
[22]	LEE H H, CHANG A X. Understanding pure clip guidance for voxel grid NeRF models[EB/OL]. (2022-09-30) [2023-02-22]. https://arxiv.org/abs/2209.15172.
[23]	LIN C H, GAO J, TANG L, et al. Magic3D: high resolution text-to-3D content creation[C]// 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2023: 300-309.
[24]	WANG H C, DU X D, LI J H, et al. Score Jacobian chaining: lifting pretrained 2D diffusion models for 3D generation[C]// 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2023: 12619-12629.
[25]	Stability AI. Stable diffusion[EB/OL]. (2022-08-22) [2023- 02-22]. https://stability.ai/blog/stable-diffusion-public-release.
[26]	WANG Y Q, SKOROKHODOV I, WONKA P. HF-NeuS: improved surface reconstruction using high-frequency details[EB/OL]. [2023-01-23]. https://arxiv.org/abs/2206.07850.
[27]	KINGMA D P, BA J. Adam: a method for stochastic optimization[EB/OL]. [2023-02-12]. https://arxiv.org/pdf/1412.6980.pdf.

基于深度条件扩散模型的零样本文本驱动虚拟人生成方法

Zero-shot text-driven avatar generation based on depth-conditioned diffusion model

RichHTML

PDF (PC)

可视化

摘要/Abstract

引用本文

使用本文

图/表 9

参考文献 27

相关文章 15

编辑推荐

Metrics

本文评价

[1]	王稚儒, 常远, 鲁鹏, 潘成伟 . 神经辐射场加速算法综述 [J]. 图学学报, 2024, 45(1): 1-13.
[2]	王欣雨, 刘慧, 朱积成, 盛玉瑞, 张彩明. 基于高低频特征分解的深度多模态医学图像融合网络 [J]. 图学学报, 2024, 45(1): 65-77.
[3]	李佳琦, 王辉, 郭宇. 基于Transformer的三角形网格分类分割网络 [J]. 图学学报, 2024, 45(1): 78-89.
[4]	韩亚振, 尹梦晓, 马伟钊, 杨诗耕, 胡锦飞, 朱丛洋 . DGOA：基于动态图和偏移注意力的点云上采样 [J]. 图学学报, 2024, 45(1): 219-229.
[5]	王江安, 黄乐, 庞大为, 秦林珍, 梁温茜. 基于自适应聚合循环递归的稠密点云重建网络 [J]. 图学学报, 2024, 45(1): 230-239.
[6]	周锐闯, 田瑾, 闫丰亭, 朱天晓, 张玉金. 融合外部注意力和图卷积的点云分类模型[J]. 图学学报, 2023, 44(6): 1162-1172.
[7]	杨陈成, 董秀成, 侯兵, 张党成, 向贤明, 冯琪茗. 基于参考的Transformer纹理迁移深度图像超分辨率重建[J]. 图学学报, 2023, 44(5): 861-867.
[8]	党宏社, 许怀彪, 张选德. 融合结构信息的深度学习立体匹配算法[J]. 图学学报, 2023, 44(5): 899-906.
[9]	翟永杰, 郭聪彬, 王乾铭, 赵宽, 白云山, 张冀. 基于隐含空间知识融合的输电线路多金具检测方法[J]. 图学学报, 2023, 44(5): 918-927.
[10]	杨红菊, 高敏, 张常有, 薄文, 武文佳, 曹付元. 一种面向图像修复的局部优化生成模型[J]. 图学学报, 2023, 44(5): 955-965.
[11]	毕春艳, 刘越. 基于深度学习的视频人体动作识别综述[J]. 图学学报, 2023, 44(4): 625-639.
[12]	曹义亲, 周一纬, 徐露. 基于E-YOLOX的实时金属表面缺陷检测算法[J]. 图学学报, 2023, 44(4): 677-690.
[13]	邵俊棋, 钱文华, 徐启豪. 基于条件残差生成对抗网络的风景图生成[J]. 图学学报, 2023, 44(4): 710-717.
[14]	余伟群, 刘佳涛, 张亚萍. 融合注意力的拉普拉斯金字塔单目深度估计[J]. 图学学报, 2023, 44(4): 728-738.
[15]	郭印宏, 王立春, 李爽. 基于重复性和特异性约束的图像特征匹配[J]. 图学学报, 2023, 44(4): 739-746.