Zero-shot text-driven avatar generation based on depth-conditioned diffusion model

doi:10.11996/JG.j.2095-302X.2023061218

Abstract

Abstract:

Avatars generation holds significant implications for various fields, including virtual reality and film production. To address the challenges associated with data volume and production costs in existing avatar generation methods, we proposed a zero-shot text-driven avatar generation method based on a depth-conditioned diffusion model. The method comprised two stages: conditional human body generation and iterative texture refinement. In the first stage, a neural network was employed to establish the implicit representation of the avatar. Subsequently, a depth-conditioned diffusion model was utilized to guide the neural implicit field in generating the required avatar model based on user input. In the second stage, the diffusion model was employed to generate high-precision inference texture images, leveraging the texture prior obtained in the first stage. The texture representation of the avatar model was enhanced through an iterative optimization scheme. With this method, users could create realistic avatars with vivid characteristics, all from text descriptions. Experimental results substantiated the effectiveness of the proposed method, showcasing that it could yield high-quality avatars exhibiting realism when generated in response to given text prompts.

Key words: diffusion model, avatar generation, zero-shot, text-driven generation, deep learning

CLC Number:

TP391

WANG Ji, WANG Sen, JIANG Zhi-wen, XIE Zhi-feng, LI Meng-tian. Zero-shot text-driven avatar generation based on depth-conditioned diffusion model[J]. Journal of Graphics, 2023, 44(6): 1218-1226.

Figures/Tables 9

Fig. 1 Network architecture diagram for virtual human generation framework based on deep conditional diffusion model

Fig. 2 Method comparison ((a) DreamFusion[8]; (b) SJC[24]; (c) AvatarCLIP[6]; (d) The final result of this paper; (e) The results of the first stage of this paper; (f) Geometric result of this article)

Fig. 3 Case comparation ((a) AvatarCLIP[7]; (b) Ours)

Table 1 Different methods of user survey score in avatar generation task

方法	一致性(↑)	几何质量(↑)	纹理质量(↑)
DreamFusion^[8]	2.8	3.2	4.2
SJC^[24]	2.9	3.6	4.4
AvatarCLIP^[7]	4.2	4.1	3.2
本文	4.6	4.2	4.6
平均	3.6	3.8	4.1

Table 2 Comparation of Different methods of running time

方法	CLIP score
DreamFusion^[8]	31.03
SJC^[24]	31.59
AvatarCLIP^[7]	32.18
本文	32.37

Table 3 Comparation of Different methods of running time

方法	生成时间(↓)
DreamFusion^[20]	52 min
SJC^[24]	30 min
AvatarCLIP^[7]	1 h 40 min
本文	1 h 15 min

Fig. 4 Interpolation procedure of styles under different training batches

Fig. 5 Ablation studies

Fig. 6 Iterative texture refinement process ((a) The updating process of texture guide map under different iterations; (b) The rendering e results of the optimized model in the corresponding period)

References 27

[1]	REED S, AKATA Z, YAN X C, et al. Generative adversarial text to image synthesis[EB/OL]. [2023-01-23]. https://arxiv.org/pdf/1605.05396.pdf.
[2]	CHEN X, JIANG T J, SONG J, et al. gDNA: towards generative detailed neural avatars[C]// 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2022: 20427-2043.
[3]	HONG F Z, CHEN Z X, LAN Y S, et al. EVA3D: compositional 3D human generation from 2D image collections[EB/OL]. [2023-01-23]. https://arxiv.org/abs/2210.04888.
[4]	RAMESH A, PAVLOV M, GOH G, et al. Zero-shot text-to-image generation[EB/OL]. [2023-01-20]. https://arxiv.org/abs/2102.12092v1.
[5]	SAHARIA C, CHAN W, SAXENA S, et al. Photorealistic text-to-image diffusion models with deep language understanding[EB/OL]. [2023-01-23]. https://arxiv.org/abs/2205.11487.
[6]	ROMBACH R, BLATTMANN A, LORENZ D, et al. High-resolution image synthesis with latent diffusion models[C]// 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2022: 10684-10695.
[7]	HONG F Z, ZHANG M Y, PAN L A, et al. AvatarCLIP: zero-shot text-driven generation and animation of 3D avatars[J]. ACM Transactions on Graphics, 2022, 41(4): 1-19.
[8]	POOLE B, JAIN A, ARRON J B, et al. DreamFusion: text-to-3D using 2D diffusion[EB/OL]. [2023-01-23]. https://www.aminer.cn/pub/63365e7f90e50fcafd1a3612/.
[9]	WANG P, LIU L J, LIU Y, et al. NeuS: learning neural implicit surfaces by volume rendering for multi-view reconstruction[EB/OL]. [2023-01-23]. https://arxiv.org/abs/2106.10689.
[10]	LOPER M, MAHMOOD N, ROMERO J, et al. SMPL: a skinned multi-person linear model[J]. ACM Transactions on Graphics, 34(6): 248: 1-248: 16.
[11]	蔡兴泉, 霍宇晴, 李发建, 等. 面向太极拳学习的人体姿态估计及相似度计算[J]. 图学学报, 2022, 43(4): 695-706.
	CAI X Q, HUO Y Q, LI F J, et al. Human pose estimation and similarity calculation for Tai Chi learning[J]. Journal of Graphics, 2022, 43(4): 695-706 (in Chinese). DOI
[12]	王玉萍, 曾毅, 李胜辉, 等. 一种基于Transformer的三维人体姿态估计方法[J]. 图学学报, 2023, 44(1): 139-145. DOI
	WANG Y P, ZENG Y, LI S H, et al. A Transformer-based 3D human pose estimation method[J]. Journal of Graphics, 2023, 44(1): 139-145 (in Chinese).
[13]	张小蒙, 方贤勇, 汪粼波, 等. 基于改进分段铰链变换的人体重建技术[J]. 图学学报, 2020, 41(1): 108-115. DOI
	ZHANG X M, FANG X Y, WANG L B, et al. Human body reconstruction based on improved piecewise hinge transformation[J]. Journal of Graphics, 2020, 41(1): 108-115 (in Chinese).
[14]	BHATNAGAR B L, SMINCHISESCU C, THEOBALT C, et al. Combining implicit function learning and parametric models for 3D human reconstruction[C]// European Conference on Computer Vision. Cham: Springer, 2020: 311-329.
[15]	MILDENHALL B, SRINIVASAN P P, TANCIK M, et al. NeRF: representing scenes as neural radiance fields for view synthesis[C]// European Conference on Computer Vision. Cham: Springer, 2020: 405-421.
[16]	GROPP A, YARIV L, HAIM N, et al. Implicit geometric regularization for learning shapes[C]// The 37th International Conference on Machine Learning. New York: ACM, 2020: 3789-3799.
[17]	GRIGOREV A, ISKAKOV K, IANINA A, et al. StylePeople: a generative model of fullbody human avatars[C]// 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2021: 5151-5160.
[18]	RAMEEN A, YIPENG Q, PETER W. Image2stylegan: How to embed images into the stylegan latent space?[C]// 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2019: 4432-4441.
[19]	CORONA E, PUMAROLA A, ALENYA G, et al. SMPLicit: topology-aware generative model for clothed people[C]// 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2021: 11875-11885.
[20]	JAIN A, MILDENHALL B, BARRON J T, et al. Zero-shot text-guided object generation with dream fields[C]// 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2022: 857-866.
[21]	KHALID N M, XIE T H, BELILOVSKY E, et al. CLIP-Mesh:generating textured meshes from text using pretrained image-text models[C]// SA'22: SIGGRAPH Asia 2022 Conference Papers. New York: ACM, 2022: 1-8.
[22]	LEE H H, CHANG A X. Understanding pure clip guidance for voxel grid NeRF models[EB/OL]. (2022-09-30) [2023-02-22]. https://arxiv.org/abs/2209.15172.
[23]	LIN C H, GAO J, TANG L, et al. Magic3D: high resolution text-to-3D content creation[C]// 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2023: 300-309.
[24]	WANG H C, DU X D, LI J H, et al. Score Jacobian chaining: lifting pretrained 2D diffusion models for 3D generation[C]// 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2023: 12619-12629.
[25]	Stability AI. Stable diffusion[EB/OL]. (2022-08-22) [2023- 02-22]. https://stability.ai/blog/stable-diffusion-public-release.
[26]	WANG Y Q, SKOROKHODOV I, WONKA P. HF-NeuS: improved surface reconstruction using high-frequency details[EB/OL]. [2023-01-23]. https://arxiv.org/abs/2206.07850.
[27]	KINGMA D P, BA J. Adam: a method for stochastic optimization[EB/OL]. [2023-02-12]. https://arxiv.org/pdf/1412.6980.pdf.