Welcome to Journal of Graphics share: 

Journal of Graphics ›› 2023, Vol. 44 ›› Issue (6): 1218-1226.DOI: 10.11996/JG.j.2095-302X.2023061218

• Computer Graphics and Virtual Reality • Previous Articles     Next Articles

Zero-shot text-driven avatar generation based on depth-conditioned diffusion model

WANG Ji1(), WANG Sen1, JIANG Zhi-wen1, XIE Zhi-feng1,2(), LI Meng-tian1,2   

  1. 1. Department of Film and Television Engineering, Shanghai University, Shanghai 200072, China
    2. Shanghai Film Special Effects Engineering Technology Research Center, Shanghai 200072, China
  • Received:2023-06-29 Accepted:2023-09-07 Online:2023-12-31 Published:2023-12-17
  • Contact: XIE Zhi-feng (1982-), associate professor, Ph.D. His main research interests cover graphic image processing, computer vision, etc. E-mail:zhifeng_xie@shu.edu.cn
  • About author:

    WANG Ji (1999-), master student. Her main research interests cover computer vison, computer graphics. E-mail:wang_ji357@shu.edu.cn


Avatars generation holds significant implications for various fields, including virtual reality and film production. To address the challenges associated with data volume and production costs in existing avatar generation methods, we proposed a zero-shot text-driven avatar generation method based on a depth-conditioned diffusion model. The method comprised two stages: conditional human body generation and iterative texture refinement. In the first stage, a neural network was employed to establish the implicit representation of the avatar. Subsequently, a depth-conditioned diffusion model was utilized to guide the neural implicit field in generating the required avatar model based on user input. In the second stage, the diffusion model was employed to generate high-precision inference texture images, leveraging the texture prior obtained in the first stage. The texture representation of the avatar model was enhanced through an iterative optimization scheme. With this method, users could create realistic avatars with vivid characteristics, all from text descriptions. Experimental results substantiated the effectiveness of the proposed method, showcasing that it could yield high-quality avatars exhibiting realism when generated in response to given text prompts.

Key words: diffusion model, avatar generation, zero-shot, text-driven generation, deep learning

CLC Number: