欢迎访问《图学学报》 分享到:

图学学报 ›› 2023, Vol. 44 ›› Issue (6): 1218-1226.DOI: 10.11996/JG.j.2095-302X.2023061218

• 计算机图形学与虚拟现实 • 上一篇    下一篇

基于深度条件扩散模型的零样本文本驱动虚拟人生成方法

王吉1(), 王森1, 蒋智文1, 谢志峰1,2(), 李梦甜1,2   

  1. 1.上海大学影视工程系,上海 200072
    2.上海电影特效工程技术研究中心,上海 200072
  • 收稿日期:2023-06-29 接受日期:2023-09-07 出版日期:2023-12-31 发布日期:2023-12-17
  • 通讯作者: 谢志峰(1982-),男,副教授,博士。主要研究方向为图形图像处理和计算机视觉等。E-mail:zhifeng_xie@shu.edu.cn
  • 作者简介:

    王吉(1999-),女,硕士研究生。主要研究方向为计算机视觉和计算机图形学。E-mail:wang_ji357@shu.edu.cn

Zero-shot text-driven avatar generation based on depth-conditioned diffusion model

WANG Ji1(), WANG Sen1, JIANG Zhi-wen1, XIE Zhi-feng1,2(), LI Meng-tian1,2   

  1. 1. Department of Film and Television Engineering, Shanghai University, Shanghai 200072, China
    2. Shanghai Film Special Effects Engineering Technology Research Center, Shanghai 200072, China
  • Received:2023-06-29 Accepted:2023-09-07 Online:2023-12-31 Published:2023-12-17
  • Contact: XIE Zhi-feng (1982-), associate professor, Ph.D. His main research interests cover graphic image processing, computer vision, etc. E-mail:zhifeng_xie@shu.edu.cn
  • About author:

    WANG Ji (1999-), master student. Her main research interests cover computer vison, computer graphics. E-mail:wang_ji357@shu.edu.cn

摘要:

虚拟人生成技术对于虚拟现实和影视制作等领域有重要意义。针对现有虚拟人生成需要大量数据和制作成本等问题,提出一种基于扩散模型的零样本文本驱动的三维虚拟人生成方法,包括条件人体生成和迭代纹理细化2个阶段。第一阶段,首先利用神经网络初始化三维人体的隐式表示,然后,使用一个基于文本提示的深度条件扩散模型来引导神经隐式场生成用户所需的虚拟人模型。第二阶段,利用扩散模型进行去噪还原,针对第一阶段人体模型提供的纹理先验进行高精度的纹理图推理更新,进而迭代细化虚拟人的纹理表示,生成最终结果。使用该方法,用户可以创建一个生动的具有任意文本描述的虚拟人,而无需使用任何参考照片。实验结果表明,该方法可以在给定的文本提示条件下生成具有真实感的高质量、生动的虚拟人。

关键词: 扩散模型, 虚拟人, 零样本, 文本驱动的生成, 深度学习

Abstract:

Avatars generation holds significant implications for various fields, including virtual reality and film production. To address the challenges associated with data volume and production costs in existing avatar generation methods, we proposed a zero-shot text-driven avatar generation method based on a depth-conditioned diffusion model. The method comprised two stages: conditional human body generation and iterative texture refinement. In the first stage, a neural network was employed to establish the implicit representation of the avatar. Subsequently, a depth-conditioned diffusion model was utilized to guide the neural implicit field in generating the required avatar model based on user input. In the second stage, the diffusion model was employed to generate high-precision inference texture images, leveraging the texture prior obtained in the first stage. The texture representation of the avatar model was enhanced through an iterative optimization scheme. With this method, users could create realistic avatars with vivid characteristics, all from text descriptions. Experimental results substantiated the effectiveness of the proposed method, showcasing that it could yield high-quality avatars exhibiting realism when generated in response to given text prompts.

Key words: diffusion model, avatar generation, zero-shot, text-driven generation, deep learning

中图分类号: