欢迎访问《图学学报》 分享到:

图学学报 ›› 2022, Vol. 43 ›› Issue (2): 181-188.DOI: 10.11996/JG.j.2095-302X.2022020181

• 综述 • 上一篇    下一篇

音频驱动跨模态视觉生成算法综述

  

  1. 1. 广东技术师范大学音乐学院,广东 广州 510665;
    2. 大连理工大学计算机科学与技术学院,辽宁 大连 116024;
    3. 大连大学软件学院,辽宁 大连 116622
  • 出版日期:2022-04-30 发布日期:2022-05-07
  • 基金资助:
    国家自然科学基金委-辽宁联合基金项目(U1908214);中央高校基本科研基金项目(DUT21TD107,DUT20RC(3)039);辽宁省兴辽人才计划项目(XLYC2008017);辽宁省重点研发计划项目(2019JH2/10100030);CCF-腾讯犀牛鸟基金项目(IAGR20210116)

Literature review of audio-driven cross-modal visual generation algorithms

  1. 1. Conservatory of Music, Guangdong Polytechnic Normal University, Guangzhou Guangdong 510665, China;
    2. School of Computer Science and Technology, Dalian University of Technology, Dalian Liaoning 116024, China;
    3. School of Software, Dalian University, Dalian Liaoning 116622, China
  • Online:2022-04-30 Published:2022-05-07
  • Supported by:
    NSFC-Liaoning Province United Foundation (U1908214); Fundamental Research Funds for the Central Universities (DUT21TD107,DUT20RC(3)039); Liaoning Revitalization Talents Program (XLYC2008017); Liaoning Key Research and Development Program(2019JH2/10100030); CCF-Tencent Open Fund (IAGR20210116)

摘要: 由于音频驱动的跨模态视觉生成算法具有广泛地应用场景,近年来已得到产业界和科研界的广泛关注。音频和视觉为人们日常生活中最重要和常见的 2 种模态,然而设计一种能够创意地想象出与音频相对应的视觉场景一直是一个巨大挑战,目前关于音频驱动的跨模态视觉生成问题在已有文献中尚未得到系统而全面地研究。针对现有音频驱动的跨模态视觉生成算法进行概述,并将其分为音频到图像、音频到肢体动作视频和音频到说话人脸视频 3 类。首先阐述其具体应用领域与主流算法流程,并对涉及框架技术进行解析,然后按照技术推进的顺序对相关算法的核心内容与优劣势进行阐述,并解释其生成表现效果,最后对目前领域内所面临的机遇和挑战进行讨论,给出未来研究方向。

关键词: 跨模态生成, 音频, 视觉, 深度学习, 综述

Abstract: Audio driven cross-modal visual generation algorithms have been widely employed in many fields, and
have gained attention from industry and academia in recent years. Audio and vision are the most important and
common modalities in people’s daily life. However, it has been a great challenge to creatively generate a visual scene
corresponding to the audio. The existing literature has not systematically and comprehensively studied the topic of
audio driven cross-modal visual generation. This paper summarized the existing algorithms for audio-driven
cross-modal visual generation and divided them into three categories: audio to image, audio to body motion video, and
audio to talking face video. For each category, we first described the fields of its specific applications and processes of
mainstream algorithms, and analyzed the framework technologies involved. Then the core contents, advantages, and
disadvantages of related algorithms were described according to the order of technology advancement, and their generation and performance effects were explained. Finally, the opportunities and challenges in the current field were
discussed and the future research suggestions were provided.

Key words: cross-modal generation, audio, vision, deep learning, review

中图分类号: