Welcome to Journal of Graphics share: 

Journal of Graphics ›› 2022, Vol. 43 ›› Issue (2): 181-188.DOI: 10.11996/JG.j.2095-302X.2022020181

• Review • Previous Articles     Next Articles

Literature review of audio-driven cross-modal visual generation algorithms

  

  1. 1. Conservatory of Music, Guangdong Polytechnic Normal University, Guangzhou Guangdong 510665, China;
    2. School of Computer Science and Technology, Dalian University of Technology, Dalian Liaoning 116024, China;
    3. School of Software, Dalian University, Dalian Liaoning 116622, China
  • Online:2022-04-30 Published:2022-05-07
  • Supported by:
    NSFC-Liaoning Province United Foundation (U1908214); Fundamental Research Funds for the Central Universities (DUT21TD107,DUT20RC(3)039); Liaoning Revitalization Talents Program (XLYC2008017); Liaoning Key Research and Development Program(2019JH2/10100030); CCF-Tencent Open Fund (IAGR20210116)

Abstract: Audio driven cross-modal visual generation algorithms have been widely employed in many fields, and
have gained attention from industry and academia in recent years. Audio and vision are the most important and
common modalities in people’s daily life. However, it has been a great challenge to creatively generate a visual scene
corresponding to the audio. The existing literature has not systematically and comprehensively studied the topic of
audio driven cross-modal visual generation. This paper summarized the existing algorithms for audio-driven
cross-modal visual generation and divided them into three categories: audio to image, audio to body motion video, and
audio to talking face video. For each category, we first described the fields of its specific applications and processes of
mainstream algorithms, and analyzed the framework technologies involved. Then the core contents, advantages, and
disadvantages of related algorithms were described according to the order of technology advancement, and their generation and performance effects were explained. Finally, the opportunities and challenges in the current field were
discussed and the future research suggestions were provided.

Key words: cross-modal generation, audio, vision, deep learning, review

CLC Number: