音频驱动跨模态视觉生成算法综述

doi:10.11996/JG.j.2095-302X.2022020181

图学学报 ›› 2022, Vol. 43 ›› Issue (2): 181-188.DOI: 10.11996/JG.j.2095-302X.2022020181

音频驱动跨模态视觉生成算法综述

1. 广东技术师范大学音乐学院，广东广州 510665；
2. 大连理工大学计算机科学与技术学院，辽宁大连 116024；
3. 大连大学软件学院，辽宁大连 116622

出版日期:2022-04-30 发布日期:2022-05-07
基金资助:
国家自然科学基金委-辽宁联合基金项目(U1908214)；中央高校基本科研基金项目(DUT21TD107，DUT20RC(3)039)；辽宁省兴辽人才计划项目(XLYC2008017)；辽宁省重点研发计划项目(2019JH2/10100030)；CCF-腾讯犀牛鸟基金项目(IAGR20210116)

Literature review of audio-driven cross-modal visual generation algorithms

1. Conservatory of Music, Guangdong Polytechnic Normal University, Guangzhou Guangdong 510665, China;
2. School of Computer Science and Technology, Dalian University of Technology, Dalian Liaoning 116024, China;
3. School of Software, Dalian University, Dalian Liaoning 116622, China

Online:2022-04-30 Published:2022-05-07
Supported by:
NSFC-Liaoning Province United Foundation (U1908214); Fundamental Research Funds for the Central Universities (DUT21TD107,DUT20RC(3)039); Liaoning Revitalization Talents Program (XLYC2008017); Liaoning Key Research and Development Program(2019JH2/10100030); CCF-Tencent Open Fund (IAGR20210116)

摘要/Abstract

摘要： 由于音频驱动的跨模态视觉生成算法具有广泛地应用场景，近年来已得到产业界和科研界的广泛关注。音频和视觉为人们日常生活中最重要和常见的 2 种模态，然而设计一种能够创意地想象出与音频相对应的视觉场景一直是一个巨大挑战，目前关于音频驱动的跨模态视觉生成问题在已有文献中尚未得到系统而全面地研究。针对现有音频驱动的跨模态视觉生成算法进行概述，并将其分为音频到图像、音频到肢体动作视频和音频到说话人脸视频 3 类。首先阐述其具体应用领域与主流算法流程，并对涉及框架技术进行解析，然后按照技术推进的顺序对相关算法的核心内容与优劣势进行阐述，并解释其生成表现效果，最后对目前领域内所面临的机遇和挑战进行讨论，给出未来研究方向。

关键词: 跨模态生成, 音频, 视觉, 深度学习, 综述

Abstract: Audio driven cross-modal visual generation algorithms have been widely employed in many fields, and
have gained attention from industry and academia in recent years. Audio and vision are the most important and
common modalities in people’s daily life. However, it has been a great challenge to creatively generate a visual scene
corresponding to the audio. The existing literature has not systematically and comprehensively studied the topic of
audio driven cross-modal visual generation. This paper summarized the existing algorithms for audio-driven
cross-modal visual generation and divided them into three categories: audio to image, audio to body motion video, and
audio to talking face video. For each category, we first described the fields of its specific applications and processes of
mainstream algorithms, and analyzed the framework technologies involved. Then the core contents, advantages, and
disadvantages of related algorithms were described according to the order of technology advancement, and their generation and performance effects were explained. Finally, the opportunities and challenges in the current field were
discussed and the future research suggestions were provided.

Key words: cross-modal generation, audio, vision, deep learning, review

中图分类号:

TP 391

姜莱, 于震, 王鹏飞, 周东生, 侯亚庆 . 音频驱动跨模态视觉生成算法综述[J]. 图学学报, 2022, 43(2): 181-188.

JIANG Lai, YU Zhen, WANG Peng-fei, ZHOU Dong-sheng, HOU Ya-qing . Literature review of audio-driven cross-modal visual generation algorithms[J]. Journal of Graphics, 2022, 43(2): 181-188.

[1]	廖仕敏, 刘仰川, 朱叶晨, 王艳玲, 高欣 . 一种基于 CycleGAN 改进的低剂量 CT 图像增强网络[J]. 图学学报, 2022, 43(4): 570-578.
[2]	梁振宇, 华嘉皓, 陈浩龙, 邓逸川. 基于计算机视觉的建筑施工期临时结构损伤识别方法 [J]. 图学学报, 2022, 43(4): 608-615.
[3]	熊琛, 陈立斌, 李林泽, 许镇, 赵杨平. 基于计算机视觉与 BIM 的裂缝可视化管理方法[J]. 图学学报, 2022, 43(4): 721-728.
[4]	范新南, 黄伟盛, 史朋飞, 辛元雪, 朱凤婷, 周润康. 基于改进 YOLOv4 的嵌入式变电站仪表检测算法[J]. 图学学报, 2022, 43(3): 396-403.
[5]	李华恩, 赵洋, 陈缘, 张效娟. 基于递归对齐网络的黑白老卡通高清重制[J]. 图学学报, 2022, 43(3): 434-442.
[6]	姜柳, 史健勇, 付功义, 潘泽宇, 王朝宇. 基于 BIM 和深度学习的建筑平面凹凸不规则识别[J]. 图学学报, 2022, 43(3): 522-529.
[7]	林佳瑞, 程志刚, 韩宇, 尹云鹏. 基于 BERT 预训练模型的灾害推文分类方法[J]. 图学学报, 2022, 43(3): 530-536.
[8]	高铭, 张荷花, 张庭瑞, 张轩铭. 基于深度学习的公共建筑像素施工图空间识别[J]. 图学学报, 2022, 43(2): 189-196.
[9]	廖志伟, 金兢, 张超凡, 杨学志. 基于分层压缩激励的 ASPP 网络单目深度估计[J]. 图学学报, 2022, 43(2): 214-222.
[10]	段锐, 邓晖, 邓逸川. ICT 支持的塔吊安全管理框架—— 回顾与展望[J]. 图学学报, 2022, 43(1): 11-20.
[11]	何国忠, 梁宇. 基于卷积神经网络的 PCB 缺陷检测[J]. 图学学报, 2022, 43(1): 21-27.
[12]	唐晓天 , 马骏 , 李峰 , 杨雪 , 梁亮 . 基于多尺度时域 3D 卷积的视频超分辨率重建[J]. 图学学报, 2022, 43(1): 53-59.
[13]	唐静, 彭伟龙, 唐可可, 方美娥. 基于多视图网络三维形状检索的通用扰动攻击[J]. 图学学报, 2022, 43(1): 93-100.
[14]	马欢, 冀晶晶, 刘佳豪, 刘雨婷. 面向机器人自主分割的肉品识别分类系统实现[J]. 图学学报, 2021, 42(6): 924-930.
[15]	朱喜梅 , 李蕊 , . 基于低分辨率输入图像的年龄识别方法[J]. 图学学报, 2021, 42(6): 931-940.

音频驱动跨模态视觉生成算法综述

Literature review of audio-driven cross-modal visual generation algorithms

PDF (PC)

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

编辑推荐

Metrics

本文评价