基于注意力的短视频多模态情感分析

doi:10.11996/JG.j.2095-302X.2021010008

图学学报 ›› 2021, Vol. 42 ›› Issue (1): 8-14.DOI: 10.11996/JG.j.2095-302X.2021010008

• 图像处理与计算机视觉 • 上一篇下一篇

基于注意力的短视频多模态情感分析

(1. 南京邮电大学计算机学院，江苏南京 210003； 2. 南京邮电大学江苏省无线传感网高技术重点实验室，江苏南京 210003； 3. 河南大学计算机与信息工程学院，河南开封 475001)

出版日期:2021-02-28 发布日期:2021-01-29
基金资助:
国家自然科学基金项目(61873131，61702284)；安徽省科技厅面上项目(1908085MF207)；江苏省博士后研究基金项目(2018K009B)

Multimodal sentiment analysis of short videos based on attention

(1. College of Computer, Nanjing University of Posts and Telecommunications, Nanjing Jiangsu 210003, China; 2. Jiangsu High Technology Research Key Laboratory for Wireless Sensor Networks, Nanjing University of Posts and Telecommunications, Nanjing Jiangsu 210003, China; 3. College of Computer and Information Engineering, Henan University, Kaifeng Henan 475001, China)

Online:2021-02-28 Published:2021-01-29
Supported by:
National Natural Science Foundation of China (61873131, 61702284); Anhui Science and Technology Department Foundation (1908085MF207); Postdoctoral Found of Jiangsu Province (2018K009B)

摘要/Abstract

摘要： 针对现有的情感分析方法缺乏对短视频中信息的充分考虑，从而导致不恰当的情感分析结果。基于音视频的多模态情感分析(AV-MSA)模型便由此产生，模型通过利用视频帧图像中的视觉特征和音频信息来完成短视频的情感分析。模型分为视觉与音频 2 分支，音频分支采用卷积神经网络(CNN)架构来提取音频图谱中的情感特征，实现情感分析的目的；视觉分支则采用 3D 卷积操作来增加视觉特征的时间相关性。并在 Resnet 的基础上，突出情感相关特征，添加了注意力机制，以提高模型对信息特征的敏感性。最后，设计了一种交叉投票机制用于融合视觉分支和音频分支的结果，产生情感分析的最终结果。AV-MSA 模型在 IEMOCAP 和微博视听(WB-AV)数据集上进行了评估, 实验结果表明，与现有算法相比，AV-MSA 在分类精确度上有了较大的提升。

关键词: 多模态情感分析, 残差网络, 3D 卷积神经网络, 注意力, 决策融合

Abstract: The existing sentiment analysis methods lack sufficient consideration of information in short videos, leading to inappropriate sentiment analysis results. Based on this, we proposed the audio-visual multimodal sentiment analysis (AV-MSA) model that can complete the sentiment analysis of short videos using visual features in frame images and audio information in videos. The model was divided into two branches, namely the visual branch and the audio branch. In the audio branch, the convolutional neural networks (CNN) architecture was employed to extract the emotional features in the audio atlas to achieve the purpose of sentiment analysis; in the visual branch, we utilized the 3D convolution operation to increase the temporal correlation of visual features. In addition, on the basis of ResNet, in order to highlight the emotion-related features, we added an attention mechanism to enhance the sensitivity of the model to information features. Finally, a cross-voting mechanism was designed to fuse the results of the visual and audio branches to produce the final result of sentiment analysis. The proposed AV-MSA was evaluated on IEMOCAP and Weibo audio-visual (Weibo audio-visual, WB-AV) datasets. Experimental results show that compared with the current short video sentiment analysis methods, the proposed AV-MSA has improved the classification accuracy greatly.

Key words: multimodal sentiment analysis, ResNet, 3D convolutional neural networks, attention, decision fusion

中图分类号:

TP 751.1

黄欢 , 孙力娟 , 曹莹 , 郭剑 , 任恒毅 . 基于注意力的短视频多模态情感分析[J]. 图学学报, 2021, 42(1): 8-14.

HUANG Huan , SUN Li-juan, CAO Ying , GUO Jian, REN Heng-yi. Multimodal sentiment analysis of short videos based on attention[J]. Journal of Graphics, 2021, 42(1): 8-14.

[1]	张盾, 黄志开, 王欢, 吴义鹏, 王颖, 邹家豪. 基于多尺度特征实现超参进化的野生菌分类研究与应用[J]. 图学学报, 2022, 43(4): 580-589.
[2]	刘玉珍, 李楠, 陶志勇. 基于环查询和通道注意力的点云分类与分割[J]. 图学学报, 2022, 43(4): 616-623.
[3]	贺琪, 李汶龙, 宋巍, 杜艳玲, 黄冬梅, 耿立佳 . 结合残差时空注意力机制的海面温度预测算法[J]. 图学学报, 2022, 43(4): 677-684.
[4]	方洪波, 万广, 陈忠辉, 黄以卫, 张文勇, 谢本亮. 基于改进 YOLOv5s 的离线手写数学符号识别[J]. 图学学报, 2022, 43(3): 387-395.
[5]	白静, 孟庆亮, 徐昊, 范有福, 杨瞻源. ST-Rec3D：基于结构和目标感知的三维重建[J]. 图学学报, 2022, 43(3): 469-477.
[6]	李扬科, 宋全博, 周元峰. 用于手势识别的时空融合网络以及虚拟签名系统[J]. 图学学报, 2022, 43(3): 504-512.
[7]	张明, 张芳慧, 宗佳平, 宋治, 岑翼刚, 张琳娜. 基于轻量级网络的人脸检测及嵌入式实现[J]. 图学学报, 2022, 43(2): 239-246.
[8]	苏常保, 龚世才. 基于深度学习的人物肖像全自动抠图算法[J]. 图学学报, 2022, 43(2): 247-253.
[9]	李翠云, 白静, 郑凉. 融合边缘增强注意力机制和 U-Net 网络的医学图像分割[J]. 图学学报, 2022, 43(2): 273-278.
[10]	何国忠, 梁宇. 基于卷积神经网络的 PCB 缺陷检测[J]. 图学学报, 2022, 43(1): 21-27.
[11]	温静, 丁友东, 于冰. 基于上下文门卷积的盲图像修复[J]. 图学学报, 2022, 43(1): 70-78.
[12]	史彩娟, 陈厚儒, 葛录录, 王子雯. 注意力残差多尺度特征增强的显著性实例分割[J]. 图学学报, 2021, 42(6): 883-890.
[13]	蒋镕圻, 彭月平, 谢文宣, 谢郭蓉. 嵌入 scSE 模块的改进 YOLOv4 小目标检测算法[J]. 图学学报, 2021, 42(4): 546-555.
[14]	黄文明, 阳沐利, 蓝如师, 邓珍荣, 罗笑南. 融合非局部神经网络的行为检测模型 [J]. 图学学报, 2021, 42(3): 439-445.
[15]	杨世强, 杨江涛, 李卓, 王金华, 李德信. 基于 LSTM 神经网络的人体动作识别[J]. 图学学报, 2021, 42(2): 174-181.

基于注意力的短视频多模态情感分析

Multimodal sentiment analysis of short videos based on attention

PDF (PC)

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

编辑推荐

Metrics

本文评价