欢迎访问《图学学报》 分享到:

图学学报 ›› 2021, Vol. 42 ›› Issue (1): 8-14.DOI: 10.11996/JG.j.2095-302X.2021010008

• 图像处理与计算机视觉 • 上一篇    下一篇

基于注意力的短视频多模态情感分析

  

  1. (1. 南京邮电大学计算机学院,江苏 南京 210003;  2. 南京邮电大学江苏省无线传感网高技术重点实验室,江苏 南京 210003;  3. 河南大学计算机与信息工程学院,河南 开封 475001)
  • 出版日期:2021-02-28 发布日期:2021-01-29
  • 基金资助:
    国家自然科学基金项目(61873131,61702284);安徽省科技厅面上项目(1908085MF207);江苏省博士后研究基金项目(2018K009B) 

Multimodal sentiment analysis of short videos based on attention

  1. (1. College of Computer, Nanjing University of Posts and Telecommunications, Nanjing Jiangsu 210003, China;  2. Jiangsu High Technology Research Key Laboratory for Wireless Sensor Networks, Nanjing University of Posts and Telecommunications, Nanjing Jiangsu 210003, China;  3. College of Computer and Information Engineering, Henan University, Kaifeng Henan 475001, China) 
  • Online:2021-02-28 Published:2021-01-29
  • Supported by:
    National Natural Science Foundation of China (61873131, 61702284); Anhui Science and Technology Department Foundation (1908085MF207); Postdoctoral Found of Jiangsu Province (2018K009B) 

摘要: 针对现有的情感分析方法缺乏对短视频中信息的充分考虑,从而导致不恰当的情感分析结果。 基于音视频的多模态情感分析(AV-MSA)模型便由此产生,模型通过利用视频帧图像中的视觉特征和音频信息 来完成短视频的情感分析。模型分为视觉与音频 2 分支,音频分支采用卷积神经网络(CNN)架构来提取音频图 谱中的情感特征,实现情感分析的目的;视觉分支则采用 3D 卷积操作来增加视觉特征的时间相关性。并在 Resnet 的基础上,突出情感相关特征,添加了注意力机制,以提高模型对信息特征的敏感性。最后,设计了一 种交叉投票机制用于融合视觉分支和音频分支的结果,产生情感分析的最终结果。AV-MSA 模型在 IEMOCAP 和微博视听(WB-AV)数据集上进行了评估, 实验结果表明,与现有算法相比,AV-MSA 在分类精确度上有了较 大的提升。

关键词: 多模态情感分析, 残差网络, 3D 卷积神经网络, 注意力, 决策融合

Abstract: The existing sentiment analysis methods lack sufficient consideration of information in short videos, leading to inappropriate sentiment analysis results. Based on this, we proposed the audio-visual multimodal sentiment analysis (AV-MSA) model that can complete the sentiment analysis of short videos using visual features in frame images and audio information in videos. The model was divided into two branches, namely the visual branch and the audio branch. In the audio branch, the convolutional neural networks (CNN) architecture was employed to extract the emotional features in the audio atlas to achieve the purpose of sentiment analysis; in the visual branch, we utilized the 3D convolution operation to increase the temporal correlation of visual features. In addition, on the basis of ResNet, in order to highlight the emotion-related features, we added an attention mechanism to enhance the sensitivity of the model to information features. Finally, a cross-voting mechanism was designed to fuse the results of the visual and audio branches to produce the final result of sentiment analysis. The proposed AV-MSA was evaluated on IEMOCAP and Weibo audio-visual (Weibo audio-visual, WB-AV) datasets. Experimental results show that compared with the current short video sentiment analysis methods, the proposed AV-MSA has improved the classification accuracy greatly. 

Key words: multimodal sentiment analysis, ResNet, 3D convolutional neural networks, attention, decision fusion 

中图分类号: