Welcome to Journal of Graphics share: 

Journal of Graphics ›› 2021, Vol. 42 ›› Issue (1): 8-14.DOI: 10.11996/JG.j.2095-302X.2021010008

• Image Processing and Computer Vision • Previous Articles     Next Articles

Multimodal sentiment analysis of short videos based on attention

  

  1. (1. College of Computer, Nanjing University of Posts and Telecommunications, Nanjing Jiangsu 210003, China;  2. Jiangsu High Technology Research Key Laboratory for Wireless Sensor Networks, Nanjing University of Posts and Telecommunications, Nanjing Jiangsu 210003, China;  3. College of Computer and Information Engineering, Henan University, Kaifeng Henan 475001, China) 
  • Online:2021-02-28 Published:2021-01-29
  • Supported by:
    National Natural Science Foundation of China (61873131, 61702284); Anhui Science and Technology Department Foundation (1908085MF207); Postdoctoral Found of Jiangsu Province (2018K009B) 

Abstract: The existing sentiment analysis methods lack sufficient consideration of information in short videos, leading to inappropriate sentiment analysis results. Based on this, we proposed the audio-visual multimodal sentiment analysis (AV-MSA) model that can complete the sentiment analysis of short videos using visual features in frame images and audio information in videos. The model was divided into two branches, namely the visual branch and the audio branch. In the audio branch, the convolutional neural networks (CNN) architecture was employed to extract the emotional features in the audio atlas to achieve the purpose of sentiment analysis; in the visual branch, we utilized the 3D convolution operation to increase the temporal correlation of visual features. In addition, on the basis of ResNet, in order to highlight the emotion-related features, we added an attention mechanism to enhance the sensitivity of the model to information features. Finally, a cross-voting mechanism was designed to fuse the results of the visual and audio branches to produce the final result of sentiment analysis. The proposed AV-MSA was evaluated on IEMOCAP and Weibo audio-visual (Weibo audio-visual, WB-AV) datasets. Experimental results show that compared with the current short video sentiment analysis methods, the proposed AV-MSA has improved the classification accuracy greatly. 

Key words: multimodal sentiment analysis, ResNet, 3D convolutional neural networks, attention, decision fusion 

CLC Number: