欢迎访问《图学学报》 分享到:

图学学报 ›› 2022, Vol. 43 ›› Issue (6): 1159-1169.DOI: 10.11996/JG.j.2095-302X.2022061159

• 图像处理与计算机视觉 • 上一篇    下一篇

融合动作特征的多模态情绪识别 

  

  1. 清华大学计算机科学与技术系,北京 100084
  • 出版日期:2022-12-30 发布日期:2023-01-11
  • 基金资助:
    清华大学自主科研计划(20211080093);博士后面上资助(2021M701891);国家自然科学基金(62202257,61725204) 

Multimodal emotion recognition with action features

  1. Department of Computer Science and Technology, Tsinghua University, Beijing 100084, China
  • Online:2022-12-30 Published:2023-01-11
  • Supported by:
    Tsinghua University Initiative Scientific Research Program (20211080093); China Postdoctoral Science Foundation (2021M701891); National Natural Science Foundation of China (62202257, 61725204) 

摘要:

近年来,利用计算机技术实现基于多模态数据的情绪识别成为自然人机交互和人工智能领域重要 的研究方向之一。利用视觉模态信息的情绪识别工作通常都将重点放在脸部特征上,很少考虑动作特征以及融合 动作特征的多模态特征。虽然动作与情绪之间有着紧密的联系,但是从视觉模态中提取有效的动作信息用于情绪 识别的难度较大。以动作与情绪的关系作为出发点,在经典的 MELD 多模态情绪识别数据集中引入视觉模态的 动作数据,采用 ST-GCN 网络模型提取肢体动作特征,并利用该特征实现基于 LSTM 网络模型的单模态情绪识别。 进一步在 MELD 数据集文本特征和音频特征的基础上引入肢体动作特征,提升了基于 LSTM 网络融合模型的多 模态情绪识别准确率,并且结合文本特征和肢体动作特征提升了上下文记忆模型的文本单模态情绪识别准确率, 实验显示虽然肢体动作特征用于单模态情绪识别的准确度无法超越传统的文本特征和音频特征,但是该特征对于 多模态情绪识别具有重要作用。基于单模态和多模态特征的情绪识别实验验证了人体动作中含有情绪信息,利用 肢体动作特征实现多模态情绪识别具有重要的发展潜力。

关键词:

Abstract: In recent years, using knowledge of computer science to realize emotion recognition based on multimodal data has become an important research direction in the fields of natural human-computer interaction and artificial intelligence. The emotion recognition research using visual modality information usually focuses on facial features, rarely considering action features or multimodal features fused with action features. Although action has a close relationship with emotion, it is difficult to extract valid action information from the visual modality. In this paper, we started with the relationship between action and emotion, and introduced action data extracted from visual modality to classic multimodal emotion recognition dataset, MELD. The body action features were extracted based on ST-GCN model, and the action features were applied to the LSTM model-based single-modal emotion recognition task. In addition, body action features were introduced to bi-modal emotion recognition in MELD dataset, improving the performance of the fusion model based on the LSTM network. The combination of body action features and text features enhanced the recognition accuracy of the context model with pre-trained memory compared with that only using the text features. The results of the experiment show that although the accuracy of body action features for emotion recognition is not higher than those of traditional text features and audio features, body action features play an important role in the process of multimodal emotion recognition. The experiments on emotion recognition based on single-modal and multimodal features validate that people use actions to convey their emotions, and that using body action features for emotion recognition has great potential. 

Key words:  , action features, emotion recognition, multimodality, action and emotion, visual modality 

中图分类号: