欢迎访问《图学学报》 分享到:

图学学报 ›› 2023, Vol. 44 ›› Issue (6): 1191-1201.DOI: 10.11996/JG.j.2095-302X.2023061191

• 图像处理与计算机视觉 • 上一篇    下一篇

基于语义引导的视频描述生成

石佳豪1(), 姚莉1,2()   

  1. 1.东南大学计算机科学与工程学院,江苏 南京 211189
    2.计算机网络和信息集成教育部重点实验室(东南大学),江苏 南京 211189
  • 收稿日期:2023-06-29 接受日期:2023-09-26 出版日期:2023-12-31 发布日期:2023-12-17
  • 通讯作者: 姚莉(1977-),女,教授,博士。主要研究方向为计算机图形学、计算机视觉等。E-mail:yao.li@seu.edu.cn
  • 作者简介:

    石佳豪(1998-),男,硕士研究生。主要研究方向为计算机视觉。E-mail:sjh143446@163.com

  • 基金资助:
    南京市重大科技专项(202209003)

Video captioning based on semantic guidance

SHI Jia-hao1(), YAO Li1,2()   

  1. 1. School of Computer Science and Engineering, Southeast University, Nanjing Jiangsu 211189, China
    2. Key Laboratory of Computer Network and Information Integration (Southeast University), Nanjing Jiangsu 211189, China
  • Received:2023-06-29 Accepted:2023-09-26 Online:2023-12-31 Published:2023-12-17
  • Contact: YAO Li (1977-), professor, Ph.D. Her main research interests cover computer graphics, computer vision, etc. E-mail:yao.li@seu.edu.cn
  • About author:

    SHI Jia-hao (1998-), master student. His main research interest covers computer vision. E-mail:sjh143446@163.com

  • Supported by:
    Major Science and Technology Projects in Nanjing(202209003)

摘要:

视频描述生成旨在对给定的一段输入视频自动生成一句文本来概述发生的事件,其可用于视频检索、短视频标题生成、辅助视障、安防监控等领域。现有的方法忽视了语义信息在描述生成的作用,导致模型对于关键信息的描述能力不足。针对这一问题,设计了一个基于语义引导的视频描述生成模型。模型整体采用了编码器-解码器框架。在编码阶段首先使用语义增强模块生成关键实体及谓词,接着通过语义融合模块生成整体的语义表示;解码阶段使用词选择模块选择合适的词向量来引导描述生成,从而高效地利用语义信息,使结果更加关注关键语义。最后的实验表明该模型在2个广泛使用的数据集MSVD和MSR-VTT上分别取得107.0%和52.4%的Cider评分,优于最先进的模型。用户实验及可视化结果也证明了模型生成的描述符合人类的理解。

关键词: 视频描述生成, 语义引导, Transformer, 特征融合, 语义增强

Abstract:

Video captioning aims to automatically generate a sentence of text for a given piece of input video, summarizing the events in the video. This technology finds application in various fields, including video retrieval, short video title generation, assisting the visually impaired individuals, and security monitoring. However, existing methods tend to overlook the role of semantic information in description generation, resulting in insufficient description ability of the model for key information. To address this issue, a video captioning model based on semantic guidance was designed. This model as a whole adopted the encoder-decoder framework. In the encoding stage, a semantic enhancement module was employed to generate key entities and predicates. Subsequently, a semantic fusion module was utilized to generate the overall semantic representation. In the decoding stage, a word selection module was adopted to select the appropriate word vector, guiding the description generation to efficiently leverage semantic information and enhance the attention to the key semantics in the results. Finally, the experiment demonstrated that the model could achieve Cider scores of 107.0% and 52.4% on two widely used datasets: MSVD and MSR-VTT, respectively, outperforming the state-of-the-art model. User studies and visualization results corroborated that the descriptions generated by the model aligned well with human comprehension.

Key words: video captioning, semantic guidance, Transformer, feature fusion, semantic enhancement

中图分类号: