Welcome to Journal of Graphics share: 

Journal of Graphics ›› 2023, Vol. 44 ›› Issue (6): 1191-1201.DOI: 10.11996/JG.j.2095-302X.2023061191

Previous Articles     Next Articles

Video captioning based on semantic guidance

SHI Jia-hao1(), YAO Li1,2()   

  1. 1. School of Computer Science and Engineering, Southeast University, Nanjing Jiangsu 211189, China
    2. Key Laboratory of Computer Network and Information Integration (Southeast University), Nanjing Jiangsu 211189, China
  • Received:2023-06-29 Accepted:2023-09-26 Online:2023-12-31 Published:2023-12-17
  • Contact: YAO Li (1977-), professor, Ph.D. Her main research interests cover computer graphics, computer vision, etc. E-mail:yao.li@seu.edu.cn
  • About author:

    SHI Jia-hao (1998-), master student. His main research interest covers computer vision. E-mail:sjh143446@163.com

  • Supported by:
    Major Science and Technology Projects in Nanjing(202209003)

Abstract:

Video captioning aims to automatically generate a sentence of text for a given piece of input video, summarizing the events in the video. This technology finds application in various fields, including video retrieval, short video title generation, assisting the visually impaired individuals, and security monitoring. However, existing methods tend to overlook the role of semantic information in description generation, resulting in insufficient description ability of the model for key information. To address this issue, a video captioning model based on semantic guidance was designed. This model as a whole adopted the encoder-decoder framework. In the encoding stage, a semantic enhancement module was employed to generate key entities and predicates. Subsequently, a semantic fusion module was utilized to generate the overall semantic representation. In the decoding stage, a word selection module was adopted to select the appropriate word vector, guiding the description generation to efficiently leverage semantic information and enhance the attention to the key semantics in the results. Finally, the experiment demonstrated that the model could achieve Cider scores of 107.0% and 52.4% on two widely used datasets: MSVD and MSR-VTT, respectively, outperforming the state-of-the-art model. User studies and visualization results corroborated that the descriptions generated by the model aligned well with human comprehension.

Key words: video captioning, semantic guidance, Transformer, feature fusion, semantic enhancement

CLC Number: