基于语义引导的视频描述生成

doi:10.11996/JG.j.2095-302X.2023061191

图学学报 ›› 2023, Vol. 44 ›› Issue (6): 1191-1201.DOI: 10.11996/JG.j.2095-302X.2023061191

• 图像处理与计算机视觉 • 上一篇下一篇

基于语义引导的视频描述生成

石佳豪¹(), 姚莉¹^,²()

1.东南大学计算机科学与工程学院，江苏南京 211189
2.计算机网络和信息集成教育部重点实验室(东南大学)，江苏南京 211189

收稿日期:2023-06-29 接受日期:2023-09-26 出版日期:2023-12-31 发布日期:2023-12-17
通讯作者: 姚莉(1977-)，女，教授，博士。主要研究方向为计算机图形学、计算机视觉等。E-mail：yao.li@seu.edu.cn
作者简介:
石佳豪(1998-)，男，硕士研究生。主要研究方向为计算机视觉。E-mail：sjh143446@163.com
基金资助:
南京市重大科技专项(202209003)

Video captioning based on semantic guidance

SHI Jia-hao¹(), YAO Li¹^,²()

1. School of Computer Science and Engineering, Southeast University, Nanjing Jiangsu 211189, China
2. Key Laboratory of Computer Network and Information Integration (Southeast University), Nanjing Jiangsu 211189, China

Received:2023-06-29 Accepted:2023-09-26 Online:2023-12-31 Published:2023-12-17
Contact: YAO Li (1977-), professor, Ph.D. Her main research interests cover computer graphics, computer vision, etc. E-mail：yao.li@seu.edu.cn
About author:
SHI Jia-hao (1998-), master student. His main research interest covers computer vision. E-mail：sjh143446@163.com
Supported by:
Major Science and Technology Projects in Nanjing(202209003)

摘要/Abstract

摘要：

视频描述生成旨在对给定的一段输入视频自动生成一句文本来概述发生的事件，其可用于视频检索、短视频标题生成、辅助视障、安防监控等领域。现有的方法忽视了语义信息在描述生成的作用，导致模型对于关键信息的描述能力不足。针对这一问题，设计了一个基于语义引导的视频描述生成模型。模型整体采用了编码器-解码器框架。在编码阶段首先使用语义增强模块生成关键实体及谓词，接着通过语义融合模块生成整体的语义表示；解码阶段使用词选择模块选择合适的词向量来引导描述生成，从而高效地利用语义信息，使结果更加关注关键语义。最后的实验表明该模型在2个广泛使用的数据集MSVD和MSR-VTT上分别取得107.0%和52.4%的Cider评分，优于最先进的模型。用户实验及可视化结果也证明了模型生成的描述符合人类的理解。

关键词: 视频描述生成, 语义引导, Transformer, 特征融合, 语义增强

Abstract:

Video captioning aims to automatically generate a sentence of text for a given piece of input video, summarizing the events in the video. This technology finds application in various fields, including video retrieval, short video title generation, assisting the visually impaired individuals, and security monitoring. However, existing methods tend to overlook the role of semantic information in description generation, resulting in insufficient description ability of the model for key information. To address this issue, a video captioning model based on semantic guidance was designed. This model as a whole adopted the encoder-decoder framework. In the encoding stage, a semantic enhancement module was employed to generate key entities and predicates. Subsequently, a semantic fusion module was utilized to generate the overall semantic representation. In the decoding stage, a word selection module was adopted to select the appropriate word vector, guiding the description generation to efficiently leverage semantic information and enhance the attention to the key semantics in the results. Finally, the experiment demonstrated that the model could achieve Cider scores of 107.0% and 52.4% on two widely used datasets: MSVD and MSR-VTT, respectively, outperforming the state-of-the-art model. User studies and visualization results corroborated that the descriptions generated by the model aligned well with human comprehension.

Key words: video captioning, semantic guidance, Transformer, feature fusion, semantic enhancement

中图分类号:

TP391

石佳豪, 姚莉. 基于语义引导的视频描述生成[J]. 图学学报, 2023, 44(6): 1191-1201.

SHI Jia-hao, YAO Li. Video captioning based on semantic guidance[J]. Journal of Graphics, 2023, 44(6): 1191-1201.

图/表 11

参考文献 45

[1]	ZHANG Z Q, CHEN Y X, MA Z Y, et al. CREATE: a benchmark for Chinese short video retrieval and title generation[EB/OL]. [2023-01-12]. https://arxiv.org/abs/2203.16763.pdf.
[2]	NIE L Q, QU L G, MENG D, et al. Search-oriented micro-video captioning[C]// The 30th ACM International Conference on Multimedia. New York: ACM, 2022: 3234-3243.
[3]	XU H J, HE K, PLUMMER B A, et al. Multilevel language and vision integration for text-to-clip retrieval[J]. Proceedings of the AAAI Conference on Artificial Intelligence, 2019, 33(1): 9062-9069. DOI URL
[4]	WRAY M, DOUGHTY H, DAMEN D M. On semantic similarity in video retrieval[C]// 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2021: 3649-3659.
[5]	CAMPOS V P, ARAÚJO T M U, SOUZA FILHO G L, et al. CineAD: a system for automated audio description script generation for the visually impaired[J]. Universal Access in the Information Society, 2020, 19(1): 99-111. DOI
[6]	SULTANI W, CHEN C, SHAH M. Real-world anomaly detection in surveillance videos[C]// 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2018: 6479-6488.
[7]	CARION N, MASSA F, SYNNAEVE G, et al. End-to-end object detection with transformers[C]// European Conference on Computer Vision. Cham: Springer, 2020: 213-229.
[8]	GRAVES A. Long short-term memory[M]// Supervised Sequence Labelling with Recurrent Neural Networks. Heidelberg: Springer, 2012: 37-45.
[9]	VENUGOPALAN S, ROHRBACH M, DONAHUE J, et al. Sequence to sequence: video to text[C]// 2015 IEEE International Conference on Computer Vision. New York: IEEE Press, 2016: 4534-4542.
[10]	YAO L, TORABI A, CHO K, et al. Describing videos by exploiting temporal structure[C]// 2015 IEEE International Conference on Computer Vision. New York: IEEE Press, 2016: 4507-4515.
[11]	LI X L, ZHAO B, LU X Q. MAM-RNN:multi-level attention model based RNN for video captioning[C]// IJCAI'17: the 26th International Joint Conference on Artificial Intelligence. New York: ACM, 2017: 2208-2214.
[12]	ZHANG J C, PENG Y X. Object-aware aggregation with bidirectional temporal graph for video captioning[C]// 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2020: 8319-8328.
[13]	PAN B X, CAI H Y, HUANG D A, et al. Spatio-temporal graph for video captioning with knowledge distillation[C]// 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2020: 10867-10876.
[14]	ZHANG Z Q, QI Z A, YUAN C F, et al. Open-book video captioning with retrieve-copy-generate network[C]// 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2021: 9832-9841.
[15]	WANG B R, MA L, ZHANG W, et al. Controllable video captioning with POS sequence guidance based on gated fusion network[C]// 2019 IEEE/CVF International Conference on Computer Vision. New York: IEEE Press, 2020: 2641-2650.
[16]	ZHENG Q, WANG C Y, TAO D C. Syntax-aware action targeting for video captioning[C]// 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2020: 13093-13102.
[17]	YE H H, LI G R, QI Y K, et al. Hierarchical modular network for video captioning[C]// 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2022: 17918-17927.
[18]	VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all You need[C]// The 31st International Conference on Neural Information Processing Systems. New York: ACM, 2017: 6000-6010.
[19]	ZHAO H, CHEN Z W, GUO L, et al. Video captioning based on vision transformer and reinforcement learning[J]. PeerJ Computer Science, 2022, 8: e916. DOI PMID
[20]	JIN T, HUANG S Y, CHEN M, et al. SBAT: video captioning with sparse boundary-aware transformer[EB/OL]. [2023-01-12]. https://arxiv.org/abs/2007.11888.pdf.
[21]	LIN K, LI L J, LIN C C, et al. SwinBERT: end-to-end transformers with sparse attention for video captioning[C]// 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2022: 17928-17937.
[22]	CHEN J, GUO H, YI K, et al. VisualGPT: data-efficient adaptation of pretrained language models for image captioning[C]// 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2022: 18009-18019.
[23]	RADFORD A, WU J, CHILD R, et al. Language models are unsupervised multitask learners[J]. OpenAI blog, 2019, 1(8): 9.
[24]	TSIMPOUKELLI M, MENICK J, CABI S, et al. Multimodal few-shot learning with frozen language models[EB/OL]. [2023-01-13]. https://arxiv.org/abs/2106.13884.pdf.
[25]	LI J N, LI D X, SAVARESE S, et al. BLIP-2: bootstrapping language-image pre-training with frozen image encoders and large language models[EB/OL]. [2023-03-12]. https://arxiv.org/abs/2301.12597.pdf.
[26]	HARA K, KATAOKA H, SATOH Y. Can spatiotemporal 3D CNNs retrace the history of 2D CNNs and ImageNet?[C]// 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2018: 6546-6555.
[27]	REN S Q, HE K M, GIRSHICK R, et al. Faster R-CNN: towards real-time object detection with region proposal networks[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 39(6): 1137-1149. DOI PMID
[28]	SZEGEDY C, IOFFE S, VANHOUCKE V, et al. Inception-v4, inception-ResNet and the impact of residual connections on learning[C]// The 31 AAAI Conference on Artificial Intelligence. New York: ACM, 2017: 4278-4284.
[29]	JANG E, GU S X, POOLE B. Categorical reparameterization with gumbel-softmax[EB/OL]. [2023-01-13]. https://arxiv.org/abs/1611.01144.pdf.
[30]	ZHANG Z Q, SHI Y Y, YUAN C F, et al. Object relational graph with teacher-recommended learning for video captioning[C]// 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2020: 13275-13285.
[31]	REIMERS N, GUREVYCH I. Sentence-BERT: sentence embeddings using Siamese BERT-networks[EB/OL]. [2023-02-01]. https://arxiv.org/abs/1908.10084.pdf.
[32]	CHEN D L, DOLAN W B. Collecting highly parallel data for paraphrase evaluation[C]// The 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1. New York:ACM, 2011: 190-200.
[33]	XU J, MEI T, YAO T, et al. MSR-VTT: a large video description dataset for bridging video and language[C]// 2016 IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2016: 5288-5296.
[34]	PAPINENI K, ROUKOS S, WARD T, et al. BLEU: a method for automatic evaluation of machine translation[C]// The 40th Annual Meeting on Association for Computational Linguistics - ACL '02. Morristown:Association for Computational Linguistics, 2002: 311-318.
[35]	BANERJEE S, LAVIE A. METEOR: an automatic metric for MT evaluation with improved correlation with human judgments[EB/OL]. [2023-01-12]. https://www.xueshufan.com/publication/2123301721.
[36]	VEDANTAM R, ZITNICK C L, PARIKH D. CIDEr: consensus-based image description evaluation[C]// 2015 IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2015: 4566-4575.
[37]	LIN C Y. Rouge: a package for automatic evaluation of summaries[EB/OL]. [2023-01-12]. https://www.doc88.com/p-4951618522651.html.
[38]	TAN G C, LIU D Q, WANG M, et al. Learning to discretely compose reasoning module networks for video captioning[EB/OL]. [2003-01-12]. https://arxiv.org/abs/2007.09049.pdf.
[39]	ZHAO W, WU X, LUO J. Multi-modal dependency tree for video captioning[J]. Advances in Neural Information Processing Systems, 2021, 34: 6634-6645.
[40]	CHEN S X, JIANG Y G. Motion guided region message passing for video captioning[C]// 2021 IEEE/CVF International Conference on Computer Vision. New York: IEEE Press, 2022: 1523-1532.
[41]	RYU H, KANG S, KANG H, et al. Semantic grouping network for video captioning[J]. Proceedings of the AAAI Conference on Artificial Intelligence, 2021, 35(3): 2514-2522. DOI URL
[42]	YANG B, ZOU Y X, LIU F L, et al. Non-autoregressive coarse-to-fine video captioning[J]. Proceedings of the AAAI Conference on Artificial Intelligence, 2021, 35(4): 3119-3127. DOI URL
[43]	ULLAH N, MOHANTA P P. Thinking hallucination for video captioning[C]// Asian Conference on Computer Vision. Cham: Springer, 2023: 623-640.
[44]	CHEN X Y, SONG J K, ZENG P P, et al. Support-set based multi-modal representation enhancement for video captioning[C]// 2022 IEEE International Conference on Multimedia and Expo. New York: IEEE Press, 2022: 1-6.
[45]	WU B F, NIU G C, YU J, et al. Towards knowledge-aware video captioning via transitive visual relationship detection[J]. IEEE Transactions on Circuits and Systems for Video Technology, 2022, 32(10): 6753-6765. DOI URL

方法	BLEU-4	METEOR	ROUGE	CIDEr
ORG-TRL(2020)^[30]	54.3	36.4	73.9	95.2
SAAT(2020)^[16]	46.5	33.5	69.4	81.0
RMN(2020)^[38]	54.6	36.5	73.4	94.4
MDT(2021)^[39]	49.0	35.3	72.2	92.5
MGRMP(2021)^[40]	53.2	35.4	73.5	90.7
SGN(2021)^[41]	52.8	35.5	72.9	94.3
NACF(2021)^[42]	55.6	36.2	73.9	96.3
HMN(2022)^[17]	59.2	37.7	75.1	104.0
Nasib's(2022)^[43]	53.3	36.5	74.0	99.9
SMRE(2022)^[44]	55.5	35.6	72.6	95.2
TVRD(2022)^[45]	50.6	34.5	71.7	84.3
Ours	54.7	36.7	74.1	107.0

方法	BLEU-4	METEOR	ROUGE	CIDEr
ORG-TRL(2020)^[30]	54.3	36.4	73.9	95.2
SAAT(2020)^[16]	46.5	33.5	69.4	81.0
RMN(2020)^[38]	54.6	36.5	73.4	94.4
MDT(2021)^[39]	49.0	35.3	72.2	92.5
MGRMP(2021)^[40]	53.2	35.4	73.5	90.7
SGN(2021)^[41]	52.8	35.5	72.9	94.3
NACF(2021)^[42]	55.6	36.2	73.9	96.3
HMN(2022)^[17]	59.2	37.7	75.1	104.0
Nasib's(2022)^[43]	53.3	36.5	74.0	99.9
SMRE(2022)^[44]	55.5	35.6	72.6	95.2
TVRD(2022)^[45]	50.6	34.5	71.7	84.3
Ours	54.7	36.7	74.1	107.0

方法	BLEU-4	METEOR	ROUGE	CIDEr
ORG-TRL(2020)^[30]	43.6	28.8	62.1	50.9
SAAT(2020)^[16]	40.5	28.2	60.9	49.1
RMN(2020)^[38]	42.5	28.4	61.6	49.6
MDT(2021)^[39]	40.2	28.2	61.1	47.3
MGRMP(2021)^[40]	42.1	28.8	61.4	50.1
SGN(2021)^[41]	40.8	28.3	60.8	49.5
NACF(2021)^[42]	42.0	28.7	62.2	51.4
HMN(2022)^[17]	41.9	28.7	61.8	51.1
Nasib's(2022)^[43]	41.1	28.9	61.9	51.7
SMRE(2022)^[44]	41.4	28.1	61.4	49.7
TVRD(2022)^[45]	43.0	28.7	62.2	51.8
Ours	42.8	28.3	61.8	52.4

方法	BLEU-4	METEOR	ROUGE	CIDEr
ORG-TRL(2020)^[30]	43.6	28.8	62.1	50.9
SAAT(2020)^[16]	40.5	28.2	60.9	49.1
RMN(2020)^[38]	42.5	28.4	61.6	49.6
MDT(2021)^[39]	40.2	28.2	61.1	47.3
MGRMP(2021)^[40]	42.1	28.8	61.4	50.1
SGN(2021)^[41]	40.8	28.3	60.8	49.5
NACF(2021)^[42]	42.0	28.7	62.2	51.4
HMN(2022)^[17]	41.9	28.7	61.8	51.1
Nasib's(2022)^[43]	41.1	28.9	61.9	51.7
SMRE(2022)^[44]	41.4	28.1	61.4	49.7
TVRD(2022)^[45]	43.0	28.7	62.2	51.8
Ours	42.8	28.3	61.8	52.4

方法	Top-1	Top-2	Top-3	总占比
SAAT(2020)	15	12.1	16.1	16.1
SGN(2021)	15	22.0	25.8	21.7
HMN(2022)	10	22.0	25.8	20.8
Ours	60	43.9	32.3	41.4

基于语义引导的视频描述生成

Video captioning based on semantic guidance

RichHTML

PDF (PC)

可视化

摘要/Abstract

引用本文

使用本文

图/表 11

参考文献 45

相关文章 15

编辑推荐

Metrics

本文评价

方法	BLEU-4	METEOR	ROUGE	CIDEr
Model-FULL	54.7	36.7	74.1	107.0
Model-NF	52.6	34.5	72.1	89.1
Model-NC	55.0	36.1	73.5	99.4
Model-NE	49.7	33.6	70.4	75.9
Model-NP	53.1	36.1	73.2	98.6

方法	BLEU-4	METEOR	ROUGE	CIDEr
Model-FULL	42.8	28.3	61.8	52.4
Model-NF	40.5	27.8	60.9	48.9
Model-NC	42.5	28.3	61.6	51.0
Model-NE	40.1	27.4	60.0	46.6
Model-NP	41.6	27.7	61.2	49.4

[1]	郭宗洋, 刘立东, 蒋东华, 刘子翔, 朱熟康, 陈京华 . 基于语义引导神经网络的人体动作识别算法 [J]. 图学学报, 2024, 45(1): 26-34.
[2]	李佳琦, 王辉, 郭宇. 基于Transformer的三角形网格分类分割网络 [J]. 图学学报, 2024, 45(1): 78-89.
[3]	崔克彬, 焦静颐 . 基于MCB-FAH-YOLOv8的钢材表面缺陷检测算法 [J]. 图学学报, 2024, 45(1): 112-125.
[4]	吕衡, 杨鸿宇 . 一种基于时空运动信息交互建模的三维人体姿态估计方法 [J]. 图学学报, 2024, 45(1): 159-168.
[5]	黄少年, 文沛然, 全琪, 陈荣元. 基于多支路聚合的帧预测轻量化视频异常检测[J]. 图学学报, 2023, 44(6): 1173-1182.
[6]	张丽媛, 赵海蓉, 何巍, 唐雄风. 融合全局-局部注意模块的Mask R-CNN膝关节囊肿检测方法[J]. 图学学报, 2023, 44(6): 1183-1190.
[7]	杨陈成, 董秀成, 侯兵, 张党成, 向贤明, 冯琪茗. 基于参考的Transformer纹理迁移深度图像超分辨率重建[J]. 图学学报, 2023, 44(5): 861-867.
[8]	杨红菊, 高敏, 张常有, 薄文, 武文佳, 曹付元. 一种面向图像修复的局部优化生成模型[J]. 图学学报, 2023, 44(5): 955-965.
[9]	李利霞, 王鑫, 王军, 张又元. 基于特征融合与注意力机制的无人机图像小目标检测算法[J]. 图学学报, 2023, 44(4): 658-666.
[10]	郝帅, 赵新生, 马旭, 张旭, 何田, 侯李祥. 基于TR-YOLOv5的输电线路多类缺陷目标检测方法[J]. 图学学报, 2023, 44(4): 667-676.
[11]	李鑫, 普园媛, 赵征鹏, 徐丹, 钱文华. 内容语义和风格特征匹配一致的艺术风格迁移[J]. 图学学报, 2023, 44(4): 699-709.
[12]	李刚, 张运涛, 汪文凯, 张东阳. 采用DETR与先验知识融合的输电线路螺栓缺陷检测方法[J]. 图学学报, 2023, 44(3): 438-447.
[13]	李雨, 闫甜甜, 周东生, 魏小鹏. 基于注意力机制与深度多尺度特征融合的自然场景文本检测[J]. 图学学报, 2023, 44(3): 473-481.
[14]	刘冰, 叶成绪. 面向不平衡数据的肺部疾病细粒度分类模型[J]. 图学学报, 2023, 44(3): 513-520.
[15]	史彩娟, 石泽, 闫巾玮, 毕阳阳. 基于双语义双向对齐VAE的广义零样本学习[J]. 图学学报, 2023, 44(3): 521-530.