Journal of Graphics ›› 2023, Vol. 44 ›› Issue (6): 1191-1201.DOI: 10.11996/JG.j.2095-302X.2023061191
Previous Articles Next Articles
Received:
2023-06-29
Accepted:
2023-09-26
Online:
2023-12-31
Published:
2023-12-17
Contact:
YAO Li (1977-), professor, Ph.D. Her main research interests cover computer graphics, computer vision, etc. About author:
SHI Jia-hao (1998-), master student. His main research interest covers computer vision. E-mail:sjh143446@163.com
Supported by:
CLC Number:
SHI Jia-hao, YAO Li. Video captioning based on semantic guidance[J]. Journal of Graphics, 2023, 44(6): 1191-1201.
Add to citation manager EndNote|Ris|BibTeX
URL: http://www.txxb.com.cn/EN/10.11996/JG.j.2095-302X.2023061191
方法 | BLEU-4 | METEOR | ROUGE | CIDEr |
---|---|---|---|---|
ORG-TRL(2020)[ | 54.3 | 36.4 | 73.9 | 95.2 |
SAAT( | 46.5 | 33.5 | 69.4 | 81.0 |
RMN(2020)[ | 54.6 | 36.5 | 73.4 | 94.4 |
MDT(2021)[ | 49.0 | 35.3 | 72.2 | 92.5 |
MGRMP( | 53.2 | 35.4 | 73.5 | 90.7 |
SGN(2021)[ | 52.8 | 35.5 | 72.9 | 94.3 |
NACF( | 55.6 | 36.2 | 73.9 | 96.3 |
HMN(2022)[ | 59.2 | 37.7 | 75.1 | 104.0 |
Nasib's(2022)[ | 53.3 | 36.5 | 74.0 | 99.9 |
SMRE( | 55.5 | 35.6 | 72.6 | 95.2 |
TVRD( | 50.6 | 34.5 | 71.7 | 84.3 |
Ours | 54.7 | 36.7 | 74.1 | 107.0 |
Table 1 Comparison of test results on the MSVD dataset
方法 | BLEU-4 | METEOR | ROUGE | CIDEr |
---|---|---|---|---|
ORG-TRL(2020)[ | 54.3 | 36.4 | 73.9 | 95.2 |
SAAT( | 46.5 | 33.5 | 69.4 | 81.0 |
RMN(2020)[ | 54.6 | 36.5 | 73.4 | 94.4 |
MDT(2021)[ | 49.0 | 35.3 | 72.2 | 92.5 |
MGRMP( | 53.2 | 35.4 | 73.5 | 90.7 |
SGN(2021)[ | 52.8 | 35.5 | 72.9 | 94.3 |
NACF( | 55.6 | 36.2 | 73.9 | 96.3 |
HMN(2022)[ | 59.2 | 37.7 | 75.1 | 104.0 |
Nasib's(2022)[ | 53.3 | 36.5 | 74.0 | 99.9 |
SMRE( | 55.5 | 35.6 | 72.6 | 95.2 |
TVRD( | 50.6 | 34.5 | 71.7 | 84.3 |
Ours | 54.7 | 36.7 | 74.1 | 107.0 |
方法 | BLEU-4 | METEOR | ROUGE | CIDEr |
---|---|---|---|---|
ORG-TRL(2020)[ | 43.6 | 28.8 | 62.1 | 50.9 |
SAAT( | 40.5 | 28.2 | 60.9 | 49.1 |
RMN(2020)[ | 42.5 | 28.4 | 61.6 | 49.6 |
MDT(2021)[ | 40.2 | 28.2 | 61.1 | 47.3 |
MGRMP( | 42.1 | 28.8 | 61.4 | 50.1 |
SGN(2021)[ | 40.8 | 28.3 | 60.8 | 49.5 |
NACF( | 42.0 | 28.7 | 62.2 | 51.4 |
HMN(2022)[ | 41.9 | 28.7 | 61.8 | 51.1 |
Nasib's(2022)[ | 41.1 | 28.9 | 61.9 | 51.7 |
SMRE( | 41.4 | 28.1 | 61.4 | 49.7 |
TVRD( | 43.0 | 28.7 | 62.2 | 51.8 |
Ours | 42.8 | 28.3 | 61.8 | 52.4 |
Table 2 Comparison of test results on the MSR-VTT dataset
方法 | BLEU-4 | METEOR | ROUGE | CIDEr |
---|---|---|---|---|
ORG-TRL(2020)[ | 43.6 | 28.8 | 62.1 | 50.9 |
SAAT( | 40.5 | 28.2 | 60.9 | 49.1 |
RMN(2020)[ | 42.5 | 28.4 | 61.6 | 49.6 |
MDT(2021)[ | 40.2 | 28.2 | 61.1 | 47.3 |
MGRMP( | 42.1 | 28.8 | 61.4 | 50.1 |
SGN(2021)[ | 40.8 | 28.3 | 60.8 | 49.5 |
NACF( | 42.0 | 28.7 | 62.2 | 51.4 |
HMN(2022)[ | 41.9 | 28.7 | 61.8 | 51.1 |
Nasib's(2022)[ | 41.1 | 28.9 | 61.9 | 51.7 |
SMRE( | 41.4 | 28.1 | 61.4 | 49.7 |
TVRD( | 43.0 | 28.7 | 62.2 | 51.8 |
Ours | 42.8 | 28.3 | 61.8 | 52.4 |
方法 | Top-1 | Top-2 | Top-3 | 总占比 |
---|---|---|---|---|
SAAT( | 15 | 12.1 | 16.1 | 16.1 |
SGN(2021) | 15 | 22.0 | 25.8 | 21.7 |
HMN(2022) | 10 | 22.0 | 25.8 | 20.8 |
Ours | 60 | 43.9 | 32.3 | 41.4 |
Table 3 Test results of user experiments (%)
方法 | Top-1 | Top-2 | Top-3 | 总占比 |
---|---|---|---|---|
SAAT( | 15 | 12.1 | 16.1 | 16.1 |
SGN(2021) | 15 | 22.0 | 25.8 | 21.7 |
HMN(2022) | 10 | 22.0 | 25.8 | 20.8 |
Ours | 60 | 43.9 | 32.3 | 41.4 |
方法 | BLEU-4 | METEOR | ROUGE | CIDEr |
---|---|---|---|---|
Model-FULL | 54.7 | 36.7 | 74.1 | 107.0 |
Model-NF | 52.6 | 34.5 | 72.1 | 89.1 |
Model-NC | 55.0 | 36.1 | 73.5 | 99.4 |
Model-NE | 49.7 | 33.6 | 70.4 | 75.9 |
Model-NP | 53.1 | 36.1 | 73.2 | 98.6 |
Table 4 Ablation experiments on the MSVD dataset
方法 | BLEU-4 | METEOR | ROUGE | CIDEr |
---|---|---|---|---|
Model-FULL | 54.7 | 36.7 | 74.1 | 107.0 |
Model-NF | 52.6 | 34.5 | 72.1 | 89.1 |
Model-NC | 55.0 | 36.1 | 73.5 | 99.4 |
Model-NE | 49.7 | 33.6 | 70.4 | 75.9 |
Model-NP | 53.1 | 36.1 | 73.2 | 98.6 |
方法 | BLEU-4 | METEOR | ROUGE | CIDEr |
---|---|---|---|---|
Model-FULL | 42.8 | 28.3 | 61.8 | 52.4 |
Model-NF | 40.5 | 27.8 | 60.9 | 48.9 |
Model-NC | 42.5 | 28.3 | 61.6 | 51.0 |
Model-NE | 40.1 | 27.4 | 60.0 | 46.6 |
Model-NP | 41.6 | 27.7 | 61.2 | 49.4 |
Table 5 Ablation experiments on the MSR-VTT dataset
方法 | BLEU-4 | METEOR | ROUGE | CIDEr |
---|---|---|---|---|
Model-FULL | 42.8 | 28.3 | 61.8 | 52.4 |
Model-NF | 40.5 | 27.8 | 60.9 | 48.9 |
Model-NC | 42.5 | 28.3 | 61.6 | 51.0 |
Model-NE | 40.1 | 27.4 | 60.0 | 46.6 |
Model-NP | 41.6 | 27.7 | 61.2 | 49.4 |
[1] | ZHANG Z Q, CHEN Y X, MA Z Y, et al. CREATE: a benchmark for Chinese short video retrieval and title generation[EB/OL]. [2023-01-12]. https://arxiv.org/abs/2203.16763.pdf. |
[2] | NIE L Q, QU L G, MENG D, et al. Search-oriented micro-video captioning[C]// The 30th ACM International Conference on Multimedia. New York: ACM, 2022: 3234-3243. |
[3] |
XU H J, HE K, PLUMMER B A, et al. Multilevel language and vision integration for text-to-clip retrieval[J]. Proceedings of the AAAI Conference on Artificial Intelligence, 2019, 33(1): 9062-9069.
DOI URL |
[4] | WRAY M, DOUGHTY H, DAMEN D M. On semantic similarity in video retrieval[C]// 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2021: 3649-3659. |
[5] |
CAMPOS V P, ARAÚJO T M U, SOUZA FILHO G L, et al. CineAD: a system for automated audio description script generation for the visually impaired[J]. Universal Access in the Information Society, 2020, 19(1): 99-111.
DOI |
[6] | SULTANI W, CHEN C, SHAH M. Real-world anomaly detection in surveillance videos[C]// 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2018: 6479-6488. |
[7] | CARION N, MASSA F, SYNNAEVE G, et al. End-to-end object detection with transformers[C]// European Conference on Computer Vision. Cham: Springer, 2020: 213-229. |
[8] | GRAVES A. Long short-term memory[M]// Supervised Sequence Labelling with Recurrent Neural Networks. Heidelberg: Springer, 2012: 37-45. |
[9] | VENUGOPALAN S, ROHRBACH M, DONAHUE J, et al. Sequence to sequence: video to text[C]// 2015 IEEE International Conference on Computer Vision. New York: IEEE Press, 2016: 4534-4542. |
[10] | YAO L, TORABI A, CHO K, et al. Describing videos by exploiting temporal structure[C]// 2015 IEEE International Conference on Computer Vision. New York: IEEE Press, 2016: 4507-4515. |
[11] | LI X L, ZHAO B, LU X Q. MAM-RNN:multi-level attention model based RNN for video captioning[C]// IJCAI'17: the 26th International Joint Conference on Artificial Intelligence. New York: ACM, 2017: 2208-2214. |
[12] | ZHANG J C, PENG Y X. Object-aware aggregation with bidirectional temporal graph for video captioning[C]// 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2020: 8319-8328. |
[13] | PAN B X, CAI H Y, HUANG D A, et al. Spatio-temporal graph for video captioning with knowledge distillation[C]// 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2020: 10867-10876. |
[14] | ZHANG Z Q, QI Z A, YUAN C F, et al. Open-book video captioning with retrieve-copy-generate network[C]// 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2021: 9832-9841. |
[15] | WANG B R, MA L, ZHANG W, et al. Controllable video captioning with POS sequence guidance based on gated fusion network[C]// 2019 IEEE/CVF International Conference on Computer Vision. New York: IEEE Press, 2020: 2641-2650. |
[16] | ZHENG Q, WANG C Y, TAO D C. Syntax-aware action targeting for video captioning[C]// 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2020: 13093-13102. |
[17] | YE H H, LI G R, QI Y K, et al. Hierarchical modular network for video captioning[C]// 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2022: 17918-17927. |
[18] | VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all You need[C]// The 31st International Conference on Neural Information Processing Systems. New York: ACM, 2017: 6000-6010. |
[19] |
ZHAO H, CHEN Z W, GUO L, et al. Video captioning based on vision transformer and reinforcement learning[J]. PeerJ Computer Science, 2022, 8: e916.
DOI PMID |
[20] | JIN T, HUANG S Y, CHEN M, et al. SBAT: video captioning with sparse boundary-aware transformer[EB/OL]. [2023-01-12]. https://arxiv.org/abs/2007.11888.pdf. |
[21] | LIN K, LI L J, LIN C C, et al. SwinBERT: end-to-end transformers with sparse attention for video captioning[C]// 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2022: 17928-17937. |
[22] | CHEN J, GUO H, YI K, et al. VisualGPT: data-efficient adaptation of pretrained language models for image captioning[C]// 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2022: 18009-18019. |
[23] | RADFORD A, WU J, CHILD R, et al. Language models are unsupervised multitask learners[J]. OpenAI blog, 2019, 1(8): 9. |
[24] | TSIMPOUKELLI M, MENICK J, CABI S, et al. Multimodal few-shot learning with frozen language models[EB/OL]. [2023-01-13]. https://arxiv.org/abs/2106.13884.pdf. |
[25] | LI J N, LI D X, SAVARESE S, et al. BLIP-2: bootstrapping language-image pre-training with frozen image encoders and large language models[EB/OL]. [2023-03-12]. https://arxiv.org/abs/2301.12597.pdf. |
[26] | HARA K, KATAOKA H, SATOH Y. Can spatiotemporal 3D CNNs retrace the history of 2D CNNs and ImageNet?[C]// 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2018: 6546-6555. |
[27] |
REN S Q, HE K M, GIRSHICK R, et al. Faster R-CNN: towards real-time object detection with region proposal networks[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 39(6): 1137-1149.
DOI PMID |
[28] | SZEGEDY C, IOFFE S, VANHOUCKE V, et al. Inception-v4, inception-ResNet and the impact of residual connections on learning[C]// The 31 AAAI Conference on Artificial Intelligence. New York: ACM, 2017: 4278-4284. |
[29] | JANG E, GU S X, POOLE B. Categorical reparameterization with gumbel-softmax[EB/OL]. [2023-01-13]. https://arxiv.org/abs/1611.01144.pdf. |
[30] | ZHANG Z Q, SHI Y Y, YUAN C F, et al. Object relational graph with teacher-recommended learning for video captioning[C]// 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2020: 13275-13285. |
[31] | REIMERS N, GUREVYCH I. Sentence-BERT: sentence embeddings using Siamese BERT-networks[EB/OL]. [2023-02-01]. https://arxiv.org/abs/1908.10084.pdf. |
[32] | CHEN D L, DOLAN W B. Collecting highly parallel data for paraphrase evaluation[C]// The 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1. New York:ACM, 2011: 190-200. |
[33] | XU J, MEI T, YAO T, et al. MSR-VTT: a large video description dataset for bridging video and language[C]// 2016 IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2016: 5288-5296. |
[34] | PAPINENI K, ROUKOS S, WARD T, et al. BLEU: a method for automatic evaluation of machine translation[C]// The 40th Annual Meeting on Association for Computational Linguistics - ACL '02. Morristown:Association for Computational Linguistics, 2002: 311-318. |
[35] | BANERJEE S, LAVIE A. METEOR: an automatic metric for MT evaluation with improved correlation with human judgments[EB/OL]. [2023-01-12]. https://www.xueshufan.com/publication/2123301721. |
[36] | VEDANTAM R, ZITNICK C L, PARIKH D. CIDEr: consensus-based image description evaluation[C]// 2015 IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2015: 4566-4575. |
[37] | LIN C Y. Rouge: a package for automatic evaluation of summaries[EB/OL]. [2023-01-12]. https://www.doc88.com/p-4951618522651.html. |
[38] | TAN G C, LIU D Q, WANG M, et al. Learning to discretely compose reasoning module networks for video captioning[EB/OL]. [2003-01-12]. https://arxiv.org/abs/2007.09049.pdf. |
[39] | ZHAO W, WU X, LUO J. Multi-modal dependency tree for video captioning[J]. Advances in Neural Information Processing Systems, 2021, 34: 6634-6645. |
[40] | CHEN S X, JIANG Y G. Motion guided region message passing for video captioning[C]// 2021 IEEE/CVF International Conference on Computer Vision. New York: IEEE Press, 2022: 1523-1532. |
[41] |
RYU H, KANG S, KANG H, et al. Semantic grouping network for video captioning[J]. Proceedings of the AAAI Conference on Artificial Intelligence, 2021, 35(3): 2514-2522.
DOI URL |
[42] |
YANG B, ZOU Y X, LIU F L, et al. Non-autoregressive coarse-to-fine video captioning[J]. Proceedings of the AAAI Conference on Artificial Intelligence, 2021, 35(4): 3119-3127.
DOI URL |
[43] | ULLAH N, MOHANTA P P. Thinking hallucination for video captioning[C]// Asian Conference on Computer Vision. Cham: Springer, 2023: 623-640. |
[44] | CHEN X Y, SONG J K, ZENG P P, et al. Support-set based multi-modal representation enhancement for video captioning[C]// 2022 IEEE International Conference on Multimedia and Expo. New York: IEEE Press, 2022: 1-6. |
[45] |
WU B F, NIU G C, YU J, et al. Towards knowledge-aware video captioning via transitive visual relationship detection[J]. IEEE Transactions on Circuits and Systems for Video Technology, 2022, 32(10): 6753-6765.
DOI URL |
[1] |
LI Jiaqi, WANG Hui, GUO Yu.
Classification and segmentation network based on Transformer for triangular mesh
[J]. Journal of Graphics, 2024, 45(1): 78-89.
|
[2] |
CUI Kebin, JIAO Jingyi.
Steel surface defect detection algorithm based on MCB-FAH-YOLOv8
[J]. Journal of Graphics, 2024, 45(1): 112-125.
|
[3] |
LV Heng, YANG Hongyu.
A 3D human pose estimation approach based on spatio-temporal motion interaction modeling
[J]. Journal of Graphics, 2024, 45(1): 159-168.
|
[4] | HUANG Shao-nian, WEN Pei-ran, QUAN Qi, CHEN Rong-yuan. Future frame prediction based on multi-branch aggregation for lightweight video anomaly detection [J]. Journal of Graphics, 2023, 44(6): 1173-1182. |
[5] | ZHANG Li-yuan, ZHAO Hai-rong, HE Wei, TANG Xiong-feng. Knee cysts detection algorithm based on Mask R-CNN integrating global-local attention module [J]. Journal of Graphics, 2023, 44(6): 1183-1190. |
[6] | YANG Chen-cheng, DONG Xiu-cheng, HOU Bing, ZHANG Dang-cheng, XIANG Xian-ming, FENG Qi-ming. Reference based transformer texture migrates depth images super resolution reconstruction [J]. Journal of Graphics, 2023, 44(5): 861-867. |
[7] | YANG Hong-ju, GAO Min, ZHANG Chang-you, BO Wen, WU Wen-jia, CAO Fu-yuan. A local optimization generation model for image inpainting [J]. Journal of Graphics, 2023, 44(5): 955-965. |
[8] | LI Li-xia, WANG Xin, WANG Jun, ZHANG You-yuan. Small object detection algorithm in UAV image based on feature fusion and attention mechanism [J]. Journal of Graphics, 2023, 44(4): 658-666. |
[9] | HAO Shuai, ZHAO Xin-sheng, MA Xu, ZHANG Xu, HE Tian, HOU Li-xiang. Multi-class defect target detection method for transmission lines based on TR-YOLOv5 [J]. Journal of Graphics, 2023, 44(4): 667-676. |
[10] | LI Xin, PU Yuan-yuan, ZHAO Zheng-peng, XU Dan, QIAN Wen-hua. Content semantics and style features match consistent artistic style transfer [J]. Journal of Graphics, 2023, 44(4): 699-709. |
[11] | LI Gang, ZHANG Yun-tao, WANG Wen-kai, ZHANG Dong-yang. Defect detection method of transmission line bolts based on DETR and prior knowledge fusion [J]. Journal of Graphics, 2023, 44(3): 438-447. |
[12] | LI Yu, YAN Tian-tian, ZHOU Dong-sheng, WEI Xiao-peng. Natural scene text detection based on attention mechanism and deep multi-scale feature fusion [J]. Journal of Graphics, 2023, 44(3): 473-481. |
[13] | LIU Bing, YE Cheng-xu. Fine-grained classification model of lung disease for imbalanced data [J]. Journal of Graphics, 2023, 44(3): 513-520. |
[14] | SHI Cai-juan, SHI Ze, YAN Jin-wei, BI Yang-yang. Bi-directionally aligned VAE based on double semantics for generalized zero-shot learning [J]. Journal of Graphics, 2023, 44(3): 521-530. |
[15] | LU Qiu, SHAO Hua-ze, ZHANG Yun-lei. Dynamic balanced multi-scale feature fusion for colorectal polyp segmentation [J]. Journal of Graphics, 2023, 44(2): 225-232. |
Viewed | ||||||
Full text |
|
|||||
Abstract |
|
|||||