图学学报 ›› 2024, Vol. 45 ›› Issue (6): 1231-1242.DOI: 10.11996/JG.j.2095-302X.2024061231
郑洪岩1(), 王慧2, 刘昊1, 张志平1, 杨晓娟3, 孙涛1(
)
收稿日期:
2024-06-21
接受日期:
2024-08-17
出版日期:
2024-12-31
发布日期:
2024-12-24
通讯作者:
孙涛(1974-),男,教授,博士。主要研究方向为自然语言处理和计算机视觉等。E-mail:sunt@qlu.edu.cn第一作者:
郑洪岩(1999-),男,硕士研究生。主要研究方向为视觉问答。E-mail:1679436540@qq.com
基金资助:
ZHENG Hongyan1(), WANG Hui2, LIU Hao1, ZHANG Zhiping1, YANG Xiaojuan3, SUN Tao1(
)
Received:
2024-06-21
Accepted:
2024-08-17
Published:
2024-12-31
Online:
2024-12-24
Contact:
SUN Tao (1974-), professor, Ph.D. His main research interests cover natural language processing and computer vision, etc. E-mail:sunt@qlu.edu.cnFirst author:
ZHENG Hongyan (1999-), master student. His main research interest covers visual question answering. E-mail:1679436540@qq.com
Supported by:
摘要:
基于知识的视觉问答(KB-VQA)不仅需要图像信息和问题信息,还需要从知识源中获取到相关知识才能回答问题。现有方法通常使用检索器从知识库中检索外部知识,或直接从大型模型中得到隐式知识,但依靠仅有的图文信息往往不足以支撑获取相关知识。针对检索阶段的查询和外部知识,提出了一种强化检索策略。在查询端,利用大模型中的隐式知识来增强现有的图像和问题信息,增强后的图文信息可以帮助检索器从知识库中定位到更准确的外部知识。在外部知识端,提出了预模拟交互模块来增强外部知识,该模块为知识向量生成一个新的轻量级向量,通过二者之间预先交互,使得检索器可以提前模拟查询和知识段落的交互,以便更好地捕捉查询和知识段落的语义关系。实验结果表明,改进后的模型仅需检索少量知识便可以在 OK-VQA 数据集上达到61.3%的准确率。
中图分类号:
郑洪岩, 王慧, 刘昊, 张志平, 杨晓娟, 孙涛. 基于隐式知识增强的KB-VQA知识检索策略研究[J]. 图学学报, 2024, 45(6): 1231-1242.
ZHENG Hongyan, WANG Hui, LIU Hao, ZHANG Zhiping, YANG Xiaojuan, SUN Tao. Research on KB-VQA knowledge retrieval strategy based on implicit knowledge enhancement[J]. Journal of Graphics, 2024, 45(6): 1231-1242.
图1 本文方法与现有方法比较((a)使用检索器从知识库中检索知识;(b)将大型模型视为答案生成器;(c)将大型模型视为隐含的知识库,并将其与外部知识库一起使用;(d)本文方法)
Fig. 1 This paper compares the method with existing methods ((a) Using a retriever to fetch knowledge from a knowledge base; (b) Treating large models as answer generators; (c) Viewing large models as implicit knowledge bases and using them in conjunction with external knowledge bases; (d) The method of this paper)
图2 提示信息增强图文信息从而辅助知识检索的具体例子
Fig. 2 Specific examples of enhancing graphic and textual information with prompt information to assist knowledge retrieval
图5 OK-VQA数据集具体示例((a)车辆及运输;(b)品牌,公司和产品;(c)物品,材料和服务)
Fig. 5 Specific example of OK-VQA dataset ((a) Vehicles and transport; (b) Brands, companies and products; (c) Goods, materials and services)
方法 | 出版 | Ptrain | Ptest | 外部知识来源 | PRRecall/% | EM/% | ACC/% |
---|---|---|---|---|---|---|---|
1.BAN+KG+AUG | MM(2020) | - | - | Wikipedia+ConceptNet | 26.7 | ||
2.ConceptBERT | EMNLP(2020) | - | - | ConceptNet | 33.7 | ||
3.KRISP | CVPR(2021) | - | - | Wikipedia+ConceptNet | 38.4 | ||
4.Vis-DPR | EMNLP(2021) | - | - | Google Search | 39.2 | ||
5.KAT-T5 | NAACL(2021) | 40 | 40 | Wikipedia | 44.2 | ||
6.VRR | EMNLP(2021) | 100 | 100 | Google Search | 45.0 | ||
7.RAG | NeurIPS(2020) | 5 | 5 | Google Search | 82.3 | 52.5 | 48.2 |
8.TRIG | CVPR(2022) | 100 | 100 | Wikipedia | 53.5 | 49.3 | |
9.RR-VEL | ICME(2023) | 5 | 5 | ConceptNet+Ascent+hasPart | 55.6 | 49.4 | |
10.RA-VQA | EMNLP(2022) | 5 | 5 | Google Search | 82.8 | 58.7 | 53.8 |
11.ReVeal | CVPR(2023) | - | - | WIT+CC12M+Wikipedia+VQAv2 | 59.1 | ||
12.PICA | AAAI(2022) | - | - | GPT-3 | 48.0 | ||
13.PromptCap | ICCV(2023) | - | - | GPT-3 | 60.4 | ||
14.Prophet | CVPR(2023) | - | - | GPT-3 | 61.1 | ||
15.PaLI-15B | ICLR(2023) | - | - | PaLI(15B) | 56.5 | ||
16.InstructBLIP-7B | NeurIPS(2023) | - | - | InstructBLIP(7B) | 57.6 | ||
17.PaLM-E-12B | ICML(2023) | - | - | PaLM-E(12B) | 60.1 | ||
18.REVIVE | NeurIPS(2022) | 45 | 45 | Wikipedia+GPT-3 | 58.0 | ||
19.RASP | ACL(2023) | - | - | Wikipedia+Codex | 58.5 | ||
20.Two | ACL(2023) | 75 | 75 | Wikipedia+GPT-3+OFA+VQAv2 | 85.2 | 58.7 | |
Ours | 5 | 5 | Google Search+GPT-3+OFA | 89.2 | 66.1 | 61.3 |
表1 OK-VQA数据集上已有方法比较
Table 1 Compare with existing methods on the OK-VQA dataset
方法 | 出版 | Ptrain | Ptest | 外部知识来源 | PRRecall/% | EM/% | ACC/% |
---|---|---|---|---|---|---|---|
1.BAN+KG+AUG | MM(2020) | - | - | Wikipedia+ConceptNet | 26.7 | ||
2.ConceptBERT | EMNLP(2020) | - | - | ConceptNet | 33.7 | ||
3.KRISP | CVPR(2021) | - | - | Wikipedia+ConceptNet | 38.4 | ||
4.Vis-DPR | EMNLP(2021) | - | - | Google Search | 39.2 | ||
5.KAT-T5 | NAACL(2021) | 40 | 40 | Wikipedia | 44.2 | ||
6.VRR | EMNLP(2021) | 100 | 100 | Google Search | 45.0 | ||
7.RAG | NeurIPS(2020) | 5 | 5 | Google Search | 82.3 | 52.5 | 48.2 |
8.TRIG | CVPR(2022) | 100 | 100 | Wikipedia | 53.5 | 49.3 | |
9.RR-VEL | ICME(2023) | 5 | 5 | ConceptNet+Ascent+hasPart | 55.6 | 49.4 | |
10.RA-VQA | EMNLP(2022) | 5 | 5 | Google Search | 82.8 | 58.7 | 53.8 |
11.ReVeal | CVPR(2023) | - | - | WIT+CC12M+Wikipedia+VQAv2 | 59.1 | ||
12.PICA | AAAI(2022) | - | - | GPT-3 | 48.0 | ||
13.PromptCap | ICCV(2023) | - | - | GPT-3 | 60.4 | ||
14.Prophet | CVPR(2023) | - | - | GPT-3 | 61.1 | ||
15.PaLI-15B | ICLR(2023) | - | - | PaLI(15B) | 56.5 | ||
16.InstructBLIP-7B | NeurIPS(2023) | - | - | InstructBLIP(7B) | 57.6 | ||
17.PaLM-E-12B | ICML(2023) | - | - | PaLM-E(12B) | 60.1 | ||
18.REVIVE | NeurIPS(2022) | 45 | 45 | Wikipedia+GPT-3 | 58.0 | ||
19.RASP | ACL(2023) | - | - | Wikipedia+Codex | 58.5 | ||
20.Two | ACL(2023) | 75 | 75 | Wikipedia+GPT-3+OFA+VQAv2 | 85.2 | 58.7 | |
Ours | 5 | 5 | Google Search+GPT-3+OFA | 89.2 | 66.1 | 61.3 |
模型 | PRRecall@5 | EM | ACC |
---|---|---|---|
w/o all | 82.8 | 58.7 | 53.8 |
w/o OFA+GPT | 84.7 | 61.5 | 55.0 |
w/o OFA | 85.8 | 63.7 | 58.9 |
w/o GPT-3 | 87.1 | 65.1 | 59.8 |
w/o PIM+R | 82.8 | 62.9 | 58.3 |
w/o PIM | 87.8 | 65.8 | 60.5 |
Ours | 89.2 | 66.1 | 61.3 |
表2 对模型结构进行消融研究/%
Table 2 Conducting an ablation study on the model structure/%
模型 | PRRecall@5 | EM | ACC |
---|---|---|---|
w/o all | 82.8 | 58.7 | 53.8 |
w/o OFA+GPT | 84.7 | 61.5 | 55.0 |
w/o OFA | 85.8 | 63.7 | 58.9 |
w/o GPT-3 | 87.1 | 65.1 | 59.8 |
w/o PIM+R | 82.8 | 62.9 | 58.3 |
w/o PIM | 87.8 | 65.8 | 60.5 |
Ours | 89.2 | 66.1 | 61.3 |
方法 | 数量:5 | 数量:500 | 数量:1 000 |
---|---|---|---|
双编码器 | 1 | 75 | 151 |
PIM | 1 | 75 | 154 |
表3 预模拟交互模块检索外部知识所消耗的时间/ms
Table 3 The time consumed by the pre-simulated interaction module for retrieving external knowle/ms
方法 | 数量:5 | 数量:500 | 数量:1 000 |
---|---|---|---|
双编码器 | 1 | 75 | 151 |
PIM | 1 | 75 | 154 |
[1] | MARINO K, RASTEGARI M, FARHADI A, et al. OK-VQA: a visual question answering benchmark requiring external knowledge[C]// 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2019: 3195-3204. |
[2] | ANTOL S, AGRAWAL A, LU J S, et al. VQA: visual question answering[C]// 2015 IEEE International Conference on Computer Vision. New York: IEEE Press, 2015: 2425-2433. |
[3] | SPEER R, CHIN J, HAVASI C. ConceptNet 5.5: an open multilingual graph of general knowledge[EB/OL]. [2024-04-01]. https://ojs.aaai.org/index.php/AAAI/article/view/11164. |
[4] | NARASIMHAN M, LAZEBNIK S, SCHWING A G. Out of the box: reasoning with graph convolution nets for factual visual question answering[C]// The 32nd International Conference on Neural Information Processing Systems. Red Hook: Curran Associates Inc., 2018: 2659-2670. |
[5] | WU J L, LU J S, SABHARWAL A, et al. Multi-modal answer validation for knowledge-based VQA[C]// The 36th AAAI Conference on Artificial Intelligence. Palo Alto: AAAI Press, 2022: 2712-2721. |
[6] | GAO F, PING Q, THATTAI G, et al. Transform- retrieve-generate: natural language-centric outside-knowledge visual question answering[C]// 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2022: 5067-5077. |
[7] | GUI L K, WANG B R, HUANG Q Y, et al. KAT: a knowledge augmented transformer for vision-and-language[EB/OL]. [2024-04-01]. https://aclanthology.org/2022.naacl-main.70/. |
[8] | LUO M, ZENG Y K, BANERJEE P, et al. Weakly-supervised visual-retriever-reader for knowledge-based question answering[EB/OL]. [2024-04-01]. https://aclanthology.org/2021.emnlp-main.517/. |
[9] | QU C, ZAMANI H, YANG L, et al. Passage retrieval for outside-knowledge visual question answering[C]// The 44th International ACM SIGIR Conference on Research and Development in Information Retrieval. New York: ACM, 2021: 1753-1757. |
[10] | YANG Z Y, GAN Z, WANG J F, et al. An empirical study of GPT-3 for few-shot knowledge-based VQA[C]// The 36th AAAI Conference on Artificial Intelligence. Palo Alto: AAAI Press, 2022: 3081-3089. |
[11] | SHAO Z W, YU Z, WANG M, et al. Prompting large language models with answer heuristics for knowledge-based visual question answering[C]// 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2023: 14974-14983. |
[12] | LIN Y Z, XIE Y J, CHEN D D, et al. REVIVE: regional visual representation matters in knowledge-based visual question answering[C]// The 36th International Conference on Neural Information Processing Systems. Red Hook: Curran Associates Inc., 2022: 767. |
[13] | BROWN T B, MANN B, RYDER N, et al. Language models are few-shot learners[C]// The 34th International Conference on Neural Information Processing Systems. Red Hook: Curran Associates Inc., 2020: 159. |
[14] | KARPUKHIN V, OĞUZ B, MIN S, et al. Dense passage retrieval for open-domain question answering[EB/OL]. [2024-04-01]. https://aclanthology.org/2020.emnlp-main.550/. |
[15] | NOGUEIRA R, CHO K. Passage Re-ranking with BERT[EB/OL]. [2024-03-01]. https://arxiv.org/abs/1901.04085. |
[16] | HUMEAU S, SHUSTER K, LACHAUX M A, et al. Poly-encoders: transformer architectures and pre-training strategies for fast and accurate multi-sentence scoring[EB/OL]. [2024-03-01]. https://arxiv.org/abs/1905.01969. |
[17] | KHATTAB O, ZAHARIA M. ColBERT: efficient and effective passage search via contextualized late interaction over BERT[C]// The 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval. New York: ACM, 2020: 39-48. |
[18] | YE W W, LIU Y D, ZOU L X, et al. Fast semantic matching via flexible contextualized interaction[C]// The 15th ACM International Conference on Web Search and Data Mining. New York: ACM, 2022: 1275-1283. |
[19] | QU Y Q, DING Y C, LIU J, et al. RocketQA: an optimized training approach to dense passage retrieval for open-domain question answering[EB/OL].[2024-03-01]. https://aclanthology.org/2021.naacl-main.466/. |
[20] | YANG Y F, JIN N, LIN K, et al. Neural retrieval for question answering with cross-attention supervised data augmentation[EB/OL]. [2024-03-01]. https://aclanthology.org/2021.acl-short.35/. |
[21] | GAO L Y, DAI Z Y, CALLAN J. COIL: revisit exact lexical match in information retrieval with contextualized inverted list[EB/OL]. [2024-03-01]. https://aclanthology.org/2021.naacl-main.241/. |
[22] | LUAN Y, EISENSTEIN J, TOUTANOVA K, et al. Sparse, dense, and attentional representations for text retrieval[J]. Transactions of the Association for Computational Linguistics, 2021, 9: 329-345. |
[23] | REN R Y, QU Y Q, LIU J, et al. RocketQAv2: a joint training method for dense passage retrieval and passage re-ranking[EB/OL]. [2024-03-01]. https://aclanthology.org/2021.emnlp-main.224/. |
[24] | THAKUR N, REIMERS N, DAXENBERGER J, et al. Augmented SBERT: data augmentation method for improving bi-encoders for pairwise sentence scoring tasks[EB/OL]. [2024-03-01]. https://aclanthology.org/2021.naacl-main.28/. |
[25] | REN R Y, LV S W, QU Y Q, et al. PAIR: leveraging passage-centric similarity relation for improving dense passage retrieval[EB/OL]. [2024-04-01]. https://aclanthology.org/2021.findings-acl.191/. |
[26] |
杨文娟, 王文明, 王全玉, 等. 基于感知哈希和视觉词袋模型的图像检索方法[J]. 图学学报, 2019, 40(3): 519-524.
DOI |
YANG W J, WANG W M, WANG Q Y, et al. Image retrieval method based on perceptual hash algorithm and bag of visual words[J]. Journal of Graphics, 2019, 40(3): 519-524. (in Chinese) | |
[27] | VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]// The 31st International Conference on Neural Information Processing Systems. Red Hook: Curran Associates Inc., 2017: 6000-6010. |
[28] | LIU Y H, OTT M, GOYAL N, et al. RoBERTa: a robustly optimized BERT pretraining approach[EB/OL]. [2024-03-01]. https://arxiv.org/abs/1907.11692. |
[29] | XIONG L, XIONG C Y, LI Y, et al. Approximate nearest neighbor negative contrastive learning for dense text retrieval[EB/OL]. [2024-03-01]. https://arxiv.org/abs/2007.00808. |
[30] | LU Y X, LIU Y D, LIU J X, et al. ERNIE-search: bridging cross-encoder with dual-encoder via self on-the-fly distillation for dense passage retrieval[EB/OL]. [2024-04-01]. https://arxiv.org/abs/2205.09153. |
[31] | SANTHANAM K, KHATTAB O, SAAD-FALCON J, et al. ColBERTv2: effective and efficient retrieval via lightweight late interaction[EB/OL]. [2024-04-01]. https://aclanthology.org/2022.naacl-main.272/. |
[32] | GUO Y Y, NIE L Q, WONG Y, et al. A unified end-to-end retriever-reader framework for knowledge-based VQA[C]// The 30th ACM International Conference on Multimedia. New York: ACM, 2022: 2061-2069. |
[33] | WU J L, MOONEY R. Entity-focused dense passage retrieval for outside-knowledge visual question answering[EB/OL]. [2024-03-01]. https://aclanthology.org/2022.emnlp-main.551/. |
[34] | YOU J X, YANG Z G, LI Q, et al. A retriever-reader framework with visual entity linking for knowledge-based visual question answering[C]// 2023 IEEE International Conference on Multimedia and Expo. New York: IEEE Press, 2023: 13-18. |
[35] | LIN W Z, BYRNE B. Retrieval augmented visual question answering with outside knowledge[EB/OL].[2024-03-01]. https://aclanthology.org/2022.emnlp-main.772/. |
[36] | LI Z H, YANG N, WANG L, et al. Learning diverse document representations with deep query interactions for dense retrieval[EB/OL]. [2024-03-01]. https://arxiv.org/abs/2208.04232. |
[37] |
何柳, 安然, 刘姝妍, 等. 基于知识图谱的航空多模态数据组织与知识发现技术研究[J]. 图学学报, 2024, 45(2): 300-307.
DOI |
HE L, AN R, LIU S Y, et al. Research on knowledge graph-based aviation multi-modal data organization and discovery method[J]. Journal of Graphics, 2024, 45(2): 300-307. (in Chinese)
DOI |
|
[38] | ZHANG P C, LI X J, HU X W, et al. VinVL: revisiting visual representations in vision-language models[C]// 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2021: 5579-5588. |
[39] | HU Y S, HUA H, YANG Z Y, et al. PromptCap: prompt-guided task-aware image captioning[EB/OL].[2024-03-01]. https://arxiv.org/abs/2211.09699. |
[40] | TIONG A M H, LI J N, LI B Y, et al. Plug-and-play VQA: zero-shot VQA by conjoining large pretrained models with zero training[EB/OL]. [2024-03-01]. https://aclanthology.org/2022.findings-emnlp.67/. |
[41] | KHASHABI D, MIN S, KHOT T, et al. UnifiedQA: crossing format boundaries with a single QA system[EB/OL]. [2024-04-01]. https://aclanthology.org/2020.findings-emnlp.171.pdf. |
[42] | GUO J X, LI J N, LI D X, et al. From images to textual prompts: zero-shot visual question answering with frozen large language models[C]// 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2023: 10867-10877. |
[43] | LI J N, LI D X, XIONG C M, et al. BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation[EB/OL]. [2024-04-01]. https://proceedings.mlr.press/v162/li22n.html. |
[44] | FU X Y, ZHANG S, KWON G, et al. Generate then select: open-ended visual question answering guided by world knowledge[EB/OL]. [2024-04-01]. https://aclanthology.org/2023.findings-acl.147/. |
[45] | SI Q Y, MO Y C, LIN Z, et al. Combo of thinking and observing for outside-knowledge VQA[EB/OL]. [2024-08-01]. https://aclanthology.org/2023.acl-long.614/. |
[46] | WANG P, YANG A, MEN R, et al. OFA: unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework[EB/OL]. [2024-04-01]. https://proceedings.mlr.press/v162/wang22al.html. |
[47] | GOYAL Y, KHOT T, SUMMERS-STAY D, et al. Making the V in VQA matter: elevating the role of image understanding in visual question answering[C]// 2017 IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2017: 6904-6913. |
[1] | 吴精乙, 景峻, 贺熠凡, 张世渝, 康运锋, 唐维, 孔德兰, 刘向栋. 基于多模态大模型的高速公路场景交通异常事件分析方法[J]. 图学学报, 2024, 45(6): 1266-1276. |
[2] | 何柳, 安然, 刘姝妍, 李润岐, 陶剑, 曾照洋. 基于知识图谱的航空多模态数据组织与知识发现技术研究[J]. 图学学报, 2024, 45(2): 300-307. |
[3] | 苑朝, 赵亚冬, 张耀, 王嘉璇, 徐大伟, 翟永杰, 朱松松. 基于YOLO轻量化的多模态行人检测算法[J]. 图学学报, 2024, 45(1): 35-46. |
[4] | 王欣雨, 刘慧, 朱积成, 盛玉瑞, 张彩明. 基于高低频特征分解的深度多模态医学图像融合网络[J]. 图学学报, 2024, 45(1): 65-77. |
[5] | 薛皓玮, 王美丽. 融合生物力学约束与多模态数据的手部重建[J]. 图学学报, 2023, 44(4): 794-800. |
[6] | 孙亚男, 温玉辉, 舒叶芷, 刘永进. 融合动作特征的多模态情绪识别 [J]. 图学学报, 2022, 43(6): 1159-1169. |
[7] | 李晓英, 余亚平. 基于多模态感官体验的儿童音画交互设计研究[J]. 图学学报, 2022, 43(4): 736-743. |
[8] | 邓壮林, 张绍兵, 成苗, 何莲. 多模态硬币图像单应性矩阵预测[J]. 图学学报, 2022, 43(3): 361-369. |
[9] | 胡俊, 顾晶晶, 王秋红. 基于遥感图像的多模态小目标检测[J]. 图学学报, 2022, 43(2): 197-204. |
[10] | 黄 欢 , 孙力娟 , 曹 莹 , 郭 剑 , 任恒毅 . 基于注意力的短视频多模态情感分析[J]. 图学学报, 2021, 42(1): 8-14. |
[11] | 穆大强, 李 腾 . 基于多模态融合的人脸反欺骗技术[J]. 图学学报, 2020, 41(5): 750-756. |
[12] | 蒋圣南, 陈恩庆, 郑铭耀, 段建康 . 基于 ResNeXt 的人体动作识别[J]. 图学学报, 2020, 41(2): 277-282. |
阅读次数 | ||||||
全文 |
|
|||||
摘要 |
|
|||||