基于隐式知识增强的KB-VQA知识检索策略研究

doi:10.11996/JG.j.2095-302X.2024061231

图学学报 ›› 2024, Vol. 45 ›› Issue (6): 1231-1242.DOI: 10.11996/JG.j.2095-302X.2024061231

• “大模型与图学技术及应用”专题 • 上一篇下一篇

基于隐式知识增强的KB-VQA知识检索策略研究

郑洪岩¹(), 王慧², 刘昊¹, 张志平¹, 杨晓娟³, 孙涛¹()

1.齐鲁工业大学(山东省科学院)计算机科学与技术学部，山东济南 250353
2.山东师范大学附属中学，山东济南 250014
3.山东师范大学教育学部，山东济南 250014

收稿日期:2024-06-21 接受日期:2024-08-17 出版日期:2024-12-31 发布日期:2024-12-24
通讯作者:孙涛(1974-)，男，教授，博士。主要研究方向为自然语言处理和计算机视觉等。E-mail：sunt@qlu.edu.cn
第一作者:郑洪岩(1999-)，男，硕士研究生。主要研究方向为视觉问答。E-mail：1679436540@qq.com
基金资助:
齐鲁工业大学(山东省科学院)科教产融合试点工程重大创新类项目(2024ZDZX08);山东省自然基金面上项目(ZR202211190244);山东省科技型中小企业创新能力提升工程项目(2023TSGC0212)

Research on KB-VQA knowledge retrieval strategy based on implicit knowledge enhancement

ZHENG Hongyan¹(), WANG Hui², LIU Hao¹, ZHANG Zhiping¹, YANG Xiaojuan³, SUN Tao¹()

1. Department of Computer Science and Technology, Qilu University of Technology (Shandong Academy of Sciences), Jinan Shandong 250353, China
2. Affiliated Middle School of Shandong Normal University, Jinan Shandong 250014, China
3. Faculty of Education, Shandong Normal University, Jinan Shandong 250014, China

Received:2024-06-21 Accepted:2024-08-17 Published:2024-12-31 Online:2024-12-24
Contact: SUN Tao (1974-), professor, Ph.D. His main research interests cover natural language processing and computer vision, etc. E-mail：sunt@qlu.edu.cn
First author：ZHENG Hongyan (1999-), master student. His main research interest covers visual question answering. E-mail：1679436540@qq.com
Supported by:
Pilot Project for Integrated Innovative of Science, Education, and Industry of Qilu University of Technology (Shandong Academy of Sciences)(2024ZDZX08);Shandong Provincial Natural Science Foundation General Project(ZR202211190244);Shandong Province Science and Technology SME Innovation Capacity Enhancement Project(2023TSGC0212)

摘要/Abstract

摘要：

基于知识的视觉问答(KB-VQA)不仅需要图像信息和问题信息，还需要从知识源中获取到相关知识才能回答问题。现有方法通常使用检索器从知识库中检索外部知识，或直接从大型模型中得到隐式知识，但依靠仅有的图文信息往往不足以支撑获取相关知识。针对检索阶段的查询和外部知识，提出了一种强化检索策略。在查询端，利用大模型中的隐式知识来增强现有的图像和问题信息，增强后的图文信息可以帮助检索器从知识库中定位到更准确的外部知识。在外部知识端，提出了预模拟交互模块来增强外部知识，该模块为知识向量生成一个新的轻量级向量，通过二者之间预先交互，使得检索器可以提前模拟查询和知识段落的交互，以便更好地捕捉查询和知识段落的语义关系。实验结果表明，改进后的模型仅需检索少量知识便可以在 OK-VQA 数据集上达到61.3%的准确率。

关键词: 视觉问答, 知识检索, 图文增强, 预模拟交互, 多模态

Abstract:

The knowledge-based visual question answering (KB-VQA) requires not only image and question information but also relevant knowledge from external sources to answer questions accurately. Existing methods typically involve using a retriever to fetch external knowledge from a knowledge base or relying on implicit knowledge from large models. However, solely depending on image and textual information often proves insufficient for acquiring the necessary knowledge. To address this issue, an enhanced retrieval strategy was proposed for both the query and external knowledge stages. On the query side, implicit knowledge from large models was utilized to enrich the existing image and question information, aiding. The retriever in locating more accurate external knowledge from the knowledge base. On the external knowledge side, a pre-simulation interaction module was introduced to enhance the external knowledge. This module generated a new lightweight vector for the knowledge vector, allowing the retriever to pre-simulate the interaction between the query and the knowledge passage, thus better capturing their semantic relationship. Experimental results demonstrated that the improved model can achieve an accuracy of 61.3% on the OK-VQA dataset by retrieving only a small amount of knowledge.

Key words: visual question answering, knowledge retrieval, text-image enhancement, pre-simulated interaction, multi-modal

中图分类号:

TP391

郑洪岩, 王慧, 刘昊, 张志平, 杨晓娟, 孙涛. 基于隐式知识增强的KB-VQA知识检索策略研究[J]. 图学学报, 2024, 45(6): 1231-1242.

ZHENG Hongyan, WANG Hui, LIU Hao, ZHANG Zhiping, YANG Xiaojuan, SUN Tao. Research on KB-VQA knowledge retrieval strategy based on implicit knowledge enhancement[J]. Journal of Graphics, 2024, 45(6): 1231-1242.

图/表 11

图1 本文方法与现有方法比较((a)使用检索器从知识库中检索知识；(b)将大型模型视为答案生成器；(c)将大型模型视为隐含的知识库，并将其与外部知识库一起使用；(d)本文方法)

Fig. 1 This paper compares the method with existing methods ((a) Using a retriever to fetch knowledge from a knowledge base; (b) Treating large models as answer generators; (c) Viewing large models as implicit knowledge bases and using them in conjunction with external knowledge bases; (d) The method of this paper)

图2 提示信息增强图文信息从而辅助知识检索的具体例子

Fig. 2 Specific examples of enhancing graphic and textual information with prompt information to assist knowledge retrieval

图3 模型的总体框架图

Fig. 3 The overall framework of the model

图4 OK-VQA数据集种类和占比

Fig. 4 Types and proportion of OK-VQA dataset

图5 OK-VQA数据集具体示例((a)车辆及运输；(b)品牌，公司和产品；(c)物品，材料和服务)

Fig. 5 Specific example of OK-VQA dataset ((a) Vehicles and transport; (b) Brands, companies and products; (c) Goods, materials and services)

表1 OK-VQA数据集上已有方法比较

Table 1 Compare with existing methods on the OK-VQA dataset

方法	出版	P_train	P_test	外部知识来源	PRRecall/%	EM/%	ACC/%
1.BAN+KG+AUG	MM(2020)	-	-	Wikipedia+ConceptNet			26.7
2.ConceptBERT	EMNLP(2020)	-	-	ConceptNet			33.7
3.KRISP	CVPR(2021)	-	-	Wikipedia+ConceptNet			38.4
4.Vis-DPR	EMNLP(2021)	-	-	Google Search			39.2
5.KAT-T5	NAACL(2021)	40	40	Wikipedia			44.2
6.VRR	EMNLP(2021)	100	100	Google Search			45.0
7.RAG	NeurIPS(2020)	5	5	Google Search	82.3	52.5	48.2
8.TRIG	CVPR(2022)	100	100	Wikipedia		53.5	49.3
9.RR-VEL	ICME(2023)	5	5	ConceptNet+Ascent+hasPart		55.6	49.4
10.RA-VQA	EMNLP(2022)	5	5	Google Search	82.8	58.7	53.8
11.ReVeal	CVPR(2023)	-	-	WIT+CC12M+Wikipedia+VQAv2			59.1
12.PICA	AAAI(2022)	-	-	GPT-3			48.0
13.PromptCap	ICCV(2023)	-	-	GPT-3			60.4
14.Prophet	CVPR(2023)	-	-	GPT-3			61.1
15.PaLI-15B	ICLR(2023)	-	-	PaLI(15B)			56.5
16.InstructBLIP-7B	NeurIPS(2023)	-	-	InstructBLIP(7B)			57.6
17.PaLM-E-12B	ICML(2023)	-	-	PaLM-E(12B)			60.1
18.REVIVE	NeurIPS(2022)	45	45	Wikipedia+GPT-3			58.0
19.RASP	ACL(2023)	-	-	Wikipedia+Codex			58.5
20.Two	ACL(2023)	75	75	Wikipedia+GPT-3+OFA+VQAv2	85.2		58.7
Ours		5	5	Google Search+GPT-3+OFA	89.2	66.1	61.3

表2 对模型结构进行消融研究/%

Table 2 Conducting an ablation study on the model structure/%

模型	PRRecall@5	EM	ACC
w/o all	82.8	58.7	53.8
w/o OFA+GPT	84.7	61.5	55.0
w/o OFA	85.8	63.7	58.9
w/o GPT-3	87.1	65.1	59.8
w/o PIM+R	82.8	62.9	58.3
w/o PIM	87.8	65.8	60.5
Ours	89.2	66.1	61.3

表3 预模拟交互模块检索外部知识所消耗的时间/ms

Table 3 The time consumed by the pre-simulated interaction module for retrieving external knowle/ms

方法	数量:5	数量:500	数量:1 000
双编码器	1	75	151
PIM	1	75	154

图6 知识数量对PRRecall的影响

Fig. 6 The impact of knowledge quantity on PRRecall

图7 知识数量对ACC的影响

Fig. 7 The impact of knowledge quantity on ACC

图8 结果示例

Fig. 8 Example of results

参考文献 47

[1]	MARINO K, RASTEGARI M, FARHADI A, et al. OK-VQA: a visual question answering benchmark requiring external knowledge[C]// 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2019: 3195-3204.
[2]	ANTOL S, AGRAWAL A, LU J S, et al. VQA: visual question answering[C]// 2015 IEEE International Conference on Computer Vision. New York: IEEE Press, 2015: 2425-2433.
[3]	SPEER R, CHIN J, HAVASI C. ConceptNet 5.5: an open multilingual graph of general knowledge[EB/OL]. [2024-04-01]. https://ojs.aaai.org/index.php/AAAI/article/view/11164.
[4]	NARASIMHAN M, LAZEBNIK S, SCHWING A G. Out of the box: reasoning with graph convolution nets for factual visual question answering[C]// The 32nd International Conference on Neural Information Processing Systems. Red Hook: Curran Associates Inc., 2018: 2659-2670.
[5]	WU J L, LU J S, SABHARWAL A, et al. Multi-modal answer validation for knowledge-based VQA[C]// The 36th AAAI Conference on Artificial Intelligence. Palo Alto: AAAI Press, 2022: 2712-2721.
[6]	GAO F, PING Q, THATTAI G, et al. Transform- retrieve-generate: natural language-centric outside-knowledge visual question answering[C]// 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2022: 5067-5077.
[7]	GUI L K, WANG B R, HUANG Q Y, et al. KAT: a knowledge augmented transformer for vision-and-language[EB/OL]. [2024-04-01]. https://aclanthology.org/2022.naacl-main.70/.
[8]	LUO M, ZENG Y K, BANERJEE P, et al. Weakly-supervised visual-retriever-reader for knowledge-based question answering[EB/OL]. [2024-04-01]. https://aclanthology.org/2021.emnlp-main.517/.
[9]	QU C, ZAMANI H, YANG L, et al. Passage retrieval for outside-knowledge visual question answering[C]// The 44th International ACM SIGIR Conference on Research and Development in Information Retrieval. New York: ACM, 2021: 1753-1757.
[10]	YANG Z Y, GAN Z, WANG J F, et al. An empirical study of GPT-3 for few-shot knowledge-based VQA[C]// The 36th AAAI Conference on Artificial Intelligence. Palo Alto: AAAI Press, 2022: 3081-3089.
[11]	SHAO Z W, YU Z, WANG M, et al. Prompting large language models with answer heuristics for knowledge-based visual question answering[C]// 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2023: 14974-14983.
[12]	LIN Y Z, XIE Y J, CHEN D D, et al. REVIVE: regional visual representation matters in knowledge-based visual question answering[C]// The 36th International Conference on Neural Information Processing Systems. Red Hook: Curran Associates Inc., 2022: 767.
[13]	BROWN T B, MANN B, RYDER N, et al. Language models are few-shot learners[C]// The 34th International Conference on Neural Information Processing Systems. Red Hook: Curran Associates Inc., 2020: 159.
[14]	KARPUKHIN V, OĞUZ B, MIN S, et al. Dense passage retrieval for open-domain question answering[EB/OL]. [2024-04-01]. https://aclanthology.org/2020.emnlp-main.550/.
[15]	NOGUEIRA R, CHO K. Passage Re-ranking with BERT[EB/OL]. [2024-03-01]. https://arxiv.org/abs/1901.04085.
[16]	HUMEAU S, SHUSTER K, LACHAUX M A, et al. Poly-encoders: transformer architectures and pre-training strategies for fast and accurate multi-sentence scoring[EB/OL]. [2024-03-01]. https://arxiv.org/abs/1905.01969.
[17]	KHATTAB O, ZAHARIA M. ColBERT: efficient and effective passage search via contextualized late interaction over BERT[C]// The 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval. New York: ACM, 2020: 39-48.
[18]	YE W W, LIU Y D, ZOU L X, et al. Fast semantic matching via flexible contextualized interaction[C]// The 15th ACM International Conference on Web Search and Data Mining. New York: ACM, 2022: 1275-1283.
[19]	QU Y Q, DING Y C, LIU J, et al. RocketQA: an optimized training approach to dense passage retrieval for open-domain question answering[EB/OL].[2024-03-01]. https://aclanthology.org/2021.naacl-main.466/.
[20]	YANG Y F, JIN N, LIN K, et al. Neural retrieval for question answering with cross-attention supervised data augmentation[EB/OL]. [2024-03-01]. https://aclanthology.org/2021.acl-short.35/.
[21]	GAO L Y, DAI Z Y, CALLAN J. COIL: revisit exact lexical match in information retrieval with contextualized inverted list[EB/OL]. [2024-03-01]. https://aclanthology.org/2021.naacl-main.241/.
[22]	LUAN Y, EISENSTEIN J, TOUTANOVA K, et al. Sparse, dense, and attentional representations for text retrieval[J]. Transactions of the Association for Computational Linguistics, 2021, 9: 329-345.
[23]	REN R Y, QU Y Q, LIU J, et al. RocketQAv2: a joint training method for dense passage retrieval and passage re-ranking[EB/OL]. [2024-03-01]. https://aclanthology.org/2021.emnlp-main.224/.
[24]	THAKUR N, REIMERS N, DAXENBERGER J, et al. Augmented SBERT: data augmentation method for improving bi-encoders for pairwise sentence scoring tasks[EB/OL]. [2024-03-01]. https://aclanthology.org/2021.naacl-main.28/.
[25]	REN R Y, LV S W, QU Y Q, et al. PAIR: leveraging passage-centric similarity relation for improving dense passage retrieval[EB/OL]. [2024-04-01]. https://aclanthology.org/2021.findings-acl.191/.
[26]	杨文娟, 王文明, 王全玉, 等. 基于感知哈希和视觉词袋模型的图像检索方法[J]. 图学学报, 2019, 40(3): 519-524. DOI
	YANG W J, WANG W M, WANG Q Y, et al. Image retrieval method based on perceptual hash algorithm and bag of visual words[J]. Journal of Graphics, 2019, 40(3): 519-524. (in Chinese)
[27]	VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]// The 31st International Conference on Neural Information Processing Systems. Red Hook: Curran Associates Inc., 2017: 6000-6010.
[28]	LIU Y H, OTT M, GOYAL N, et al. RoBERTa: a robustly optimized BERT pretraining approach[EB/OL]. [2024-03-01]. https://arxiv.org/abs/1907.11692.
[29]	XIONG L, XIONG C Y, LI Y, et al. Approximate nearest neighbor negative contrastive learning for dense text retrieval[EB/OL]. [2024-03-01]. https://arxiv.org/abs/2007.00808.
[30]	LU Y X, LIU Y D, LIU J X, et al. ERNIE-search: bridging cross-encoder with dual-encoder via self on-the-fly distillation for dense passage retrieval[EB/OL]. [2024-04-01]. https://arxiv.org/abs/2205.09153.
[31]	SANTHANAM K, KHATTAB O, SAAD-FALCON J, et al. ColBERTv2: effective and efficient retrieval via lightweight late interaction[EB/OL]. [2024-04-01]. https://aclanthology.org/2022.naacl-main.272/.
[32]	GUO Y Y, NIE L Q, WONG Y, et al. A unified end-to-end retriever-reader framework for knowledge-based VQA[C]// The 30th ACM International Conference on Multimedia. New York: ACM, 2022: 2061-2069.
[33]	WU J L, MOONEY R. Entity-focused dense passage retrieval for outside-knowledge visual question answering[EB/OL]. [2024-03-01]. https://aclanthology.org/2022.emnlp-main.551/.
[34]	YOU J X, YANG Z G, LI Q, et al. A retriever-reader framework with visual entity linking for knowledge-based visual question answering[C]// 2023 IEEE International Conference on Multimedia and Expo. New York: IEEE Press, 2023: 13-18.
[35]	LIN W Z, BYRNE B. Retrieval augmented visual question answering with outside knowledge[EB/OL].[2024-03-01]. https://aclanthology.org/2022.emnlp-main.772/.
[36]	LI Z H, YANG N, WANG L, et al. Learning diverse document representations with deep query interactions for dense retrieval[EB/OL]. [2024-03-01]. https://arxiv.org/abs/2208.04232.
[37]	何柳, 安然, 刘姝妍, 等. 基于知识图谱的航空多模态数据组织与知识发现技术研究[J]. 图学学报, 2024, 45(2): 300-307. DOI
	HE L, AN R, LIU S Y, et al. Research on knowledge graph-based aviation multi-modal data organization and discovery method[J]. Journal of Graphics, 2024, 45(2): 300-307. (in Chinese) DOI
[38]	ZHANG P C, LI X J, HU X W, et al. VinVL: revisiting visual representations in vision-language models[C]// 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2021: 5579-5588.
[39]	HU Y S, HUA H, YANG Z Y, et al. PromptCap: prompt-guided task-aware image captioning[EB/OL].[2024-03-01]. https://arxiv.org/abs/2211.09699.
[40]	TIONG A M H, LI J N, LI B Y, et al. Plug-and-play VQA: zero-shot VQA by conjoining large pretrained models with zero training[EB/OL]. [2024-03-01]. https://aclanthology.org/2022.findings-emnlp.67/.
[41]	KHASHABI D, MIN S, KHOT T, et al. UnifiedQA: crossing format boundaries with a single QA system[EB/OL]. [2024-04-01]. https://aclanthology.org/2020.findings-emnlp.171.pdf.
[42]	GUO J X, LI J N, LI D X, et al. From images to textual prompts: zero-shot visual question answering with frozen large language models[C]// 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2023: 10867-10877.
[43]	LI J N, LI D X, XIONG C M, et al. BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation[EB/OL]. [2024-04-01]. https://proceedings.mlr.press/v162/li22n.html.
[44]	FU X Y, ZHANG S, KWON G, et al. Generate then select: open-ended visual question answering guided by world knowledge[EB/OL]. [2024-04-01]. https://aclanthology.org/2023.findings-acl.147/.
[45]	SI Q Y, MO Y C, LIN Z, et al. Combo of thinking and observing for outside-knowledge VQA[EB/OL]. [2024-08-01]. https://aclanthology.org/2023.acl-long.614/.
[46]	WANG P, YANG A, MEN R, et al. OFA: unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework[EB/OL]. [2024-04-01]. https://proceedings.mlr.press/v162/wang22al.html.
[47]	GOYAL Y, KHOT T, SUMMERS-STAY D, et al. Making the V in VQA matter: elevating the role of image understanding in visual question answering[C]// 2017 IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2017: 6904-6913.

基于隐式知识增强的KB-VQA知识检索策略研究

Research on KB-VQA knowledge retrieval strategy based on implicit knowledge enhancement

RichHTML

PDF (PC)

可视化

摘要/Abstract

引用本文

使用本文

图/表 11

参考文献 47

相关文章 12

编辑推荐

Metrics

本文评价

[1]	吴精乙, 景峻, 贺熠凡, 张世渝, 康运锋, 唐维, 孔德兰, 刘向栋. 基于多模态大模型的高速公路场景交通异常事件分析方法[J]. 图学学报, 2024, 45(6): 1266-1276.
[2]	何柳, 安然, 刘姝妍, 李润岐, 陶剑, 曾照洋. 基于知识图谱的航空多模态数据组织与知识发现技术研究[J]. 图学学报, 2024, 45(2): 300-307.
[3]	苑朝, 赵亚冬, 张耀, 王嘉璇, 徐大伟, 翟永杰, 朱松松. 基于YOLO轻量化的多模态行人检测算法[J]. 图学学报, 2024, 45(1): 35-46.
[4]	王欣雨, 刘慧, 朱积成, 盛玉瑞, 张彩明. 基于高低频特征分解的深度多模态医学图像融合网络[J]. 图学学报, 2024, 45(1): 65-77.
[5]	薛皓玮, 王美丽. 融合生物力学约束与多模态数据的手部重建[J]. 图学学报, 2023, 44(4): 794-800.
[6]	孙亚男, 温玉辉, 舒叶芷, 刘永进. 融合动作特征的多模态情绪识别 [J]. 图学学报, 2022, 43(6): 1159-1169.
[7]	李晓英, 余亚平. 基于多模态感官体验的儿童音画交互设计研究[J]. 图学学报, 2022, 43(4): 736-743.
[8]	邓壮林, 张绍兵, 成苗, 何莲. 多模态硬币图像单应性矩阵预测[J]. 图学学报, 2022, 43(3): 361-369.
[9]	胡俊, 顾晶晶, 王秋红. 基于遥感图像的多模态小目标检测[J]. 图学学报, 2022, 43(2): 197-204.
[10]	黄欢 , 孙力娟 , 曹莹 , 郭剑 , 任恒毅 . 基于注意力的短视频多模态情感分析[J]. 图学学报, 2021, 42(1): 8-14.
[11]	穆大强，李腾 . 基于多模态融合的人脸反欺骗技术[J]. 图学学报, 2020, 41(5): 750-756.
[12]	蒋圣南，陈恩庆，郑铭耀，段建康 . 基于 ResNeXt 的人体动作识别[J]. 图学学报, 2020, 41(2): 277-282.