Journal of Graphics ›› 2024, Vol. 45 ›› Issue (6): 1231-1242.DOI: 10.11996/JG.j.2095-302X.2024061231
• Special Topic on “Large Models and Graphics Technology and Applications” • Previous Articles Next Articles
ZHENG Hongyan1(), WANG Hui2, LIU Hao1, ZHANG Zhiping1, YANG Xiaojuan3, SUN Tao1(
)
Received:
2024-06-21
Accepted:
2024-08-17
Online:
2024-12-31
Published:
2024-12-24
Contact:
SUN Tao
About author:
First author contact:ZHENG Hongyan (1999-), master student. His main research interest covers visual question answering. E-mail:1679436540@qq.com
Supported by:
CLC Number:
ZHENG Hongyan, WANG Hui, LIU Hao, ZHANG Zhiping, YANG Xiaojuan, SUN Tao. Research on KB-VQA knowledge retrieval strategy based on implicit knowledge enhancement[J]. Journal of Graphics, 2024, 45(6): 1231-1242.
Add to citation manager EndNote|Ris|BibTeX
URL: http://www.txxb.com.cn/EN/10.11996/JG.j.2095-302X.2024061231
Fig. 1 This paper compares the method with existing methods ((a) Using a retriever to fetch knowledge from a knowledge base; (b) Treating large models as answer generators; (c) Viewing large models as implicit knowledge bases and using them in conjunction with external knowledge bases; (d) The method of this paper)
方法 | 出版 | Ptrain | Ptest | 外部知识来源 | PRRecall/% | EM/% | ACC/% |
---|---|---|---|---|---|---|---|
1.BAN+KG+AUG | MM(2020) | - | - | Wikipedia+ConceptNet | 26.7 | ||
2.ConceptBERT | EMNLP(2020) | - | - | ConceptNet | 33.7 | ||
3.KRISP | CVPR(2021) | - | - | Wikipedia+ConceptNet | 38.4 | ||
4.Vis-DPR | EMNLP(2021) | - | - | Google Search | 39.2 | ||
5.KAT-T5 | NAACL(2021) | 40 | 40 | Wikipedia | 44.2 | ||
6.VRR | EMNLP(2021) | 100 | 100 | Google Search | 45.0 | ||
7.RAG | NeurIPS(2020) | 5 | 5 | Google Search | 82.3 | 52.5 | 48.2 |
8.TRIG | CVPR(2022) | 100 | 100 | Wikipedia | 53.5 | 49.3 | |
9.RR-VEL | ICME(2023) | 5 | 5 | ConceptNet+Ascent+hasPart | 55.6 | 49.4 | |
10.RA-VQA | EMNLP(2022) | 5 | 5 | Google Search | 82.8 | 58.7 | 53.8 |
11.ReVeal | CVPR(2023) | - | - | WIT+CC12M+Wikipedia+VQAv2 | 59.1 | ||
12.PICA | AAAI(2022) | - | - | GPT-3 | 48.0 | ||
13.PromptCap | ICCV(2023) | - | - | GPT-3 | 60.4 | ||
14.Prophet | CVPR(2023) | - | - | GPT-3 | 61.1 | ||
15.PaLI-15B | ICLR(2023) | - | - | PaLI(15B) | 56.5 | ||
16.InstructBLIP-7B | NeurIPS(2023) | - | - | InstructBLIP(7B) | 57.6 | ||
17.PaLM-E-12B | ICML(2023) | - | - | PaLM-E(12B) | 60.1 | ||
18.REVIVE | NeurIPS(2022) | 45 | 45 | Wikipedia+GPT-3 | 58.0 | ||
19.RASP | ACL(2023) | - | - | Wikipedia+Codex | 58.5 | ||
20.Two | ACL(2023) | 75 | 75 | Wikipedia+GPT-3+OFA+VQAv2 | 85.2 | 58.7 | |
Ours | 5 | 5 | Google Search+GPT-3+OFA | 89.2 | 66.1 | 61.3 |
Table 1 Compare with existing methods on the OK-VQA dataset
方法 | 出版 | Ptrain | Ptest | 外部知识来源 | PRRecall/% | EM/% | ACC/% |
---|---|---|---|---|---|---|---|
1.BAN+KG+AUG | MM(2020) | - | - | Wikipedia+ConceptNet | 26.7 | ||
2.ConceptBERT | EMNLP(2020) | - | - | ConceptNet | 33.7 | ||
3.KRISP | CVPR(2021) | - | - | Wikipedia+ConceptNet | 38.4 | ||
4.Vis-DPR | EMNLP(2021) | - | - | Google Search | 39.2 | ||
5.KAT-T5 | NAACL(2021) | 40 | 40 | Wikipedia | 44.2 | ||
6.VRR | EMNLP(2021) | 100 | 100 | Google Search | 45.0 | ||
7.RAG | NeurIPS(2020) | 5 | 5 | Google Search | 82.3 | 52.5 | 48.2 |
8.TRIG | CVPR(2022) | 100 | 100 | Wikipedia | 53.5 | 49.3 | |
9.RR-VEL | ICME(2023) | 5 | 5 | ConceptNet+Ascent+hasPart | 55.6 | 49.4 | |
10.RA-VQA | EMNLP(2022) | 5 | 5 | Google Search | 82.8 | 58.7 | 53.8 |
11.ReVeal | CVPR(2023) | - | - | WIT+CC12M+Wikipedia+VQAv2 | 59.1 | ||
12.PICA | AAAI(2022) | - | - | GPT-3 | 48.0 | ||
13.PromptCap | ICCV(2023) | - | - | GPT-3 | 60.4 | ||
14.Prophet | CVPR(2023) | - | - | GPT-3 | 61.1 | ||
15.PaLI-15B | ICLR(2023) | - | - | PaLI(15B) | 56.5 | ||
16.InstructBLIP-7B | NeurIPS(2023) | - | - | InstructBLIP(7B) | 57.6 | ||
17.PaLM-E-12B | ICML(2023) | - | - | PaLM-E(12B) | 60.1 | ||
18.REVIVE | NeurIPS(2022) | 45 | 45 | Wikipedia+GPT-3 | 58.0 | ||
19.RASP | ACL(2023) | - | - | Wikipedia+Codex | 58.5 | ||
20.Two | ACL(2023) | 75 | 75 | Wikipedia+GPT-3+OFA+VQAv2 | 85.2 | 58.7 | |
Ours | 5 | 5 | Google Search+GPT-3+OFA | 89.2 | 66.1 | 61.3 |
模型 | PRRecall@5 | EM | ACC |
---|---|---|---|
w/o all | 82.8 | 58.7 | 53.8 |
w/o OFA+GPT | 84.7 | 61.5 | 55.0 |
w/o OFA | 85.8 | 63.7 | 58.9 |
w/o GPT-3 | 87.1 | 65.1 | 59.8 |
w/o PIM+R | 82.8 | 62.9 | 58.3 |
w/o PIM | 87.8 | 65.8 | 60.5 |
Ours | 89.2 | 66.1 | 61.3 |
Table 2 Conducting an ablation study on the model structure/%
模型 | PRRecall@5 | EM | ACC |
---|---|---|---|
w/o all | 82.8 | 58.7 | 53.8 |
w/o OFA+GPT | 84.7 | 61.5 | 55.0 |
w/o OFA | 85.8 | 63.7 | 58.9 |
w/o GPT-3 | 87.1 | 65.1 | 59.8 |
w/o PIM+R | 82.8 | 62.9 | 58.3 |
w/o PIM | 87.8 | 65.8 | 60.5 |
Ours | 89.2 | 66.1 | 61.3 |
方法 | 数量:5 | 数量:500 | 数量:1 000 |
---|---|---|---|
双编码器 | 1 | 75 | 151 |
PIM | 1 | 75 | 154 |
Table 3 The time consumed by the pre-simulated interaction module for retrieving external knowle/ms
方法 | 数量:5 | 数量:500 | 数量:1 000 |
---|---|---|---|
双编码器 | 1 | 75 | 151 |
PIM | 1 | 75 | 154 |
[1] | MARINO K, RASTEGARI M, FARHADI A, et al. OK-VQA: a visual question answering benchmark requiring external knowledge[C]// 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2019: 3195-3204. |
[2] | ANTOL S, AGRAWAL A, LU J S, et al. VQA: visual question answering[C]// 2015 IEEE International Conference on Computer Vision. New York: IEEE Press, 2015: 2425-2433. |
[3] | SPEER R, CHIN J, HAVASI C. ConceptNet 5.5: an open multilingual graph of general knowledge[EB/OL]. [2024-04-01]. https://ojs.aaai.org/index.php/AAAI/article/view/11164. |
[4] | NARASIMHAN M, LAZEBNIK S, SCHWING A G. Out of the box: reasoning with graph convolution nets for factual visual question answering[C]// The 32nd International Conference on Neural Information Processing Systems. Red Hook: Curran Associates Inc., 2018: 2659-2670. |
[5] | WU J L, LU J S, SABHARWAL A, et al. Multi-modal answer validation for knowledge-based VQA[C]// The 36th AAAI Conference on Artificial Intelligence. Palo Alto: AAAI Press, 2022: 2712-2721. |
[6] | GAO F, PING Q, THATTAI G, et al. Transform- retrieve-generate: natural language-centric outside-knowledge visual question answering[C]// 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2022: 5067-5077. |
[7] | GUI L K, WANG B R, HUANG Q Y, et al. KAT: a knowledge augmented transformer for vision-and-language[EB/OL]. [2024-04-01]. https://aclanthology.org/2022.naacl-main.70/. |
[8] | LUO M, ZENG Y K, BANERJEE P, et al. Weakly-supervised visual-retriever-reader for knowledge-based question answering[EB/OL]. [2024-04-01]. https://aclanthology.org/2021.emnlp-main.517/. |
[9] | QU C, ZAMANI H, YANG L, et al. Passage retrieval for outside-knowledge visual question answering[C]// The 44th International ACM SIGIR Conference on Research and Development in Information Retrieval. New York: ACM, 2021: 1753-1757. |
[10] | YANG Z Y, GAN Z, WANG J F, et al. An empirical study of GPT-3 for few-shot knowledge-based VQA[C]// The 36th AAAI Conference on Artificial Intelligence. Palo Alto: AAAI Press, 2022: 3081-3089. |
[11] | SHAO Z W, YU Z, WANG M, et al. Prompting large language models with answer heuristics for knowledge-based visual question answering[C]// 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2023: 14974-14983. |
[12] | LIN Y Z, XIE Y J, CHEN D D, et al. REVIVE: regional visual representation matters in knowledge-based visual question answering[C]// The 36th International Conference on Neural Information Processing Systems. Red Hook: Curran Associates Inc., 2022: 767. |
[13] | BROWN T B, MANN B, RYDER N, et al. Language models are few-shot learners[C]// The 34th International Conference on Neural Information Processing Systems. Red Hook: Curran Associates Inc., 2020: 159. |
[14] | KARPUKHIN V, OĞUZ B, MIN S, et al. Dense passage retrieval for open-domain question answering[EB/OL]. [2024-04-01]. https://aclanthology.org/2020.emnlp-main.550/. |
[15] | NOGUEIRA R, CHO K. Passage Re-ranking with BERT[EB/OL]. [2024-03-01]. https://arxiv.org/abs/1901.04085. |
[16] | HUMEAU S, SHUSTER K, LACHAUX M A, et al. Poly-encoders: transformer architectures and pre-training strategies for fast and accurate multi-sentence scoring[EB/OL]. [2024-03-01]. https://arxiv.org/abs/1905.01969. |
[17] | KHATTAB O, ZAHARIA M. ColBERT: efficient and effective passage search via contextualized late interaction over BERT[C]// The 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval. New York: ACM, 2020: 39-48. |
[18] | YE W W, LIU Y D, ZOU L X, et al. Fast semantic matching via flexible contextualized interaction[C]// The 15th ACM International Conference on Web Search and Data Mining. New York: ACM, 2022: 1275-1283. |
[19] | QU Y Q, DING Y C, LIU J, et al. RocketQA: an optimized training approach to dense passage retrieval for open-domain question answering[EB/OL].[2024-03-01]. https://aclanthology.org/2021.naacl-main.466/. |
[20] | YANG Y F, JIN N, LIN K, et al. Neural retrieval for question answering with cross-attention supervised data augmentation[EB/OL]. [2024-03-01]. https://aclanthology.org/2021.acl-short.35/. |
[21] | GAO L Y, DAI Z Y, CALLAN J. COIL: revisit exact lexical match in information retrieval with contextualized inverted list[EB/OL]. [2024-03-01]. https://aclanthology.org/2021.naacl-main.241/. |
[22] | LUAN Y, EISENSTEIN J, TOUTANOVA K, et al. Sparse, dense, and attentional representations for text retrieval[J]. Transactions of the Association for Computational Linguistics, 2021, 9: 329-345. |
[23] | REN R Y, QU Y Q, LIU J, et al. RocketQAv2: a joint training method for dense passage retrieval and passage re-ranking[EB/OL]. [2024-03-01]. https://aclanthology.org/2021.emnlp-main.224/. |
[24] | THAKUR N, REIMERS N, DAXENBERGER J, et al. Augmented SBERT: data augmentation method for improving bi-encoders for pairwise sentence scoring tasks[EB/OL]. [2024-03-01]. https://aclanthology.org/2021.naacl-main.28/. |
[25] | REN R Y, LV S W, QU Y Q, et al. PAIR: leveraging passage-centric similarity relation for improving dense passage retrieval[EB/OL]. [2024-04-01]. https://aclanthology.org/2021.findings-acl.191/. |
[26] |
杨文娟, 王文明, 王全玉, 等. 基于感知哈希和视觉词袋模型的图像检索方法[J]. 图学学报, 2019, 40(3): 519-524.
DOI |
YANG W J, WANG W M, WANG Q Y, et al. Image retrieval method based on perceptual hash algorithm and bag of visual words[J]. Journal of Graphics, 2019, 40(3): 519-524. (in Chinese) | |
[27] | VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]// The 31st International Conference on Neural Information Processing Systems. Red Hook: Curran Associates Inc., 2017: 6000-6010. |
[28] | LIU Y H, OTT M, GOYAL N, et al. RoBERTa: a robustly optimized BERT pretraining approach[EB/OL]. [2024-03-01]. https://arxiv.org/abs/1907.11692. |
[29] | XIONG L, XIONG C Y, LI Y, et al. Approximate nearest neighbor negative contrastive learning for dense text retrieval[EB/OL]. [2024-03-01]. https://arxiv.org/abs/2007.00808. |
[30] | LU Y X, LIU Y D, LIU J X, et al. ERNIE-search: bridging cross-encoder with dual-encoder via self on-the-fly distillation for dense passage retrieval[EB/OL]. [2024-04-01]. https://arxiv.org/abs/2205.09153. |
[31] | SANTHANAM K, KHATTAB O, SAAD-FALCON J, et al. ColBERTv2: effective and efficient retrieval via lightweight late interaction[EB/OL]. [2024-04-01]. https://aclanthology.org/2022.naacl-main.272/. |
[32] | GUO Y Y, NIE L Q, WONG Y, et al. A unified end-to-end retriever-reader framework for knowledge-based VQA[C]// The 30th ACM International Conference on Multimedia. New York: ACM, 2022: 2061-2069. |
[33] | WU J L, MOONEY R. Entity-focused dense passage retrieval for outside-knowledge visual question answering[EB/OL]. [2024-03-01]. https://aclanthology.org/2022.emnlp-main.551/. |
[34] | YOU J X, YANG Z G, LI Q, et al. A retriever-reader framework with visual entity linking for knowledge-based visual question answering[C]// 2023 IEEE International Conference on Multimedia and Expo. New York: IEEE Press, 2023: 13-18. |
[35] | LIN W Z, BYRNE B. Retrieval augmented visual question answering with outside knowledge[EB/OL].[2024-03-01]. https://aclanthology.org/2022.emnlp-main.772/. |
[36] | LI Z H, YANG N, WANG L, et al. Learning diverse document representations with deep query interactions for dense retrieval[EB/OL]. [2024-03-01]. https://arxiv.org/abs/2208.04232. |
[37] |
何柳, 安然, 刘姝妍, 等. 基于知识图谱的航空多模态数据组织与知识发现技术研究[J]. 图学学报, 2024, 45(2): 300-307.
DOI |
HE L, AN R, LIU S Y, et al. Research on knowledge graph-based aviation multi-modal data organization and discovery method[J]. Journal of Graphics, 2024, 45(2): 300-307. (in Chinese)
DOI |
|
[38] | ZHANG P C, LI X J, HU X W, et al. VinVL: revisiting visual representations in vision-language models[C]// 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2021: 5579-5588. |
[39] | HU Y S, HUA H, YANG Z Y, et al. PromptCap: prompt-guided task-aware image captioning[EB/OL].[2024-03-01]. https://arxiv.org/abs/2211.09699. |
[40] | TIONG A M H, LI J N, LI B Y, et al. Plug-and-play VQA: zero-shot VQA by conjoining large pretrained models with zero training[EB/OL]. [2024-03-01]. https://aclanthology.org/2022.findings-emnlp.67/. |
[41] | KHASHABI D, MIN S, KHOT T, et al. UnifiedQA: crossing format boundaries with a single QA system[EB/OL]. [2024-04-01]. https://aclanthology.org/2020.findings-emnlp.171.pdf. |
[42] | GUO J X, LI J N, LI D X, et al. From images to textual prompts: zero-shot visual question answering with frozen large language models[C]// 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2023: 10867-10877. |
[43] | LI J N, LI D X, XIONG C M, et al. BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation[EB/OL]. [2024-04-01]. https://proceedings.mlr.press/v162/li22n.html. |
[44] | FU X Y, ZHANG S, KWON G, et al. Generate then select: open-ended visual question answering guided by world knowledge[EB/OL]. [2024-04-01]. https://aclanthology.org/2023.findings-acl.147/. |
[45] | SI Q Y, MO Y C, LIN Z, et al. Combo of thinking and observing for outside-knowledge VQA[EB/OL]. [2024-08-01]. https://aclanthology.org/2023.acl-long.614/. |
[46] | WANG P, YANG A, MEN R, et al. OFA: unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework[EB/OL]. [2024-04-01]. https://proceedings.mlr.press/v162/wang22al.html. |
[47] | GOYAL Y, KHOT T, SUMMERS-STAY D, et al. Making the V in VQA matter: elevating the role of image understanding in visual question answering[C]// 2017 IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2017: 6904-6913. |
[1] | HE Liu, AN Ran, LIU Shuyan, LI Runqi, TAO Jian, ZENG Zhaoyang. Research on knowledge graph-based aviation multi-modal data organization and discovery method [J]. Journal of Graphics, 2024, 45(2): 300-307. |
[2] | YUAN Chao, ZHAO Yadong, ZHANG Yao, WANG Jiaxuan, XU Dawei, ZHAI Yongjie, ZHU Songsong. Lightweight multi-modal pedestrian detection algorithm based on YOLO [J]. Journal of Graphics, 2024, 45(1): 35-46. |
[3] | WANG Xinyu, LIU Hui, ZHU Jicheng, SHENG Yurui, ZHANG Caiming. Deep multimodal medical image fusion network based on high-low frequency feature decomposition [J]. Journal of Graphics, 2024, 45(1): 65-77. |
[4] | XUE Hao-wei, WANG Mei-li. Hand reconstruction incorporating biomechanical constraints and multi-modal data [J]. Journal of Graphics, 2023, 44(4): 794-800. |
[5] | LIU Jing , HU Yong-li , LIU Xiu-ping , TAN Hong-chen , YIN Bao-cai. Multi-scale modality perception network for referring image segmentation [J]. Journal of Graphics, 2022, 43(6): 1150-1158. |
Viewed | ||||||
Full text |
|
|||||
Abstract |
|
|||||