欢迎访问《图学学报》 分享到:

图学学报 ›› 2024, Vol. 45 ›› Issue (6): 1231-1242.DOI: 10.11996/JG.j.2095-302X.2024061231

• “大模型与图学技术及应用”专题 • 上一篇    下一篇

基于隐式知识增强的KB-VQA知识检索策略研究

郑洪岩1(), 王慧2, 刘昊1, 张志平1, 杨晓娟3, 孙涛1()   

  1. 1.齐鲁工业大学(山东省科学院)计算机科学与技术学部,山东 济南 250353
    2.山东师范大学附属中学,山东 济南 250014
    3.山东师范大学教育学部,山东 济南 250014
  • 收稿日期:2024-06-21 接受日期:2024-08-17 出版日期:2024-12-31 发布日期:2024-12-24
  • 通讯作者:孙涛(1974-),男,教授,博士。主要研究方向为自然语言处理和计算机视觉等。E-mail:sunt@qlu.edu.cn
  • 第一作者:郑洪岩(1999-),男,硕士研究生。主要研究方向为视觉问答。E-mail:1679436540@qq.com
  • 基金资助:
    齐鲁工业大学(山东省科学院)科教产融合试点工程重大创新类项目(2024ZDZX08);山东省自然基金面上项目(ZR202211190244);山东省科技型中小企业创新能力提升工程项目(2023TSGC0212)

Research on KB-VQA knowledge retrieval strategy based on implicit knowledge enhancement

ZHENG Hongyan1(), WANG Hui2, LIU Hao1, ZHANG Zhiping1, YANG Xiaojuan3, SUN Tao1()   

  1. 1. Department of Computer Science and Technology, Qilu University of Technology (Shandong Academy of Sciences), Jinan Shandong 250353, China
    2. Affiliated Middle School of Shandong Normal University, Jinan Shandong 250014, China
    3. Faculty of Education, Shandong Normal University, Jinan Shandong 250014, China
  • Received:2024-06-21 Accepted:2024-08-17 Published:2024-12-31 Online:2024-12-24
  • Contact: SUN Tao (1974-), professor, Ph.D. His main research interests cover natural language processing and computer vision, etc. E-mail:sunt@qlu.edu.cn
  • First author:ZHENG Hongyan (1999-), master student. His main research interest covers visual question answering. E-mail:1679436540@qq.com
  • Supported by:
    Pilot Project for Integrated Innovative of Science, Education, and Industry of Qilu University of Technology (Shandong Academy of Sciences)(2024ZDZX08);Shandong Provincial Natural Science Foundation General Project(ZR202211190244);Shandong Province Science and Technology SME Innovation Capacity Enhancement Project(2023TSGC0212)

摘要:

基于知识的视觉问答(KB-VQA)不仅需要图像信息和问题信息,还需要从知识源中获取到相关知识才能回答问题。现有方法通常使用检索器从知识库中检索外部知识,或直接从大型模型中得到隐式知识,但依靠仅有的图文信息往往不足以支撑获取相关知识。针对检索阶段的查询和外部知识,提出了一种强化检索策略。在查询端,利用大模型中的隐式知识来增强现有的图像和问题信息,增强后的图文信息可以帮助检索器从知识库中定位到更准确的外部知识。在外部知识端,提出了预模拟交互模块来增强外部知识,该模块为知识向量生成一个新的轻量级向量,通过二者之间预先交互,使得检索器可以提前模拟查询和知识段落的交互,以便更好地捕捉查询和知识段落的语义关系。实验结果表明,改进后的模型仅需检索少量知识便可以在 OK-VQA 数据集上达到61.3%的准确率。

关键词: 视觉问答, 知识检索, 图文增强, 预模拟交互, 多模态

Abstract:

The knowledge-based visual question answering (KB-VQA) requires not only image and question information but also relevant knowledge from external sources to answer questions accurately. Existing methods typically involve using a retriever to fetch external knowledge from a knowledge base or relying on implicit knowledge from large models. However, solely depending on image and textual information often proves insufficient for acquiring the necessary knowledge. To address this issue, an enhanced retrieval strategy was proposed for both the query and external knowledge stages. On the query side, implicit knowledge from large models was utilized to enrich the existing image and question information, aiding. The retriever in locating more accurate external knowledge from the knowledge base. On the external knowledge side, a pre-simulation interaction module was introduced to enhance the external knowledge. This module generated a new lightweight vector for the knowledge vector, allowing the retriever to pre-simulate the interaction between the query and the knowledge passage, thus better capturing their semantic relationship. Experimental results demonstrated that the improved model can achieve an accuracy of 61.3% on the OK-VQA dataset by retrieving only a small amount of knowledge.

Key words: visual question answering, knowledge retrieval, text-image enhancement, pre-simulated interaction, multi-modal

中图分类号: