欢迎访问《图学学报》 分享到:

图学学报 ›› 2025, Vol. 46 ›› Issue (5): 919-930.DOI: 10.11996/JG.j.2095-302X.2025050919

• 综述 • 上一篇    下一篇

视觉图灵三境界:大模型时代下视觉智能进展与展望

黄凯奇1,2,3(), 武美奇1,2, 陈宏昊1, 丰效坤1,3, 张岱凌1   

  1. 1 中国科学院自动化研究所智能系统与工程研究中心&复杂系统认知与决策重点实验室北京 100190
    2 中国科学院大学计算机科学与技术学院北京 100049
    3 中国科学院大学人工智能学院北京 100049
  • 收稿日期:2025-07-07 接受日期:2025-08-20 出版日期:2025-10-30 发布日期:2025-09-10
  • 第一作者:黄凯奇(1977-),男,研究员,博士。主要研究方向为计算机视觉与认知决策。E-mail:kaiqi.huang@nlpr.ia.ac.cn
  • 基金资助:
    新一代人工智能国家科技重大专项(2022ZD0116403)

The three realms of visual turing: from seeing to imagining in the LLM era

HUANG Kaiqi1,2,3(), WU Meiqi1,2, CHEN Honghao1, FENG Xiaokun1,3, ZHANG Dailing1   

  1. 1 Center for Research on Intelligent System and Engineering & Key Laboratory of Complex System Intelligent Control and Decision, Institute of Automation, Chinese Academy of Sciences, Beijing 100190, China
    2 School of Computer Science and Technology, University of Chinese Academy of Sciences, Beijing 100049, China
    3 School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing 100049, China
  • Received:2025-07-07 Accepted:2025-08-20 Published:2025-10-30 Online:2025-09-10
  • First author:HUANG Kaiqi (1977-), professor, Ph.D. His main research interests cover computer vision and cognitive decision-making. E-mail:kaiqi.huang@nlpr.ia.ac.cn
  • Supported by:
    National Science and Technology Major Project(2022ZD0116403)

摘要:

视觉图灵是通过图灵评测的方式对计算机视觉模型进行评估,为计算机视觉的发展提供了类人评估基准。随着大模型时代的到来,计算机视觉技术的飞速发展极大提升了视觉能力,尤其在图像分类、物体检测分割以及视频理解等领域表现出色。然而,与人类视觉相比,这些算法在适应性、跨场景泛化和高层次认知推理等方面仍存在显著差距。本文从视觉图灵的三重境界(看所见、看所知和看所想)出发对视觉智能发展进行了梳理,对大模型时代下智能技术面临的瓶颈与挑战进行了整理和分析,介绍了视觉智能从物理世界感知到语义理解认知再到主观心理建模的能力跃迁路径,为推动计算机视觉技术更加接近人类的视觉感知与认知能力的发展提供了思路。

关键词: 视觉图灵三境界, 视觉图灵, 多模态大模型, 视觉智能, 类人智能

Abstract:

The Visual Turing evaluates computer vision models through a Turing-style assessment, offering a human-aligned benchmark for the advancing visual intelligence. With the advent of the large language models (LLM), computer vision technologies have advanced rapidly, achieving remarkable performance in tasks such as image classification, object detection and segmentation, and video understanding. However, despite these impressive technical achievements, there remains a significant gap between current algorithms and human visual cognition in terms of adaptability and generalization. The evolution of visual intelligence was revisited from the perspective of its three progressive levels—Seeing the Visible, Seeing the Cognized, and Seeing the Conceived—while systematically examining the limitations and challenges of current technologies. The objectivewas to drive computer vision toward a more human-like capacity for perception and cognition.

Key words: visual turing three realms, visual turing test, MLLMs, visual intelligence, human-like intelligence

中图分类号: