视觉图灵三境界：大模型时代下视觉智能进展与展望

doi:10.11996/JG.j.2095-302X.2025050919

图学学报 ›› 2025, Vol. 46 ›› Issue (5): 919-930.DOI: 10.11996/JG.j.2095-302X.2025050919

视觉图灵三境界：大模型时代下视觉智能进展与展望

黄凯奇¹^,²^,³(), 武美奇¹^,², 陈宏昊¹, 丰效坤¹^,³, 张岱凌¹

¹ 中国科学院自动化研究所智能系统与工程研究中心&复杂系统认知与决策重点实验室，北京 100190
² 中国科学院大学计算机科学与技术学院，北京 100049
³ 中国科学院大学人工智能学院，北京 100049

收稿日期:2025-07-07 接受日期:2025-08-20 出版日期:2025-10-30 发布日期:2025-09-10
第一作者:黄凯奇(1977-)，男，研究员，博士。主要研究方向为计算机视觉与认知决策。E-mail：kaiqi.huang@nlpr.ia.ac.cn
基金资助:
新一代人工智能国家科技重大专项(2022ZD0116403)

The three realms of visual turing: from seeing to imagining in the LLM era

HUANG Kaiqi¹^,²^,³(), WU Meiqi¹^,², CHEN Honghao¹, FENG Xiaokun¹^,³, ZHANG Dailing¹

¹ Center for Research on Intelligent System and Engineering & Key Laboratory of Complex System Intelligent Control and Decision, Institute of Automation, Chinese Academy of Sciences, Beijing 100190, China
² School of Computer Science and Technology, University of Chinese Academy of Sciences, Beijing 100049, China
³ School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing 100049, China

Received:2025-07-07 Accepted:2025-08-20 Published:2025-10-30 Online:2025-09-10
First author：HUANG Kaiqi (1977-), professor, Ph.D. His main research interests cover computer vision and cognitive decision-making. E-mail：kaiqi.huang@nlpr.ia.ac.cn
Supported by:
National Science and Technology Major Project(2022ZD0116403)

摘要/Abstract

摘要：

视觉图灵是通过图灵评测的方式对计算机视觉模型进行评估，为计算机视觉的发展提供了类人评估基准。随着大模型时代的到来，计算机视觉技术的飞速发展极大提升了视觉能力，尤其在图像分类、物体检测分割以及视频理解等领域表现出色。然而，与人类视觉相比，这些算法在适应性、跨场景泛化和高层次认知推理等方面仍存在显著差距。本文从视觉图灵的三重境界(看所见、看所知和看所想)出发对视觉智能发展进行了梳理，对大模型时代下智能技术面临的瓶颈与挑战进行了整理和分析，介绍了视觉智能从物理世界感知到语义理解认知再到主观心理建模的能力跃迁路径，为推动计算机视觉技术更加接近人类的视觉感知与认知能力的发展提供了思路。

关键词: 视觉图灵三境界, 视觉图灵, 多模态大模型, 视觉智能, 类人智能

Abstract:

The Visual Turing evaluates computer vision models through a Turing-style assessment, offering a human-aligned benchmark for the advancing visual intelligence. With the advent of the large language models (LLM), computer vision technologies have advanced rapidly, achieving remarkable performance in tasks such as image classification, object detection and segmentation, and video understanding. However, despite these impressive technical achievements, there remains a significant gap between current algorithms and human visual cognition in terms of adaptability and generalization. The evolution of visual intelligence was revisited from the perspective of its three progressive levels—Seeing the Visible, Seeing the Cognized, and Seeing the Conceived—while systematically examining the limitations and challenges of current technologies. The objectivewas to drive computer vision toward a more human-like capacity for perception and cognition.

Key words: visual turing three realms, visual turing test, MLLMs, visual intelligence, human-like intelligence

中图分类号:

TP391.41

黄凯奇, 武美奇, 陈宏昊, 丰效坤, 张岱凌. 视觉图灵三境界：大模型时代下视觉智能进展与展望[J]. 图学学报, 2025, 46(5): 919-930.

HUANG Kaiqi, WU Meiqi, CHEN Honghao, FENG Xiaokun, ZHANG Dailing. The three realms of visual turing: from seeing to imagining in the LLM era[J]. Journal of Graphics, 2025, 46(5): 919-930.

图/表 4

图1 视觉图灵三层境界框架图

Fig. 1 Visual turing framework: the three levels of visual intelligence

图2 “看所见”阶段的代表性任务((a) 图像分类；(b) 图像检测；(c) 图像分割)

Fig. 2 Representative tasks in the “Seeing the Visible” ((a) Image classification; (b) Image detection; (c) Image segmentation)

图3 “看所知”代表性任务((a) 图像描述；(b) 视觉问答；(c) 视觉推理)

Fig. 3 Representative tasks in the “ Seeing the Cognized” ((a) Image caption; (b) Visual question answer; (c) Visual reasoning)

图4 “看所想”代表性任务((a) 心理沙盘投射测验；(b) 绘画投射测验)

Fig. 4 Representative Tasks in the “Seeing the Conceived” ((a) Sand tray projective test; (b) Drawing projective test)

参考文献 94

[1]	KIRILLOV A, MINTUN E, RAVI N, et al. Segment anything[C]// 2023 IEEE/CVF International Conference on Computer Vision. New York: IEEE Press, 2023: 3992-4003.
[2]	OpenAI. GPT-4 technical report[EB/OL]. [2025-05-07]. https://doi.org/10.48550/arXiv.2303.08774.
[3]	RADFORD A, KIM J W, HALLACY C, et al. Learning transferable visual models from natural language supervision[EB/OL]. [2025-05-07]. https://proceedings.mlr.press/v139/radford21a.
[4]	黄凯奇, 赵鑫, 李乔哲, 等. 视觉图灵: 从人机对抗看计算机视觉下一步发展[J]. 图学学报, 2021, 42(3): 339-348.
	HUANG K Q, ZHAO X, LI Q Z, et al. Visual Turing: the next development of computer vision in the view of human-computer gaming[J]. Journal of Graphics, 2021, 42(3): 339-348 (in Chinese).
[5]	黄凯奇, 兴军亮, 张俊格, 等. 人机对抗智能技术[J]. 中国科学: 信息科学, 2020, 50(4): 540-550.
	HUANG K Q, XING J L, ZHANG J G, et al. Intelligent technologies of human-computer gaming[J]. Scientia Sinica Informationis, 2020, 50(4): 540-550 (in Chinese).
[6]	DENG J, DONG W, SOCHER R, et al. ImageNet: a large-scale hierarchical image database[C]// 2009 IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2009: 248-255.
[7]	LIN T Y, MAIRE M, BELONGIE S, et al. Microsoft COCO: common objects in context[C]// The 13th European Conference on Computer Vision-ECCV 2014. Cham: Springer, 2014: 740-755.
[8]	REN S Q, HE K M, GIRSHICK R, et al. Faster R-CNN: towards real-time object detection with region proposal networks[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 39(6): 1137-1149. DOI PMID
[9]	LONG J, SHELHAMER E, DARRELL T. Fully convolutional networks for semantic segmentation[C]// 2015 IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2015: 3431-3440.
[10]	HE K M, GKIOXARI G, DOLLÁR P, et al. Mask R-CNN[C]// 2017 IEEE International Conference on Computer Vision. New York: IEEE Press, 2017: 2980-2988.
[11]	VINYALS O, TOSHEV A, BENGIO S, et al. Show and tell: a neural image caption generator[C]// 2015 IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2015: 3156-3164.
[12]	GEMAN D, GEMAN S, HALLONQUIST N, et al. Visual Turing test for computer vision systems[J]. Proceedings of the National Academy of Sciences of the United States of America, 2015, 112(12): 3618-3623. DOI PMID
[13]	ANTOL S, AGRAWAL A, LU J S, et al. VQA: visual question answering[C]// 2015 IEEE International Conference on Computer Vision. New York: IEEE Press, 2015: 2425-2433.
[14]	JOHNSON J, HARIHARAN B, VAN DER MAATEN L, et al. Inferring and executing programs for visual reasoning[C]// 2017 IEEE International Conference on Computer Vision. New York: IEEE Press, 2017: 3008-3017.
[15]	HUDSON D A, MANNING C D. GQA: a new dataset for real-world visual reasoning and compositional question answering[C]// 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2019: 6693-6702.
[16]	PEI G X, LI H Y, LU Y D, et al. Affective computing: recent advances, challenges, and future trends[J]. Intelligent Computing, 2024, 3: 0076.
[17]	LILIENFELD S O, WOOD J M, GARB H N. The scientific status of projective techniques[J]. Psychological Science in the Public Interest, 2000, 1(2): 27-66. DOI PMID
[18]	KRIZHEVSKY A, SUTSKEVER I, HINTON G E. ImageNet classification with deep convolutional neural networks[J]. Communications of the ACM, 2017, 60(6): 84-90.
[19]	SIMONYAN K, ZISSERMAN A. Very deep convolutional networks for large-scale image recognition[EB/OL]. [2025-05-07]. https://arxiv.org/abs/1409.1556.
[20]	HE K M, ZHANG X Y, REN S Q, et al. Deep residual learning for image recognition[C]// 2016 IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2016: 770-778.
[21]	DOSOVITSKIY A, BEYER L, KOLESNIKOV A, et al. An image is worth 16x16 words: transformers for image recognition at scale[EB/OL]. [2025-05-07]. https://arxiv.org/abs/2010.11929.
[22]	HE K M, FAN H Q, WU Y X, et al. Momentum contrast for unsupervised visual representation learning[C]// 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2020: 9726-9735.
[23]	ZHANG H, LI F, LIU S L, et al. DINO: DETR with improved denoising anchor boxes for end-to-end object detection[EB/OL]. [2025-05-07]. https://arxiv.org/abs/2203.03605.
[24]	JIANG P Y, ERGU D, LIU F Y, et al. A review of Yolo algorithm developments[J]. Procedia Computer Science, 2022, 199: 1066-1073.
[25]	LIN T Y, GOYAL P, GIRSHICK R, et al. Focal loss for dense object detection[C]// 2017 IEEE International Conference on Computer Vision. New York: IEEE Press, 2017: 2999-3007.
[26]	CARION N, MASSA F, SYNNAEVE G, et al. End-to-end object detection with transformers[C]// The 16th European Conference on Computer Vision. Cham: Springer, 2020: 213-229.
[27]	ZHU X Z, SU W J, LU L W, et al. Deformable DETR: deformable transformers for end-to-end object detection[EB/OL]. [2025-05-07]. https://arxiv.org/abs/2010.04159.
[28]	LIN Y T, YUAN Y H, ZHANG Z, et al. DETR does not need multi-scale or locality design[C]// 2023 IEEE/CVF International Conference on Computer Vision. New York: IEEE Press, 2023: 6522-6531.
[29]	CHEN L C, PAPANDREOU G, KOKKINOS I, et al. DeepLab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018, 40(4): 834-848.
[30]	XIE E Z, WANG W H, YU Z D, et al. SegFormer: simple and efficient design for semantic segmentation with transformers[C]// The 35th International Conference on Neural Information Processing Systems. New York: Curran Associates Inc., 2021: 12077-12090.
[31]	CHENG B W, MISRA I, SCHWING A G, et al. Masked-attention mask transformer for universal image segmentation[C]// 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2022: 1280-1289.
[32]	WANG X L, ZHANG R F, KONG T, et al. SOLOv2: dynamic and fast instance segmentation[C]// The 34th International Conference on Neural Information Processing Systems. New York: Curran Associates Inc., 2020: 17721-17732.
[33]	TIAN Z, SHEN C H, CHEN H. Conditional convolutions for instance segmentation[C]// The 16th European Conference on Computer Vision-ECCV 2020. Cham: Springer, 2020: 282-298.
[34]	KIRILLOV A, GIRSHICK R, HE K M, et al. Panoptic feature pyramid networks[C]// 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2019: 6392-6401.
[35]	XIONG Y W, LIAO R J, ZHAO H S, et al. UPSNet: a unified panoptic segmentation network[C]// 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2019: 8810-8818.
[36]	XIAO T T, LIU Y C, ZHOU B L, et al. Unified perceptual parsing for scene understanding[C]// The 15th European Conference on Computer Vision. Cham: Springer, 2018: 432-448.
[37]	RAVI N, GABEUR V, Hu Y T, et al. SAM 2:segment anything in images and videos[EB/OL]. [2025-05-07]. https://arxiv.org/abs/2408.00714.
[38]	RUSSAKOVSKY, DENG J, SU H, et al. ImageNet large scale visual recognition challenge[J]. International Journal of Computer Vision, 2015, 115(3): 211-252.
[39]	ZHOU B L, ZHAO H, PUIG X, et al. Scene parsing through ADE20K dataset[C]// 2017 IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2017: 5122-5130.
[40]	GUPTA A, DOLLÁR P, GIRSHICK R. LVIS: a dataset for large vocabulary instance segmentation[C]// 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2019: 5351-5359.
[41]	LI C Y, LIU H T, LI L H, et al. ELEVATER: a benchmark and toolkit for evaluating language-augmented visual models[C]// The 36th International Conference on Neural Information Processing Systems. New York: Curran Associates Inc., 2022: 9287-9301.
[42]	BORJI A. ObjectNet dataset: reanalysis and correction[EB/OL]. [2025-05-07]. https://arxiv.org/abs/2004.02042.
[43]	HENDRYCKS D, BASART S, MU N, et al. The many faces of robustness: a critical analysis of out-of-distribution generalization[C]// 2021 IEEE/CVF International Conference on Computer Vision. New York: IEEE Press, 2021: 8320-8329.
[44]	XU K, BA J L, KIROS R, et al. Show, attend and tell: neural image caption generation with visual attention[C]// The 32nd International Conference on International Conference on Machine Learning. Lile: International Machine Learning Society (IMLS), 2015: 2048-2057.
[45]	LI X J, YIN X, LI C Y, et al. OSCAR: object-semantics aligned pre-training for vision-language tasks[C]// The 16th European Conference on Computer Vision. Cham: Springer, 2020: 121-137.
[46]	ZHANG P C, LI X J, HU X W, et al. VinVL: revisiting visual representations in vision-language models[C]// 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2021: 5575-5584.
[47]	LI J N, LI D X, XIONG C M, et al. BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation[C]// The 39th International Conference on Machine Learning. Baltimore: Proceeding of Machine Learning Research, 2022: 12888-12900.
[48]	LI J N, LI D X, SAVARESE S, et al. BLIP-2: bootstrapping language-image pre-training with frozen image encoders and large language models[C]// The 40th International Conference on Machine Learning. Hawaii: International Machine Learning Society (IMLS), 2023: 19730-19742.
[49]	SUHR A, LEWIS M, YEH J, et al. A corpus of natural language for visual reasoning[C]// The 55th Annual Meeting of the Association for Computational Linguistics (Volume 2:Short Papers). Vancouver: Association for Computational Linguistics, 2017: 217-223.
[50]	YI K X, GAN C, LI Y Z, et al. CLEVRER: collision events for video representation and reasoning[EB/OL]. [2025-05-07]. https://arxiv.org/abs/1910.01442.
[51]	JOHNSON, HARIHARAN, VAN DER MAATEN, et al. Dataset: CLEVR-CoGenT[EB/OL]. [2025-05-07]. https://doi.org/10.57702/v42pwykk.
[52]	PEREZ E, STRUB F, DE VRIES H, et al. FiLM: visual reasoning with a general conditioning layer[C]// The 32nd AAAI Conference on Artificial Intelligence. Palo Alto: AAAI Press, 2018: 3942-3951.
[53]	HUDSON D A, MANNING C D. Compositional attention networks for machine reasoning[EB/OL]. [2025-05-07]. https://arxiv.org/abs/1803.03067.
[54]	YU Z, CUI Y H, YU J, et al. Deep multimodal neural architecture search[C]// The 28th ACM International Conference on Multimedia. New York: ACM, 2020: 3743-3752.
[55]	CHEN Y C, LI L J, YU L C, et al. UNITER: universal image-text representation learning[C]// The 16th European Conference on Computer Vision. Cham: Springer, 2020: 104-120.
[56]	SU W J, ZHU X Z, CAO Y, et al. VL-BERT: pre-training of generic visual-linguistic representations[EB/OL]. [2025-05-07]. https://arxiv.org/abs/1908.08530.
[57]	Gemini Team Google. Gemini: a family of highly capable multimodal models[EB/OL]. [2025-05-07]. https://doi.org/10.48550/arXiv.2312.11805.
[58]	BAI J Z, BAI S, CHU Y F, et al. Qwen technical report[EB/OL]. [2025-05-07]. https://arxiv.org/abs/2309.16609.
[59]	Claude\Anthropic[EB/OL]. [2025-05-07]. https://www.anthropic.com/
[60]	ANDREAS J, ROHRBACH M, DARRELL T, et al. Neural module networks[C]// 2016 IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2016: 39-48.
[61]	GRUNDE-MCLAUGHLIN M, KRISHNA R, AGRAWALA M. AGQA: a benchmark for compositional spatio-temporal reasoning[C]// 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2021: 11282-11292.
[62]	GOYAL Y, KHOT T, SUMMERS-STAY D, et al. Making the V in VQA matter: elevating the role of image understanding in visual question answering[C]// 2017 IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2017: 6325-6334.
[63]	ZHU Y K, GROTH O, BERNSTEIN M, et al. Visual7W: grounded question answering in images[C]// 2016 IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2016: 4995-5004.
[64]	KAFLE K, KANAN C. An analysis of visual question answering algorithms[C]// 2017 IEEE International Conference on Computer Vision. New York: IEEE Press, 2017: 1983-1991.
[65]	MARINO K, RASTEGARI M, FARHADI A, et al. OK-VQA: a visual question answering benchmark requiring external knowledge[C]// 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2019: 3190-3199.
[66]	ZHANG Y Q, YANG X C, XU X L, et al. Affective computing in the era of large language models: a survey from the NLP perspective[EB/OL]. [2025-05-07]. https://arxiv.org/abs/2408.04638.
[67]	TAO J, TAN T. Affective computing: a review[C]// The 1st International Conference on Affective Computing and Intelligent Interaction. Cham: Springer, 2005: 981-995.
[68]	SHANKMAN S A, KLEIN D N. The relation between depression and anxiety: an evaluation of the tripartite, approach-withdrawal and valence-arousal models[J]. Clinical Psychology Review, 2003, 23(4): 605-637. PMID
[69]	EKMAN P. An argument for basic emotions[J]. Cognition & Emotion, 1992, 6(3/4): 169-200.
[70]	PORIA S, CAMBRIA E, BAJPAI R, et al. A review of affective computing: from unimodal analysis to multimodal fusion[J]. Information Fusion, 2017, 37: 98-125.
[71]	AMIN M M, CAMBRIA E, SCHULLER B W. Will affective computing emerge from foundation models and general artificial intelligence? A first evaluation of ChatGPT[J]. IEEE Intelligent Systems, 2023, 38(2): 15-23.
[72]	DEVLIN J, CHANG M W, LEE K, et al. BERT: pre-training of deep bidirectional transformers for language understanding[C]// 2019 Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies, Volume 1 (Long and Short Papers). Minneapolis: Association for Computational Linguistics, 2019: 4171-4186.
[73]	LIU Y H, OTT M, GOYAL N, et al. RoBERTa: a robustly optimized BERT pretraining approach[EB/OL]. [2025-05-07]. https://arxiv.org/abs/1907.11692.
[74]	MAO R, LIU Q, HE K, et al. The biases of pre-trained language models: an empirical study on prompt-based sentiment analysis and emotion detection[J]. IEEE Transactions on Affective Computing, 2023, 14(3): 1743-1753.
[75]	TOUVRON H, LAVRIL T, IZACARD G, et al. LLaMA: open and efficient foundation language models[EB/OL]. [2025-05-07]. https://arxiv.org/abs/2302.13971.
[76]	CHUNG H W, HOU L, LONGPRE S, et al. Scaling instruction-finetuned language models[J]. Journal of Machine Learning Research, 2024, 25(1): 3381-3433.
[77]	ZHOU Y C, MURESANU A I, HAN Z W, et al. Large language models are human-level prompt engineers[EB/OL]. [2025-05-07]. https://arxiv.org/abs/2211.01910.
[78]	WHITE J, FU Q C, HAYS S, et al. A prompt pattern catalog to enhance prompt engineering with ChatGPT[EB/OL]. [2025-05-07]. https://arxiv.org/abs/2302.11382.
[79]	DEVALAL S, KARTHIKEYAN A. LoRa technology-an overview[C]// The 2018 Second International Conference on Electronics, Communication and Aerospace Technology. New York: IEEE Press, 2018: 284-290.
[80]	LIU X, JI K X, FU Y C, et al. P-Tuning v2:prompt tuning can be comparable to fine-tuning universally across scales and tasks[EB/OL]. [2025-05-07]. https://arxiv.org/abs/2110.07602.
[81]	LI X L, LIANG P. Prefix-tuning: optimizing continuous prompts for generation[EB/OL]. [2025-05-07]. https://arxiv.org/abs/2101.00190.
[82]	BROWN T B, MANN B, RYDER N, et al. Language models are few-shot learners[C]// The 34th International Conference on Neural Information Processing Systems. New York: Curran Associates Inc., 2020: 1877-1901.
[83]	WEI J, WANG X Z, SCHUURMANS D, et al. Chain-of- thought prompting elicits reasoning in large language models[C]// The 36th International Conference on Neural Information Processing Systems. New York: Curran Associates Inc., 2022: 24824-24837.
[84]	PARK J S, O’BRIEN J, CAI C J, et al. Generative agents: interactive simulacra of human behavior[C]// The 36th Annual ACM Symposium on User Interface Software and Technology. New York: ACM, 2023: 2.
[85]	DONG Q X, LI L, DAI D M, et al. A survey on in-context learning[EB/OL]. [2025-05-07]. https://arxiv.org/abs/2301.00234.
[86]	FENG X K, HU S Y, CHEN X T, et al. A hierarchical theme recognition model for sandplay therapy[C]// The 6th Chinese Conference on Pattern Recognition and Computer Vision. Cham: Springer, 2024: 241-252.
[87]	黄凯奇, 康雅萱, 晏成信, 等. 基于交互环境的智能化心理测评(综述)[J]. 中国心理卫生杂志, 2025, 39(4): 337-345.
	HUANG K Q, KANG Y X, YAN C X, et al. A review of intelligent psychological assessment based on interactive environment[J]. Chinese Mental Health Journal, 2025, 39(4): 337-345 (in Chinese).
[88]	GAMBLE K R. The Holtzman inkblot technique[J]. Psychological Bulletin, 1972, 77(3): 172-194.
[89]	WU M Q, KANG Y X, LI X C, et al. VS-LLM: visual- semantic depression assessment based on LLM for drawing projection test[C]// The 7th Chinese Conference on Pattern Recognition and Computer Vision. Cham: Springer, 2025: 232-246.
[90]	LIAN Z, SUN L C, REN Y, et al. MERBench: a unified evaluation benchmark for multimodal emotion recognition[EB/OL]. [2025-05-07]. https://arxiv.org/abs/2401.03429.
[91]	DENG Y, ZHANG W X, PAN S J, et al. SOUL: towards sentiment and opinion understanding of language[EB/OL]. [2025-05-07]. https://arxiv.org/abs/2310.17924.
[92]	LIU R, ZUO H L, LIAN Z, et al. Emotion and intent joint understanding in multimodal conversation: a benchmarking dataset[EB/OL]. [2025-05-07]. https://arxiv.org/abs/2407.02751.
[93]	WANG H R, WANG R, MI F, et al. Cue-CoT: chain-of-thought prompting for responding to in-depth dialogue questions with LLMs[EB/OL]. [2025-05-07]. https://arxiv.org/abs/2305.11792.
[94]	ZHAO W X, LI Z J, WANG S L, et al. Both matter: enhancing the emotional intelligence of large language models without compromising the general intelligence[EB/OL]. [2025-05-07]. https://arxiv.org/abs/2402.10073.

视觉图灵三境界：大模型时代下视觉智能进展与展望

The three realms of visual turing: from seeing to imagining in the LLM era

RichHTML

PDF (PC)

可视化

摘要/Abstract

引用本文

使用本文

图/表 4

参考文献 94

相关文章 2

编辑推荐

Metrics

本文评价

[1]	吴精乙, 景峻, 贺熠凡, 张世渝, 康运锋, 唐维, 孔德兰, 刘向栋. 基于多模态大模型的高速公路场景交通异常事件分析方法[J]. 图学学报, 2024, 45(6): 1266-1276.
[2]	黄凯奇, 赵鑫, 李乔哲, 胡世宇 . 视觉图灵：从人机对抗看计算机视觉下一步发展[J]. 图学学报, 2021, 42(3): 339-348.