The three realms of visual turing: from seeing to imagining in the LLM era

doi:10.11996/JG.j.2095-302X.2025050919

Abstract

Abstract:

The Visual Turing evaluates computer vision models through a Turing-style assessment, offering a human-aligned benchmark for the advancing visual intelligence. With the advent of the large language models (LLM), computer vision technologies have advanced rapidly, achieving remarkable performance in tasks such as image classification, object detection and segmentation, and video understanding. However, despite these impressive technical achievements, there remains a significant gap between current algorithms and human visual cognition in terms of adaptability and generalization. The evolution of visual intelligence was revisited from the perspective of its three progressive levels—Seeing the Visible, Seeing the Cognized, and Seeing the Conceived—while systematically examining the limitations and challenges of current technologies. The objectivewas to drive computer vision toward a more human-like capacity for perception and cognition.

Key words: visual turing three realms, visual turing test, MLLMs, visual intelligence, human-like intelligence

CLC Number:

TP391.41

HUANG Kaiqi, WU Meiqi, CHEN Honghao, FENG Xiaokun, ZHANG Dailing. The three realms of visual turing: from seeing to imagining in the LLM era[J]. Journal of Graphics, 2025, 46(5): 919-930.

Figures/Tables 4

Fig. 1 Visual turing framework: the three levels of visual intelligence

Fig. 2 Representative tasks in the “Seeing the Visible” ((a) Image classification; (b) Image detection; (c) Image segmentation)

Fig. 3 Representative tasks in the “ Seeing the Cognized” ((a) Image caption; (b) Visual question answer; (c) Visual reasoning)

Fig. 4 Representative Tasks in the “Seeing the Conceived” ((a) Sand tray projective test; (b) Drawing projective test)

References 94

[1]	KIRILLOV A, MINTUN E, RAVI N, et al. Segment anything[C]// 2023 IEEE/CVF International Conference on Computer Vision. New York: IEEE Press, 2023: 3992-4003.
[2]	OpenAI. GPT-4 technical report[EB/OL]. [2025-05-07]. https://doi.org/10.48550/arXiv.2303.08774.
[3]	RADFORD A, KIM J W, HALLACY C, et al. Learning transferable visual models from natural language supervision[EB/OL]. [2025-05-07]. https://proceedings.mlr.press/v139/radford21a.
[4]	黄凯奇, 赵鑫, 李乔哲, 等. 视觉图灵: 从人机对抗看计算机视觉下一步发展[J]. 图学学报, 2021, 42(3): 339-348.
	HUANG K Q, ZHAO X, LI Q Z, et al. Visual Turing: the next development of computer vision in the view of human-computer gaming[J]. Journal of Graphics, 2021, 42(3): 339-348 (in Chinese).
[5]	黄凯奇, 兴军亮, 张俊格, 等. 人机对抗智能技术[J]. 中国科学: 信息科学, 2020, 50(4): 540-550.
	HUANG K Q, XING J L, ZHANG J G, et al. Intelligent technologies of human-computer gaming[J]. Scientia Sinica Informationis, 2020, 50(4): 540-550 (in Chinese).
[6]	DENG J, DONG W, SOCHER R, et al. ImageNet: a large-scale hierarchical image database[C]// 2009 IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2009: 248-255.
[7]	LIN T Y, MAIRE M, BELONGIE S, et al. Microsoft COCO: common objects in context[C]// The 13th European Conference on Computer Vision-ECCV 2014. Cham: Springer, 2014: 740-755.
[8]	REN S Q, HE K M, GIRSHICK R, et al. Faster R-CNN: towards real-time object detection with region proposal networks[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 39(6): 1137-1149. DOI PMID
[9]	LONG J, SHELHAMER E, DARRELL T. Fully convolutional networks for semantic segmentation[C]// 2015 IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2015: 3431-3440.
[10]	HE K M, GKIOXARI G, DOLLÁR P, et al. Mask R-CNN[C]// 2017 IEEE International Conference on Computer Vision. New York: IEEE Press, 2017: 2980-2988.
[11]	VINYALS O, TOSHEV A, BENGIO S, et al. Show and tell: a neural image caption generator[C]// 2015 IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2015: 3156-3164.
[12]	GEMAN D, GEMAN S, HALLONQUIST N, et al. Visual Turing test for computer vision systems[J]. Proceedings of the National Academy of Sciences of the United States of America, 2015, 112(12): 3618-3623. DOI PMID
[13]	ANTOL S, AGRAWAL A, LU J S, et al. VQA: visual question answering[C]// 2015 IEEE International Conference on Computer Vision. New York: IEEE Press, 2015: 2425-2433.
[14]	JOHNSON J, HARIHARAN B, VAN DER MAATEN L, et al. Inferring and executing programs for visual reasoning[C]// 2017 IEEE International Conference on Computer Vision. New York: IEEE Press, 2017: 3008-3017.
[15]	HUDSON D A, MANNING C D. GQA: a new dataset for real-world visual reasoning and compositional question answering[C]// 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2019: 6693-6702.
[16]	PEI G X, LI H Y, LU Y D, et al. Affective computing: recent advances, challenges, and future trends[J]. Intelligent Computing, 2024, 3: 0076.
[17]	LILIENFELD S O, WOOD J M, GARB H N. The scientific status of projective techniques[J]. Psychological Science in the Public Interest, 2000, 1(2): 27-66. DOI PMID
[18]	KRIZHEVSKY A, SUTSKEVER I, HINTON G E. ImageNet classification with deep convolutional neural networks[J]. Communications of the ACM, 2017, 60(6): 84-90.
[19]	SIMONYAN K, ZISSERMAN A. Very deep convolutional networks for large-scale image recognition[EB/OL]. [2025-05-07]. https://arxiv.org/abs/1409.1556.
[20]	HE K M, ZHANG X Y, REN S Q, et al. Deep residual learning for image recognition[C]// 2016 IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2016: 770-778.
[21]	DOSOVITSKIY A, BEYER L, KOLESNIKOV A, et al. An image is worth 16x16 words: transformers for image recognition at scale[EB/OL]. [2025-05-07]. https://arxiv.org/abs/2010.11929.
[22]	HE K M, FAN H Q, WU Y X, et al. Momentum contrast for unsupervised visual representation learning[C]// 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2020: 9726-9735.
[23]	ZHANG H, LI F, LIU S L, et al. DINO: DETR with improved denoising anchor boxes for end-to-end object detection[EB/OL]. [2025-05-07]. https://arxiv.org/abs/2203.03605.
[24]	JIANG P Y, ERGU D, LIU F Y, et al. A review of Yolo algorithm developments[J]. Procedia Computer Science, 2022, 199: 1066-1073.
[25]	LIN T Y, GOYAL P, GIRSHICK R, et al. Focal loss for dense object detection[C]// 2017 IEEE International Conference on Computer Vision. New York: IEEE Press, 2017: 2999-3007.
[26]	CARION N, MASSA F, SYNNAEVE G, et al. End-to-end object detection with transformers[C]// The 16th European Conference on Computer Vision. Cham: Springer, 2020: 213-229.
[27]	ZHU X Z, SU W J, LU L W, et al. Deformable DETR: deformable transformers for end-to-end object detection[EB/OL]. [2025-05-07]. https://arxiv.org/abs/2010.04159.
[28]	LIN Y T, YUAN Y H, ZHANG Z, et al. DETR does not need multi-scale or locality design[C]// 2023 IEEE/CVF International Conference on Computer Vision. New York: IEEE Press, 2023: 6522-6531.
[29]	CHEN L C, PAPANDREOU G, KOKKINOS I, et al. DeepLab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018, 40(4): 834-848.
[30]	XIE E Z, WANG W H, YU Z D, et al. SegFormer: simple and efficient design for semantic segmentation with transformers[C]// The 35th International Conference on Neural Information Processing Systems. New York: Curran Associates Inc., 2021: 12077-12090.
[31]	CHENG B W, MISRA I, SCHWING A G, et al. Masked-attention mask transformer for universal image segmentation[C]// 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2022: 1280-1289.
[32]	WANG X L, ZHANG R F, KONG T, et al. SOLOv2: dynamic and fast instance segmentation[C]// The 34th International Conference on Neural Information Processing Systems. New York: Curran Associates Inc., 2020: 17721-17732.
[33]	TIAN Z, SHEN C H, CHEN H. Conditional convolutions for instance segmentation[C]// The 16th European Conference on Computer Vision-ECCV 2020. Cham: Springer, 2020: 282-298.
[34]	KIRILLOV A, GIRSHICK R, HE K M, et al. Panoptic feature pyramid networks[C]// 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2019: 6392-6401.
[35]	XIONG Y W, LIAO R J, ZHAO H S, et al. UPSNet: a unified panoptic segmentation network[C]// 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2019: 8810-8818.
[36]	XIAO T T, LIU Y C, ZHOU B L, et al. Unified perceptual parsing for scene understanding[C]// The 15th European Conference on Computer Vision. Cham: Springer, 2018: 432-448.
[37]	RAVI N, GABEUR V, Hu Y T, et al. SAM 2:segment anything in images and videos[EB/OL]. [2025-05-07]. https://arxiv.org/abs/2408.00714.
[38]	RUSSAKOVSKY, DENG J, SU H, et al. ImageNet large scale visual recognition challenge[J]. International Journal of Computer Vision, 2015, 115(3): 211-252.
[39]	ZHOU B L, ZHAO H, PUIG X, et al. Scene parsing through ADE20K dataset[C]// 2017 IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2017: 5122-5130.
[40]	GUPTA A, DOLLÁR P, GIRSHICK R. LVIS: a dataset for large vocabulary instance segmentation[C]// 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2019: 5351-5359.
[41]	LI C Y, LIU H T, LI L H, et al. ELEVATER: a benchmark and toolkit for evaluating language-augmented visual models[C]// The 36th International Conference on Neural Information Processing Systems. New York: Curran Associates Inc., 2022: 9287-9301.
[42]	BORJI A. ObjectNet dataset: reanalysis and correction[EB/OL]. [2025-05-07]. https://arxiv.org/abs/2004.02042.
[43]	HENDRYCKS D, BASART S, MU N, et al. The many faces of robustness: a critical analysis of out-of-distribution generalization[C]// 2021 IEEE/CVF International Conference on Computer Vision. New York: IEEE Press, 2021: 8320-8329.
[44]	XU K, BA J L, KIROS R, et al. Show, attend and tell: neural image caption generation with visual attention[C]// The 32nd International Conference on International Conference on Machine Learning. Lile: International Machine Learning Society (IMLS), 2015: 2048-2057.
[45]	LI X J, YIN X, LI C Y, et al. OSCAR: object-semantics aligned pre-training for vision-language tasks[C]// The 16th European Conference on Computer Vision. Cham: Springer, 2020: 121-137.
[46]	ZHANG P C, LI X J, HU X W, et al. VinVL: revisiting visual representations in vision-language models[C]// 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2021: 5575-5584.
[47]	LI J N, LI D X, XIONG C M, et al. BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation[C]// The 39th International Conference on Machine Learning. Baltimore: Proceeding of Machine Learning Research, 2022: 12888-12900.
[48]	LI J N, LI D X, SAVARESE S, et al. BLIP-2: bootstrapping language-image pre-training with frozen image encoders and large language models[C]// The 40th International Conference on Machine Learning. Hawaii: International Machine Learning Society (IMLS), 2023: 19730-19742.
[49]	SUHR A, LEWIS M, YEH J, et al. A corpus of natural language for visual reasoning[C]// The 55th Annual Meeting of the Association for Computational Linguistics (Volume 2:Short Papers). Vancouver: Association for Computational Linguistics, 2017: 217-223.
[50]	YI K X, GAN C, LI Y Z, et al. CLEVRER: collision events for video representation and reasoning[EB/OL]. [2025-05-07]. https://arxiv.org/abs/1910.01442.
[51]	JOHNSON, HARIHARAN, VAN DER MAATEN, et al. Dataset: CLEVR-CoGenT[EB/OL]. [2025-05-07]. https://doi.org/10.57702/v42pwykk.
[52]	PEREZ E, STRUB F, DE VRIES H, et al. FiLM: visual reasoning with a general conditioning layer[C]// The 32nd AAAI Conference on Artificial Intelligence. Palo Alto: AAAI Press, 2018: 3942-3951.
[53]	HUDSON D A, MANNING C D. Compositional attention networks for machine reasoning[EB/OL]. [2025-05-07]. https://arxiv.org/abs/1803.03067.
[54]	YU Z, CUI Y H, YU J, et al. Deep multimodal neural architecture search[C]// The 28th ACM International Conference on Multimedia. New York: ACM, 2020: 3743-3752.
[55]	CHEN Y C, LI L J, YU L C, et al. UNITER: universal image-text representation learning[C]// The 16th European Conference on Computer Vision. Cham: Springer, 2020: 104-120.
[56]	SU W J, ZHU X Z, CAO Y, et al. VL-BERT: pre-training of generic visual-linguistic representations[EB/OL]. [2025-05-07]. https://arxiv.org/abs/1908.08530.
[57]	Gemini Team Google. Gemini: a family of highly capable multimodal models[EB/OL]. [2025-05-07]. https://doi.org/10.48550/arXiv.2312.11805.
[58]	BAI J Z, BAI S, CHU Y F, et al. Qwen technical report[EB/OL]. [2025-05-07]. https://arxiv.org/abs/2309.16609.
[59]	Claude\Anthropic[EB/OL]. [2025-05-07]. https://www.anthropic.com/
[60]	ANDREAS J, ROHRBACH M, DARRELL T, et al. Neural module networks[C]// 2016 IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2016: 39-48.
[61]	GRUNDE-MCLAUGHLIN M, KRISHNA R, AGRAWALA M. AGQA: a benchmark for compositional spatio-temporal reasoning[C]// 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2021: 11282-11292.
[62]	GOYAL Y, KHOT T, SUMMERS-STAY D, et al. Making the V in VQA matter: elevating the role of image understanding in visual question answering[C]// 2017 IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2017: 6325-6334.
[63]	ZHU Y K, GROTH O, BERNSTEIN M, et al. Visual7W: grounded question answering in images[C]// 2016 IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2016: 4995-5004.
[64]	KAFLE K, KANAN C. An analysis of visual question answering algorithms[C]// 2017 IEEE International Conference on Computer Vision. New York: IEEE Press, 2017: 1983-1991.
[65]	MARINO K, RASTEGARI M, FARHADI A, et al. OK-VQA: a visual question answering benchmark requiring external knowledge[C]// 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2019: 3190-3199.
[66]	ZHANG Y Q, YANG X C, XU X L, et al. Affective computing in the era of large language models: a survey from the NLP perspective[EB/OL]. [2025-05-07]. https://arxiv.org/abs/2408.04638.
[67]	TAO J, TAN T. Affective computing: a review[C]// The 1st International Conference on Affective Computing and Intelligent Interaction. Cham: Springer, 2005: 981-995.
[68]	SHANKMAN S A, KLEIN D N. The relation between depression and anxiety: an evaluation of the tripartite, approach-withdrawal and valence-arousal models[J]. Clinical Psychology Review, 2003, 23(4): 605-637. PMID
[69]	EKMAN P. An argument for basic emotions[J]. Cognition & Emotion, 1992, 6(3/4): 169-200.
[70]	PORIA S, CAMBRIA E, BAJPAI R, et al. A review of affective computing: from unimodal analysis to multimodal fusion[J]. Information Fusion, 2017, 37: 98-125.
[71]	AMIN M M, CAMBRIA E, SCHULLER B W. Will affective computing emerge from foundation models and general artificial intelligence? A first evaluation of ChatGPT[J]. IEEE Intelligent Systems, 2023, 38(2): 15-23.
[72]	DEVLIN J, CHANG M W, LEE K, et al. BERT: pre-training of deep bidirectional transformers for language understanding[C]// 2019 Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies, Volume 1 (Long and Short Papers). Minneapolis: Association for Computational Linguistics, 2019: 4171-4186.
[73]	LIU Y H, OTT M, GOYAL N, et al. RoBERTa: a robustly optimized BERT pretraining approach[EB/OL]. [2025-05-07]. https://arxiv.org/abs/1907.11692.
[74]	MAO R, LIU Q, HE K, et al. The biases of pre-trained language models: an empirical study on prompt-based sentiment analysis and emotion detection[J]. IEEE Transactions on Affective Computing, 2023, 14(3): 1743-1753.
[75]	TOUVRON H, LAVRIL T, IZACARD G, et al. LLaMA: open and efficient foundation language models[EB/OL]. [2025-05-07]. https://arxiv.org/abs/2302.13971.
[76]	CHUNG H W, HOU L, LONGPRE S, et al. Scaling instruction-finetuned language models[J]. Journal of Machine Learning Research, 2024, 25(1): 3381-3433.
[77]	ZHOU Y C, MURESANU A I, HAN Z W, et al. Large language models are human-level prompt engineers[EB/OL]. [2025-05-07]. https://arxiv.org/abs/2211.01910.
[78]	WHITE J, FU Q C, HAYS S, et al. A prompt pattern catalog to enhance prompt engineering with ChatGPT[EB/OL]. [2025-05-07]. https://arxiv.org/abs/2302.11382.
[79]	DEVALAL S, KARTHIKEYAN A. LoRa technology-an overview[C]// The 2018 Second International Conference on Electronics, Communication and Aerospace Technology. New York: IEEE Press, 2018: 284-290.
[80]	LIU X, JI K X, FU Y C, et al. P-Tuning v2:prompt tuning can be comparable to fine-tuning universally across scales and tasks[EB/OL]. [2025-05-07]. https://arxiv.org/abs/2110.07602.
[81]	LI X L, LIANG P. Prefix-tuning: optimizing continuous prompts for generation[EB/OL]. [2025-05-07]. https://arxiv.org/abs/2101.00190.
[82]	BROWN T B, MANN B, RYDER N, et al. Language models are few-shot learners[C]// The 34th International Conference on Neural Information Processing Systems. New York: Curran Associates Inc., 2020: 1877-1901.
[83]	WEI J, WANG X Z, SCHUURMANS D, et al. Chain-of- thought prompting elicits reasoning in large language models[C]// The 36th International Conference on Neural Information Processing Systems. New York: Curran Associates Inc., 2022: 24824-24837.
[84]	PARK J S, O’BRIEN J, CAI C J, et al. Generative agents: interactive simulacra of human behavior[C]// The 36th Annual ACM Symposium on User Interface Software and Technology. New York: ACM, 2023: 2.
[85]	DONG Q X, LI L, DAI D M, et al. A survey on in-context learning[EB/OL]. [2025-05-07]. https://arxiv.org/abs/2301.00234.
[86]	FENG X K, HU S Y, CHEN X T, et al. A hierarchical theme recognition model for sandplay therapy[C]// The 6th Chinese Conference on Pattern Recognition and Computer Vision. Cham: Springer, 2024: 241-252.
[87]	黄凯奇, 康雅萱, 晏成信, 等. 基于交互环境的智能化心理测评(综述)[J]. 中国心理卫生杂志, 2025, 39(4): 337-345.
	HUANG K Q, KANG Y X, YAN C X, et al. A review of intelligent psychological assessment based on interactive environment[J]. Chinese Mental Health Journal, 2025, 39(4): 337-345 (in Chinese).
[88]	GAMBLE K R. The Holtzman inkblot technique[J]. Psychological Bulletin, 1972, 77(3): 172-194.
[89]	WU M Q, KANG Y X, LI X C, et al. VS-LLM: visual- semantic depression assessment based on LLM for drawing projection test[C]// The 7th Chinese Conference on Pattern Recognition and Computer Vision. Cham: Springer, 2025: 232-246.
[90]	LIAN Z, SUN L C, REN Y, et al. MERBench: a unified evaluation benchmark for multimodal emotion recognition[EB/OL]. [2025-05-07]. https://arxiv.org/abs/2401.03429.
[91]	DENG Y, ZHANG W X, PAN S J, et al. SOUL: towards sentiment and opinion understanding of language[EB/OL]. [2025-05-07]. https://arxiv.org/abs/2310.17924.
[92]	LIU R, ZUO H L, LIAN Z, et al. Emotion and intent joint understanding in multimodal conversation: a benchmarking dataset[EB/OL]. [2025-05-07]. https://arxiv.org/abs/2407.02751.
[93]	WANG H R, WANG R, MI F, et al. Cue-CoT: chain-of-thought prompting for responding to in-depth dialogue questions with LLMs[EB/OL]. [2025-05-07]. https://arxiv.org/abs/2305.11792.
[94]	ZHAO W X, LI Z J, WANG S L, et al. Both matter: enhancing the emotional intelligence of large language models without compromising the general intelligence[EB/OL]. [2025-05-07]. https://arxiv.org/abs/2402.10073.