Welcome to Journal of Graphics share: 

Journal of Graphics ›› 2025, Vol. 46 ›› Issue (3): 558-567.DOI: 10.11996/JG.j.2095-302X.2025030558

• Image Processing and Computer Vision • Previous Articles     Next Articles

Research on multimodal text-visual large model for robotic terrain perception algorithm

SUN Hao1(), XIE Tao1, HE Long2, GUO Wenzhong3, YU Yongfang2, WU Qijun2, WANG Jianwei2, DONG Hui4,5   

  1. 1. School of Mechanical Engineering and Automation, Fuzhou University, Fuzhou Fujian 350108, China
    2. Hangzhou Zhiyuan Research Institute Co., Ltd, Hangzhou Zhejiang 310008, China
    3. School of Computer Science and Big Data, Fuzhou University, Fuzhou Fujian 350108, China
    4. School of Mechatronics Engineering, Harbin Institute of Technology, Harbin Heilongjiang 150001, China
    5. State Key Laboratory of Robotics and System, Harbin Institute of Technology, Harbin Heilongjiang 150001, China
  • Received:2024-08-01 Accepted:2025-01-22 Online:2025-06-30 Published:2025-06-13
  • About author:First author contact:

    SUN Hao (1986-), professor, Ph.D. His main research interests cover sensors and artificial intelligence. E-mail:sunnice@hit.edu.cn

  • Supported by:
    National Natural Science Foundation of China(T2388101)

Abstract:

A terrain segmentation algorithm based on the fusion of information from multimodal text-visual large models was proposed to enhance the intelligent perception capability of robots in dynamic and complex environments. The algorithm integrated simple linear iterative clustering (SLIC) for image data preprocessing, contrastive language-image pre-training (CLIP) and segment anything model (SAM) for mask generation, and Dice coefficient for post-processing. Initially, the original input image was preprocessed using SLIC to obtain image segmentation blocks, and the quality of subsequent masks was improved by adding prompt points, which significantly enhanced terrain classification accuracy. Subsequently, the CLIP large model, which has been pre-trained on text-image data, was used to match the input visual images with predefined terrain text information, leveraging its interpretability and zero-shot learning capabilities to generate sets of terrain prompt points. The SAM large model then generates masked data with semantic labels based on these sets, and the Dice coefficient was applied in post-processing to select usable masks. Using the Cityscapes dataset as a terrain segmentation sample, the superiority of the proposed algorithm over mainstream segmentation algorithms under both supervised and unsupervised learning frameworks was validated. Without the need for labeled data, the algorithm achieved a mask generation rate of 76.58% and an IoU (intersection over union) of 90.14%. For the terrain perception task of a quadruped robot, a U-net encoder/decoder network quantification validation module was added. Using the generated masks as a dataset, a lightweight terrain segmentation model was constructed, deployed on the edge computing device of the quadruped robot, and terrain segmentation experiments were conducted in a real-world environment. The experimental results demonstrated that the two mask optimization methods proposed in this paper improved the model’s mean IoU (MIoU) by 2.36% and.2.56%, respectively, with the final lightweight model achieving an MIoU of 96.34%, demonstrating reliable terrain segmentation accuracy. The segmentation algorithm effectively guided the robot to quickly and safely navigate from the starting point to the target location, while effectively avoiding non-geometric obstacles such as grasslands.

Key words: deep learning, text-visual large models, quadruped robots, terrain perception, computer vision

CLC Number: