Research on multimodal text-visual large model for robotic terrain perception algorithm

doi:10.11996/JG.j.2095-302X.2025030558

Abstract

Abstract:

A terrain segmentation algorithm based on the fusion of information from multimodal text-visual large models was proposed to enhance the intelligent perception capability of robots in dynamic and complex environments. The algorithm integrated simple linear iterative clustering (SLIC) for image data preprocessing, contrastive language-image pre-training (CLIP) and segment anything model (SAM) for mask generation, and Dice coefficient for post-processing. Initially, the original input image was preprocessed using SLIC to obtain image segmentation blocks, and the quality of subsequent masks was improved by adding prompt points, which significantly enhanced terrain classification accuracy. Subsequently, the CLIP large model, which has been pre-trained on text-image data, was used to match the input visual images with predefined terrain text information, leveraging its interpretability and zero-shot learning capabilities to generate sets of terrain prompt points. The SAM large model then generates masked data with semantic labels based on these sets, and the Dice coefficient was applied in post-processing to select usable masks. Using the Cityscapes dataset as a terrain segmentation sample, the superiority of the proposed algorithm over mainstream segmentation algorithms under both supervised and unsupervised learning frameworks was validated. Without the need for labeled data, the algorithm achieved a mask generation rate of 76.58% and an IoU (intersection over union) of 90.14%. For the terrain perception task of a quadruped robot, a U-net encoder/decoder network quantification validation module was added. Using the generated masks as a dataset, a lightweight terrain segmentation model was constructed, deployed on the edge computing device of the quadruped robot, and terrain segmentation experiments were conducted in a real-world environment. The experimental results demonstrated that the two mask optimization methods proposed in this paper improved the model’s mean IoU (MIoU) by 2.36% and.2.56%, respectively, with the final lightweight model achieving an MIoU of 96.34%, demonstrating reliable terrain segmentation accuracy. The segmentation algorithm effectively guided the robot to quickly and safely navigate from the starting point to the target location, while effectively avoiding non-geometric obstacles such as grasslands.

Key words: deep learning, text-visual large models, quadruped robots, terrain perception, computer vision

CLC Number:

TP391
TP242

SUN Hao, XIE Tao, HE Long, GUO Wenzhong, YU Yongfang, WU Qijun, WANG Jianwei, DONG Hui. Research on multimodal text-visual large model for robotic terrain perception algorithm[J]. Journal of Graphics, 2025, 46(3): 558-567.

Figures/Tables 10

Fig. 1 Overall framework of network

Fig. 2 Original input and corresponding heatmaps of various terrains ((a) t=5 s; (b) t=10 s; (c) t=30 s; (d) t=60 s; (e) t=65 s; (f) t=70 s)

Fig. 3 Data collection route

Fig. 4 Matching results of various terrains ((a) No SLIC; (b) SLIC)

Fig. 5 Mask comparison and result visualization ((a) Segmentation result comparison; (b) Segmentation result)

Table 1 Evaluation results of Cityscapes dataset under different supervision frameworks

网络	监督方式	IoU/%	可用掩码样本数	可用掩码占比/%
SERNet-Former^[27]	监督训练	98.82	-	-
Panoptic DeepLab^[28]	监督训练	98.88	-	-
SAC^[29]	自监督训练	90.41	-	-
RPT^[30]	自监督训练	89.20	-	-
本文算法	无训练	90.14	3 446	76.58

Fig. 6 Experimental results of Cityscapes dataset

Table 2 Ablation Experiment

SLIC	Dice	MIoU/%	训练损失
×	×	91.42	0.930 4
√	×	93.78	0.820 1
√	√	96.34	0.666 5

Fig. 7 Training loss under different strategies

Fig. 8 Outdoor experiment and segmentation prediction of robotic dogs ((a) Robot dog and experimental environment; (b) Original terrain image; (c) Terrain segmentation results)

References 30

[1]	GUPTA A, SAVARESE S, GANGULI S, et al. Embodied intelligence via learning and evolution[J]. Nature communications, 2021, 12(1): 5721. DOI PMID
[2]	MATHUR P, PANDIAN K S. Terrain classification for traversability analysis for autonomous robot navigation in unknown natural terrain[J]. International Journal of Engineering Science and Technology, 2012, 4(1): 38-49.
[3]	LEE J, HWANGBO J, WELLHAUSEN L, et al. Learning quadrupedal locomotion over challenging terrain[J]. Science robotics, 2020, 5(47): eabc5986.
[4]	WELLHAUSEN L, DOSOVITSKIY A, RANFTL R, et al. Where should i walk? predicting terrain properties from images via self-supervised learning[J]. IEEE Robotics and Automation Letters, 2019, 4(2): 1509-1516.
[5]	LONG Y X, LI X Q, CAI W Z, et al. Discuss before moving: Visual language navigation via multi-expert discussions[C]// 2024 IEEE International Conference on Robotics and Automation. New York: IEEE Press, 2024: 17380-17387.
[6]	张慧, 荣学文, 李贻斌, 等. 四足机器人地形识别与路径规划算法[J]. 机器人, 2015, 37(5): 546-556. DOI
	ZHANG H, RONG X W, LI Y B, et al. Terrain recognition and path planning for quadruped robot[J]. Robot, 2015, 37(5): 546-556 (in Chinese). DOI
[7]	FANKHAUSER P, BJELONIC M, BELLICOSO C D, et al. Robust rough-terrain locomotion with a quadrupedal robot[C]// 2018 IEEE International Conference on Robotics and Automation. New York: IEEE Press, 2018: 5761-5768.
[8]	JENELTEN F, MIKI T, VIJAYAN A E, et al. Perceptive locomotion in rough terrain-online foothold optimization[J]. IEEE Robotics and Automation Letters, 2020, 5(4): 5370-5376.
[9]	KUROBE A, NAKAJIMA Y, KITANI K, et al. Audio-visual self-supervised terrain type recognition for ground mobile platforms[J]. IEEE Access, 2021, 9: 29970-29979.
[10]	CHEN L C, ZHU Y K, PAPANDREOU G, et al. Encoder-decoder with atrous separable convolution for semantic image segmentation[C]// The 15th European Conference on Computer Vision. Cham: Springer, 2018: 833-851.
[11]	赵迪, 戴志鹏, 李世其, 等. 巡视探测任务中复杂地形信息感知与场景建模[J]. 航天器工程, 2019, 28(5): 32-38.
	ZHAO D, DAI Z P, LI S Q, et al. Perception and scene modeling of complex terrain information in patrol and exploration tasks[J]. Spacecraft Engineering, 2019, 28(5): 32-38 (in Chinese).
[12]	张明路, 王哲, 李满宏, 等. 基于足端位置的六足机器人漫游地形感知与表征[J]. 机械工程学报, 2021, 57(19): 48-60. DOI
	ZHANG M L, WANG Z, LI M H, et al. Perception and representation of roaming terrain for a hexapod robot based on foot positions[J]. Journal of Mechanical Engineering, 2021, 57(19): 48-60 (in Chinese). DOI
[13]	CHEN L C, PAPANDREOU G, KOKKINOS I, et al. Deeplab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs[J]. IEEE transactions on pattern analysis and machine intelligence, 2018, 40(4): 834-848.
[14]	HOWARD A, SERAJI H. Vision‐based terrain characterization and traversability assessment[J]. journal of robotic systems, 2001, 18(10): 577-587.
[15]	KINGRY N, JUNG M, DERSE E, et al. Vision-based terrain classification and solar irradiance mapping for solar-powered robotics[C]// 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems. New York: IEEE Press, 2018: 5834-5840.
[16]	张桂梅, 陶辉, 鲁飞飞, 等. 基于双源判别器的域自适应城市场景语义分割[J]. 图学学报, 2023, 44(5): 907-917. DOI
	ZHANG G M, TAO H, LU F Fi, et al. Domain adaptive urban scene semantic segmentation based on dual-source discriminator[J]. Journal of Graphics, 2023, 44(5): 907-917 (in Chinese).
[17]	WANG Z R, ZENG X, YAN Z Y, et al. AIR-PolSAR-Seg: a large-scale data set for terrain segmentation in complex-scene PolSAR images[J]. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 2022, 15: 3830-3841.
[18]	CORDTS M, OMRAN M, RAMOS S, et al. The cityscapes dataset for semantic urban scene understanding[C]// The IEEE conference on computer vision and pattern recognition. New York: IEEE Press, 2016: 3213-3223.
[19]	XU P, DING L, LI Z Y, et al. Learning physical characteristics like animals for legged robots[J]. National Science Review, 2023, 10(5): nwad045.
[20]	李满宏, 张明路, 张建华, 等. 基于增强学习的六足机器人自由步态规划[J]. 机械工程学报, 2019, 55(5): 36-44. DOI
	LI M H, ZHANG M L, ZHANG J H, et al. Free gait planning for a hexapod robot based on reinforcement learning[J]. Journal of Mechanical Engineering, 2019, 55(5): 36-44 (in Chinese). DOI
[21]	KIRILLOV A, MINTUN E, RAVI N, et al. Segment anything[C]// The IEEE/CVF International Conference on Computer Vision. New York: IEEE Press, 2023: 3992-4003.
[22]	吴精乙, 景峻, 贺熠凡, 等. 基于多模态大模型的高速公路场景交通异常事件分析方法[J]. 图学学报, 2024, 45(6): 1266-1276. DOI
	WU J Y, JING J, HE Y F, et al. Traffic anomaly event analysis method for highway scenes based on multimodal large language models[J]. Journal of Graphics, 2024, 45(6): 1266-1276 (in Chinese). DOI
[23]	RADFORD A, KIM J W, HALLACY C, et al. Learning transferable visual models from natural language supervision[EB/OL]//[2024-05-31]https://dblp.uni-trier.de/db/conf/icml/icml2021.html#RadfordKHRGASAM21.
[24]	ACHANTA R, SHAJI A, SMITH K, et al. SLIC superpixels compared to state-of-the-art superpixel methods[J]. IEEE transactions on pattern analysis and machine intelligence, 2012, 34(11): 2274-2282. PMID
[25]	RONNEBERGER O, FISCHER P, BROX T. U-net: convolutional networks for biomedical image segmentation[C]// The 18th International Conference on Medical image computing and computer-assisted intervention Cham:Springer, 2015: 234-241.
[26]	MINAEE S, BOYKOV Y, PORIKLI F, et al. Image segmentation using deep learning: A survey[J]. IEEE transactions on pattern analysis and machine intelligence, 2022, 44(7): 3523-3542.
[27]	ERISEN S. SERNet-former: segmentation by efficient-ResNet with attention-boosting gates and attention-fusion networks[C]// IEEE International Conference on Computer Vision and Machine Intelligence. New York: IEEE Press, 2024: 1-6.
[28]	CHENG B W, COLLINS M D, ZHU Y K, et al. Panoptic-deeplab: a simple, strong, and fast baseline for bottom-up panoptic segmentation[C]// The IEEE/CVF conference on computer vision and pattern recognition. New York: IEEE Press, 2020: 12472-12482.
[29]	ARASLANOV N, ROTH S. Self-supervised augmentation consistency for adapting semantic segmentation[C]// The IEEE/CVF conference on computer vision and pattern recognition. New York: IEEE Press, 2021: 15379-15389.
[30]	ZHANG YH, QIU Z F, YAO T, et al. Transferring and regularizing prediction for semantic segmentation[C]// The IEEE/CVF Conference on computer vision and pattern recognition. New York: IEEE Press, 2020: 9618-9627.