多模态文本视觉大模型机器人地形感知算法研究

doi:10.11996/JG.j.2095-302X.2025030558

图学学报 ›› 2025, Vol. 46 ›› Issue (3): 558-567.DOI: 10.11996/JG.j.2095-302X.2025030558

• 图像处理与计算机视觉 • 上一篇下一篇

多模态文本视觉大模型机器人地形感知算法研究

孙浩¹(), 谢滔¹, 何龙², 郭文忠³, 虞永方², 吴其军², 王建伟², 东辉⁴^,⁵

1.福州大学机械工程及自动化学院，福建福州 350108
2.杭州智元研究院有限公司，浙江杭州 310008
3.福州大学计算机与大数据学院，福建福州 350108
4.哈尔滨工业大学机电工程学院，黑龙江哈尔滨 150001
5.哈尔滨工业大学机器人技术与系统全国重点实验室，黑龙江哈尔滨 150001

收稿日期:2024-08-01 接受日期:2025-01-22 出版日期:2025-06-30 发布日期:2025-06-13
第一作者:孙浩(1986-)，男，教授，博士。主要研究方向为传感器与人工智能。E-mail：sunnice@hit.edu.cn
基金资助:
国家自然科学基金(T2388101)

Research on multimodal text-visual large model for robotic terrain perception algorithm

SUN Hao¹(), XIE Tao¹, HE Long², GUO Wenzhong³, YU Yongfang², WU Qijun², WANG Jianwei², DONG Hui⁴^,⁵

1. School of Mechanical Engineering and Automation, Fuzhou University, Fuzhou Fujian 350108, China
2. Hangzhou Zhiyuan Research Institute Co., Ltd, Hangzhou Zhejiang 310008, China
3. School of Computer Science and Big Data, Fuzhou University, Fuzhou Fujian 350108, China
4. School of Mechatronics Engineering, Harbin Institute of Technology, Harbin Heilongjiang 150001, China
5. State Key Laboratory of Robotics and System, Harbin Institute of Technology, Harbin Heilongjiang 150001, China

Received:2024-08-01 Accepted:2025-01-22 Published:2025-06-30 Online:2025-06-13
First author：SUN Hao (1986-), professor, Ph.D. His main research interests cover sensors and artificial intelligence. E-mail：sunnice@hit.edu.cn
Supported by:
National Natural Science Foundation of China(T2388101)

摘要/Abstract

摘要：

为提升机器人在动态复杂环境下对地形的智能感知能力，提出了一种基于多模态文本视觉大模型信息融合地形分割算法，集成了SLIC图像数据预处理、CLIP和SAM掩码生成模块、Dice系数后处理。首先，对原始输入图像进行SLIC预处理，得到图像分割子块，通过增加提示点提高后续掩码质量，可显著提高地形分类准确度。然后，通过文本-图像预训练大模型CLIP，将输入视觉图像和预设地形文本信息进行匹配，并借助其可解释性和零次学习，生成各地形提示点集合。由SAM大模型接受上述集合生成带有语义标签的掩码数据，并通过Dice系数后处理筛选可用掩码。以Cityscapes数据集为地形分割样本，验证了该算法相较于监督和无监督学习框架下主流分割算法的优越性，在无需标记数据的情况下，实现了76.58%的有效掩码生成率，IoU达到90.14%。针对四足机器人地形感知任务，添加U-net编/解码器网络量化验证模块。以生成掩码作为数据集，构建轻量化地形分割模型，部署在四足机器人的边缘计算设备，并在真实环境中开展地形分割实验。实验结果表明，2种掩码优化方法分别使模型MIoU提升了2.36%和2.56%，最终轻量化模型MIoU达到96.34%，地形分割精度可靠，该算法有效指导了机器人快速地从起点安全行进到目标地，并有效避开草地等非几何障碍物。

关键词: 深度学习, 文本视觉大模型, 足式机器人, 地形感知, 计算机视觉

Abstract:

A terrain segmentation algorithm based on the fusion of information from multimodal text-visual large models was proposed to enhance the intelligent perception capability of robots in dynamic and complex environments. The algorithm integrated simple linear iterative clustering (SLIC) for image data preprocessing, contrastive language-image pre-training (CLIP) and segment anything model (SAM) for mask generation, and Dice coefficient for post-processing. Initially, the original input image was preprocessed using SLIC to obtain image segmentation blocks, and the quality of subsequent masks was improved by adding prompt points, which significantly enhanced terrain classification accuracy. Subsequently, the CLIP large model, which has been pre-trained on text-image data, was used to match the input visual images with predefined terrain text information, leveraging its interpretability and zero-shot learning capabilities to generate sets of terrain prompt points. The SAM large model then generates masked data with semantic labels based on these sets, and the Dice coefficient was applied in post-processing to select usable masks. Using the Cityscapes dataset as a terrain segmentation sample, the superiority of the proposed algorithm over mainstream segmentation algorithms under both supervised and unsupervised learning frameworks was validated. Without the need for labeled data, the algorithm achieved a mask generation rate of 76.58% and an IoU (intersection over union) of 90.14%. For the terrain perception task of a quadruped robot, a U-net encoder/decoder network quantification validation module was added. Using the generated masks as a dataset, a lightweight terrain segmentation model was constructed, deployed on the edge computing device of the quadruped robot, and terrain segmentation experiments were conducted in a real-world environment. The experimental results demonstrated that the two mask optimization methods proposed in this paper improved the model’s mean IoU (MIoU) by 2.36% and.2.56%, respectively, with the final lightweight model achieving an MIoU of 96.34%, demonstrating reliable terrain segmentation accuracy. The segmentation algorithm effectively guided the robot to quickly and safely navigate from the starting point to the target location, while effectively avoiding non-geometric obstacles such as grasslands.

Key words: deep learning, text-visual large models, quadruped robots, terrain perception, computer vision

中图分类号:

TP391
TP242

孙浩, 谢滔, 何龙, 郭文忠, 虞永方, 吴其军, 王建伟, 东辉. 多模态文本视觉大模型机器人地形感知算法研究[J]. 图学学报, 2025, 46(3): 558-567.

SUN Hao, XIE Tao, HE Long, GUO Wenzhong, YU Yongfang, WU Qijun, WANG Jianwei, DONG Hui. Research on multimodal text-visual large model for robotic terrain perception algorithm[J]. Journal of Graphics, 2025, 46(3): 558-567.

图/表 10

图1 算法网络整体框架

Fig. 1 Overall framework of network

图2 原始输入与各地形对应热图

Fig. 2 Original input and corresponding heatmaps of various terrains ((a) t=5 s; (b) t=10 s; (c) t=30 s; (d) t=60 s; (e) t=65 s; (f) t=70 s)

图3 数据采集路线

Fig. 3 Data collection route

图4 各地形匹配结果

Fig. 4 Matching results of various terrains ((a) No SLIC; (b) SLIC)

图5 掩码对比与结果可视化((a)分割结果对比；(b)分割结果)

Fig. 5 Mask comparison and result visualization ((a) Segmentation result comparison; (b) Segmentation result)

表1 不同监督框架下Cityscapes数据集评估结果

Table 1 Evaluation results of Cityscapes dataset under different supervision frameworks

网络	监督方式	IoU/%	可用掩码样本数	可用掩码占比/%
SERNet-Former^[27]	监督训练	98.82	-	-
Panoptic DeepLab^[28]	监督训练	98.88	-	-
SAC^[29]	自监督训练	90.41	-	-
RPT^[30]	自监督训练	89.20	-	-
本文算法	无训练	90.14	3 446	76.58

图6 Cityscapes数据集实验结果

Fig. 6 Experimental results of Cityscapes dataset

表2 消融实验

Table 2 Ablation Experiment

SLIC	Dice	MIoU/%	训练损失
×	×	91.42	0.930 4
√	×	93.78	0.820 1
√	√	96.34	0.666 5

图7 不同策略下训练损失

Fig. 7 Training loss under different strategies

图8 机器狗户外实验与分割预测((a)机器狗与实验环境；(b)地形原始图像；(c)地形分割结果)

Fig. 8 Outdoor experiment and segmentation prediction of robotic dogs ((a) Robot dog and experimental environment; (b) Original terrain image; (c) Terrain segmentation results)

参考文献 30

[1]	GUPTA A, SAVARESE S, GANGULI S, et al. Embodied intelligence via learning and evolution[J]. Nature communications, 2021, 12(1): 5721. DOI PMID
[2]	MATHUR P, PANDIAN K S. Terrain classification for traversability analysis for autonomous robot navigation in unknown natural terrain[J]. International Journal of Engineering Science and Technology, 2012, 4(1): 38-49.
[3]	LEE J, HWANGBO J, WELLHAUSEN L, et al. Learning quadrupedal locomotion over challenging terrain[J]. Science robotics, 2020, 5(47): eabc5986.
[4]	WELLHAUSEN L, DOSOVITSKIY A, RANFTL R, et al. Where should i walk? predicting terrain properties from images via self-supervised learning[J]. IEEE Robotics and Automation Letters, 2019, 4(2): 1509-1516.
[5]	LONG Y X, LI X Q, CAI W Z, et al. Discuss before moving: Visual language navigation via multi-expert discussions[C]// 2024 IEEE International Conference on Robotics and Automation. New York: IEEE Press, 2024: 17380-17387.
[6]	张慧, 荣学文, 李贻斌, 等. 四足机器人地形识别与路径规划算法[J]. 机器人, 2015, 37(5): 546-556. DOI
	ZHANG H, RONG X W, LI Y B, et al. Terrain recognition and path planning for quadruped robot[J]. Robot, 2015, 37(5): 546-556 (in Chinese). DOI
[7]	FANKHAUSER P, BJELONIC M, BELLICOSO C D, et al. Robust rough-terrain locomotion with a quadrupedal robot[C]// 2018 IEEE International Conference on Robotics and Automation. New York: IEEE Press, 2018: 5761-5768.
[8]	JENELTEN F, MIKI T, VIJAYAN A E, et al. Perceptive locomotion in rough terrain-online foothold optimization[J]. IEEE Robotics and Automation Letters, 2020, 5(4): 5370-5376.
[9]	KUROBE A, NAKAJIMA Y, KITANI K, et al. Audio-visual self-supervised terrain type recognition for ground mobile platforms[J]. IEEE Access, 2021, 9: 29970-29979.
[10]	CHEN L C, ZHU Y K, PAPANDREOU G, et al. Encoder-decoder with atrous separable convolution for semantic image segmentation[C]// The 15th European Conference on Computer Vision. Cham: Springer, 2018: 833-851.
[11]	赵迪, 戴志鹏, 李世其, 等. 巡视探测任务中复杂地形信息感知与场景建模[J]. 航天器工程, 2019, 28(5): 32-38.
	ZHAO D, DAI Z P, LI S Q, et al. Perception and scene modeling of complex terrain information in patrol and exploration tasks[J]. Spacecraft Engineering, 2019, 28(5): 32-38 (in Chinese).
[12]	张明路, 王哲, 李满宏, 等. 基于足端位置的六足机器人漫游地形感知与表征[J]. 机械工程学报, 2021, 57(19): 48-60. DOI
	ZHANG M L, WANG Z, LI M H, et al. Perception and representation of roaming terrain for a hexapod robot based on foot positions[J]. Journal of Mechanical Engineering, 2021, 57(19): 48-60 (in Chinese). DOI
[13]	CHEN L C, PAPANDREOU G, KOKKINOS I, et al. Deeplab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs[J]. IEEE transactions on pattern analysis and machine intelligence, 2018, 40(4): 834-848.
[14]	HOWARD A, SERAJI H. Vision‐based terrain characterization and traversability assessment[J]. journal of robotic systems, 2001, 18(10): 577-587.
[15]	KINGRY N, JUNG M, DERSE E, et al. Vision-based terrain classification and solar irradiance mapping for solar-powered robotics[C]// 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems. New York: IEEE Press, 2018: 5834-5840.
[16]	张桂梅, 陶辉, 鲁飞飞, 等. 基于双源判别器的域自适应城市场景语义分割[J]. 图学学报, 2023, 44(5): 907-917. DOI
	ZHANG G M, TAO H, LU F Fi, et al. Domain adaptive urban scene semantic segmentation based on dual-source discriminator[J]. Journal of Graphics, 2023, 44(5): 907-917 (in Chinese).
[17]	WANG Z R, ZENG X, YAN Z Y, et al. AIR-PolSAR-Seg: a large-scale data set for terrain segmentation in complex-scene PolSAR images[J]. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 2022, 15: 3830-3841.
[18]	CORDTS M, OMRAN M, RAMOS S, et al. The cityscapes dataset for semantic urban scene understanding[C]// The IEEE conference on computer vision and pattern recognition. New York: IEEE Press, 2016: 3213-3223.
[19]	XU P, DING L, LI Z Y, et al. Learning physical characteristics like animals for legged robots[J]. National Science Review, 2023, 10(5): nwad045.
[20]	李满宏, 张明路, 张建华, 等. 基于增强学习的六足机器人自由步态规划[J]. 机械工程学报, 2019, 55(5): 36-44. DOI
	LI M H, ZHANG M L, ZHANG J H, et al. Free gait planning for a hexapod robot based on reinforcement learning[J]. Journal of Mechanical Engineering, 2019, 55(5): 36-44 (in Chinese). DOI
[21]	KIRILLOV A, MINTUN E, RAVI N, et al. Segment anything[C]// The IEEE/CVF International Conference on Computer Vision. New York: IEEE Press, 2023: 3992-4003.
[22]	吴精乙, 景峻, 贺熠凡, 等. 基于多模态大模型的高速公路场景交通异常事件分析方法[J]. 图学学报, 2024, 45(6): 1266-1276. DOI
	WU J Y, JING J, HE Y F, et al. Traffic anomaly event analysis method for highway scenes based on multimodal large language models[J]. Journal of Graphics, 2024, 45(6): 1266-1276 (in Chinese). DOI
[23]	RADFORD A, KIM J W, HALLACY C, et al. Learning transferable visual models from natural language supervision[EB/OL]//[2024-05-31]https://dblp.uni-trier.de/db/conf/icml/icml2021.html#RadfordKHRGASAM21.
[24]	ACHANTA R, SHAJI A, SMITH K, et al. SLIC superpixels compared to state-of-the-art superpixel methods[J]. IEEE transactions on pattern analysis and machine intelligence, 2012, 34(11): 2274-2282. PMID
[25]	RONNEBERGER O, FISCHER P, BROX T. U-net: convolutional networks for biomedical image segmentation[C]// The 18th International Conference on Medical image computing and computer-assisted intervention Cham:Springer, 2015: 234-241.
[26]	MINAEE S, BOYKOV Y, PORIKLI F, et al. Image segmentation using deep learning: A survey[J]. IEEE transactions on pattern analysis and machine intelligence, 2022, 44(7): 3523-3542.
[27]	ERISEN S. SERNet-former: segmentation by efficient-ResNet with attention-boosting gates and attention-fusion networks[C]// IEEE International Conference on Computer Vision and Machine Intelligence. New York: IEEE Press, 2024: 1-6.
[28]	CHENG B W, COLLINS M D, ZHU Y K, et al. Panoptic-deeplab: a simple, strong, and fast baseline for bottom-up panoptic segmentation[C]// The IEEE/CVF conference on computer vision and pattern recognition. New York: IEEE Press, 2020: 12472-12482.
[29]	ARASLANOV N, ROTH S. Self-supervised augmentation consistency for adapting semantic segmentation[C]// The IEEE/CVF conference on computer vision and pattern recognition. New York: IEEE Press, 2021: 15379-15389.
[30]	ZHANG YH, QIU Z F, YAO T, et al. Transferring and regularizing prediction for semantic segmentation[C]// The IEEE/CVF Conference on computer vision and pattern recognition. New York: IEEE Press, 2020: 9618-9627.

多模态文本视觉大模型机器人地形感知算法研究

Research on multimodal text-visual large model for robotic terrain perception algorithm

RichHTML

PDF (PC)

可视化

摘要/Abstract

引用本文

使用本文

图/表 10

参考文献 30

相关文章 15

编辑推荐

Metrics

本文评价

[1]	王道累, 丁子健, 杨君, 郑劭恺, 朱瑞, 赵文彬. 基于体素网格特征的NeRF大场景重建方法[J]. 图学学报, 2025, 46(3): 502-509.
[2]	翟永杰, 王璐瑶, 赵晓瑜, 胡哲东, 王乾铭, 王亚茹. 基于级联查询-位置关系的输电线路多金具检测方法[J]. 图学学报, 2025, 46(2): 288-299.
[3]	潘树焱, 刘立群. MSFAFuse：基于多尺度特征信息与注意力机制的SAR和可见光图像融合模型[J]. 图学学报, 2025, 46(2): 300-311.
[4]	张天圣, 朱闽峰, 任怡雯, 王琛涵, 张立冬, 张玮, 陈为. BPA-SAM：面向工笔画数据的SAM边界框提示增强方法[J]. 图学学报, 2025, 46(2): 322-331.
[5]	孙禾衣, 李艺潇, 田希, 张松海. 结合程序内容生成与扩散模型的图像到三维瓷瓶生成技术[J]. 图学学报, 2025, 46(2): 332-344.
[6]	陈瑞启, 刘晓飞, 万峰, 侯鹏, 沈金屹. 数字孪生驱动的卫星太阳翼展开测试仿真与预测方法[J]. 图学学报, 2025, 46(2): 449-458.
[7]	汪颜, 张牧雨, 刘秀珍. 基于深度学习的电影海报视觉互动意义评价方法[J]. 图学学报, 2025, 46(1): 221-232.
[8]	刘冀辰, 李金星, 吴佳, 张威, 齐宇诺, 周国亮. 大模型技术在电力行业的应用展望[J]. 图学学报, 2024, 45(6): 1132-1144.
[9]	李琼, 考月英, 张莹, 徐沛. 面向无人机航拍图像的目标检测研究综述[J]. 图学学报, 2024, 45(6): 1145-1164.
[10]	刘灿锋, 孙浩, 东辉. 结合Transformer与Kolmogorov Arnold网络的分子扩增时序预测研究[J]. 图学学报, 2024, 45(6): 1256-1265.
[11]	宋思程, 陈辰, 李晨辉, 王长波. 基于密度图多目标追踪的时空数据可视化[J]. 图学学报, 2024, 45(6): 1289-1300.
[12]	王宗继, 刘云飞, 陆峰. Cloud Sphere: 一种基于渐进式变形自编码的三维模型表征方法[J]. 图学学报, 2024, 45(6): 1375-1388.
[13]	许丹丹, 崔勇, 张世倩, 刘雨聪, 林予松. 优化医学影像三维渲染可视化效果：技术综述[J]. 图学学报, 2024, 45(5): 879-891.
[14]	胡凤阔, 叶兰, 谭显峰, 张钦展, 胡志新, 方清, 王磊, 满孝锋. 一种基于改进YOLOv8的轻量化路面病害检测算法[J]. 图学学报, 2024, 45(5): 892-900.
[15]	刘义艳, 郝婷楠, 贺晨, 常英杰. 基于DBBR-YOLO的光伏电池表面缺陷检测[J]. 图学学报, 2024, 45(5): 913-921.