欢迎访问《图学学报》 分享到:

图学学报 ›› 2025, Vol. 46 ›› Issue (3): 558-567.DOI: 10.11996/JG.j.2095-302X.2025030558

• 图像处理与计算机视觉 • 上一篇    下一篇

多模态文本视觉大模型机器人地形感知算法研究

孙浩1(), 谢滔1, 何龙2, 郭文忠3, 虞永方2, 吴其军2, 王建伟2, 东辉4,5   

  1. 1.福州大学机械工程及自动化学院,福建 福州 350108
    2.杭州智元研究院有限公司,浙江 杭州 310008
    3.福州大学计算机与大数据学院,福建 福州 350108
    4.哈尔滨工业大学机电工程学院,黑龙江 哈尔滨 150001
    5.哈尔滨工业大学机器人技术与系统全国重点实验室,黑龙江 哈尔滨 150001
  • 收稿日期:2024-08-01 接受日期:2025-01-22 出版日期:2025-06-30 发布日期:2025-06-13
  • 第一作者:孙浩(1986-),男,教授,博士。主要研究方向为传感器与人工智能。E-mail:sunnice@hit.edu.cn
  • 基金资助:
    国家自然科学基金(T2388101)

Research on multimodal text-visual large model for robotic terrain perception algorithm

SUN Hao1(), XIE Tao1, HE Long2, GUO Wenzhong3, YU Yongfang2, WU Qijun2, WANG Jianwei2, DONG Hui4,5   

  1. 1. School of Mechanical Engineering and Automation, Fuzhou University, Fuzhou Fujian 350108, China
    2. Hangzhou Zhiyuan Research Institute Co., Ltd, Hangzhou Zhejiang 310008, China
    3. School of Computer Science and Big Data, Fuzhou University, Fuzhou Fujian 350108, China
    4. School of Mechatronics Engineering, Harbin Institute of Technology, Harbin Heilongjiang 150001, China
    5. State Key Laboratory of Robotics and System, Harbin Institute of Technology, Harbin Heilongjiang 150001, China
  • Received:2024-08-01 Accepted:2025-01-22 Published:2025-06-30 Online:2025-06-13
  • First author:SUN Hao (1986-), professor, Ph.D. His main research interests cover sensors and artificial intelligence. E-mail:sunnice@hit.edu.cn
  • Supported by:
    National Natural Science Foundation of China(T2388101)

摘要:

为提升机器人在动态复杂环境下对地形的智能感知能力,提出了一种基于多模态文本视觉大模型信息融合地形分割算法,集成了SLIC图像数据预处理、CLIP和SAM掩码生成模块、Dice系数后处理。首先,对原始输入图像进行SLIC预处理,得到图像分割子块,通过增加提示点提高后续掩码质量,可显著提高地形分类准确度。然后,通过文本-图像预训练大模型CLIP,将输入视觉图像和预设地形文本信息进行匹配,并借助其可解释性和零次学习,生成各地形提示点集合。由SAM大模型接受上述集合生成带有语义标签的掩码数据,并通过Dice系数后处理筛选可用掩码。以Cityscapes数据集为地形分割样本,验证了该算法相较于监督和无监督学习框架下主流分割算法的优越性,在无需标记数据的情况下,实现了76.58%的有效掩码生成率,IoU达到90.14%。针对四足机器人地形感知任务,添加U-net编/解码器网络量化验证模块。以生成掩码作为数据集,构建轻量化地形分割模型,部署在四足机器人的边缘计算设备,并在真实环境中开展地形分割实验。实验结果表明,2种掩码优化方法分别使模型MIoU提升了2.36%和2.56%,最终轻量化模型MIoU达到96.34%,地形分割精度可靠,该算法有效指导了机器人快速地从起点安全行进到目标地,并有效避开草地等非几何障碍物。

关键词: 深度学习, 文本视觉大模型, 足式机器人, 地形感知, 计算机视觉

Abstract:

A terrain segmentation algorithm based on the fusion of information from multimodal text-visual large models was proposed to enhance the intelligent perception capability of robots in dynamic and complex environments. The algorithm integrated simple linear iterative clustering (SLIC) for image data preprocessing, contrastive language-image pre-training (CLIP) and segment anything model (SAM) for mask generation, and Dice coefficient for post-processing. Initially, the original input image was preprocessed using SLIC to obtain image segmentation blocks, and the quality of subsequent masks was improved by adding prompt points, which significantly enhanced terrain classification accuracy. Subsequently, the CLIP large model, which has been pre-trained on text-image data, was used to match the input visual images with predefined terrain text information, leveraging its interpretability and zero-shot learning capabilities to generate sets of terrain prompt points. The SAM large model then generates masked data with semantic labels based on these sets, and the Dice coefficient was applied in post-processing to select usable masks. Using the Cityscapes dataset as a terrain segmentation sample, the superiority of the proposed algorithm over mainstream segmentation algorithms under both supervised and unsupervised learning frameworks was validated. Without the need for labeled data, the algorithm achieved a mask generation rate of 76.58% and an IoU (intersection over union) of 90.14%. For the terrain perception task of a quadruped robot, a U-net encoder/decoder network quantification validation module was added. Using the generated masks as a dataset, a lightweight terrain segmentation model was constructed, deployed on the edge computing device of the quadruped robot, and terrain segmentation experiments were conducted in a real-world environment. The experimental results demonstrated that the two mask optimization methods proposed in this paper improved the model’s mean IoU (MIoU) by 2.36% and.2.56%, respectively, with the final lightweight model achieving an MIoU of 96.34%, demonstrating reliable terrain segmentation accuracy. The segmentation algorithm effectively guided the robot to quickly and safely navigate from the starting point to the target location, while effectively avoiding non-geometric obstacles such as grasslands.

Key words: deep learning, text-visual large models, quadruped robots, terrain perception, computer vision

中图分类号: