欢迎访问《图学学报》 分享到:

图学学报 ›› 2025, Vol. 46 ›› Issue (5): 960-968.DOI: 10.11996/JG.j.2095-302X.2025050960

• 图像处理与计算机视觉 • 上一篇    下一篇

基于大型视觉语言模型的施工现场安全监控研究

冷烁(), 王玮(), 欧家勇, 薛志刚, 宋英龙, 莫斯钧   

  1. 广州地铁建设管理有限公司广东 广州 510335
  • 收稿日期:2024-12-18 接受日期:2025-03-03 出版日期:2025-10-30 发布日期:2025-09-10
  • 通讯作者:王玮(1980-),男,高级工程师,学士。主要研究方向为城市轨道交通数字化建造技术。E-mail:wangwei1@gzmtr.com
  • 第一作者:冷烁(1996-),男,博士。主要研究方向为数据分析与图像识别在工程建设中的应用。E-mail:lengshuo@gzmtr.com

On-Site construction safety monitoring based on large vision language models

LENG Shuo(), WANG Wei(), OU Jiayong, XUE Zhigang, SONG Yinglong, MO Sijun   

  1. Guangzhou Metro Construction Management Co., Ltd., Guangzhou Guangdong 510335, China
  • Received:2024-12-18 Accepted:2025-03-03 Published:2025-10-30 Online:2025-09-10
  • First author:LENG Shuo (1996-), Ph.D. His main research interests cover the application of data analysis and image recognition in construction engineering. E-mail:lengshuo@gzmtr.com

摘要:

针对施工安全监控过程中,传统视觉模型构建成本高、应用范围窄等问题,提出一种基于大型视觉语言模型(LVLM)的全新解决方案。基于开源预训练LVLM,提出包括文本提示、图像附加信息、图像样本提示等多类适用于施工安全监控任务的提示词策略,实现LVLM对施工监控图像的理解与推理,并设计了基于LVLM的智能监控工作流程与系统架构。研究成果被应用于管理人员离岗识别、危险区域侵入识别、以及违规施工行为识别等多项典型施工安全监控场景。实际数据验证表明,通过合适的提示词策略,LVLM无需数据标注与模型训练,便可实现接近主流深度学习模型的识别准确率,同时具有构建成本低、落地速度快、任务适应灵活等优势,在图像识别与智能监控领域具有应用潜力。

关键词: 大型视觉语言模型, 计算机视觉, 施工安全, 智能监控, 提示词工程

Abstract:

To address the challenges of high development cost and limited applicability of traditional vision models in construction safety monitoring, an original solution based on large vision language model (LVLM) was proposed. Based on an open-source pretrained LVLM, various types of prompt strategies suitable for construction safety monitoring tasks were designed, including text prompts, image prompts with supplementary information, and image exemplar prompts. These strategies enable the LVLM to effectively comprehend and reason about construction site imagery. Moreover, an intelligent monitoring workflow and system architecture based on LVLM were developed. The proposed method has been applied to three representative construction safety monitoring scenarios, including supervisor absence detection, hazardous zone intrusion identification, and non-compliant behavior recognition. Empirical data validation demonstrated that with appropriate prompting strategies, the LVLM can achieve satisfactory recognition accuracy close to that of mainstream deep learning models without requiring data annotation and model training. The proposed approach has the advantages of low development cost, fast implementation speed, and flexible task adaptation, revealing application potential in the fields of image recognition and intelligent monitoring.

Key words: large vision language model, computer vision, construction safety, intelligent monitoring, prompt engineering

中图分类号: