基于大型视觉语言模型的施工现场安全监控研究

doi:10.11996/JG.j.2095-302X.2025050960

摘要/Abstract

摘要：

针对施工安全监控过程中，传统视觉模型构建成本高、应用范围窄等问题，提出一种基于大型视觉语言模型(LVLM)的全新解决方案。基于开源预训练LVLM，提出包括文本提示、图像附加信息、图像样本提示等多类适用于施工安全监控任务的提示词策略，实现LVLM对施工监控图像的理解与推理，并设计了基于LVLM的智能监控工作流程与系统架构。研究成果被应用于管理人员离岗识别、危险区域侵入识别、以及违规施工行为识别等多项典型施工安全监控场景。实际数据验证表明，通过合适的提示词策略，LVLM无需数据标注与模型训练，便可实现接近主流深度学习模型的识别准确率，同时具有构建成本低、落地速度快、任务适应灵活等优势，在图像识别与智能监控领域具有应用潜力。

关键词: 大型视觉语言模型, 计算机视觉, 施工安全, 智能监控, 提示词工程

Abstract:

To address the challenges of high development cost and limited applicability of traditional vision models in construction safety monitoring, an original solution based on large vision language model (LVLM) was proposed. Based on an open-source pretrained LVLM, various types of prompt strategies suitable for construction safety monitoring tasks were designed, including text prompts, image prompts with supplementary information, and image exemplar prompts. These strategies enable the LVLM to effectively comprehend and reason about construction site imagery. Moreover, an intelligent monitoring workflow and system architecture based on LVLM were developed. The proposed method has been applied to three representative construction safety monitoring scenarios, including supervisor absence detection, hazardous zone intrusion identification, and non-compliant behavior recognition. Empirical data validation demonstrated that with appropriate prompting strategies, the LVLM can achieve satisfactory recognition accuracy close to that of mainstream deep learning models without requiring data annotation and model training. The proposed approach has the advantages of low development cost, fast implementation speed, and flexible task adaptation, revealing application potential in the fields of image recognition and intelligent monitoring.

Key words: large vision language model, computer vision, construction safety, intelligent monitoring, prompt engineering

中图分类号:

冷烁, 王玮, 欧家勇, 薛志刚, 宋英龙, 莫斯钧. 基于大型视觉语言模型的施工现场安全监控研究[J]. 图学学报, 2025, 46(5): 960-968.

LENG Shuo, WANG Wei, OU Jiayong, XUE Zhigang, SONG Yinglong, MO Sijun. On-Site construction safety monitoring based on large vision language models[J]. Journal of Graphics, 2025, 46(5): 960-968.

图/表 12

参考文献 26

[1]	胡振中, 张建平, 张旭磊. 基于4D施工安全信息模型的建筑施工支撑体系安全分析方法[J]. 工程力学, 2010, 27(12): 192-200.
	HU Z Z, ZHANG J P, ZHANG X L. 4D construction safety information model-based safety analysis approach for scaffold system during construction[J]. Engineering Mechanics, 2010, 27(12): 192-200 (in Chinese).
[2]	朱云, 凌志刚, 张雨强. 机器视觉技术研究进展及展望[J]. 图学学报, 2020, 41(6): 871-890.
	ZHU Y, LING Z G, ZHANG Y Q. Research progress and prospect of machine vision technology[J]. Journal of Graphics, 2020, 41(6): 871-890 (in Chinese).
[3]	LU M, ZHANG Y, ZHANG J P, et al. Integration of four-dimensional computer-aided design modeling and three-dimensional animation of operations simulation for visualizing construction of the main stadium for the Beijing 2008 Olympic games[J]. Canadian Journal of Civil Engineering, 2009, 36(3): 473-479.
[4]	杨晓娇, 于忠, 冮军. 智慧工地中的图像传感技术的应用进展[J]. 四川建筑, 2021, 41(S1): 41-44.
	YANG X J, YU Z, GANG J. Application progress of image sensing technology in smart construction sites[J]. Sichuan Architecture, 2021, 41(S1): 41-44 (in Chinese).
[5]	谢国波, 肖峰, 林志毅, 等. 复杂作业场景下的反光衣和安全帽检测方法[J]. 安全与环境学报, 2024, 24(9): 3513-3521.
	XIE G B, XIAO F, LIN Z Y, et al. Method for detecting reflective vests and safety helmets in complex operational environments[J]. Journal of Safety and Environment, 2024, 24(9): 3513-3521 (in Chinese).
[6]	崔克彬, 耿佳昌. 基于EE-YOLOv8s的多场景火灾迹象检测算法[J]. 图学学报, 2025, 46(1): 13-27. DOI
	CUI K B, GENG J C. A multi-scene fire sign detection algorithm based on EE-YOLOv8s[J]. Journal of Graphics, 2025, 46(1): 13-27 (in Chinese). DOI
[7]	郑相波, 姚国栋, 史方圆, 等. 大型施工机械监管系统智能视频分析模型研究[J]. 铁路计算机应用, 2024, 33(4): 23-29.
	ZHENG X B, YAO G D, SHI F Y, et al. Intelligent video analysis model for large-scale construction machinery supervision system[J]. Railway Computer Application, 2024, 33(4): 23-29 (in Chinese).
[8]	赵树煊, 银莉, 苏帅鸣, 等. 基于多尺度特征注意力网络的施工安全预警方法[J]. 中国科学: 技术科学, 2023, 53(7): 1241-1252.
	ZHAO S X, YIN L, SU S M, et al. Construction safety monitoring method based on multiscale feature attention network[J]. SCIENTIA SINICA Technologica, 2023, 53(7): 1241-1252 (in Chinese).
[9]	石文堃. 基于目标检测的工人违规行为识别系统研究[D]. 阜新: 辽宁工程技术大学, 2023.
	SHI W K. Research on worker violation recognition system based on object detection[D]. Fuxin: Liaoning Technical University, 2023 (in Chinese).
[10]	甘文霞, 张宇轩, 耿晶, 等. 改进PoseConv3D模型在建筑工人临边不安全行为识别中的应用[J]. 安全与环境学报, 2024, 24(7): 2712-2720.
	GAN W X, ZHANG Y X, GENG J, et al. Application of improved PoseConv3D model in recognition of unsafe behaviors of construction workers near the edge[J]. Journal of Safety and Environment, 2024, 24(7): 2712-2720 (in Chinese).
[11]	张琦, 张荣梅, 陈彬. 基于深度学习的图像识别技术研究综述[J]. 河北省科学院学报, 2019, 36(3): 28-36.
	ZHANG Q, ZHANG R M, CHEN B. Research review of image recognition technology based on deep learning[J]. Journal of the Hebei Academy of Sciences, 2019, 36(3): 28-36 (in Chinese).
[12]	蒋灿, 郑哲, 梁雄, 等. 大语言模型驱动的交互式建筑设计新范式——基于Rhino7的概念验证[J]. 图学学报, 2024, 45(3): 594-600. DOI
	JIANG C, ZHENG Z, LIANG X, et al. A new interaction paradigm for building design driven by large language model: proof of concept with Rhino7[J]. Journal of Graphics, 2024, 45(3): 594-600 (in Chinese). DOI
[13]	OpenCompass. OpenCompass multi-modal academic leaderboard[EB/OL]. [2024-12-17]. https://rank.opencompass.org.cn/leaderboard-multimodal.
[14]	OpenAI, ACHIAM J, ADLER S, et al. GPT-4 technical report[EB/OL]. [2025-01-17]. https://arxiv.org/abs/2303.08774.
[15]	ANTHROPIC. The Claude 3 model family: opus, sonnet, haiku[EB/OL]. [2024-12-17]. https://www-cdn.anthropic.com/de8ba9b01c9ab7cbabf5c33b80b7bbc618857627/Model_Card_Claude_3.pdf.
[16]	BAI J Z, BAI S, YANG S S, et al. Qwen-VL: a frontier large vision-language model with versatile abilities[EB/OL]. [2024-12-17]. https://arxiv.org/abs/2308.12966.
[17]	LI B, ZHANG Y H, GUO D, et al. LLaVA-OneVision: easy visual task transfer[EB/OL]. [2024-12-17]. https://arxiv.org/abs/2408.03326.
[18]	YAO Y, YU T Y, ZHANG A, et al. MiniCPM-V:a GPT-4V level MLLM on your phone[EB/OL]. [2024-12-17]. https://arxiv.org/abs/2408.01800.
[19]	AGRAWAL P, ANTONIAK S, HANNA E B, et al. Pixtral 12B[EB/OL]. [2025-01-23]. https://arxiv.org/abs/2410.07073.
[20]	Team GLM. ChatGLM:a family of large language models from GLM-130B to GLM-4 all tools[EB/OL]. [2024-12-17]. https://arxiv.org/abs/2406.12793.
[21]	金传. 基于视觉语言模型的遥感图像检测算法研究[D]. 杭州: 杭州电子科技大学, 2024.
	JIN C. Remote sensing images detection algorithm research based on visual-language model[D]. Hangzhou: Hangzhou Dianzi University, 2024 (in Chinese).
[22]	陈泳财, 张强, 黄咏秋, 等. CLAML: 视觉语言模型下铁谱图像的自适应元学习[J]. 广东石油化工学院学报, 2024, 34(4): 93-99.
	CHEN Y C, ZHANG Q, HUANG Y Q, et al. CLAML: adaptive meta-learning for ferrography images under vision-language models[J]. Journal of Guangdong University of Petrochemical Technology, 2024, 34(4): 93-99 (in Chinese).
[23]	XU X Z, JIANG Y Q, CHEN W H, et al. DAMO-YOLO: a report on real-time object detection design[EB/OL]. [2024-12-17]. https://arxiv.org/abs/2211.15444.
[24]	DAI X Y, CHEN Y P, XIAO B, et al. Dynamic head: unifying object detection heads with attentions[C]// 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2021: 7369-7378.
[25]	HUANG X. Smart_Construction:base on YOLOv5 head person helmet detection on construction sites[EB/OL]. [2024- 12-17]. https://github.com/PeterH0323/Smart_Construction.
[26]	WU Z Y, CHEN X K, PAN Z Z, et al. DeepSeek-VL2:mixture-of-experts vision-language models for advanced multimodal understanding[EB/OL]. [2024-12-17]. https://arxiv.org/abs/2412.10302.

研究方法	识别模型	识别对象	数据集规模
文献[5]	YOLO X	安全帽、反光衣	1 083张图片
文献[6]	YOLO v8s	火灾迹象	2 286张图片
文献[7]	YOLO v6	10类大型施工机械	3 600张图片
文献[8]	Mask R-CNN	人员入侵警示区行为	43 000张图片
文献[9]	YOLO v5	抽烟、打电话2类违规行为	15 368张图片
文献[10]	YOLO v3+PoseConv3D	倚靠护栏等6类违规行为	2 132份视频

研究方法	识别模型	识别对象	数据集规模
文献[5]	YOLO X	安全帽、反光衣	1 083张图片
文献[6]	YOLO v8s	火灾迹象	2 286张图片
文献[7]	YOLO v6	10类大型施工机械	3 600张图片
文献[8]	Mask R-CNN	人员入侵警示区行为	43 000张图片
文献[9]	YOLO v5	抽烟、打电话2类违规行为	15 368张图片
文献[10]	YOLO v3+PoseConv3D	倚靠护栏等6类违规行为	2 132份视频

模型	参数量	性能得分	特点与应用场景
GPT-4o-20241120^[14]	未知	72.0	线上商用模型，适用于可公开、非敏感数据计算
Claude 3.5-Sonnet^[15]	未知	67.9	线上商用模型，适用于可公开、非敏感数据计算
Qwen2-VL-72B^[16]	734亿	67.1	开源大型模型，适用于中心侧集中计算场景
LLaVA-OneVision^[17]	730亿	68.1	开源大型模型，适用于中心侧集中计算场景
MiniCPM-V^[18]	80亿	65.2	开源小型模型，适用于端侧边缘计算场景
Pixtral-12B^[19]	130亿	61.0
GLM-4V-9B^[20]	90亿	59.1

模型	参数量	性能得分	特点与应用场景
GPT-4o-20241120^[14]	未知	72.0	线上商用模型，适用于可公开、非敏感数据计算
Claude 3.5-Sonnet^[15]	未知	67.9	线上商用模型，适用于可公开、非敏感数据计算
Qwen2-VL-72B^[16]	734亿	67.1	开源大型模型，适用于中心侧集中计算场景
LLaVA-OneVision^[17]	730亿	68.1	开源大型模型，适用于中心侧集中计算场景
MiniCPM-V^[18]	80亿	65.2	开源小型模型，适用于端侧边缘计算场景
Pixtral-12B^[19]	130亿	61.0
GLM-4V-9B^[20]	90亿	59.1

策略名称	识别任务示例	输入图像示例	输入文本示例
文本提示策略	人数识别	(原始图像直接输入模型)	你是一名擅长图像分析的AI助理。你的任务是从视频监控图像中，识别图像内的总人数
图像附加信息提示策略	危险区域侵入识别	(原始图像) (附加信息后的图像)	你是一名擅长图像分析的AI助理。你的任务是判断是否有人位于图示区域中。区域在图中以红色边框的多边形表示
图像样本提示策略	施工机械识别	(样本图像) (待判断的图像)	你是一名擅长图像分析的AI助理。图像1为你展示了混凝土搅拌车的示例，请判断图像2中是否存在混凝土搅拌车
格式化输出策略	配合其他策略使用	-	请严格按照以下JSON格式输出： {期望的JSON格式}。不要输出其他内容，不需要对输出结果进行解释