Journal of Graphics ›› 2025, Vol. 46 ›› Issue (5): 960-968.DOI: 10.11996/JG.j.2095-302X.2025050960
• Image Processing and Computer Vision • Previous Articles Next Articles
LENG Shuo(), WANG Wei(
), OU Jiayong, XUE Zhigang, SONG Yinglong, MO Sijun
Received:
2024-12-18
Accepted:
2025-03-03
Online:
2025-10-30
Published:
2025-09-10
Contact:
WANG Wei
About author:
First author contact:LENG Shuo (1996-), Ph.D. His main research interests cover the application of data analysis and image recognition in construction engineering. E-mail:lengshuo@gzmtr.com
CLC Number:
LENG Shuo, WANG Wei, OU Jiayong, XUE Zhigang, SONG Yinglong, MO Sijun. On-Site construction safety monitoring based on large vision language models[J]. Journal of Graphics, 2025, 46(5): 960-968.
Add to citation manager EndNote|Ris|BibTeX
URL: http://www.txxb.com.cn/EN/10.11996/JG.j.2095-302X.2025050960
研究方法 | 识别模型 | 识别对象 | 数据集规模 |
---|---|---|---|
文献[5] | YOLO X | 安全帽、反光衣 | 1 083张图片 |
文献[6] | YOLO v8s | 火灾迹象 | 2 286张图片 |
文献[7] | YOLO v6 | 10类大型施工机械 | 3 600张图片 |
文献[8] | Mask R-CNN | 人员入侵警示区行为 | 43 000张图片 |
文献[9] | YOLO v5 | 抽烟、打电话2类违规行为 | 15 368张图片 |
文献[10] | YOLO v3+PoseConv3D | 倚靠护栏等6类违规行为 | 2 132份视频 |
Table 1 Typical construction safety monitoring analysis research in recent years
研究方法 | 识别模型 | 识别对象 | 数据集规模 |
---|---|---|---|
文献[5] | YOLO X | 安全帽、反光衣 | 1 083张图片 |
文献[6] | YOLO v8s | 火灾迹象 | 2 286张图片 |
文献[7] | YOLO v6 | 10类大型施工机械 | 3 600张图片 |
文献[8] | Mask R-CNN | 人员入侵警示区行为 | 43 000张图片 |
文献[9] | YOLO v5 | 抽烟、打电话2类违规行为 | 15 368张图片 |
文献[10] | YOLO v3+PoseConv3D | 倚靠护栏等6类违规行为 | 2 132份视频 |
模型 | 参数量 | 性能得分 | 特点与应用场景 |
---|---|---|---|
GPT-4o-20241120[ | 未知 | 72.0 | 线上商用模型,适用于可公开、非敏感数据计算 |
Claude 3.5-Sonnet[ | 未知 | 67.9 | |
Qwen2-VL-72B[ | 734亿 | 67.1 | 开源大型模型,适用于中心侧集中计算场景 |
LLaVA-OneVision[ | 730亿 | 68.1 | |
MiniCPM-V[ | 80亿 | 65.2 | 开源小型模型,适用于端侧边缘计算场景 |
Pixtral-12B[ | 130亿 | 61.0 | |
GLM-4V-9B[ | 90亿 | 59.1 |
Table 2 Performance and characteristics of recent mainstream LVLMs
模型 | 参数量 | 性能得分 | 特点与应用场景 |
---|---|---|---|
GPT-4o-20241120[ | 未知 | 72.0 | 线上商用模型,适用于可公开、非敏感数据计算 |
Claude 3.5-Sonnet[ | 未知 | 67.9 | |
Qwen2-VL-72B[ | 734亿 | 67.1 | 开源大型模型,适用于中心侧集中计算场景 |
LLaVA-OneVision[ | 730亿 | 68.1 | |
MiniCPM-V[ | 80亿 | 65.2 | 开源小型模型,适用于端侧边缘计算场景 |
Pixtral-12B[ | 130亿 | 61.0 | |
GLM-4V-9B[ | 90亿 | 59.1 |
策略 名称 | 识别任务 示例 | 输入图像示例 | 输入文本示例 |
---|---|---|---|
文本 提示 策略 | 人数 识别 | ![]() (原始图像直接输入模型) | 你是一名擅长图像分析的AI助理。 你的任务是从视频监控图像中, 识别图像内的总人数 |
图像 附加 信息 提示 策略 | 危险 区域 侵入 识别 | ![]() ![]() (原始图像) (附加信息后的图像) | 你是一名擅长图像分析的AI助理。 你的任务是判断是否有人位于图示区域中。 区域在图中以红色边框的多边形表示 |
图像 样本 提示 策略 | 施工 机械 识别 | ![]() ![]() (样本图像) (待判断的图像) | 你是一名擅长图像分析的AI助理。 图像1为你展示了混凝土搅拌车的示例, 请判断图像2中是否存在混凝土搅拌车 |
格式化 输出 策略 | 配合其他 策略使用 | - | 请严格按照以下JSON格式输出: {期望的JSON格式}。 不要输出其他内容,不需要对输出结果进行解释 |
Table 3 Examples of LVLM prompting strategies
策略 名称 | 识别任务 示例 | 输入图像示例 | 输入文本示例 |
---|---|---|---|
文本 提示 策略 | 人数 识别 | ![]() (原始图像直接输入模型) | 你是一名擅长图像分析的AI助理。 你的任务是从视频监控图像中, 识别图像内的总人数 |
图像 附加 信息 提示 策略 | 危险 区域 侵入 识别 | ![]() ![]() (原始图像) (附加信息后的图像) | 你是一名擅长图像分析的AI助理。 你的任务是判断是否有人位于图示区域中。 区域在图中以红色边框的多边形表示 |
图像 样本 提示 策略 | 施工 机械 识别 | ![]() ![]() (样本图像) (待判断的图像) | 你是一名擅长图像分析的AI助理。 图像1为你展示了混凝土搅拌车的示例, 请判断图像2中是否存在混凝土搅拌车 |
格式化 输出 策略 | 配合其他 策略使用 | - | 请严格按照以下JSON格式输出: {期望的JSON格式}。 不要输出其他内容,不需要对输出结果进行解释 |
模型 | 精确率/% | 召回率/% | 每秒处理帧数 |
---|---|---|---|
本文 | 94.2 | 97.5 | 0.83 |
文献[23] | 92.5 | 99.1 | 35.00 |
文献[24] | 95.8 | 98.3 | 41.00 |
Table 4 Model performance on the off-duty recognition task
模型 | 精确率/% | 召回率/% | 每秒处理帧数 |
---|---|---|---|
本文 | 94.2 | 97.5 | 0.83 |
文献[23] | 92.5 | 99.1 | 35.00 |
文献[24] | 95.8 | 98.3 | 41.00 |
模型 | 精确率/% | 召回率/% | 每秒处理帧数 |
---|---|---|---|
本文 | 87.8 | 89.0 | 0.81 |
文献[25] | 92.3 | 93.5 | 39.00 |
Table 5 Model performance on the region intrusion task
模型 | 精确率/% | 召回率/% | 每秒处理帧数 |
---|---|---|---|
本文 | 87.8 | 89.0 | 0.81 |
文献[25] | 92.3 | 93.5 | 39.00 |
性能指标 | 精确率/% | 召回率/% | 每秒处理帧数 |
---|---|---|---|
使用手机识别 | 93.8 | 94.5 | 0.77 |
睡觉识别 | 80.3 | 95.3 |
Table 6 Model performance on the behavior recognition task
性能指标 | 精确率/% | 召回率/% | 每秒处理帧数 |
---|---|---|---|
使用手机识别 | 93.8 | 94.5 | 0.77 |
睡觉识别 | 80.3 | 95.3 |
[1] | 胡振中, 张建平, 张旭磊. 基于4D施工安全信息模型的建筑施工支撑体系安全分析方法[J]. 工程力学, 2010, 27(12): 192-200. |
HU Z Z, ZHANG J P, ZHANG X L. 4D construction safety information model-based safety analysis approach for scaffold system during construction[J]. Engineering Mechanics, 2010, 27(12): 192-200 (in Chinese). | |
[2] | 朱云, 凌志刚, 张雨强. 机器视觉技术研究进展及展望[J]. 图学学报, 2020, 41(6): 871-890. |
ZHU Y, LING Z G, ZHANG Y Q. Research progress and prospect of machine vision technology[J]. Journal of Graphics, 2020, 41(6): 871-890 (in Chinese). | |
[3] | LU M, ZHANG Y, ZHANG J P, et al. Integration of four-dimensional computer-aided design modeling and three-dimensional animation of operations simulation for visualizing construction of the main stadium for the Beijing 2008 Olympic games[J]. Canadian Journal of Civil Engineering, 2009, 36(3): 473-479. |
[4] | 杨晓娇, 于忠, 冮军. 智慧工地中的图像传感技术的应用进展[J]. 四川建筑, 2021, 41(S1): 41-44. |
YANG X J, YU Z, GANG J. Application progress of image sensing technology in smart construction sites[J]. Sichuan Architecture, 2021, 41(S1): 41-44 (in Chinese). | |
[5] | 谢国波, 肖峰, 林志毅, 等. 复杂作业场景下的反光衣和安全帽检测方法[J]. 安全与环境学报, 2024, 24(9): 3513-3521. |
XIE G B, XIAO F, LIN Z Y, et al. Method for detecting reflective vests and safety helmets in complex operational environments[J]. Journal of Safety and Environment, 2024, 24(9): 3513-3521 (in Chinese). | |
[6] |
崔克彬, 耿佳昌. 基于EE-YOLOv8s的多场景火灾迹象检测算法[J]. 图学学报, 2025, 46(1): 13-27.
DOI |
CUI K B, GENG J C. A multi-scene fire sign detection algorithm based on EE-YOLOv8s[J]. Journal of Graphics, 2025, 46(1): 13-27 (in Chinese).
DOI |
|
[7] | 郑相波, 姚国栋, 史方圆, 等. 大型施工机械监管系统智能视频分析模型研究[J]. 铁路计算机应用, 2024, 33(4): 23-29. |
ZHENG X B, YAO G D, SHI F Y, et al. Intelligent video analysis model for large-scale construction machinery supervision system[J]. Railway Computer Application, 2024, 33(4): 23-29 (in Chinese). | |
[8] | 赵树煊, 银莉, 苏帅鸣, 等. 基于多尺度特征注意力网络的施工安全预警方法[J]. 中国科学: 技术科学, 2023, 53(7): 1241-1252. |
ZHAO S X, YIN L, SU S M, et al. Construction safety monitoring method based on multiscale feature attention network[J]. SCIENTIA SINICA Technologica, 2023, 53(7): 1241-1252 (in Chinese). | |
[9] | 石文堃. 基于目标检测的工人违规行为识别系统研究[D]. 阜新: 辽宁工程技术大学, 2023. |
SHI W K. Research on worker violation recognition system based on object detection[D]. Fuxin: Liaoning Technical University, 2023 (in Chinese). | |
[10] | 甘文霞, 张宇轩, 耿晶, 等. 改进PoseConv3D模型在建筑工人临边不安全行为识别中的应用[J]. 安全与环境学报, 2024, 24(7): 2712-2720. |
GAN W X, ZHANG Y X, GENG J, et al. Application of improved PoseConv3D model in recognition of unsafe behaviors of construction workers near the edge[J]. Journal of Safety and Environment, 2024, 24(7): 2712-2720 (in Chinese). | |
[11] | 张琦, 张荣梅, 陈彬. 基于深度学习的图像识别技术研究综述[J]. 河北省科学院学报, 2019, 36(3): 28-36. |
ZHANG Q, ZHANG R M, CHEN B. Research review of image recognition technology based on deep learning[J]. Journal of the Hebei Academy of Sciences, 2019, 36(3): 28-36 (in Chinese). | |
[12] |
蒋灿, 郑哲, 梁雄, 等. 大语言模型驱动的交互式建筑设计新范式——基于Rhino7的概念验证[J]. 图学学报, 2024, 45(3): 594-600.
DOI |
JIANG C, ZHENG Z, LIANG X, et al. A new interaction paradigm for building design driven by large language model: proof of concept with Rhino7[J]. Journal of Graphics, 2024, 45(3): 594-600 (in Chinese).
DOI |
|
[13] | OpenCompass. OpenCompass multi-modal academic leaderboard[EB/OL]. [2024-12-17]. https://rank.opencompass.org.cn/leaderboard-multimodal. |
[14] | OpenAI, ACHIAM J, ADLER S, et al. GPT-4 technical report[EB/OL]. [2025-01-17]. https://arxiv.org/abs/2303.08774. |
[15] | ANTHROPIC. The Claude 3 model family: opus, sonnet, haiku[EB/OL]. [2024-12-17]. https://www-cdn.anthropic.com/de8ba9b01c9ab7cbabf5c33b80b7bbc618857627/Model_Card_Claude_3.pdf. |
[16] | BAI J Z, BAI S, YANG S S, et al. Qwen-VL: a frontier large vision-language model with versatile abilities[EB/OL]. [2024-12-17]. https://arxiv.org/abs/2308.12966. |
[17] | LI B, ZHANG Y H, GUO D, et al. LLaVA-OneVision: easy visual task transfer[EB/OL]. [2024-12-17]. https://arxiv.org/abs/2408.03326. |
[18] | YAO Y, YU T Y, ZHANG A, et al. MiniCPM-V:a GPT-4V level MLLM on your phone[EB/OL]. [2024-12-17]. https://arxiv.org/abs/2408.01800. |
[19] | AGRAWAL P, ANTONIAK S, HANNA E B, et al. Pixtral 12B[EB/OL]. [2025-01-23]. https://arxiv.org/abs/2410.07073. |
[20] | Team GLM. ChatGLM:a family of large language models from GLM-130B to GLM-4 all tools[EB/OL]. [2024-12-17]. https://arxiv.org/abs/2406.12793. |
[21] | 金传. 基于视觉语言模型的遥感图像检测算法研究[D]. 杭州: 杭州电子科技大学, 2024. |
JIN C. Remote sensing images detection algorithm research based on visual-language model[D]. Hangzhou: Hangzhou Dianzi University, 2024 (in Chinese). | |
[22] | 陈泳财, 张强, 黄咏秋, 等. CLAML: 视觉语言模型下铁谱图像的自适应元学习[J]. 广东石油化工学院学报, 2024, 34(4): 93-99. |
CHEN Y C, ZHANG Q, HUANG Y Q, et al. CLAML: adaptive meta-learning for ferrography images under vision-language models[J]. Journal of Guangdong University of Petrochemical Technology, 2024, 34(4): 93-99 (in Chinese). | |
[23] | XU X Z, JIANG Y Q, CHEN W H, et al. DAMO-YOLO: a report on real-time object detection design[EB/OL]. [2024-12-17]. https://arxiv.org/abs/2211.15444. |
[24] | DAI X Y, CHEN Y P, XIAO B, et al. Dynamic head: unifying object detection heads with attentions[C]// 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2021: 7369-7378. |
[25] | HUANG X. Smart_Construction:base on YOLOv5 head person helmet detection on construction sites[EB/OL]. [2024- 12-17]. https://github.com/PeterH0323/Smart_Construction. |
[26] | WU Z Y, CHEN X K, PAN Z Z, et al. DeepSeek-VL2:mixture-of-experts vision-language models for advanced multimodal understanding[EB/OL]. [2024-12-17]. https://arxiv.org/abs/2412.10302. |
[1] | ZHANG Shuai, HONG Ao, HU Hengrui, LAN Mingying, XI Xiaochao. Study on the interaction of an AI-based motion capture technology in rehabilitation training systems for neuromyelitis optica [J]. Journal of Graphics, 2025, 46(4): 783-792. |
[2] | SUN Hao, XIE Tao, HE Long, GUO Wenzhong, YU Yongfang, WU Qijun, WANG Jianwei, DONG Hui. Research on multimodal text-visual large model for robotic terrain perception algorithm [J]. Journal of Graphics, 2025, 46(3): 558-567. |
[3] | ZHANG Tiansheng, ZHU Minfeng, REN Yiwen, WANG Chenhan, ZHANG Lidong, ZHANG Wei, CHEN Wei. BPA-SAM: box prompt augmented SAM for traditional Chinese realistic painting [J]. Journal of Graphics, 2025, 46(2): 322-331. |
[4] | LIU Jichen, LI Jinxing, WU Jia, ZHANG Wei, QI Yunuo, ZHOU Guoliang. Prospects for the application of large models technology in the power industry [J]. Journal of Graphics, 2024, 45(6): 1132-1144. |
[5] | LI Qiong, KAO Yueying, ZHANG Ying, XU Pei. Review on object detection in UAV aerial images [J]. Journal of Graphics, 2024, 45(6): 1145-1164. |
[6] | YU Han, CHEN Zhiyuan, XIONG Xirui, DAI Yuanxing, CAI Hongming. Intelligent MBSE design approach based on retrieval augmented large language model [J]. Journal of Graphics, 2024, 45(6): 1188-1199. |
[7] | WANG Changsheng. Research on prompt engineering for large model art image generation [J]. Journal of Graphics, 2024, 45(6): 1243-1255. |
[8] | WU Peichen, YUAN Lining, HU Hao, LIU Zhao, GUO Fang. Video anomaly detection based on attention feature fusion [J]. Journal of Graphics, 2024, 45(5): 922-929. |
[9] | WANG Jiang’an, HUANG Le, PANG Dawei, QIN Linzhen, LIANG Wenqian. Dense point cloud reconstruction network based on adaptive aggregation recurrent recursion [J]. Journal of Graphics, 2024, 45(1): 230-239. |
[10] | LI Hong-xuan, ZHANG Song-yang, REN Bo. High-capacity clipped robust image steganography based on multilevel invertible neural networks [J]. Journal of Graphics, 2023, 44(6): 1149-1161. |
[11] | BI Chun-yan, LIU Yue. A survey of video human action recognition based on deep learning [J]. Journal of Graphics, 2023, 44(4): 625-639. |
[12] | YANG Liu, WU Xiao-qun. 3D shape completion via deep learning: a method survey [J]. Journal of Graphics, 2023, 44(2): 201-215. |
[13] | WANG Jiang-an, PANG Da-wei, HUANG Le, QING Lin-zhen. Dense point cloud reconstruction network using multi-scale feature recursive convolution [J]. Journal of Graphics, 2022, 43(5): 875-883. |
[14] | LIANG Zhen-yu, HUA Jia-hao, CHEN Hao-long, DENG Yi-chuan. A computer vision based structural damage identification method for temporary structure during construction [J]. Journal of Graphics, 2022, 43(4): 608-615. |
[15] | XIONG Chen, CHEN Li-bin, LI Lin-ze, XU Zhen, ZHAO Yang-ping. Crack visualization management method based on computer vision and BIM [J]. Journal of Graphics, 2022, 43(4): 721-728. |
Viewed | ||||||
Full text |
|
|||||
Abstract |
|
|||||