图学学报 ›› 2024, Vol. 45 ›› Issue (6): 1266-1276.DOI: 10.11996/JG.j.2095-302X.2024061266
吴精乙1(), 景峻2(
), 贺熠凡1, 张世渝1, 康运锋1, 唐维2, 孔德兰2, 刘向栋2
收稿日期:
2024-08-05
接受日期:
2024-10-15
出版日期:
2024-12-31
发布日期:
2024-12-24
通讯作者:
景峻(1977-),男,研究员,博士。主要研究方向为数字交通、智慧高速和公路信息化等。E-mail:signal926@163.com第一作者:
吴精乙(2002-),男,硕士研究生。主要研究方向为数字图像处理与模式识别。E-mail:goldfish_42@163.com
WU Jingyi1(), JING Jun2(
), HE Yifan1, ZHANG Shiyu1, KANG Yunfeng1, TANG Wei2, KONG Delan2, LIU Xiangdong2
Received:
2024-08-05
Accepted:
2024-10-15
Published:
2024-12-31
Online:
2024-12-24
Contact:
JING Jun (1977-), researcher, Ph.D. His main research interests cover digital transportation, smart highways and highway informatization, etc. E-mail:signal926@163.comFirst author:
WU Jingyi (2002-), master student. His main research interests cover cross-modal understanding and generative large language model. E-mail:goldfish_42@163.com
摘要:
针对现有交通异常事件检测系统无法深入感知事件的局限性,以及人工审核报警事件成本高的问题,研究了一种结合多模态大模型(MLLM)的高速公路场景交通异常事件分析方法,设计并验证了3种基于MLLM的任务:一是自动生成异常事件的详细工单描述,提升事件的感知深度;二是利用MLLM对报警事件进行复审,减少误报,提高检测准确性;三是基于MLLM生成异常事件视频描述,增强事件的可解释性。实验结果显示,基于MLLM的工单描述方法通过视觉指令调优数据集的构建和模型微调,提升了工单信息的完整性和准确性。报警事件复审方面,MLLM能够有效审核出由图像质量低下、虚警误报和类别错误导致的误报,降低了人工审核成本。此外,基于MLLM的视频描述方法通过事件视频图像的采样与描述,实现了对异常事件的高效分析,提高了事件解释性。尽管开源模型在特定场景下略逊于闭源模型,但两者均展现出对多种误报问题的审核能力,证实了MLLM在异常事件审核中的应用潜力。该研究为智能交通监控系统提供了新的解决方案,提高了异常事件处理的自动化水平和实用性。
中图分类号:
吴精乙, 景峻, 贺熠凡, 张世渝, 康运锋, 唐维, 孔德兰, 刘向栋. 基于多模态大模型的高速公路场景交通异常事件分析方法[J]. 图学学报, 2024, 45(6): 1266-1276.
WU Jingyi, JING Jun, HE Yifan, ZHANG Shiyu, KANG Yunfeng, TANG Wei, KONG Delan, LIU Xiangdong. Traffic anomaly event analysis method for highway scenes based on multimodal large language models[J]. Journal of Graphics, 2024, 45(6): 1266-1276.
图2 工单描述的视觉指令调优数据集构建流程。工单信息包括事件编号、时间、地点、类型、描述和异常对象外观特征
Fig. 2 Workflow for building a visual instruction tuning dataset described in work orders. The work order information includes the event number, time, location, type, description, and appearance characteristics of anomalous objects
图3 基于传统小模型的交通事件分析系统存在的事件误报类型((a)图像质量低下;(b)虚警误报;(c)类别错误;(d)道路外误报)
Fig. 3 Types of false positives in traffic event analysis systems based on traditional small models ((a) Low image quality; (b) False alarm; (c) Incorrect category; (d) Off-highway false alarm)
图6 交通异常事件的工单描述的结果(MiniCPM+SFT表示有监督微调后的MLLM。MiniCPM+SFT相比于MiniCPM在BLEU-1和CIDEr指标上分别提高了78.9%和47.6%)
Fig. 6 The results of the work order description for traffic anomaly events (MiniCPM+SFT represents MLLM with supervised fine-tuning. Compared to MiniCPM, MiniCPM+SFT improves BLEU-1 and CIDEr scores by 78.9% and 47.6%, respectively)
图7 基于MLLM的交通异常事件工单描述(停车事件)。MiniCPM+SFT相比MiniCPM生成的描述更加更准确,比如事件编号和地点
Fig. 7 MLLM-based traffic anomaly event work order description (parking event). The descriptions generated by MiniCPM+SFT are more accurate compared to those generated by MiniCPM alone, especially regarding details like event numbers and locations
图8 基于MLLM的交通异常事件工单描述(行人闯入事件)。MiniCPM+SFT生成的事件对象的外观描述更详细
Fig. 8 MLLM-based traffic anomaly event work order description (pedestrian intrusion event). The appearance descriptions of event objects generated by MiniCPM+SFT are more detailed
图9 基于MLLM的交通异常事件工单描述(抛撒物事件)。MiniCPM+SFT感知到抛撒物是一些散落的碎片,且语气更加肯定
Fig. 9 MLLM-based traffic anomaly event work order description (abandoned object event). MiniCPM+SFT perceives the scattered material as debris, and its tone is more assertive
图10 基于MLLM复审核交通报警事件(图像失真)。对比了通义千问2.5和MiniCPM+SFT,两者都可以过滤出由于图像质量问题触发的事件报警。通义千问2.5的指令跟随能力更强,给出了要求的简短的回答
Fig. 10 MLLM-based secondary review traffic alarm event (image distortion). A comparison was made between Qwen 2.5 and MiniCPM+SFT. Both are capable of filtering out event alarms caused by poor image quality. Qwen 2.5 has stronger instruction-following capabilities and provides the requested concise responses
图11 基于MLLM复审交通报警事件(虚警误报) MLLM通过图像质量和检测框中是否有物体存在来判断是否发生误报。通义千问2.5两步判断均正确,而MiniCPM+SFT在第二步判断时出错
Fig. 11 MLLM-based secondary review traffic alarm event (false alarm). MLLM determines whether a false alarm has occurred by assessing image quality and the presence of objects within the detection box. Qwen 2.5 correctly performs both steps of the evaluation, whereas MiniCPM+SFT makes an error in the second step
图12 基于MLLM复审交通报警事件(道路外误报)。MLLM通过四步提问来审核报警事件。通义千问2.5在最后一步判断异常对象是否在道路上时出错,而MiniCPM+SFT四步判断均正确,成功审核出误报事件
Fig. 12 MLLM-based secondary review traffic alarm event (off-highway false alarm). MLLM reviews alarm events through a four-step questioning process. The comparison in the image shows that Qwen 2.5 makes an error in the final step when determining whether the anomalous object is on the road, whereas MiniCPM+SFT correctly performs all four steps and successfully identifies false alarm events
图14 生成工单描述任务中描述错误和出现幻觉的例子。违规停车的是黄色车辆,模型输出的却是白色货车,并把汽车标志JAC误识别为车牌号码
Fig. 14 Examples of incorrect descriptions and hallucinations in the task of generating work order descriptions. The vehicle that violated parking regulations is a yellow car; however, the model output incorrectly identifies it as a white truck and misinterprets the car emblem JAC as a license plate number
[1] | SALUKY S, SUPANGKAT S H, NUGRAHA I B. Abandoned object detection method using convolutional neural network[C]// 2020 International Conference on ICT for Smart Society. New York: IEEE Press, 2020: 1-4. |
[2] |
BARNICH O, VAN DROOGENBROECK M. ViBe: a universal background subtraction algorithm for video sequences[J]. IEEE Transactions on Image Processing, 2011, 20(6): 1709-1724.
DOI PMID |
[3] | DOGARIU M, STEFAN L D, CONSTANTIN M G, et al. Human-object interaction: application to abandoned luggage detection in video surveillance scenarios[C]// 2020 13th International Conference on Communications. New York: IEEE Press, 2020: 157-160. |
[4] | VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]// The 31st International Conference on Neural Information Processing Systems. New York: ACM, 2017: 6000-6010. |
[5] | DEVLIN J, CHANG M W, LEE K, et al. BERT: pre-training of deep bidirectional transformers for language understanding[EB/OL]. (2019-03-24)[2024-06-25]. https://arxiv.org/abs/1810.04805. |
[6] | BROWN T B, MANN B, RYDER N, et al. Language models are few-shot learners[C]// The 34th International Conference on Neural Information Processing Systems. Morehouse Lane: Curran Associates Inc., 2020: 159. |
[7] | OpenAI. ChatGPT[EB/OL]. (2022-12-06)[2024-06-25]. https://openai.com/blog/chatgpt/. |
[8] | ACHIAM J, ADLER S, AGARWAL S, et al. GPT-4 technical report[EB/OL]. (2023-03-15)[2024-06-25]. https://arxiv.org/abs/2303.08774. |
[9] | TOUVRON H, LAVRIL T, IZACARD G, et al. LLaMA: open and efficient foundation language models[EB/OL]. (2024-04-19)[2024-06-25]. https://arxiv.org/abs/2302.13971. |
[10] | OUYANG L, WU J, JIANG X, et al. Training language models to follow instructions with human feedback[C]// The 36th International Conference on Neural Information Processing Systems. Morehouse Lane: Curran Associates Inc., 2022: 2011. |
[11] | WEI J, BOSMA M, ZHAO V Y, et al. Finetuned language models are zero-shot learners[EB/OL]. (2022-02-08) [2024-06-25]. https://arxiv.org/abs/2109.01652. |
[12] | ZHANG S Y, FU D C, LIANG W Z, et al. TrafficGPT: viewing, processing and interacting with traffic foundation models[J]. Transport Policy, 2024, 150: 95-105. |
[13] |
周臻, 顾子渊, 曲小波, 等. 城市多模式交通大模型MT-GPT: 点线面的分层技术与应用场景[J]. 中国公路学报, 2024, 37(2): 253-274.
DOI |
ZHOU Z, GU Z Y, QU X B, et al. Urban multimodal transportation generative pretrained transformer foundation model: hierarchical techniques and application scenarios of spot-corridor-network decomposition[J]. China Journal of Highway and Transport, 2024, 37(2): 253-274. (in Chinese)
DOI |
|
[14] | WANG P, WEI X, HU F X, et al. TransGPT: multi-modal generative pre-trained transformer for transportation[EB/OL]. (2024-02-11)[2024-06-25]. https://arxiv.org/abs/2402.07233. |
[15] | LUO S, CHEN W, TIAN W X, et al. Delving into multi-modal multi-task foundation models for road scene understanding: from learning paradigm perspectives[EB/OL]. [2024-06-28]. https://doi.org/10.1109/TIV.2024.3406372. |
[16] | GOYAL Y, KHOT T, SUMMERS-STAY D, et al. Making the V in VQA matter: elevating the role of image understanding in visual question answering[C]// 2017 IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2017: 6325-6334. |
[17] | LU P, MISHRA S, XIA T, et al. Learn to explain: multimodal reasoning via thought chains for science question answering[C]// The 36th International Conference on Neural Information Processing Systems. Morehouse Lane: Curran Associates Inc., 2022: 2507-2521. |
[18] | LI J N, LI D X, SAVARESE S, et al. BLIP-2: bootstrapping language-image pre-training with frozen image encoders and large language models[C]// The 40th International Conference on Machine Learning. Honolulu: PMLR, 2023: 814. |
[19] | LIU H T, LI C Y, WU Q Y, et al. Visual instruction tuning[C]// The 37th International Conference on Neural Information Processing Systems. Morehouse Lane: Curran Associates Inc., 2024: 1516. |
[20] | YU T Y, ZHANG H Y, YAO Y, et al. RLAIF-V: aligning MLLMs through open-source AI feedback for super GPT-4V trustworthiness[EB/OL]. (2024-05-27)[2024-06-25]. https://arxiv.org/abs/2405.17220. |
[21] | ZHANG H, LI X, BING L D. Video-LLaMA: an instruction-tuned audio-visual language model for video understanding[EB/OL]. (2023-10-25)[2024-06-25]. https://arxiv.org/abs/2306.02858. |
[22] | LI K C, HE Y N, YI W, et al. VideoChat: chat-centric video understanding[EB/OL]. (2024-01-04)[2024-06-25]. https://arxiv.org/abs/2305.06355. |
[23] | LI Y W, WANG C Y, JIA J Y. LLaMA-VID: an image is worth 2 tokens in large language models[C]// The 18th European Conference on Computer Vision. Cham: Springer, 2024: 323-340. |
[24] | CHENG Z S, LENG S C, ZHANG H, et al. VideoLLaMA 2: advancing spatial-temporal modeling and audio understanding in video-LLMs[EB/OL]. (2024-01-17)[2024-06-25]. https://arxiv.org/abs/2406.07476. |
[25] | RAFIQ G, RAFIQ M, CHOI G S. Video description: a comprehensive survey of deep learning approaches[J]. Artificial Intelligence Review, 2023, 56(11): 13293-13372. |
[26] | AAFAQ N, MIAN A, LIU W, et al. Video description: a survey of methods, datasets, and evaluation metrics[J]. ACM Computing Surveys (CSUR), 2019, 52(6): 115. |
[27] | PASUNURU R, BANSAL M. Multi-task video captioning with video and entailment generation[EB/OL]. (2017-08-08) [2024-06-25]. https://arxiv.org/abs/1704.07489. |
[28] | JIN T, HUANG S Y, LI Y M, et al. Low-rank HOCA: efficient high-order cross-modal attention for video captioning[EB/OL]. (2019-11-01)[2024-06-24]. https://arxiv.org/abs/1911.00212. |
[29] | Gemini Team Google. Gemini 1.5: unlocking multimodal understanding across millions of tokens of context[EB/OL]. (2024-02-09)[2024-05-25]. https://arxiv.org/abs/2403.05530. |
[30] | HU E J, SHEN Y L, WALLIS P, et al. LoRA: low-rank adaptation of large language models[EB/OL]. (2021-10-16) [2024-06-25]. https://arxiv.org/abs/2106.09685. |
[31] | PAPINENI K, ROUKOS S, WARD T, et al. Bleu: a method for automatic evaluation of machine translation[C]// The 40th Annual Meeting of the Association for Computational Linguistics. Philadelphia: ACL, 2002: 311-318. |
[32] | VEDANTAM R, LAWRENCE ZITNICK C, PARIKH D. CIDEr: Consensus-based image description evaluation[C]// 2015 IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2015: 4566-4575. |
[33] | ANDERSON P, FERNANDO B, JOHNSON M, et al. SPICE: semantic propositional image caption evaluation[C]// The 14th European Conference on Computer Vision. Cham: Springer, 2016: 382-398. |
[34] | BAI J Z, BAI S, CHU Y F, et al. Qwen technical report[EB/OL]. (2023-09-28)[2025-06-25]. https://arxiv.org/abs/2309.16609. |
[35] | XU P, YIN Q Y, ZHANG J G, et al. Deep reinforcement learning with part-aware exploration bonus in video games[J]. IEEE Transactions on Games, 2022, 14(4): 644-653. |
[36] | XU P, ZHANG J G, HUANG K Q. Exploration via joint policy diversity for sparse-reward multi-agent tasks[C]// The 32nd International Joint Conference on Artificial Intelligence. Macao: IJCAI, 2023: 37. |
[37] | XU P, ZHANG J G, YIN Q Y, et al. Subspace-aware exploration for sparse-reward multi-agent tasks[C]// The 37th AAAI Conference on Artificial Intelligence. Washington: AAAI, 2023: 11717-11725. |
[1] | 毕春艳, 刘越. 基于深度学习的视频人体动作识别综述[J]. 图学学报, 2023, 44(4): 625-639. |
[2] | 陈刚, 张培基, 龚冬冬, 于俊清. 火电厂监控视频安全服检测方法研究[J]. 图学学报, 2023, 44(2): 291-297. |
[3] | 姜红涛 1, 陈晓华 1, 石 玥 2,3, 马翠霞 2,3. 一种基于螺旋摘要的监控视频可视分析系统[J]. 图学学报, 2020, 41(2): 187-195. |
阅读次数 | ||||||
全文 |
|
|||||
摘要 |
|
|||||