Traffic anomaly event analysis method for highway scenes based on multimodal large language models

doi:10.11996/JG.j.2095-302X.2024061266

Abstract

Abstract:

To address the limitations of current traffic anomaly detection systems, which lack deep incident perception capabilities, and to address the high cost of manual review for alarmed incidents, a highway traffic anomaly analysis method based on multimodal large language models (MLLM) was researched. Three MLLM-based tasks were designed and validated: first, automatically generating detailed work order descriptions for anomalous events, enhancing the depth of event perception depth; second, reviewing alarm events using MLLM, reducing false alarms and improving detection accuracy; and third, generating descriptive narratives for anomaly event videos based on MLLM, enhancing the interpretability of events. Experimental results demonstrated that the MLLM-based work order description method improved work order information completeness and accuracy through the construction of visual instruction-tuned datasets and model fine-tuning. In the review of alarm events, MLLM effectively filtered out false alarms caused by poor image quality, false positives, and misclassifications, thus reducing manual review costs. Furthermore, the MLLM-based video description method enabled efficient anomaly analysis by sampling and describing event video frames, thus improving event explainability. Although open-source models were slightly inferior to closed-source models in specific scenarios, both types demonstrated the ability to review various false alarm issues, confirming the potential application of MLLM in anomaly event reviews. This study provides a novel solution for intelligent traffic monitoring systems, enhancing the automation and practicality of handling anomaly events.

Key words: multimodal large language models, surveillance video, anomaly event detection, video understanding, work order description, traffic event review

CLC Number:

TP391
U495

WU Jingyi, JING Jun, HE Yifan, ZHANG Shiyu, KANG Yunfeng, TANG Wei, KONG Delan, LIU Xiangdong. Traffic anomaly event analysis method for highway scenes based on multimodal large language models[J]. Journal of Graphics, 2024, 45(6): 1266-1276.

Figures/Tables 14

Fig. 1 Methodology of traffic event analysis combining multimodal large model

Fig. 2 Workflow for building a visual instruction tuning dataset described in work orders. The work order information includes the event number, time, location, type, description, and appearance characteristics of anomalous objects

Fig. 3 Types of false positives in traffic event analysis systems based on traditional small models ((a) Low image quality; (b) False alarm; (c) Incorrect category; (d) Off-highway false alarm)

Fig. 4 Flowchart for secondary review of traffic anomalies based on MLLM

Fig. 5 Traffic anomaly event video description generation based on MLLM

Fig. 6 The results of the work order description for traffic anomaly events (MiniCPM+SFT represents MLLM with supervised fine-tuning. Compared to MiniCPM, MiniCPM+SFT improves BLEU-1 and CIDEr scores by 78.9% and 47.6%, respectively)

Fig. 7 MLLM-based traffic anomaly event work order description (parking event). The descriptions generated by MiniCPM+SFT are more accurate compared to those generated by MiniCPM alone, especially regarding details like event numbers and locations

Fig. 8 MLLM-based traffic anomaly event work order description (pedestrian intrusion event). The appearance descriptions of event objects generated by MiniCPM+SFT are more detailed

Fig. 9 MLLM-based traffic anomaly event work order description (abandoned object event). MiniCPM+SFT perceives the scattered material as debris, and its tone is more assertive

Fig. 10 MLLM-based secondary review traffic alarm event (image distortion). A comparison was made between Qwen 2.5 and MiniCPM+SFT. Both are capable of filtering out event alarms caused by poor image quality. Qwen 2.5 has stronger instruction-following capabilities and provides the requested concise responses

Fig. 11 MLLM-based secondary review traffic alarm event (false alarm). MLLM determines whether a false alarm has occurred by assessing image quality and the presence of objects within the detection box. Qwen 2.5 correctly performs both steps of the evaluation, whereas MiniCPM+SFT makes an error in the second step

Fig. 12 MLLM-based secondary review traffic alarm event (off-highway false alarm). MLLM reviews alarm events through a four-step questioning process. The comparison in the image shows that Qwen 2.5 makes an error in the final step when determining whether the anomalous object is on the road, whereas MiniCPM+SFT correctly performs all four steps and successfully identifies false alarm events

Fig. 13 MLLM-based traffic anomaly event video description

Fig. 14 Examples of incorrect descriptions and hallucinations in the task of generating work order descriptions. The vehicle that violated parking regulations is a yellow car; however, the model output incorrectly identifies it as a white truck and misinterprets the car emblem JAC as a license plate number

References 37

[1]	SALUKY S, SUPANGKAT S H, NUGRAHA I B. Abandoned object detection method using convolutional neural network[C]// 2020 International Conference on ICT for Smart Society. New York: IEEE Press, 2020: 1-4.
[2]	BARNICH O, VAN DROOGENBROECK M. ViBe: a universal background subtraction algorithm for video sequences[J]. IEEE Transactions on Image Processing, 2011, 20(6): 1709-1724. DOI PMID
[3]	DOGARIU M, STEFAN L D, CONSTANTIN M G, et al. Human-object interaction: application to abandoned luggage detection in video surveillance scenarios[C]// 2020 13th International Conference on Communications. New York: IEEE Press, 2020: 157-160.
[4]	VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]// The 31st International Conference on Neural Information Processing Systems. New York: ACM, 2017: 6000-6010.
[5]	DEVLIN J, CHANG M W, LEE K, et al. BERT: pre-training of deep bidirectional transformers for language understanding[EB/OL]. (2019-03-24)[2024-06-25]. https://arxiv.org/abs/1810.04805.
[6]	BROWN T B, MANN B, RYDER N, et al. Language models are few-shot learners[C]// The 34th International Conference on Neural Information Processing Systems. Morehouse Lane: Curran Associates Inc., 2020: 159.
[7]	OpenAI. ChatGPT[EB/OL]. (2022-12-06)[2024-06-25]. https://openai.com/blog/chatgpt/.
[8]	ACHIAM J, ADLER S, AGARWAL S, et al. GPT-4 technical report[EB/OL]. (2023-03-15)[2024-06-25]. https://arxiv.org/abs/2303.08774.
[9]	TOUVRON H, LAVRIL T, IZACARD G, et al. LLaMA: open and efficient foundation language models[EB/OL]. (2024-04-19)[2024-06-25]. https://arxiv.org/abs/2302.13971.
[10]	OUYANG L, WU J, JIANG X, et al. Training language models to follow instructions with human feedback[C]// The 36th International Conference on Neural Information Processing Systems. Morehouse Lane: Curran Associates Inc., 2022: 2011.
[11]	WEI J, BOSMA M, ZHAO V Y, et al. Finetuned language models are zero-shot learners[EB/OL]. (2022-02-08) [2024-06-25]. https://arxiv.org/abs/2109.01652.
[12]	ZHANG S Y, FU D C, LIANG W Z, et al. TrafficGPT: viewing, processing and interacting with traffic foundation models[J]. Transport Policy, 2024, 150: 95-105.
[13]	周臻, 顾子渊, 曲小波, 等. 城市多模式交通大模型MT-GPT: 点线面的分层技术与应用场景[J]. 中国公路学报, 2024, 37(2): 253-274. DOI
	ZHOU Z, GU Z Y, QU X B, et al. Urban multimodal transportation generative pretrained transformer foundation model: hierarchical techniques and application scenarios of spot-corridor-network decomposition[J]. China Journal of Highway and Transport, 2024, 37(2): 253-274. (in Chinese) DOI
[14]	WANG P, WEI X, HU F X, et al. TransGPT: multi-modal generative pre-trained transformer for transportation[EB/OL]. (2024-02-11)[2024-06-25]. https://arxiv.org/abs/2402.07233.
[15]	LUO S, CHEN W, TIAN W X, et al. Delving into multi-modal multi-task foundation models for road scene understanding: from learning paradigm perspectives[EB/OL]. [2024-06-28]. https://doi.org/10.1109/TIV.2024.3406372.
[16]	GOYAL Y, KHOT T, SUMMERS-STAY D, et al. Making the V in VQA matter: elevating the role of image understanding in visual question answering[C]// 2017 IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2017: 6325-6334.
[17]	LU P, MISHRA S, XIA T, et al. Learn to explain: multimodal reasoning via thought chains for science question answering[C]// The 36th International Conference on Neural Information Processing Systems. Morehouse Lane: Curran Associates Inc., 2022: 2507-2521.
[18]	LI J N, LI D X, SAVARESE S, et al. BLIP-2: bootstrapping language-image pre-training with frozen image encoders and large language models[C]// The 40th International Conference on Machine Learning. Honolulu: PMLR, 2023: 814.
[19]	LIU H T, LI C Y, WU Q Y, et al. Visual instruction tuning[C]// The 37th International Conference on Neural Information Processing Systems. Morehouse Lane: Curran Associates Inc., 2024: 1516.
[20]	YU T Y, ZHANG H Y, YAO Y, et al. RLAIF-V: aligning MLLMs through open-source AI feedback for super GPT-4V trustworthiness[EB/OL]. (2024-05-27)[2024-06-25]. https://arxiv.org/abs/2405.17220.
[21]	ZHANG H, LI X, BING L D. Video-LLaMA: an instruction-tuned audio-visual language model for video understanding[EB/OL]. (2023-10-25)[2024-06-25]. https://arxiv.org/abs/2306.02858.
[22]	LI K C, HE Y N, YI W, et al. VideoChat: chat-centric video understanding[EB/OL]. (2024-01-04)[2024-06-25]. https://arxiv.org/abs/2305.06355.
[23]	LI Y W, WANG C Y, JIA J Y. LLaMA-VID: an image is worth 2 tokens in large language models[C]// The 18th European Conference on Computer Vision. Cham: Springer, 2024: 323-340.
[24]	CHENG Z S, LENG S C, ZHANG H, et al. VideoLLaMA 2: advancing spatial-temporal modeling and audio understanding in video-LLMs[EB/OL]. (2024-01-17)[2024-06-25]. https://arxiv.org/abs/2406.07476.
[25]	RAFIQ G, RAFIQ M, CHOI G S. Video description: a comprehensive survey of deep learning approaches[J]. Artificial Intelligence Review, 2023, 56(11): 13293-13372.
[26]	AAFAQ N, MIAN A, LIU W, et al. Video description: a survey of methods, datasets, and evaluation metrics[J]. ACM Computing Surveys (CSUR), 2019, 52(6): 115.
[27]	PASUNURU R, BANSAL M. Multi-task video captioning with video and entailment generation[EB/OL]. (2017-08-08) [2024-06-25]. https://arxiv.org/abs/1704.07489.
[28]	JIN T, HUANG S Y, LI Y M, et al. Low-rank HOCA: efficient high-order cross-modal attention for video captioning[EB/OL]. (2019-11-01)[2024-06-24]. https://arxiv.org/abs/1911.00212.
[29]	Gemini Team Google. Gemini 1.5: unlocking multimodal understanding across millions of tokens of context[EB/OL]. (2024-02-09)[2024-05-25]. https://arxiv.org/abs/2403.05530.
[30]	HU E J, SHEN Y L, WALLIS P, et al. LoRA: low-rank adaptation of large language models[EB/OL]. (2021-10-16) [2024-06-25]. https://arxiv.org/abs/2106.09685.
[31]	PAPINENI K, ROUKOS S, WARD T, et al. Bleu: a method for automatic evaluation of machine translation[C]// The 40th Annual Meeting of the Association for Computational Linguistics. Philadelphia: ACL, 2002: 311-318.
[32]	VEDANTAM R, LAWRENCE ZITNICK C, PARIKH D. CIDEr: Consensus-based image description evaluation[C]// 2015 IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2015: 4566-4575.
[33]	ANDERSON P, FERNANDO B, JOHNSON M, et al. SPICE: semantic propositional image caption evaluation[C]// The 14th European Conference on Computer Vision. Cham: Springer, 2016: 382-398.
[34]	BAI J Z, BAI S, CHU Y F, et al. Qwen technical report[EB/OL]. (2023-09-28)[2025-06-25]. https://arxiv.org/abs/2309.16609.
[35]	XU P, YIN Q Y, ZHANG J G, et al. Deep reinforcement learning with part-aware exploration bonus in video games[J]. IEEE Transactions on Games, 2022, 14(4): 644-653.
[36]	XU P, ZHANG J G, HUANG K Q. Exploration via joint policy diversity for sparse-reward multi-agent tasks[C]// The 32nd International Joint Conference on Artificial Intelligence. Macao: IJCAI, 2023: 37.
[37]	XU P, ZHANG J G, YIN Q Y, et al. Subspace-aware exploration for sparse-reward multi-agent tasks[C]// The 37th AAAI Conference on Artificial Intelligence. Washington: AAAI, 2023: 11717-11725.