Journal of Graphics ›› 2024, Vol. 45 ›› Issue (6): 1266-1276.DOI: 10.11996/JG.j.2095-302X.2024061266
• Special Topic on “Large Models and Graphics Technology and Applications” • Previous Articles Next Articles
WU Jingyi1(), JING Jun2(
), HE Yifan1, ZHANG Shiyu1, KANG Yunfeng1, TANG Wei2, KONG Delan2, LIU Xiangdong2
Received:
2024-08-05
Accepted:
2024-10-15
Online:
2024-12-31
Published:
2024-12-24
Contact:
JING Jun
About author:
First author contact:WU Jingyi (2002-), master student. His main research interests cover cross-modal understanding and generative large language model. E-mail:goldfish_42@163.com
CLC Number:
WU Jingyi, JING Jun, HE Yifan, ZHANG Shiyu, KANG Yunfeng, TANG Wei, KONG Delan, LIU Xiangdong. Traffic anomaly event analysis method for highway scenes based on multimodal large language models[J]. Journal of Graphics, 2024, 45(6): 1266-1276.
Add to citation manager EndNote|Ris|BibTeX
URL: http://www.txxb.com.cn/EN/10.11996/JG.j.2095-302X.2024061266
Fig. 2 Workflow for building a visual instruction tuning dataset described in work orders. The work order information includes the event number, time, location, type, description, and appearance characteristics of anomalous objects
Fig. 3 Types of false positives in traffic event analysis systems based on traditional small models ((a) Low image quality; (b) False alarm; (c) Incorrect category; (d) Off-highway false alarm)
Fig. 6 The results of the work order description for traffic anomaly events (MiniCPM+SFT represents MLLM with supervised fine-tuning. Compared to MiniCPM, MiniCPM+SFT improves BLEU-1 and CIDEr scores by 78.9% and 47.6%, respectively)
Fig. 7 MLLM-based traffic anomaly event work order description (parking event). The descriptions generated by MiniCPM+SFT are more accurate compared to those generated by MiniCPM alone, especially regarding details like event numbers and locations
Fig. 8 MLLM-based traffic anomaly event work order description (pedestrian intrusion event). The appearance descriptions of event objects generated by MiniCPM+SFT are more detailed
Fig. 9 MLLM-based traffic anomaly event work order description (abandoned object event). MiniCPM+SFT perceives the scattered material as debris, and its tone is more assertive
Fig. 10 MLLM-based secondary review traffic alarm event (image distortion). A comparison was made between Qwen 2.5 and MiniCPM+SFT. Both are capable of filtering out event alarms caused by poor image quality. Qwen 2.5 has stronger instruction-following capabilities and provides the requested concise responses
Fig. 11 MLLM-based secondary review traffic alarm event (false alarm). MLLM determines whether a false alarm has occurred by assessing image quality and the presence of objects within the detection box. Qwen 2.5 correctly performs both steps of the evaluation, whereas MiniCPM+SFT makes an error in the second step
Fig. 12 MLLM-based secondary review traffic alarm event (off-highway false alarm). MLLM reviews alarm events through a four-step questioning process. The comparison in the image shows that Qwen 2.5 makes an error in the final step when determining whether the anomalous object is on the road, whereas MiniCPM+SFT correctly performs all four steps and successfully identifies false alarm events
Fig. 14 Examples of incorrect descriptions and hallucinations in the task of generating work order descriptions. The vehicle that violated parking regulations is a yellow car; however, the model output incorrectly identifies it as a white truck and misinterprets the car emblem JAC as a license plate number
[1] | SALUKY S, SUPANGKAT S H, NUGRAHA I B. Abandoned object detection method using convolutional neural network[C]// 2020 International Conference on ICT for Smart Society. New York: IEEE Press, 2020: 1-4. |
[2] |
BARNICH O, VAN DROOGENBROECK M. ViBe: a universal background subtraction algorithm for video sequences[J]. IEEE Transactions on Image Processing, 2011, 20(6): 1709-1724.
DOI PMID |
[3] | DOGARIU M, STEFAN L D, CONSTANTIN M G, et al. Human-object interaction: application to abandoned luggage detection in video surveillance scenarios[C]// 2020 13th International Conference on Communications. New York: IEEE Press, 2020: 157-160. |
[4] | VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]// The 31st International Conference on Neural Information Processing Systems. New York: ACM, 2017: 6000-6010. |
[5] | DEVLIN J, CHANG M W, LEE K, et al. BERT: pre-training of deep bidirectional transformers for language understanding[EB/OL]. (2019-03-24)[2024-06-25]. https://arxiv.org/abs/1810.04805. |
[6] | BROWN T B, MANN B, RYDER N, et al. Language models are few-shot learners[C]// The 34th International Conference on Neural Information Processing Systems. Morehouse Lane: Curran Associates Inc., 2020: 159. |
[7] | OpenAI. ChatGPT[EB/OL]. (2022-12-06)[2024-06-25]. https://openai.com/blog/chatgpt/. |
[8] | ACHIAM J, ADLER S, AGARWAL S, et al. GPT-4 technical report[EB/OL]. (2023-03-15)[2024-06-25]. https://arxiv.org/abs/2303.08774. |
[9] | TOUVRON H, LAVRIL T, IZACARD G, et al. LLaMA: open and efficient foundation language models[EB/OL]. (2024-04-19)[2024-06-25]. https://arxiv.org/abs/2302.13971. |
[10] | OUYANG L, WU J, JIANG X, et al. Training language models to follow instructions with human feedback[C]// The 36th International Conference on Neural Information Processing Systems. Morehouse Lane: Curran Associates Inc., 2022: 2011. |
[11] | WEI J, BOSMA M, ZHAO V Y, et al. Finetuned language models are zero-shot learners[EB/OL]. (2022-02-08) [2024-06-25]. https://arxiv.org/abs/2109.01652. |
[12] | ZHANG S Y, FU D C, LIANG W Z, et al. TrafficGPT: viewing, processing and interacting with traffic foundation models[J]. Transport Policy, 2024, 150: 95-105. |
[13] |
周臻, 顾子渊, 曲小波, 等. 城市多模式交通大模型MT-GPT: 点线面的分层技术与应用场景[J]. 中国公路学报, 2024, 37(2): 253-274.
DOI |
ZHOU Z, GU Z Y, QU X B, et al. Urban multimodal transportation generative pretrained transformer foundation model: hierarchical techniques and application scenarios of spot-corridor-network decomposition[J]. China Journal of Highway and Transport, 2024, 37(2): 253-274. (in Chinese)
DOI |
|
[14] | WANG P, WEI X, HU F X, et al. TransGPT: multi-modal generative pre-trained transformer for transportation[EB/OL]. (2024-02-11)[2024-06-25]. https://arxiv.org/abs/2402.07233. |
[15] | LUO S, CHEN W, TIAN W X, et al. Delving into multi-modal multi-task foundation models for road scene understanding: from learning paradigm perspectives[EB/OL]. [2024-06-28]. https://doi.org/10.1109/TIV.2024.3406372. |
[16] | GOYAL Y, KHOT T, SUMMERS-STAY D, et al. Making the V in VQA matter: elevating the role of image understanding in visual question answering[C]// 2017 IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2017: 6325-6334. |
[17] | LU P, MISHRA S, XIA T, et al. Learn to explain: multimodal reasoning via thought chains for science question answering[C]// The 36th International Conference on Neural Information Processing Systems. Morehouse Lane: Curran Associates Inc., 2022: 2507-2521. |
[18] | LI J N, LI D X, SAVARESE S, et al. BLIP-2: bootstrapping language-image pre-training with frozen image encoders and large language models[C]// The 40th International Conference on Machine Learning. Honolulu: PMLR, 2023: 814. |
[19] | LIU H T, LI C Y, WU Q Y, et al. Visual instruction tuning[C]// The 37th International Conference on Neural Information Processing Systems. Morehouse Lane: Curran Associates Inc., 2024: 1516. |
[20] | YU T Y, ZHANG H Y, YAO Y, et al. RLAIF-V: aligning MLLMs through open-source AI feedback for super GPT-4V trustworthiness[EB/OL]. (2024-05-27)[2024-06-25]. https://arxiv.org/abs/2405.17220. |
[21] | ZHANG H, LI X, BING L D. Video-LLaMA: an instruction-tuned audio-visual language model for video understanding[EB/OL]. (2023-10-25)[2024-06-25]. https://arxiv.org/abs/2306.02858. |
[22] | LI K C, HE Y N, YI W, et al. VideoChat: chat-centric video understanding[EB/OL]. (2024-01-04)[2024-06-25]. https://arxiv.org/abs/2305.06355. |
[23] | LI Y W, WANG C Y, JIA J Y. LLaMA-VID: an image is worth 2 tokens in large language models[C]// The 18th European Conference on Computer Vision. Cham: Springer, 2024: 323-340. |
[24] | CHENG Z S, LENG S C, ZHANG H, et al. VideoLLaMA 2: advancing spatial-temporal modeling and audio understanding in video-LLMs[EB/OL]. (2024-01-17)[2024-06-25]. https://arxiv.org/abs/2406.07476. |
[25] | RAFIQ G, RAFIQ M, CHOI G S. Video description: a comprehensive survey of deep learning approaches[J]. Artificial Intelligence Review, 2023, 56(11): 13293-13372. |
[26] | AAFAQ N, MIAN A, LIU W, et al. Video description: a survey of methods, datasets, and evaluation metrics[J]. ACM Computing Surveys (CSUR), 2019, 52(6): 115. |
[27] | PASUNURU R, BANSAL M. Multi-task video captioning with video and entailment generation[EB/OL]. (2017-08-08) [2024-06-25]. https://arxiv.org/abs/1704.07489. |
[28] | JIN T, HUANG S Y, LI Y M, et al. Low-rank HOCA: efficient high-order cross-modal attention for video captioning[EB/OL]. (2019-11-01)[2024-06-24]. https://arxiv.org/abs/1911.00212. |
[29] | Gemini Team Google. Gemini 1.5: unlocking multimodal understanding across millions of tokens of context[EB/OL]. (2024-02-09)[2024-05-25]. https://arxiv.org/abs/2403.05530. |
[30] | HU E J, SHEN Y L, WALLIS P, et al. LoRA: low-rank adaptation of large language models[EB/OL]. (2021-10-16) [2024-06-25]. https://arxiv.org/abs/2106.09685. |
[31] | PAPINENI K, ROUKOS S, WARD T, et al. Bleu: a method for automatic evaluation of machine translation[C]// The 40th Annual Meeting of the Association for Computational Linguistics. Philadelphia: ACL, 2002: 311-318. |
[32] | VEDANTAM R, LAWRENCE ZITNICK C, PARIKH D. CIDEr: Consensus-based image description evaluation[C]// 2015 IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2015: 4566-4575. |
[33] | ANDERSON P, FERNANDO B, JOHNSON M, et al. SPICE: semantic propositional image caption evaluation[C]// The 14th European Conference on Computer Vision. Cham: Springer, 2016: 382-398. |
[34] | BAI J Z, BAI S, CHU Y F, et al. Qwen technical report[EB/OL]. (2023-09-28)[2025-06-25]. https://arxiv.org/abs/2309.16609. |
[35] | XU P, YIN Q Y, ZHANG J G, et al. Deep reinforcement learning with part-aware exploration bonus in video games[J]. IEEE Transactions on Games, 2022, 14(4): 644-653. |
[36] | XU P, ZHANG J G, HUANG K Q. Exploration via joint policy diversity for sparse-reward multi-agent tasks[C]// The 32nd International Joint Conference on Artificial Intelligence. Macao: IJCAI, 2023: 37. |
[37] | XU P, ZHANG J G, YIN Q Y, et al. Subspace-aware exploration for sparse-reward multi-agent tasks[C]// The 37th AAAI Conference on Artificial Intelligence. Washington: AAAI, 2023: 11717-11725. |
[1] | BI Chun-yan, LIU Yue. A survey of video human action recognition based on deep learning [J]. Journal of Graphics, 2023, 44(4): 625-639. |
[2] | CHEN Gang, ZHANG Pei-ji, GONG Dong-dong, YU Jun-qing. Research on safety clothing detection method for surveillance video of thermal power plant [J]. Journal of Graphics, 2023, 44(2): 291-297. |
[3] | JIANG Hong-tao1, CHEN Xiao-hua1, SHI Yue2,3, MA Cui-xia2,3 . Surveillance video analysis system based on SpiralTape summarization [J]. Journal of Graphics, 2020, 41(2): 187-195. |
Viewed | ||||||
Full text |
|
|||||
Abstract |
|
|||||