面向景区直播流推荐的视频吸引力评价方法

doi:10.11996/JG.j.2095-302X.2026020264

摘要/Abstract

摘要：

随着5G、云计算与音视频等技术的普及，视频流直播成为线上文旅的重要形态。但主流的多摄像机“慢直播”缺乏人工导播与脚本，内容随机性强，使传统基于偏好与热度的推荐方法难以奏效。为此，提出了一种视频吸引力评价方法，通过评估多视频源内容对观众注意力和情感共鸣的激发能力，从而预测其对观众的吸引效果。相较于传统基于偏好与热度的推荐方法，视频吸引力评价方法更适用于景区直播场景。围绕“视频吸引力”这一概念，首先构建多角度线索引导的视频描述生成方法，引入大型视觉语言模型(LVLM)对视频进行关键信息抽取、结构化内容表征与情感语义推理，并整合为可读的描述文本与吸引力因素；其次建立基于多模态特征的视频吸引力评价方法，在对比学习网络中引入交叉注意力、动态显著性与负样本增强，输出吸引力评分与关键因素；最后在此基础上实现了视频吸引力驱动的景区直播系统原型，包括频道推荐、吸引力可视化与AI导览等功能。在TVSum50数据集上，验证了视频描述与视频内容的相关性，较原始视频描述提升了7.00%，在跨任务泛化实验中提升了6.00%；在自建景区直播数据集上，基于多模态特征的视频吸引力方法相较基线在吸引力评价上提升了24.00%。

关键词: 视频吸引力分析, 多模态融合, 景区直播, 智能推荐, 大型视觉语言模型

Abstract:

With the proliferation of 5G, cloud computing, and audio-video technologies, live streaming has emerged as a pivotal medium for online cultural tourism. However, mainstream multi-camera “slow live broadcasts” lack human-guided narration and scripting, resulting in high content randomness that undermines traditional recommendation methods based on user preferences or video popularity. To address this limitation, video attractiveness assessment method was proposed to predict audience engagement by evaluating how multi-source video content stimulated viewer attention and emotional resonance. This approach proved more suitable for scenic-area live streaming scenarios than conventional methods. Centered on video attractiveness, a multi-perspective guided video- description generation method was developed and leveraged a Large Vision-Language Model (LVLM) to extract key information, structure content representations, and infer emotional semantics, synthesizing them into readable descriptive texts and attractiveness factors. Secondly, a multimodal feature fusion-based attractiveness assessment method integrated cross-attention mechanisms, dynamic saliency, and negative sample augmentation within a contrastive-learning network to output attractiveness scores and critical factors. Finally, an attractiveness driven live-streaming system prototype for scenic areas was implemented, featuring channel recommendation, attractiveness visualization, and AI-guided navigation. Validation on the TVSum50 dataset was conducted and demonstrated a 7.00% improvement in video-description relevance over raw descriptions and a 6.00% gain in cross-task generalization. On a self-built scenic live streaming dataset, the multimodal attractiveness evaluation method achieved a 24.00% higher accuracy than unimodal baselines.

Key words: video attraction analysis, multimodal fusion, scenic spot live-streaming, intelligent recommendation, large vision-language model

中图分类号:

G206

周强, 黄尧秋, 史伟民, 周忠. 面向景区直播流推荐的视频吸引力评价方法[J]. 图学学报, 2026, 47(2): 264-274.

ZHOU Qiang, HUANG Yaoqiu, SHI Weimin, ZHOU Zhong. Video attractiveness assessment method for scenic live stream recommendations[J]. Journal of Graphics, 2026, 47(2): 264-274.

图/表 11

参考文献 38

[1]	KUMAR T, SHARMA P, TANWAR J, et al. Cloud-based video streaming services: trends, challenges, and opportunities[J]. CAAI Transactions on Intelligence Technology, 2024, 9(2): 265-285. DOI URL
[2]	SONG Y, VALLMITJANA J, STENT A, et al. TVSum: summarizing web videos using titles[C]// 2015 IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2015: 5179-5187.
[3]	TZELEPIS C, MAVRIDAKI E, MEZARIS V, et al. Video aesthetic quality assessment using kernel support vector machine with isotropic Gaussian sample uncertainty[C]// 2016 IEEE International Conference on Image Processing. New York: IEEE Press, 2016: 2410-2414.
[4]	LI M, WANG Z, REN J C, et al. MVVA-Net: a video aesthetic quality assessment network with cognitive fusion of multi-type feature-based strong generalization[J]. Cognitive Computation, 2022, 14(4): 1435-1445. DOI
[5]	WU H N, ZHANG E L, LIAO L, et al. Exploring video quality assessment on user generated contents from aesthetic and technical perspectives[C]// 2023 IEEE/CVF International Conference on Computer Vision. New York: IEEE Press, 2023: 20087-20097.
[6]	WANG Z, ZHOU J, MA J, et al. Discovering attractive segments in the user-generated video streams[J]. Information Processing & Management, 2020, 57(1): 102130. DOI URL
[7]	张佳婧, 于金辉, 缪永伟, 等. 基于深度学习的自适应水墨画计算美学评估[J]. 计算机辅助设计与图形学学报, 2021, 33(9): 1349-1360.
	ZHANG J J, YU J H, MIAO Y W, et al. Self-adaptive computational aesthetic evaluation of Chinese ink paintings based on deep learning[J]. Journal of Computer-Aided Design & Computer Graphics, 2021, 33(9): 1349-1360 (in Chinese).
[8]	牛玉贞, 陈珊珊, 李悦洲, 等. 融合场景特征的跨模态图像美学评价[J]. 计算机辅助设计与图形学学报, 2025, 37(7): 1270-1282.
	NIU Y Z, CHEN S S, LI Y Z, et al. Cross-modal image aesthetics assessment with scene features[J]. Journal of Computer-Aided Design & Computer Graphics, 2025, 37(7): 1270-1282 (in Chinese).
[9]	QI F, YANG X S, XU C S. Emotion knowledge driven video highlight detection[J]. IEEE Transactions on Multimedia, 2021, 23: 3999-4013. DOI URL
[10]	JIANG Y Y, ZHANG W Y, ZHANG X L, et al. Prior knowledge integration via LLM encoding and pseudo event regulation for video moment retrieval[C]// The 32nd ACM International Conference on Multimedia. New York: ACM, 2024: 7249-7258.
[11]	LIU Y, LI S Y, WU Y, et al. UMT: unified multi-modal transformers for joint video moment retrieval and highlight detection[C]// 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2022: 3032-3041.
[12]	MOON W, HYUN S, PARK S, et al. Query-dependent video representation for moment retrieval and highlight detection[C]// 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2023: 23023-23033.
[13]	XIAO Y C, LUO Z Y, LIU Y, et al. Bridging the gap: a unified video comprehension framework for moment retrieval and highlight detection[C]// 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2024: 18709-18719.
[14]	CAO Z, ZHANG B Q, DU H M, et al. FlashVTG: feature layering and adaptive score handling network for video temporal grounding[C]// 2025 IEEE/CVF Winter Conference on Applications of Computer Vision. New York: IEEE Press, 2025: 9226-9236.
[15]	OCHI D, KUNITA Y, KAMEDA A, et al. Live streaming system for omnidirectional video[C]// 2015 IEEE Virtual Reality. New York: IEEE Press, 2015: 349-350.
[16]	ALAYRAC J B, DONAHUE J, LUC P, et al. Flamingo: a visual language model for few-shot learning[C]// The 36th International Conference on Neural Information Processing Systems. Red Hook: Curran Associates Inc., 2022: 1723.
[17]	LI J N, LI D X, SAVARESE S, et al. BLIP-2:bootstrapping language-image pre-training with frozen image encoders and large language models[EB/OL]. [2025-07-08]. https://proceedings.mlr.press/v202/li23q.html.
[18]	YANG Z Y, LI L J, LIN K, et al. The dawn of LMMs:preliminary explorations with GPT-4V(ision)[EB/OL]. [2025-08-05]. https://arxiv.org/abs/2309.17421.
[19]	ZHU D Y, CHEN J, SHEN X Q, et al. MiniGPT-4:enhancing vision-language understanding with advanced large language models[EB/OL]. [2025-08-05]. https://openreview.net/forum?id=1tZbq88f27.
[20]	WANG P, BAI S, TAN S N, et al. Qwen2-VL:enhancing vision-language model’s perception of the world at any resolution[EB/OL]. [2025-08-05]. https://arxiv.org/abs/2409.12191.
[21]	CHENG Z S, LENG S C, ZHANG H, et al. VideoLLAMA 2:advancing spatial-temporal modeling and audio understanding in video-LLMs[EB/OL]. [2025-08-05]. https://arxiv.org/abs/2406.07476.
[22]	YAO Y, YU T Y, ZHANG A, et al. MiniCPM-V:a GPT-4V level MLLM on your phone[EB/OL]. [2025-08-05]. https://arxiv.org/abs/2408.01800.
[23]	LIN K, AHMED F, LI L J, et al. MM-VID:advancing video understanding with GPT-4V(ision)[EB/OL]. [2025-08-05]. https://arxiv.org/abs/2310.19773.
[24]	ACHIAM J, ADLER S, AGARWAL S, et al. GPT-4 technical report[EB/OL]. [2025-08-05]. https://arxiv.org/abs/2303.08774.
[25]	LI Y H. A practical survey on zero-shot prompt design for in-context learning[C]// The 14th International Conference on Recent Advances in Natural Language Processing. New York: ACL, 2023: 641-647.
[26]	KOJIMA T, GU S S, REID M, et al. Large language models are zero-shot reasoners[C]// The 36th International Conference on Neural Information Processing Systems. Red Hook: Curran Associates Inc., 2022: 1613.
[27]	WANG X Z, WEI J, SCHUURMANS D, et al. Self- consistency improves chain of thought reasoning in language models[EB/OL]. [2025-08-05]. https://openreview.net/forum?id=1PL1NIMMrw.
[28]	LIU J C, LIU A, LU X M, et al. Generated knowledge prompting for commonsense reasoning[C]// The 60th Annual Meeting of the Association for Computational Linguistics. New York: ACL, 2022: 3154-3169.
[29]	WU T S, JIANG E, DONSBACH A, et al. PromptChainer: chaining large language model prompts through visual programming[C]// The CHI Conference on Human Factors in Computing Systems Extended Abstracts. New York: ACM, 2022: 359.
[30]	HE K M, ZHANG X Y, REN S Q, et al. Deep residual learning for image recognition[C]// 2016 IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2016: 770-778.
[31]	GRAVES A. Long short-term memory[M]//GRAVES A. Supervised Sequence Labelling with Recurrent Neural Networks. Berlin: Springer, 2012: 37-45.
[32]	BARNUM G, TALUKDER S, YUE Y S. On the benefits of early fusion in multimodal representation learning[EB/OL]. [2025-08-05]. https://arxiv.org/abs/2011.07191.
[33]	VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]// The 31st International Conference on Neural Information Processing Systems. Red Hook: Curran Associates Inc., 2017: 6000-6010.
[34]	LOGAN IV R, BALAŽEVIĆ I, WALLACE E, et al. Cutting down on prompts and parameters: simple few-shot learning with language models[C]// Findings of the Association for Computational Linguistics: ACL 2022. New York: ACL, 2022: 2824-2835.
[35]	LIU H T, LI C Y, WU Q Y, et al. Visual instruction tuning[C]// The 37th International Conference on Neural Information Processing Systems. Red Hook: Curran Associates Inc., 2023: 1516.
[36]	KAMATH A, FERRET J, PATHAK S, et al. Gemma 3 technical report[EB/OL]. [2025-08-05]. https://arxiv.org/abs/2503.19786.
[37]	ALAPARTHI S, MISHRA M. Bidirectional encoder representations from transformers: a sentiment analysis odyssey[EB/OL]. [2025-08-05]. https://arxiv.org/abs/2007.01127.
[38]	RADFORD A, KIM J W, HALLACY C, et al. Learning transferable visual models from natural language supervision[EB/OL]. [2025-07-08]. http://proceedings.mlr.press/v139/radford21a.html.

视频文本信息	VT	VU	GA	MS	PK	PR	FM	BK	BT	DS	mAP
视频标题	41.00	47.20	42.00	45.7 0	54.6 0	63.20	48.3 0	49.80	58.00	36.60	48.64
minicpm-v(本文)	48.30	48.10	53.50	38.40	54.30	58.5 0	54.60	58.90	70.80	47.4 0	53.28
Llava(本文)	49.9 0	60.40	50.80	46.40	42.70	49.70	44.10	51.60	64 .0 0	39.40	49.90
gemma 3(本文)	69.40	54.4 0	51.6 0	43.70	57.70	53.00	37.50	56.9 0	62.20	48.90	53.50

视频文本信息	VT	VU	GA	MS	PK	PR	FM	BK	BT	DS	mAP
视频标题	41.00	47.20	42.00	45.7 0	54.6 0	63.20	48.3 0	49.80	58.00	36.60	48.64
minicpm-v(本文)	48.30	48.10	53.50	38.40	54.30	58.5 0	54.60	58.90	70.80	47.4 0	53.28
Llava(本文)	49.9 0	60.40	50.80	46.40	42.70	49.70	44.10	51.60	64 .0 0	39.40	49.90
gemma 3(本文)	69.40	54.4 0	51.6 0	43.70	57.70	53.00	37.50	56.9 0	62.20	48.90	53.50

网络	视频文本信息	VT	VU	GA	MS	PK	PR	FM	BK	BT	DS	mAP
FlashVTG	视频标题	49.50	71.40	71.50	68.50	65.60	66.80	71.30	88.10	85.00	62.40	70.01
FlashVTG	视频描述	55.80	70.70	70.90	71.20	72.80	81.90	62.10	67.40	83.10	68.20	70.40
UVCOM	视频标题	82.60	92.40	90.90	79.00	83.50	88.90	75.40	80.00	86.50	79.90	83.90
UVCOM	视频描述	87.30	94.10	92.60	68.70	88.00	84.40	74.90	90.80	88.50	82.70	85.20
QDDETR	视频标题	75.30	94.10	84.50	78.00	88.00	80.80	73.40	84.80	88.80	60.00	80.77
QDDETR	视频描述	87.60	93.30	84.40	80.60	88.30	87.40	75.10	92.00	88.70	78.50	85.59

网络	视频文本信息	VT	VU	GA	MS	PK	PR	FM	BK	BT	DS	mAP
FlashVTG	视频标题	49.50	71.40	71.50	68.50	65.60	66.80	71.30	88.10	85.00	62.40	70.01
FlashVTG	视频描述	55.80	70.70	70.90	71.20	72.80	81.90	62.10	67.40	83.10	68.20	70.40
UVCOM	视频标题	82.60	92.40	90.90	79.00	83.50	88.90	75.40	80.00	86.50	79.90	83.90
UVCOM	视频描述	87.30	94.10	92.60	68.70	88.00	84.40	74.90	90.80	88.50	82.70	85.20
QDDETR	视频标题	75.30	94.10	84.50	78.00	88.00	80.80	73.40	84.80	88.80	60.00	80.77
QDDETR	视频描述	87.60	93.30	84.40	80.60	88.30	87.40	75.10	92.00	88.70	78.50	85.59

单模态特征	自然景观	文化遗址	主题娱乐	所有景区	平均mAP
I3D视觉特征	62.60	55.50	72.80	65.50	64.10
Clip视觉特征	61.20	58.70	70.10	68.40	64.60
Clip文本特征	66.40	64.60	74.60	71.50	69.28
BERT文本特征	65.70	66.10	73.40	72.10	69.33