Video attractiveness assessment method for scenic live stream recommendations

doi:10.11996/JG.j.2095-302X.2026020264

Abstract

Abstract:

With the proliferation of 5G, cloud computing, and audio-video technologies, live streaming has emerged as a pivotal medium for online cultural tourism. However, mainstream multi-camera “slow live broadcasts” lack human-guided narration and scripting, resulting in high content randomness that undermines traditional recommendation methods based on user preferences or video popularity. To address this limitation, video attractiveness assessment method was proposed to predict audience engagement by evaluating how multi-source video content stimulated viewer attention and emotional resonance. This approach proved more suitable for scenic-area live streaming scenarios than conventional methods. Centered on video attractiveness, a multi-perspective guided video- description generation method was developed and leveraged a Large Vision-Language Model (LVLM) to extract key information, structure content representations, and infer emotional semantics, synthesizing them into readable descriptive texts and attractiveness factors. Secondly, a multimodal feature fusion-based attractiveness assessment method integrated cross-attention mechanisms, dynamic saliency, and negative sample augmentation within a contrastive-learning network to output attractiveness scores and critical factors. Finally, an attractiveness driven live-streaming system prototype for scenic areas was implemented, featuring channel recommendation, attractiveness visualization, and AI-guided navigation. Validation on the TVSum50 dataset was conducted and demonstrated a 7.00% improvement in video-description relevance over raw descriptions and a 6.00% gain in cross-task generalization. On a self-built scenic live streaming dataset, the multimodal attractiveness evaluation method achieved a 24.00% higher accuracy than unimodal baselines.

Key words: video attraction analysis, multimodal fusion, scenic spot live-streaming, intelligent recommendation, large vision-language model

CLC Number:

G206

ZHOU Qiang, HUANG Yaoqiu, SHI Weimin, ZHOU Zhong. Video attractiveness assessment method for scenic live stream recommendations[J]. Journal of Graphics, 2026, 47(2): 264-274.

Figures/Tables 11

References 38

[1]	KUMAR T, SHARMA P, TANWAR J, et al. Cloud-based video streaming services: trends, challenges, and opportunities[J]. CAAI Transactions on Intelligence Technology, 2024, 9(2): 265-285. DOI URL
[2]	SONG Y, VALLMITJANA J, STENT A, et al. TVSum: summarizing web videos using titles[C]// 2015 IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2015: 5179-5187.
[3]	TZELEPIS C, MAVRIDAKI E, MEZARIS V, et al. Video aesthetic quality assessment using kernel support vector machine with isotropic Gaussian sample uncertainty[C]// 2016 IEEE International Conference on Image Processing. New York: IEEE Press, 2016: 2410-2414.
[4]	LI M, WANG Z, REN J C, et al. MVVA-Net: a video aesthetic quality assessment network with cognitive fusion of multi-type feature-based strong generalization[J]. Cognitive Computation, 2022, 14(4): 1435-1445. DOI
[5]	WU H N, ZHANG E L, LIAO L, et al. Exploring video quality assessment on user generated contents from aesthetic and technical perspectives[C]// 2023 IEEE/CVF International Conference on Computer Vision. New York: IEEE Press, 2023: 20087-20097.
[6]	WANG Z, ZHOU J, MA J, et al. Discovering attractive segments in the user-generated video streams[J]. Information Processing & Management, 2020, 57(1): 102130. DOI URL
[7]	张佳婧, 于金辉, 缪永伟, 等. 基于深度学习的自适应水墨画计算美学评估[J]. 计算机辅助设计与图形学学报, 2021, 33(9): 1349-1360.
	ZHANG J J, YU J H, MIAO Y W, et al. Self-adaptive computational aesthetic evaluation of Chinese ink paintings based on deep learning[J]. Journal of Computer-Aided Design & Computer Graphics, 2021, 33(9): 1349-1360 (in Chinese).
[8]	牛玉贞, 陈珊珊, 李悦洲, 等. 融合场景特征的跨模态图像美学评价[J]. 计算机辅助设计与图形学学报, 2025, 37(7): 1270-1282.
	NIU Y Z, CHEN S S, LI Y Z, et al. Cross-modal image aesthetics assessment with scene features[J]. Journal of Computer-Aided Design & Computer Graphics, 2025, 37(7): 1270-1282 (in Chinese).
[9]	QI F, YANG X S, XU C S. Emotion knowledge driven video highlight detection[J]. IEEE Transactions on Multimedia, 2021, 23: 3999-4013. DOI URL
[10]	JIANG Y Y, ZHANG W Y, ZHANG X L, et al. Prior knowledge integration via LLM encoding and pseudo event regulation for video moment retrieval[C]// The 32nd ACM International Conference on Multimedia. New York: ACM, 2024: 7249-7258.
[11]	LIU Y, LI S Y, WU Y, et al. UMT: unified multi-modal transformers for joint video moment retrieval and highlight detection[C]// 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2022: 3032-3041.
[12]	MOON W, HYUN S, PARK S, et al. Query-dependent video representation for moment retrieval and highlight detection[C]// 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2023: 23023-23033.
[13]	XIAO Y C, LUO Z Y, LIU Y, et al. Bridging the gap: a unified video comprehension framework for moment retrieval and highlight detection[C]// 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2024: 18709-18719.
[14]	CAO Z, ZHANG B Q, DU H M, et al. FlashVTG: feature layering and adaptive score handling network for video temporal grounding[C]// 2025 IEEE/CVF Winter Conference on Applications of Computer Vision. New York: IEEE Press, 2025: 9226-9236.
[15]	OCHI D, KUNITA Y, KAMEDA A, et al. Live streaming system for omnidirectional video[C]// 2015 IEEE Virtual Reality. New York: IEEE Press, 2015: 349-350.
[16]	ALAYRAC J B, DONAHUE J, LUC P, et al. Flamingo: a visual language model for few-shot learning[C]// The 36th International Conference on Neural Information Processing Systems. Red Hook: Curran Associates Inc., 2022: 1723.
[17]	LI J N, LI D X, SAVARESE S, et al. BLIP-2:bootstrapping language-image pre-training with frozen image encoders and large language models[EB/OL]. [2025-07-08]. https://proceedings.mlr.press/v202/li23q.html.
[18]	YANG Z Y, LI L J, LIN K, et al. The dawn of LMMs:preliminary explorations with GPT-4V(ision)[EB/OL]. [2025-08-05]. https://arxiv.org/abs/2309.17421.
[19]	ZHU D Y, CHEN J, SHEN X Q, et al. MiniGPT-4:enhancing vision-language understanding with advanced large language models[EB/OL]. [2025-08-05]. https://openreview.net/forum?id=1tZbq88f27.
[20]	WANG P, BAI S, TAN S N, et al. Qwen2-VL:enhancing vision-language model’s perception of the world at any resolution[EB/OL]. [2025-08-05]. https://arxiv.org/abs/2409.12191.
[21]	CHENG Z S, LENG S C, ZHANG H, et al. VideoLLAMA 2:advancing spatial-temporal modeling and audio understanding in video-LLMs[EB/OL]. [2025-08-05]. https://arxiv.org/abs/2406.07476.
[22]	YAO Y, YU T Y, ZHANG A, et al. MiniCPM-V:a GPT-4V level MLLM on your phone[EB/OL]. [2025-08-05]. https://arxiv.org/abs/2408.01800.
[23]	LIN K, AHMED F, LI L J, et al. MM-VID:advancing video understanding with GPT-4V(ision)[EB/OL]. [2025-08-05]. https://arxiv.org/abs/2310.19773.
[24]	ACHIAM J, ADLER S, AGARWAL S, et al. GPT-4 technical report[EB/OL]. [2025-08-05]. https://arxiv.org/abs/2303.08774.
[25]	LI Y H. A practical survey on zero-shot prompt design for in-context learning[C]// The 14th International Conference on Recent Advances in Natural Language Processing. New York: ACL, 2023: 641-647.
[26]	KOJIMA T, GU S S, REID M, et al. Large language models are zero-shot reasoners[C]// The 36th International Conference on Neural Information Processing Systems. Red Hook: Curran Associates Inc., 2022: 1613.
[27]	WANG X Z, WEI J, SCHUURMANS D, et al. Self- consistency improves chain of thought reasoning in language models[EB/OL]. [2025-08-05]. https://openreview.net/forum?id=1PL1NIMMrw.
[28]	LIU J C, LIU A, LU X M, et al. Generated knowledge prompting for commonsense reasoning[C]// The 60th Annual Meeting of the Association for Computational Linguistics. New York: ACL, 2022: 3154-3169.
[29]	WU T S, JIANG E, DONSBACH A, et al. PromptChainer: chaining large language model prompts through visual programming[C]// The CHI Conference on Human Factors in Computing Systems Extended Abstracts. New York: ACM, 2022: 359.
[30]	HE K M, ZHANG X Y, REN S Q, et al. Deep residual learning for image recognition[C]// 2016 IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2016: 770-778.
[31]	GRAVES A. Long short-term memory[M]//GRAVES A. Supervised Sequence Labelling with Recurrent Neural Networks. Berlin: Springer, 2012: 37-45.
[32]	BARNUM G, TALUKDER S, YUE Y S. On the benefits of early fusion in multimodal representation learning[EB/OL]. [2025-08-05]. https://arxiv.org/abs/2011.07191.
[33]	VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]// The 31st International Conference on Neural Information Processing Systems. Red Hook: Curran Associates Inc., 2017: 6000-6010.
[34]	LOGAN IV R, BALAŽEVIĆ I, WALLACE E, et al. Cutting down on prompts and parameters: simple few-shot learning with language models[C]// Findings of the Association for Computational Linguistics: ACL 2022. New York: ACL, 2022: 2824-2835.
[35]	LIU H T, LI C Y, WU Q Y, et al. Visual instruction tuning[C]// The 37th International Conference on Neural Information Processing Systems. Red Hook: Curran Associates Inc., 2023: 1516.
[36]	KAMATH A, FERRET J, PATHAK S, et al. Gemma 3 technical report[EB/OL]. [2025-08-05]. https://arxiv.org/abs/2503.19786.
[37]	ALAPARTHI S, MISHRA M. Bidirectional encoder representations from transformers: a sentiment analysis odyssey[EB/OL]. [2025-08-05]. https://arxiv.org/abs/2007.01127.
[38]	RADFORD A, KIM J W, HALLACY C, et al. Learning transferable visual models from natural language supervision[EB/OL]. [2025-07-08]. http://proceedings.mlr.press/v139/radford21a.html.

视频文本信息	VT	VU	GA	MS	PK	PR	FM	BK	BT	DS	mAP
视频标题	41.00	47.20	42.00	45.7 0	54.6 0	63.20	48.3 0	49.80	58.00	36.60	48.64
minicpm-v(本文)	48.30	48.10	53.50	38.40	54.30	58.5 0	54.60	58.90	70.80	47.4 0	53.28
Llava(本文)	49.9 0	60.40	50.80	46.40	42.70	49.70	44.10	51.60	64 .0 0	39.40	49.90
gemma 3(本文)	69.40	54.4 0	51.6 0	43.70	57.70	53.00	37.50	56.9 0	62.20	48.90	53.50

视频文本信息	VT	VU	GA	MS	PK	PR	FM	BK	BT	DS	mAP
视频标题	41.00	47.20	42.00	45.7 0	54.6 0	63.20	48.3 0	49.80	58.00	36.60	48.64
minicpm-v(本文)	48.30	48.10	53.50	38.40	54.30	58.5 0	54.60	58.90	70.80	47.4 0	53.28
Llava(本文)	49.9 0	60.40	50.80	46.40	42.70	49.70	44.10	51.60	64 .0 0	39.40	49.90
gemma 3(本文)	69.40	54.4 0	51.6 0	43.70	57.70	53.00	37.50	56.9 0	62.20	48.90	53.50

网络	视频文本信息	VT	VU	GA	MS	PK	PR	FM	BK	BT	DS	mAP
FlashVTG	视频标题	49.50	71.40	71.50	68.50	65.60	66.80	71.30	88.10	85.00	62.40	70.01
FlashVTG	视频描述	55.80	70.70	70.90	71.20	72.80	81.90	62.10	67.40	83.10	68.20	70.40
UVCOM	视频标题	82.60	92.40	90.90	79.00	83.50	88.90	75.40	80.00	86.50	79.90	83.90
UVCOM	视频描述	87.30	94.10	92.60	68.70	88.00	84.40	74.90	90.80	88.50	82.70	85.20
QDDETR	视频标题	75.30	94.10	84.50	78.00	88.00	80.80	73.40	84.80	88.80	60.00	80.77
QDDETR	视频描述	87.60	93.30	84.40	80.60	88.30	87.40	75.10	92.00	88.70	78.50	85.59

网络	视频文本信息	VT	VU	GA	MS	PK	PR	FM	BK	BT	DS	mAP
FlashVTG	视频标题	49.50	71.40	71.50	68.50	65.60	66.80	71.30	88.10	85.00	62.40	70.01
FlashVTG	视频描述	55.80	70.70	70.90	71.20	72.80	81.90	62.10	67.40	83.10	68.20	70.40
UVCOM	视频标题	82.60	92.40	90.90	79.00	83.50	88.90	75.40	80.00	86.50	79.90	83.90
UVCOM	视频描述	87.30	94.10	92.60	68.70	88.00	84.40	74.90	90.80	88.50	82.70	85.20
QDDETR	视频标题	75.30	94.10	84.50	78.00	88.00	80.80	73.40	84.80	88.80	60.00	80.77
QDDETR	视频描述	87.60	93.30	84.40	80.60	88.30	87.40	75.10	92.00	88.70	78.50	85.59

单模态特征	自然景观	文化遗址	主题娱乐	所有景区	平均mAP
I3D视觉特征	62.60	55.50	72.80	65.50	64.10
Clip视觉特征	61.20	58.70	70.10	68.40	64.60
Clip文本特征	66.40	64.60	74.60	71.50	69.28
BERT文本特征	65.70	66.10	73.40	72.10	69.33