图学学报 ›› 2026, Vol. 47 ›› Issue (2): 264-274.DOI: 10.11996/JG.j.2095-302X.2026020264
收稿日期:2025-09-08
接受日期:2025-11-25
出版日期:2026-04-30
发布日期:2026-05-20
通讯作者:周忠,E-mail:zz@buaa.edu.cn基金资助:
ZHOU Qiang1, HUANG Yaoqiu2, SHI Weimin1, ZHOU Zhong1(
)
Received:2025-09-08
Accepted:2025-11-25
Published:2026-04-30
Online:2026-05-20
Contact:
ZHOU Zhong,E-mail:zz@buaa.edu.cnSupported by:摘要:
随着5G、云计算与音视频等技术的普及,视频流直播成为线上文旅的重要形态。但主流的多摄像机“慢直播”缺乏人工导播与脚本,内容随机性强,使传统基于偏好与热度的推荐方法难以奏效。为此,提出了一种视频吸引力评价方法,通过评估多视频源内容对观众注意力和情感共鸣的激发能力,从而预测其对观众的吸引效果。相较于传统基于偏好与热度的推荐方法,视频吸引力评价方法更适用于景区直播场景。围绕“视频吸引力”这一概念,首先构建多角度线索引导的视频描述生成方法,引入大型视觉语言模型(LVLM)对视频进行关键信息抽取、结构化内容表征与情感语义推理,并整合为可读的描述文本与吸引力因素;其次建立基于多模态特征的视频吸引力评价方法,在对比学习网络中引入交叉注意力、动态显著性与负样本增强,输出吸引力评分与关键因素;最后在此基础上实现了视频吸引力驱动的景区直播系统原型,包括频道推荐、吸引力可视化与AI导览等功能。在TVSum50数据集上,验证了视频描述与视频内容的相关性,较原始视频描述提升了7.00%,在跨任务泛化实验中提升了6.00%;在自建景区直播数据集上,基于多模态特征的视频吸引力方法相较基线在吸引力评价上提升了24.00%。
中图分类号:
周强, 黄尧秋, 史伟民, 周忠. 面向景区直播流推荐的视频吸引力评价方法[J]. 图学学报, 2026, 47(2): 264-274.
ZHOU Qiang, HUANG Yaoqiu, SHI Weimin, ZHOU Zhong. Video attractiveness assessment method for scenic live stream recommendations[J]. Journal of Graphics, 2026, 47(2): 264-274.
| 视频文本信息 | VT | VU | GA | MS | PK | PR | FM | BK | BT | DS | mAP |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 视频标题 | 41.00 | 47.20 | 42.00 | 45.7 0 | 54.6 0 | 63.20 | 48.3 0 | 49.80 | 58.00 | 36.60 | 48.64 |
| minicpm-v(本文) | 48.30 | 48.10 | 53.50 | 38.40 | 54.30 | 58.5 0 | 54.60 | 58.90 | 70.80 | 47.4 0 | 53.28 |
| Llava(本文) | 49.9 0 | 60.40 | 50.80 | 46.40 | 42.70 | 49.70 | 44.10 | 51.60 | 64 .0 0 | 39.40 | 49.90 |
| gemma 3(本文) | 69.40 | 54.4 0 | 51.6 0 | 43.70 | 57.70 | 53.00 | 37.50 | 56.9 0 | 62.20 | 48.90 | 53.50 |
表1 视频描述与视频相关性实验结果/%
Table 1 Experimental results on the correlation between videos and their descriptions/%
| 视频文本信息 | VT | VU | GA | MS | PK | PR | FM | BK | BT | DS | mAP |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 视频标题 | 41.00 | 47.20 | 42.00 | 45.7 0 | 54.6 0 | 63.20 | 48.3 0 | 49.80 | 58.00 | 36.60 | 48.64 |
| minicpm-v(本文) | 48.30 | 48.10 | 53.50 | 38.40 | 54.30 | 58.5 0 | 54.60 | 58.90 | 70.80 | 47.4 0 | 53.28 |
| Llava(本文) | 49.9 0 | 60.40 | 50.80 | 46.40 | 42.70 | 49.70 | 44.10 | 51.60 | 64 .0 0 | 39.40 | 49.90 |
| gemma 3(本文) | 69.40 | 54.4 0 | 51.6 0 | 43.70 | 57.70 | 53.00 | 37.50 | 56.9 0 | 62.20 | 48.90 | 53.50 |
| 网络 | 视频文本信息 | VT | VU | GA | MS | PK | PR | FM | BK | BT | DS | mAP |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| FlashVTG | 视频标题 | 49.50 | 71.40 | 71.50 | 68.50 | 65.60 | 66.80 | 71.30 | 88.10 | 85.00 | 62.40 | 70.01 |
| 视频描述 | 55.80 | 70.70 | 70.90 | 71.20 | 72.80 | 81.90 | 62.10 | 67.40 | 83.10 | 68.20 | 70.40 | |
| UVCOM | 视频标题 | 82.60 | 92.40 | 90.90 | 79.00 | 83.50 | 88.90 | 75.40 | 80.00 | 86.50 | 79.90 | 83.90 |
| 视频描述 | 87.30 | 94.10 | 92.60 | 68.70 | 88.00 | 84.40 | 74.90 | 90.80 | 88.50 | 82.70 | 85.20 | |
| QDDETR | 视频标题 | 75.30 | 94.10 | 84.50 | 78.00 | 88.00 | 80.80 | 73.40 | 84.80 | 88.80 | 60.00 | 80.77 |
| 视频描述 | 87.60 | 93.30 | 84.40 | 80.60 | 88.30 | 87.40 | 75.10 | 92.00 | 88.70 | 78.50 | 85.59 |
表2 视频描述生成泛化性实验结果/%
Table 2 Experimental results on the generalization of video description generation/%
| 网络 | 视频文本信息 | VT | VU | GA | MS | PK | PR | FM | BK | BT | DS | mAP |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| FlashVTG | 视频标题 | 49.50 | 71.40 | 71.50 | 68.50 | 65.60 | 66.80 | 71.30 | 88.10 | 85.00 | 62.40 | 70.01 |
| 视频描述 | 55.80 | 70.70 | 70.90 | 71.20 | 72.80 | 81.90 | 62.10 | 67.40 | 83.10 | 68.20 | 70.40 | |
| UVCOM | 视频标题 | 82.60 | 92.40 | 90.90 | 79.00 | 83.50 | 88.90 | 75.40 | 80.00 | 86.50 | 79.90 | 83.90 |
| 视频描述 | 87.30 | 94.10 | 92.60 | 68.70 | 88.00 | 84.40 | 74.90 | 90.80 | 88.50 | 82.70 | 85.20 | |
| QDDETR | 视频标题 | 75.30 | 94.10 | 84.50 | 78.00 | 88.00 | 80.80 | 73.40 | 84.80 | 88.80 | 60.00 | 80.77 |
| 视频描述 | 87.60 | 93.30 | 84.40 | 80.60 | 88.30 | 87.40 | 75.10 | 92.00 | 88.70 | 78.50 | 85.59 |
| 单模态特征 | 自然 景观 | 文化 遗址 | 主题 娱乐 | 所有 景区 | 平均mAP |
|---|---|---|---|---|---|
| I3D视觉特征 | 62.60 | 55.50 | 72.80 | 65.50 | 64.10 |
| Clip视觉特征 | 61.20 | 58.70 | 70.10 | 68.40 | 64.60 |
| Clip文本特征 | 66.40 | 64.60 | 74.60 | 71.50 | 69.28 |
| BERT文本特征 | 65.70 | 66.10 | 73.40 | 72.10 | 69.33 |
表3 视频吸引力评价对比实验结果/%
Table 3 Comparison results of video attractiveness evaluation/%
| 单模态特征 | 自然 景观 | 文化 遗址 | 主题 娱乐 | 所有 景区 | 平均mAP |
|---|---|---|---|---|---|
| I3D视觉特征 | 62.60 | 55.50 | 72.80 | 65.50 | 64.10 |
| Clip视觉特征 | 61.20 | 58.70 | 70.10 | 68.40 | 64.60 |
| Clip文本特征 | 66.40 | 64.60 | 74.60 | 71.50 | 69.28 |
| BERT文本特征 | 65.70 | 66.10 | 73.40 | 72.10 | 69.33 |
| 单模态特征 | 自然 景观 | 文化 遗址 | 主题 娱乐 | 所有 景区 | 平均mAP |
|---|---|---|---|---|---|
| 基础模型 | 60.80 | 58.90 | 70.60 | 62.60 | 63.23 |
| +跨模态注意力 | 64.30 | 63.40 | 73.40 | 68.80 | 67.48 |
| +视频描述生成 | 69.60 | 67.50 | 78.80 | 72.20 | 72.03 |
| +动态显著性标记 | 72.40 | 71.70 | 80.90 | 76.10 | 75.28 |
| +负样本增强 (本文方法) | 76.20 | 74.10 | 83.40 | 80.70 | 78.60 |
表4 视频吸引力评价消融实验结果/%
Table 4 Ablation study of video attractiveness evaluation/%
| 单模态特征 | 自然 景观 | 文化 遗址 | 主题 娱乐 | 所有 景区 | 平均mAP |
|---|---|---|---|---|---|
| 基础模型 | 60.80 | 58.90 | 70.60 | 62.60 | 63.23 |
| +跨模态注意力 | 64.30 | 63.40 | 73.40 | 68.80 | 67.48 |
| +视频描述生成 | 69.60 | 67.50 | 78.80 | 72.20 | 72.03 |
| +动态显著性标记 | 72.40 | 71.70 | 80.90 | 76.10 | 75.28 |
| +负样本增强 (本文方法) | 76.20 | 74.10 | 83.40 | 80.70 | 78.60 |
| 景区 类型 | 景区名称 | 视频关键帧 | 评分 | |
|---|---|---|---|---|
| 真实值 | 预测值 | |||
| 自然 景观 | 海南环岛旅游公路 | ![]() | 4.700 0 | 4.800 0 |
![]() | 4.535 0 | 4.900 0 | ||
| 文化 遗址 | 北京故宫 | ![]() | 1.100 0 | 1.000 0 |
![]() | 4.495 0 | 4.800 0 | ||
| 主题 娱乐 | 海南热带野生动植物园 | ![]() | 1.600 0 | 1.000 0 |
![]() | 4.132 5 | 4.000 0 | ||
表5 视频吸引力评价预测结果示例
Table 5 Example of prediction results for video attractiveness evaluation
| 景区 类型 | 景区名称 | 视频关键帧 | 评分 | |
|---|---|---|---|---|
| 真实值 | 预测值 | |||
| 自然 景观 | 海南环岛旅游公路 | ![]() | 4.700 0 | 4.800 0 |
![]() | 4.535 0 | 4.900 0 | ||
| 文化 遗址 | 北京故宫 | ![]() | 1.100 0 | 1.000 0 |
![]() | 4.495 0 | 4.800 0 | ||
| 主题 娱乐 | 海南热带野生动植物园 | ![]() | 1.600 0 | 1.000 0 |
![]() | 4.132 5 | 4.000 0 | ||
| [1] |
KUMAR T, SHARMA P, TANWAR J, et al. Cloud-based video streaming services: trends, challenges, and opportunities[J]. CAAI Transactions on Intelligence Technology, 2024, 9(2): 265-285.
DOI URL |
| [2] | SONG Y, VALLMITJANA J, STENT A, et al. TVSum: summarizing web videos using titles[C]// 2015 IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2015: 5179-5187. |
| [3] | TZELEPIS C, MAVRIDAKI E, MEZARIS V, et al. Video aesthetic quality assessment using kernel support vector machine with isotropic Gaussian sample uncertainty[C]// 2016 IEEE International Conference on Image Processing. New York: IEEE Press, 2016: 2410-2414. |
| [4] |
LI M, WANG Z, REN J C, et al. MVVA-Net: a video aesthetic quality assessment network with cognitive fusion of multi-type feature-based strong generalization[J]. Cognitive Computation, 2022, 14(4): 1435-1445.
DOI |
| [5] | WU H N, ZHANG E L, LIAO L, et al. Exploring video quality assessment on user generated contents from aesthetic and technical perspectives[C]// 2023 IEEE/CVF International Conference on Computer Vision. New York: IEEE Press, 2023: 20087-20097. |
| [6] |
WANG Z, ZHOU J, MA J, et al. Discovering attractive segments in the user-generated video streams[J]. Information Processing & Management, 2020, 57(1): 102130.
DOI URL |
| [7] | 张佳婧, 于金辉, 缪永伟, 等. 基于深度学习的自适应水墨画计算美学评估[J]. 计算机辅助设计与图形学学报, 2021, 33(9): 1349-1360. |
| ZHANG J J, YU J H, MIAO Y W, et al. Self-adaptive computational aesthetic evaluation of Chinese ink paintings based on deep learning[J]. Journal of Computer-Aided Design & Computer Graphics, 2021, 33(9): 1349-1360 (in Chinese). | |
| [8] | 牛玉贞, 陈珊珊, 李悦洲, 等. 融合场景特征的跨模态图像美学评价[J]. 计算机辅助设计与图形学学报, 2025, 37(7): 1270-1282. |
| NIU Y Z, CHEN S S, LI Y Z, et al. Cross-modal image aesthetics assessment with scene features[J]. Journal of Computer-Aided Design & Computer Graphics, 2025, 37(7): 1270-1282 (in Chinese). | |
| [9] |
QI F, YANG X S, XU C S. Emotion knowledge driven video highlight detection[J]. IEEE Transactions on Multimedia, 2021, 23: 3999-4013.
DOI URL |
| [10] | JIANG Y Y, ZHANG W Y, ZHANG X L, et al. Prior knowledge integration via LLM encoding and pseudo event regulation for video moment retrieval[C]// The 32nd ACM International Conference on Multimedia. New York: ACM, 2024: 7249-7258. |
| [11] | LIU Y, LI S Y, WU Y, et al. UMT: unified multi-modal transformers for joint video moment retrieval and highlight detection[C]// 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2022: 3032-3041. |
| [12] | MOON W, HYUN S, PARK S, et al. Query-dependent video representation for moment retrieval and highlight detection[C]// 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2023: 23023-23033. |
| [13] | XIAO Y C, LUO Z Y, LIU Y, et al. Bridging the gap: a unified video comprehension framework for moment retrieval and highlight detection[C]// 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2024: 18709-18719. |
| [14] | CAO Z, ZHANG B Q, DU H M, et al. FlashVTG: feature layering and adaptive score handling network for video temporal grounding[C]// 2025 IEEE/CVF Winter Conference on Applications of Computer Vision. New York: IEEE Press, 2025: 9226-9236. |
| [15] | OCHI D, KUNITA Y, KAMEDA A, et al. Live streaming system for omnidirectional video[C]// 2015 IEEE Virtual Reality. New York: IEEE Press, 2015: 349-350. |
| [16] | ALAYRAC J B, DONAHUE J, LUC P, et al. Flamingo: a visual language model for few-shot learning[C]// The 36th International Conference on Neural Information Processing Systems. Red Hook: Curran Associates Inc., 2022: 1723. |
| [17] | LI J N, LI D X, SAVARESE S, et al. BLIP-2:bootstrapping language-image pre-training with frozen image encoders and large language models[EB/OL]. [2025-07-08]. https://proceedings.mlr.press/v202/li23q.html. |
| [18] | YANG Z Y, LI L J, LIN K, et al. The dawn of LMMs:preliminary explorations with GPT-4V(ision)[EB/OL]. [2025-08-05]. https://arxiv.org/abs/2309.17421. |
| [19] | ZHU D Y, CHEN J, SHEN X Q, et al. MiniGPT-4:enhancing vision-language understanding with advanced large language models[EB/OL]. [2025-08-05]. https://openreview.net/forum?id=1tZbq88f27. |
| [20] | WANG P, BAI S, TAN S N, et al. Qwen2-VL:enhancing vision-language model’s perception of the world at any resolution[EB/OL]. [2025-08-05]. https://arxiv.org/abs/2409.12191. |
| [21] | CHENG Z S, LENG S C, ZHANG H, et al. VideoLLAMA 2:advancing spatial-temporal modeling and audio understanding in video-LLMs[EB/OL]. [2025-08-05]. https://arxiv.org/abs/2406.07476. |
| [22] | YAO Y, YU T Y, ZHANG A, et al. MiniCPM-V:a GPT-4V level MLLM on your phone[EB/OL]. [2025-08-05]. https://arxiv.org/abs/2408.01800. |
| [23] | LIN K, AHMED F, LI L J, et al. MM-VID:advancing video understanding with GPT-4V(ision)[EB/OL]. [2025-08-05]. https://arxiv.org/abs/2310.19773. |
| [24] | ACHIAM J, ADLER S, AGARWAL S, et al. GPT-4 technical report[EB/OL]. [2025-08-05]. https://arxiv.org/abs/2303.08774. |
| [25] | LI Y H. A practical survey on zero-shot prompt design for in-context learning[C]// The 14th International Conference on Recent Advances in Natural Language Processing. New York: ACL, 2023: 641-647. |
| [26] | KOJIMA T, GU S S, REID M, et al. Large language models are zero-shot reasoners[C]// The 36th International Conference on Neural Information Processing Systems. Red Hook: Curran Associates Inc., 2022: 1613. |
| [27] | WANG X Z, WEI J, SCHUURMANS D, et al. Self- consistency improves chain of thought reasoning in language models[EB/OL]. [2025-08-05]. https://openreview.net/forum?id=1PL1NIMMrw. |
| [28] | LIU J C, LIU A, LU X M, et al. Generated knowledge prompting for commonsense reasoning[C]// The 60th Annual Meeting of the Association for Computational Linguistics. New York: ACL, 2022: 3154-3169. |
| [29] | WU T S, JIANG E, DONSBACH A, et al. PromptChainer: chaining large language model prompts through visual programming[C]// The CHI Conference on Human Factors in Computing Systems Extended Abstracts. New York: ACM, 2022: 359. |
| [30] | HE K M, ZHANG X Y, REN S Q, et al. Deep residual learning for image recognition[C]// 2016 IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2016: 770-778. |
| [31] | GRAVES A. Long short-term memory[M]//GRAVES A. Supervised Sequence Labelling with Recurrent Neural Networks. Berlin: Springer, 2012: 37-45. |
| [32] | BARNUM G, TALUKDER S, YUE Y S. On the benefits of early fusion in multimodal representation learning[EB/OL]. [2025-08-05]. https://arxiv.org/abs/2011.07191. |
| [33] | VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]// The 31st International Conference on Neural Information Processing Systems. Red Hook: Curran Associates Inc., 2017: 6000-6010. |
| [34] | LOGAN IV R, BALAŽEVIĆ I, WALLACE E, et al. Cutting down on prompts and parameters: simple few-shot learning with language models[C]// Findings of the Association for Computational Linguistics: ACL 2022. New York: ACL, 2022: 2824-2835. |
| [35] | LIU H T, LI C Y, WU Q Y, et al. Visual instruction tuning[C]// The 37th International Conference on Neural Information Processing Systems. Red Hook: Curran Associates Inc., 2023: 1516. |
| [36] | KAMATH A, FERRET J, PATHAK S, et al. Gemma 3 technical report[EB/OL]. [2025-08-05]. https://arxiv.org/abs/2503.19786. |
| [37] | ALAPARTHI S, MISHRA M. Bidirectional encoder representations from transformers: a sentiment analysis odyssey[EB/OL]. [2025-08-05]. https://arxiv.org/abs/2007.01127. |
| [38] | RADFORD A, KIM J W, HALLACY C, et al. Learning transferable visual models from natural language supervision[EB/OL]. [2025-07-08]. http://proceedings.mlr.press/v139/radford21a.html. |
| [1] | 房友江, 王世豪, 张亮, 段可然, 刘越, 魏小鹏, 杨鑫. 基于图拓扑特征提取的跨模态一致性检测方法[J]. 图学学报, 2026, 47(2): 286-295. |
| [2] | 王明微, 赵建骅, 孙志宏, 睢鹏, 路晓君. 基于深度信念网络的非标刀具设计知识挖掘与重用研究[J]. 图学学报, 2026, 47(2): 411-422. |
| [3] | 冷烁, 王玮, 欧家勇, 薛志刚, 宋英龙, 莫斯钧. 基于大型视觉语言模型的施工现场安全监控研究[J]. 图学学报, 2025, 46(5): 960-968. |
| 阅读次数 | ||||||
|
全文 |
|
|||||
|
摘要 |
|
|||||