欢迎访问《图学学报》

图学学报 ›› 2026, Vol. 47 ›› Issue (2): 264-274.DOI: 10.11996/JG.j.2095-302X.2026020264

• 图像处理与计算机视觉 • 上一篇    下一篇

面向景区直播流推荐的视频吸引力评价方法

周强1, 黄尧秋2, 史伟民1, 周忠1()   

  1. 1 北京航空航天大学计算机学院北京 100191
    2 北京航空航天大学软件学院北京 100191
  • 收稿日期:2025-09-08 接受日期:2025-11-25 出版日期:2026-04-30 发布日期:2026-05-20
  • 通讯作者:周忠,E-mail:zz@buaa.edu.cn
  • 基金资助:
    国家自然科学基金(62272018);国家自然科学基金(62206184);海南省交通科技项目(HNJTT-KXC2024-3-22-02)

Video attractiveness assessment method for scenic live stream recommendations

ZHOU Qiang1, HUANG Yaoqiu2, SHI Weimin1, ZHOU Zhong1()   

  1. 1 School of Computer Science and Engineering, Beihang University, Beijing 100191, China
    2 School of Software, Beihang University, Beijing 100191, China
  • Received:2025-09-08 Accepted:2025-11-25 Published:2026-04-30 Online:2026-05-20
  • Contact: ZHOU Zhong,E-mail:zz@buaa.edu.cn
  • Supported by:
    National Natural Science Foundation of China(62272018);National Natural Science Foundation of China(62206184);Science and Technology Project of Hainan Provincial Department of Transportation(HNJTT-KXC2024-3-22-02)

摘要:

随着5G、云计算与音视频等技术的普及,视频流直播成为线上文旅的重要形态。但主流的多摄像机“慢直播”缺乏人工导播与脚本,内容随机性强,使传统基于偏好与热度的推荐方法难以奏效。为此,提出了一种视频吸引力评价方法,通过评估多视频源内容对观众注意力和情感共鸣的激发能力,从而预测其对观众的吸引效果。相较于传统基于偏好与热度的推荐方法,视频吸引力评价方法更适用于景区直播场景。围绕“视频吸引力”这一概念,首先构建多角度线索引导的视频描述生成方法,引入大型视觉语言模型(LVLM)对视频进行关键信息抽取、结构化内容表征与情感语义推理,并整合为可读的描述文本与吸引力因素;其次建立基于多模态特征的视频吸引力评价方法,在对比学习网络中引入交叉注意力、动态显著性与负样本增强,输出吸引力评分与关键因素;最后在此基础上实现了视频吸引力驱动的景区直播系统原型,包括频道推荐、吸引力可视化与AI导览等功能。在TVSum50数据集上,验证了视频描述与视频内容的相关性,较原始视频描述提升了7.00%,在跨任务泛化实验中提升了6.00%;在自建景区直播数据集上,基于多模态特征的视频吸引力方法相较基线在吸引力评价上提升了24.00%。

关键词: 视频吸引力分析, 多模态融合, 景区直播, 智能推荐, 大型视觉语言模型

Abstract:

With the proliferation of 5G, cloud computing, and audio-video technologies, live streaming has emerged as a pivotal medium for online cultural tourism. However, mainstream multi-camera “slow live broadcasts” lack human-guided narration and scripting, resulting in high content randomness that undermines traditional recommendation methods based on user preferences or video popularity. To address this limitation, video attractiveness assessment method was proposed to predict audience engagement by evaluating how multi-source video content stimulated viewer attention and emotional resonance. This approach proved more suitable for scenic-area live streaming scenarios than conventional methods. Centered on video attractiveness, a multi-perspective guided video- description generation method was developed and leveraged a Large Vision-Language Model (LVLM) to extract key information, structure content representations, and infer emotional semantics, synthesizing them into readable descriptive texts and attractiveness factors. Secondly, a multimodal feature fusion-based attractiveness assessment method integrated cross-attention mechanisms, dynamic saliency, and negative sample augmentation within a contrastive-learning network to output attractiveness scores and critical factors. Finally, an attractiveness driven live-streaming system prototype for scenic areas was implemented, featuring channel recommendation, attractiveness visualization, and AI-guided navigation. Validation on the TVSum50 dataset was conducted and demonstrated a 7.00% improvement in video-description relevance over raw descriptions and a 6.00% gain in cross-task generalization. On a self-built scenic live streaming dataset, the multimodal attractiveness evaluation method achieved a 24.00% higher accuracy than unimodal baselines.

Key words: video attraction analysis, multimodal fusion, scenic spot live-streaming, intelligent recommendation, large vision-language model

中图分类号: