欢迎访问《图学学报》 分享到:

图学学报 ›› 2025, Vol. 46 ›› Issue (3): 491-501.DOI: 10.11996/JG.j.2095-302X.2025030491

• 图像处理与计算机视觉 • 上一篇    下一篇

基于对比学习的数据高效视频检索

凌非1(), 余京涛2, 朱哲燕1, 罗剑1, 朱继祥2, 陈先客2(), 董建锋2,3()   

  1. 1.浙江经济职业技术学院数字信息技术学院,浙江 杭州 310018
    2.浙江工商大学计算机科学与技术学院,浙江 杭州 310018
    3.全省大数据与未来电子商务技术重点实验室,浙江 杭州 310018
  • 收稿日期:2024-09-24 接受日期:2025-01-14 出版日期:2025-06-30 发布日期:2025-06-13
  • 通讯作者:陈先客(1998-),男,博士研究生。主要研究方向为图形图像处理和计算机视觉。E-mail:a397283164@163.com;董建锋(1991-),男,研究员,博士。主要研究方向为多媒体分析与检索和多模态学习。E-mail:dongjf24@gmail.com
  • 第一作者:凌非(1990-),男,讲师,硕士。主要研究方向为视频理解和跨模态检索。E-mail:lingfei@zjtie.edu.cn
  • 基金资助:
    浙江省“尖兵”“领雁”研发攻关计划(2023C01212);浙江省基础公益技术研究计划项目(LGF21F020010);浙江省属高校基本科研业务费专项资金(FR2402ZD);浙江省教育厅一般科研课题(Y202351804)

Data-efficient video retrieval with contrastive learning

LING Fei1(), YU Jingtao2, ZHU Zheyan1, LUO Jian1, ZHU Jixiang2, CHEN Xianke2(), DONG Jianfeng2,3()   

  1. 1. Department of Digital Information Technology, Zhejiang Technical Institute of Economics, Hangzhou Zhejiang 310018, China
    2. Department of Computer Science and Technology, Zhejiang Gongshang University, Hangzhou Zhejiang 310018, China
    3. Zhejiang Key Laboratory of Big Data and Future E-Commerce Technology, Hangzhou Zhejiang 310018, China
  • Received:2024-09-24 Accepted:2025-01-14 Published:2025-06-30 Online:2025-06-13
  • First author:LING Fei (1990-), lecturer, master. His main research interests cover video understanding and cross-modal retrieval. E-mail:lingfei@zjtie.edu.cn
  • Supported by:
    ‘Pioneer’ and ‘Leading Goose’ R&D Program of Zhejiang(2023C01212);Public Welfare Technology Research Project of Zhejiang Province(LGF21F020010);Fundamental Research Funds for the Provincial Universities of Zhejiang(FR2402ZD);Scientific Research Fund of Zhejiang Provincial Education Department(Y202351804)

摘要:

视频检索系统的性能很大程度上依赖标注数据,而在提高性能的同时减少对高昂手工标注的依赖是一个关键问题。为此,提出了一种基于对比学习的数据高效视频检索方法,包括2个关键的优化策略。首先,为构建更加多样且有效的学习数据,提出了基于内容感知的特征级别数据增强,利用基于帧间相似度的K-近邻算法来捕获深层语义信息,减少标注数据依赖。其次,设计了长-短动态采样策略,通过从视频中提取长片段及其内部短片段,使其能够构造具有多尺度信息的正样本对以进行更加有效的对比学习,同时通过动态调整采样长度来提高数据利用率。在SVD和UCF101数据集上的实验结果表明,该方法显著优于现有检索模型。大量消融实验证明,基于内容感知的特征级数据增强能提升模型适应性;长-短动态采样不仅适用于自监督学习,还能提升半监督模型性能。

关键词: 对比学习, 内容感知, 特征增强, 视频检索, 视频表征学习

Abstract:

The performance of video retrieval systems largely depends on annotated data, and a key challenge is to reduce reliance on expensive manual annotation while enhancing performance. To address this issue, a data-efficient video retrieval method based on contrastive learning was proposed, which incorporated two key optimization strategies. First, to construct more diverse and effective learning data, a content-aware feature-level data augmentation method was introduced, utilizing a frame-based similarity K-nearest neighbor algorithm to capture deep semantic information and reduce dependence on annotated data. Second, by extracting long segments and their internal short segments from videos, a long-short dynamic sampling strategy was designed to construct positive sample pairs with multi-scale information for more effective contrastive learning, while the sampling lengths of long and short segments were dynamically adjusted to improve data utilization. Experimental results on the SVD and UCF101 datasets demonstrated that the proposed method significantly outperformed existing retrieval models. Extensive ablation studies confirmed that content-aware feature-level data augmentation enhanced model adaptability, and long-short dynamic sampling benefits not only self-supervised learning but also improved the performance of semi-supervised models.

Key words: contrastive learning, content awareness, feature enhancement, video retrieval, video representation learning

中图分类号: