Welcome to Journal of Graphics share: 

Journal of Graphics ›› 2025, Vol. 46 ›› Issue (3): 491-501.DOI: 10.11996/JG.j.2095-302X.2025030491

• Image Processing and Computer Vision • Previous Articles     Next Articles

Data-efficient video retrieval with contrastive learning

LING Fei1(), YU Jingtao2, ZHU Zheyan1, LUO Jian1, ZHU Jixiang2, CHEN Xianke2(), DONG Jianfeng2,3()   

  1. 1. Department of Digital Information Technology, Zhejiang Technical Institute of Economics, Hangzhou Zhejiang 310018, China
    2. Department of Computer Science and Technology, Zhejiang Gongshang University, Hangzhou Zhejiang 310018, China
    3. Zhejiang Key Laboratory of Big Data and Future E-Commerce Technology, Hangzhou Zhejiang 310018, China
  • Received:2024-09-24 Accepted:2025-01-14 Online:2025-06-30 Published:2025-06-13
  • Contact: CHEN Xianke, DONG Jianfeng
  • About author:First author contact:

    LING Fei (1990-), lecturer, master. His main research interests cover video understanding and cross-modal retrieval. E-mail:lingfei@zjtie.edu.cn

  • Supported by:
    ‘Pioneer’ and ‘Leading Goose’ R&D Program of Zhejiang(2023C01212);Public Welfare Technology Research Project of Zhejiang Province(LGF21F020010);Fundamental Research Funds for the Provincial Universities of Zhejiang(FR2402ZD);Scientific Research Fund of Zhejiang Provincial Education Department(Y202351804)

Abstract:

The performance of video retrieval systems largely depends on annotated data, and a key challenge is to reduce reliance on expensive manual annotation while enhancing performance. To address this issue, a data-efficient video retrieval method based on contrastive learning was proposed, which incorporated two key optimization strategies. First, to construct more diverse and effective learning data, a content-aware feature-level data augmentation method was introduced, utilizing a frame-based similarity K-nearest neighbor algorithm to capture deep semantic information and reduce dependence on annotated data. Second, by extracting long segments and their internal short segments from videos, a long-short dynamic sampling strategy was designed to construct positive sample pairs with multi-scale information for more effective contrastive learning, while the sampling lengths of long and short segments were dynamically adjusted to improve data utilization. Experimental results on the SVD and UCF101 datasets demonstrated that the proposed method significantly outperformed existing retrieval models. Extensive ablation studies confirmed that content-aware feature-level data augmentation enhanced model adaptability, and long-short dynamic sampling benefits not only self-supervised learning but also improved the performance of semi-supervised models.

Key words: contrastive learning, content awareness, feature enhancement, video retrieval, video representation learning

CLC Number: