基于对比学习的数据高效视频检索

doi:10.11996/JG.j.2095-302X.2025030491

图学学报 ›› 2025, Vol. 46 ›› Issue (3): 491-501.DOI: 10.11996/JG.j.2095-302X.2025030491

• 图像处理与计算机视觉 • 上一篇下一篇

基于对比学习的数据高效视频检索

凌非¹(), 余京涛², 朱哲燕¹, 罗剑¹, 朱继祥², 陈先客²(), 董建锋²^,³()

1.浙江经济职业技术学院数字信息技术学院，浙江杭州 310018
2.浙江工商大学计算机科学与技术学院，浙江杭州 310018
3.全省大数据与未来电子商务技术重点实验室，浙江杭州 310018

收稿日期:2024-09-24 接受日期:2025-01-14 出版日期:2025-06-30 发布日期:2025-06-13
通讯作者:陈先客(1998-)，男，博士研究生。主要研究方向为图形图像处理和计算机视觉。E-mail：a397283164@163.com；董建锋(1991-)，男，研究员，博士。主要研究方向为多媒体分析与检索和多模态学习。E-mail：dongjf24@gmail.com
第一作者:凌非(1990-)，男，讲师，硕士。主要研究方向为视频理解和跨模态检索。E-mail：lingfei@zjtie.edu.cn
基金资助:
浙江省“尖兵”“领雁”研发攻关计划(2023C01212);浙江省基础公益技术研究计划项目(LGF21F020010);浙江省属高校基本科研业务费专项资金(FR2402ZD);浙江省教育厅一般科研课题(Y202351804)

Data-efficient video retrieval with contrastive learning

LING Fei¹(), YU Jingtao², ZHU Zheyan¹, LUO Jian¹, ZHU Jixiang², CHEN Xianke²(), DONG Jianfeng²^,³()

1. Department of Digital Information Technology, Zhejiang Technical Institute of Economics, Hangzhou Zhejiang 310018, China
2. Department of Computer Science and Technology, Zhejiang Gongshang University, Hangzhou Zhejiang 310018, China
3. Zhejiang Key Laboratory of Big Data and Future E-Commerce Technology, Hangzhou Zhejiang 310018, China

Received:2024-09-24 Accepted:2025-01-14 Published:2025-06-30 Online:2025-06-13
First author：LING Fei (1990-), lecturer, master. His main research interests cover video understanding and cross-modal retrieval. E-mail：lingfei@zjtie.edu.cn
Supported by:
‘Pioneer’ and ‘Leading Goose’ R&D Program of Zhejiang(2023C01212);Public Welfare Technology Research Project of Zhejiang Province(LGF21F020010);Fundamental Research Funds for the Provincial Universities of Zhejiang(FR2402ZD);Scientific Research Fund of Zhejiang Provincial Education Department(Y202351804)

摘要/Abstract

摘要：

视频检索系统的性能很大程度上依赖标注数据，而在提高性能的同时减少对高昂手工标注的依赖是一个关键问题。为此，提出了一种基于对比学习的数据高效视频检索方法，包括2个关键的优化策略。首先，为构建更加多样且有效的学习数据，提出了基于内容感知的特征级别数据增强，利用基于帧间相似度的K-近邻算法来捕获深层语义信息，减少标注数据依赖。其次，设计了长-短动态采样策略，通过从视频中提取长片段及其内部短片段，使其能够构造具有多尺度信息的正样本对以进行更加有效的对比学习，同时通过动态调整采样长度来提高数据利用率。在SVD和UCF101数据集上的实验结果表明，该方法显著优于现有检索模型。大量消融实验证明，基于内容感知的特征级数据增强能提升模型适应性；长-短动态采样不仅适用于自监督学习，还能提升半监督模型性能。

关键词: 对比学习, 内容感知, 特征增强, 视频检索, 视频表征学习

Abstract:

The performance of video retrieval systems largely depends on annotated data, and a key challenge is to reduce reliance on expensive manual annotation while enhancing performance. To address this issue, a data-efficient video retrieval method based on contrastive learning was proposed, which incorporated two key optimization strategies. First, to construct more diverse and effective learning data, a content-aware feature-level data augmentation method was introduced, utilizing a frame-based similarity K-nearest neighbor algorithm to capture deep semantic information and reduce dependence on annotated data. Second, by extracting long segments and their internal short segments from videos, a long-short dynamic sampling strategy was designed to construct positive sample pairs with multi-scale information for more effective contrastive learning, while the sampling lengths of long and short segments were dynamically adjusted to improve data utilization. Experimental results on the SVD and UCF101 datasets demonstrated that the proposed method significantly outperformed existing retrieval models. Extensive ablation studies confirmed that content-aware feature-level data augmentation enhanced model adaptability, and long-short dynamic sampling benefits not only self-supervised learning but also improved the performance of semi-supervised models.

Key words: contrastive learning, content awareness, feature enhancement, video retrieval, video representation learning

中图分类号:

TP391
TP18

凌非, 余京涛, 朱哲燕, 罗剑, 朱继祥, 陈先客, 董建锋. 基于对比学习的数据高效视频检索[J]. 图学学报, 2025, 46(3): 491-501.

LING Fei, YU Jingtao, ZHU Zheyan, LUO Jian, ZHU Jixiang, CHEN Xianke, DONG Jianfeng. Data-efficient video retrieval with contrastive learning[J]. Journal of Graphics, 2025, 46(3): 491-501.

图/表 12

图1 基于对比学习的数据高效视频检索框架

Fig. 1 The framework of the data-efficient video retrieval with contrastive learning

图2 传统的特征级数据增强：Mixup

Fig. 2 Feature-level data augmentation module: Mixup

图3 长-短动态采样LSDS模块与基线之间的可视化比较

Fig. 3 The comparison of visualizations between the LSDS module and baseline

表1 在SVD数据集上的半监督视频检索性能比较

Table 1 Comparison of semi-supervised video retrieval performance on the SVD dataset

模型	骨干网络	Top-100 mAp	Top mAp
CNN-V^[22]	VGG16	0.251	0.191
CTE^[8]	VGG16	-	0.510
CNN-L^[22]	VGG16	0.610	0.555
HDML+^[2]	VGG16	0.786	-
ITQ+^[25]	VGG16	0.789	-
IsoH+^[26]	VGG16	0.790	-
DML^[14]	VGG16	0.813	0.784
EVR-CL (本文方法)	VGG16	0.836	0.812
LAMV^[27]	ResNet34	-	0.781
VRL^[5]	ResNet50	0.860	-
clip-level SVRTN^[29]	ResNet50	0.860	-
frame-level ViSiL^[30]	ResNet50	0.869	-
frame-level SVRTN^[29]	ResNet50	0.871	-
Barlow Twins^[34]	ResNet50	0.872	0.848
MoCo-V3^[33]	ResNet50	0.873	0.855
EVR-CL (本文方法)	ResNet50	0.889	0.867

表2 在UCF101数据集上的无监督视频检索性能比较

Table 2 Comparison of unsupervised video retrieval performance on the UCF101 dataset

模型	R@1	R@5	R@10
VCOP^[11]	14.10	30.30	40.40
VCP^[15]	18.60	33.60	42.50
Pace Pred^[16]	23.80	38.10	46.40
Var. PSP^[18]	24.60	41.90	51.30
TempTrans^[21]	26.40	48.50	59.40
Mem DPC^[23]	20.20	40.40	52.40
MFO^[32]	39.60	57.60	69.20
RSPNet^[35]	41.10	59.40	68.40
MaMico^[36]	42.90	-	70.30
EVR-CL(本文方法)	50.12	66.45	73.45

表3 在SVD数据集上的消融实验对比

Table 3 Ablation study on the SVD dataset

模块		Top-100 mAp	Top mAp	R@1	R@5	R@10
CAFE	LSDS	Top-100 mAp	Top mAp	R@1	R@5	R@10
-	-	0.852	0.832	76.7	85.9	86.4
-	√	0.883	0.857	79.1	89.8	92.2
√	-	0.861	0.853	78.6	78.6	87.9
√	√	0.889	0.867	82.5	90.3	92.7

图4 长-短动态采样LSDS模块与均匀采样的可视化比较

Fig. 4 The comparison of visualizations between the LSDS module and uniform sampling

表4 数据增强方案对模型检索性能的影响

Table 4 The effectiveness of the data augmentation

数据增强方案	Top-100 mAp	Top mAp	训练时间/%
基准模型	0.852	0.832	-
Mixup	0.876	0.857	17.40
CAFE	0.878	0.860	17.52

图5 长-短动态采样比例因子对模型检索性能的影响

Fig. 5 The effectiveness of long and short parameter on the retrieval performance ((a) αs=0.5, (b) αl=0.5)

图6 基于内容感知的特征级数据增强中参数对性能的影响

Fig. 6 The effectiveness of m and β on the retrieval performance (a) m=0.5; (b) β=0.5)

表5 CAFE中K-近邻搜索不同K值对模型性能影响

Table 5 The impact of varying K values in K-nearest neighbors within the CAFE module

K值	Top-100 mAp	Top mAp
1	0.885	0.861
2	0.883	0.851
4	0.870	0.848

图7 CAFE模块和Mixup分别与基准模型的t-SNE可视化对比((a) CAFE与基准模型相比；(b) Mixup与基准模型相比)

Fig. 7 The comparison of t-SNE visualizations between the CAFE and Mixup with the baseline model ((a) CAFE vs baseline model; (b) Mixup vs baseline model)

参考文献 43

[1]	MA Z, DONG J F, JI S L, et al. Let all be whitened: multi-teacher distillation for efficient visual retrieval[C]// The 38th AAAI Conference on Artificial Intelligence. Palo Alto: AAAI Press, 2024: 4126-4135.
[2]	NOROUZI M, FLEET D J, SALAKHUTDINOV R. Hamming distance metric learning[C]// The 26th International Conference on Neural Information Processing Systems. Red Hook: Curran Associates Inc., 2012: 1061-1069.
[3]	毕春艳, 刘越. 基于深度学习的视频人体动作识别综述[J]. 图学学报, 2023, 44(4): 625-639. DOI
	BI C Y, LIU Y. A survey of video human action recognition based on deep learning[J]. Journal of Graphics, 2023, 44(4): 625-639 (in Chinese).
[4]	HE S F, YANG X D, JIANG C, et al. A large-scale comprehensive dataset and copy-overlap aware evaluation protocol for segment-level video copy detection[C]// 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2022: 21086-21095.
[5]	HE X T, PAN Y L, TANG M Q, et al. Learn from unlabeled videos for near-duplicate video retrieval[C]// The 45th International ACM SIGIR Conference on Research and Development in Information Retrieval. New York: ACM, 2022: 1002-1011.
[6]	WEI Y W, WANG X, LI Q, et al. Contrastive learning for cold-start recommendation[C]// The 29th ACM International Conference on Multimedia. New York: ACM, 2021: 5382-5390.
[7]	HE X N, ZHANG Y, FENG F L, et al. Addressing confounding feature issue for causal recommendation[J]. ACM Transactions on Information Systems, 2023, 41(3): 53.
[8]	REVAUD J, ALMAZÁN J, REZENDE R S, et al. Learning with average precision: training image retrieval with a listwise loss[C]// 2019 IEEE/CVF International Conference on Computer Vision. New York: IEEE Press, 2019: 5107-5116.
[9]	SHORTEN C, KHOSHGOFTAAR T M. A survey on image data augmentation for deep learning[J]. Journal of Big Data, 2019, 6(1): 60.
[10]	JIANG Q Y, HE Y, LI G, et al. SVD: a large-scale short video dataset for near-duplicate video retrieval[C]// 2019 IEEE/CVF International Conference on Computer Vision. New York: IEEE Press, 2019: 5281-5289.
[11]	XU D J, XIAO J, ZHAO Z, et al. Self-supervised spatiotemporal learning via video clip order prediction[C]// 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2019: 10334-10343.
[12]	冯尊登, 王洪元, 林龙, 等. 基于多分支注意网络与相似度学习策略的无监督行人重识别[J]. 图学学报, 2023, 44(2): 280-290. DOI
	FENG Z D, WANG H Y, LIN L, et al. Unsupervised person re-identification with multi-branch attention network and similarity learning strategy[J]. Journal of Graphics, 2023, 44(2): 280-290 (in Chinese).
[13]	SIMONYAN K, ZISSERMAN A. Very deep convolutional networks for large-scale image recognition[EB/OL]. [2025-01-06]. https://arxiv.org/abs/1409.1556.
[14]	KORDOPATIS-ZILOS G, PAPADOPOULOS S, PATRAS I, et al. Near-duplicate video retrieval with deep metric learning[C]// 2017 IEEE International Conference on Computer Vision Workshops. New York: IEEE Press, 2017: 347-356.
[15]	LUO D Z, LIU C, ZHOU Y, et al. Video cloze procedure for self-supervised spatio-temporal learning[C]// The 34th AAAI Conference on Artificial Intelligence. Palo Alto: AAAI Press, 2020: 11701-11708.
[16]	WANG J L, JIAO J B, LIU Y H. Self-supervised video representation learning by pace prediction[C]// The 16th European Conference on Computer Vision. Cham: Springer, 2020: 504-521.
[17]	ZHANG B W, HU H X, LEE J, et al. A hierarchical multi-modal encoder for moment localization in video corpus[EB/OL]. [2025-01-06]https://arxiv.org/abs/2011.09046.
[18]	CHO H, KIM T, CHANG H J, et al. Self-supervised visual learning by variable playback speeds prediction of a video[J]. IEEE Access, 2021, 9: 79562-79571.
[19]	LI B Y, WU F, LIM S N, et al. On feature normalization and data augmentation[C]// 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2021: 12383-12392.
[20]	WANG Y L, HUANG G, SONG S J, et al. Regularizing deep networks with semantic data augmentation[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022, 44(7): 3733-3748.
[21]	JENNI S, MEISHVILI G, FAVARO P. Video representation learning by recognizing temporal transformations[EB/OL]. [2025-01-06]. https://arxiv.org/abs/2007.10730.
[22]	KORDOPATIS-ZILOS G, PAPADOPOULOS S, PATRAS I, et al. Near-duplicate video retrieval by aggregating intermediate CNN layers[C]// The 23rd International Conference on MultiMedia Modeling. Cham: Springer, 2017: 251-263.
[23]	HAN T D, XIE W D, ZISSERMAN A. Memory-augmented dense predictive coding for video representation learning[C]// The 16th European Conference on Computer Vision. Cham: Springer, 2020: 312-329.
[24]	WANG H Y, LI B, WU S, et al. Rethinking the learning paradigm for dynamic facial expression recognition[C]// 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2023: 17958-17968.
[25]	GONG Y C, LAZEBNIK S, GORDO A, et al. Iterative quantization: a procrustean approach to learning binary codes for large-scale image retrieval[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2013, 35(12): 2916-2929. DOI PMID
[26]	KONG W, LI W J. Isotropic hashing[C]// The 26th International Conference on Neural Information Processing Systems. Red Hook: Curran Associates Inc., 2012: 1646-1654.
[27]	BARALDI L, DOUZE M, CUCCHIARA R, et al. LAMV: learning to align and match videos with kernelized temporal layers[C]// 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2018: 7804-7813.
[28]	LU Y W, ZHANG G J, SUN S, et al. f-MICL: understanding and generalizing InfoNCE-based contrastive learning[EB/OL]. [2025-01-06]https://arxiv.org/abs/2402.10150.
[29]	HE X T, PAN Y L, TANG M Q, et al. Self-supervised video retrieval transformer network[EB/OL]. [2025-01-06]https://arxiv.org/abs/2104.07993.
[30]	KORDOPATIS-ZILOS G, PAPADOPOULOS S, PATRAS I, et al. ViSiL: fine-grained spatio-temporal video similarity learning[C]// 2019 IEEE/CVF International Conference on Computer Vision. New York: IEEE Press, 2019: 6351-6360.
[31]	JO W, LIM G, LEE G, et al. VVS: video-to-video retrieval with irrelevant frame suppression[C]// The 38th AAAI Conference on Artificial Intelligence. Palo Alto: AAAI Press, 2024: 2679-2687.
[32]	QIAN R, LI Y X, LIU H B, et al. Enhancing self-supervised video representation learning via multi-level feature optimization[C]// 2021 IEEE/CVF International Conference on Computer Vision. New York: IEEE Press, 2021: 7990-8001.
[33]	CHEN X L, XIE S N, HE K M. An empirical study of training self-supervised vision transformers[C]// 2021 IEEE/CVF International Conference on Computer Vision. New York: IEEE Press, 2021: 9640-9649.
[34]	ZBONTAR J, JING L, MISRA I, et al. Barlow twins: self-supervised learning via redundancy reduction[EB/OL]. [2024-07-23]https://dblp.uni-trier.de/db/conf/icml/icml2021.html#ZbontarJMLD21.
[35]	CHEN P H, HUANG D, HE D L, et al. RSPNet: relative speed perception for unsupervised video representation learning[C]// The 35th AAAI Conference on Artificial Intelligence. Palo Alto: AAAI Press, 2021: 1045-1053.
[36]	FANG B, WU W H, LIU C, et al. MaMiCo: macro-to-micro semantic correspondence for self-supervised video representation learning[C]// The 30th ACM International Conference on Multimedia. New York: ACM, 2022: 1348-1357.
[37]	SOOMRO K, ZAMIR A R, SHAH M. UCF101: a dataset of 101 human actions classes from videos in the wild[EB/OL]. [2025-01-06]https://arxiv.org/abs/1212.0402.
[38]	GHIASI G, CUI Y, SRINIVAS A, et al. Simple copy-paste is a strong data augmentation method for instance segmentation[C]// 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2021: 2918-2928.
[39]	BOCHKOVSKIY A, WANG C Y, LIAO H Y M. YOLOv4:optimal speed and accuracy of object detection[EB/OL]. [2025-01-06]https://arxiv.org/abs/2004.10934.
[40]	ZHANG H Y, CISSE M, DAUPHIN Y N, et al. Mixup: beyond empirical risk minimization[EB/OL]. [2025-01-06]https://arxiv.org/abs/1710.09412.
[41]	GUO P K, YANG H Y, SANO A. Empirical study of mix-based data augmentation methods in physiological time series data[C]// 2023 IEEE 11th International Conference on Healthcare Informatics. New York: IEEE Press, 2023: 206-213.
[42]	HE K M, ZHANG X Y, REN S Q, et al. Deep residual learning for image recognition[C]// 2016 IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2016: 770-778.
[43]	DING K Z, XU Z, TONG H H, et al. Data augmentation for deep graph learning: a survey[J]. ACM SIGKDD Explorations Newsletter, 2022, 24(2): 61-77.

基于对比学习的数据高效视频检索

Data-efficient video retrieval with contrastive learning

RichHTML

PDF (PC)

可视化

摘要/Abstract

引用本文

使用本文

图/表 12

参考文献 43

相关文章 7

编辑推荐

Metrics

本文评价

[1]	卢洋, 陈林慧, 姜晓恒, 徐明亮. SDENet：基于多尺度注意力质量感知的合成缺陷数据评价网络[J]. 图学学报, 2025, 46(1): 94-103.
[2]	李大湘, 吉展, 刘颖, 唐垚. 改进YOLOv7遥感图像目标检测算法[J]. 图学学报, 2024, 45(4): 650-658.
[3]	程艳, 严志航, 赖建明, 王桂喜, 钟林辉. 基于语义引导的人像自动抠图模型[J]. 图学学报, 2024, 45(4): 683-695.
[4]	阎光伟, 刘润泽, 焦润海, 何慧. 基于改进Cascade RCNN的输电线路防振锤脱落检测方法[J]. 图学学报, 2023, 44(5): 849-860.
[5]	冯尊登, 王洪元, 林龙, 孙博言, 陈海琴. 基于多分支注意网络与相似度学习策略的无监督行人重识别[J]. 图学学报, 2023, 44(2): 280-290.
[6]	范溢华 , 王永振 , 燕雪峰 , 宫丽娜 , 郭延文 , 魏明强 . 人脸识别任务驱动的低光照图像增强算法 [J]. 图学学报, 2022, 43(6): 1170-1181.
[7]	史彩娟, 陈厚儒, 葛录录, 王子雯. 注意力残差多尺度特征增强的显著性实例分割[J]. 图学学报, 2021, 42(6): 883-890.