Data-efficient video retrieval with contrastive learning

doi:10.11996/JG.j.2095-302X.2025030491

Abstract

Abstract:

The performance of video retrieval systems largely depends on annotated data, and a key challenge is to reduce reliance on expensive manual annotation while enhancing performance. To address this issue, a data-efficient video retrieval method based on contrastive learning was proposed, which incorporated two key optimization strategies. First, to construct more diverse and effective learning data, a content-aware feature-level data augmentation method was introduced, utilizing a frame-based similarity K-nearest neighbor algorithm to capture deep semantic information and reduce dependence on annotated data. Second, by extracting long segments and their internal short segments from videos, a long-short dynamic sampling strategy was designed to construct positive sample pairs with multi-scale information for more effective contrastive learning, while the sampling lengths of long and short segments were dynamically adjusted to improve data utilization. Experimental results on the SVD and UCF101 datasets demonstrated that the proposed method significantly outperformed existing retrieval models. Extensive ablation studies confirmed that content-aware feature-level data augmentation enhanced model adaptability, and long-short dynamic sampling benefits not only self-supervised learning but also improved the performance of semi-supervised models.

Key words: contrastive learning, content awareness, feature enhancement, video retrieval, video representation learning

CLC Number:

TP391
TP18

LING Fei, YU Jingtao, ZHU Zheyan, LUO Jian, ZHU Jixiang, CHEN Xianke, DONG Jianfeng. Data-efficient video retrieval with contrastive learning[J]. Journal of Graphics, 2025, 46(3): 491-501.

Figures/Tables 12

References 43

[1]	MA Z, DONG J F, JI S L, et al. Let all be whitened: multi-teacher distillation for efficient visual retrieval[C]// The 38th AAAI Conference on Artificial Intelligence. Palo Alto: AAAI Press, 2024: 4126-4135.
[2]	NOROUZI M, FLEET D J, SALAKHUTDINOV R. Hamming distance metric learning[C]// The 26th International Conference on Neural Information Processing Systems. Red Hook: Curran Associates Inc., 2012: 1061-1069.
[3]	毕春艳, 刘越. 基于深度学习的视频人体动作识别综述[J]. 图学学报, 2023, 44(4): 625-639. DOI
	BI C Y, LIU Y. A survey of video human action recognition based on deep learning[J]. Journal of Graphics, 2023, 44(4): 625-639 (in Chinese).
[4]	HE S F, YANG X D, JIANG C, et al. A large-scale comprehensive dataset and copy-overlap aware evaluation protocol for segment-level video copy detection[C]// 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2022: 21086-21095.
[5]	HE X T, PAN Y L, TANG M Q, et al. Learn from unlabeled videos for near-duplicate video retrieval[C]// The 45th International ACM SIGIR Conference on Research and Development in Information Retrieval. New York: ACM, 2022: 1002-1011.
[6]	WEI Y W, WANG X, LI Q, et al. Contrastive learning for cold-start recommendation[C]// The 29th ACM International Conference on Multimedia. New York: ACM, 2021: 5382-5390.
[7]	HE X N, ZHANG Y, FENG F L, et al. Addressing confounding feature issue for causal recommendation[J]. ACM Transactions on Information Systems, 2023, 41(3): 53.
[8]	REVAUD J, ALMAZÁN J, REZENDE R S, et al. Learning with average precision: training image retrieval with a listwise loss[C]// 2019 IEEE/CVF International Conference on Computer Vision. New York: IEEE Press, 2019: 5107-5116.
[9]	SHORTEN C, KHOSHGOFTAAR T M. A survey on image data augmentation for deep learning[J]. Journal of Big Data, 2019, 6(1): 60.
[10]	JIANG Q Y, HE Y, LI G, et al. SVD: a large-scale short video dataset for near-duplicate video retrieval[C]// 2019 IEEE/CVF International Conference on Computer Vision. New York: IEEE Press, 2019: 5281-5289.
[11]	XU D J, XIAO J, ZHAO Z, et al. Self-supervised spatiotemporal learning via video clip order prediction[C]// 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2019: 10334-10343.
[12]	冯尊登, 王洪元, 林龙, 等. 基于多分支注意网络与相似度学习策略的无监督行人重识别[J]. 图学学报, 2023, 44(2): 280-290. DOI
	FENG Z D, WANG H Y, LIN L, et al. Unsupervised person re-identification with multi-branch attention network and similarity learning strategy[J]. Journal of Graphics, 2023, 44(2): 280-290 (in Chinese).
[13]	SIMONYAN K, ZISSERMAN A. Very deep convolutional networks for large-scale image recognition[EB/OL]. [2025-01-06]. https://arxiv.org/abs/1409.1556.
[14]	KORDOPATIS-ZILOS G, PAPADOPOULOS S, PATRAS I, et al. Near-duplicate video retrieval with deep metric learning[C]// 2017 IEEE International Conference on Computer Vision Workshops. New York: IEEE Press, 2017: 347-356.
[15]	LUO D Z, LIU C, ZHOU Y, et al. Video cloze procedure for self-supervised spatio-temporal learning[C]// The 34th AAAI Conference on Artificial Intelligence. Palo Alto: AAAI Press, 2020: 11701-11708.
[16]	WANG J L, JIAO J B, LIU Y H. Self-supervised video representation learning by pace prediction[C]// The 16th European Conference on Computer Vision. Cham: Springer, 2020: 504-521.
[17]	ZHANG B W, HU H X, LEE J, et al. A hierarchical multi-modal encoder for moment localization in video corpus[EB/OL]. [2025-01-06]https://arxiv.org/abs/2011.09046.
[18]	CHO H, KIM T, CHANG H J, et al. Self-supervised visual learning by variable playback speeds prediction of a video[J]. IEEE Access, 2021, 9: 79562-79571.
[19]	LI B Y, WU F, LIM S N, et al. On feature normalization and data augmentation[C]// 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2021: 12383-12392.
[20]	WANG Y L, HUANG G, SONG S J, et al. Regularizing deep networks with semantic data augmentation[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022, 44(7): 3733-3748.
[21]	JENNI S, MEISHVILI G, FAVARO P. Video representation learning by recognizing temporal transformations[EB/OL]. [2025-01-06]. https://arxiv.org/abs/2007.10730.
[22]	KORDOPATIS-ZILOS G, PAPADOPOULOS S, PATRAS I, et al. Near-duplicate video retrieval by aggregating intermediate CNN layers[C]// The 23rd International Conference on MultiMedia Modeling. Cham: Springer, 2017: 251-263.
[23]	HAN T D, XIE W D, ZISSERMAN A. Memory-augmented dense predictive coding for video representation learning[C]// The 16th European Conference on Computer Vision. Cham: Springer, 2020: 312-329.
[24]	WANG H Y, LI B, WU S, et al. Rethinking the learning paradigm for dynamic facial expression recognition[C]// 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2023: 17958-17968.
[25]	GONG Y C, LAZEBNIK S, GORDO A, et al. Iterative quantization: a procrustean approach to learning binary codes for large-scale image retrieval[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2013, 35(12): 2916-2929. DOI PMID
[26]	KONG W, LI W J. Isotropic hashing[C]// The 26th International Conference on Neural Information Processing Systems. Red Hook: Curran Associates Inc., 2012: 1646-1654.
[27]	BARALDI L, DOUZE M, CUCCHIARA R, et al. LAMV: learning to align and match videos with kernelized temporal layers[C]// 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2018: 7804-7813.
[28]	LU Y W, ZHANG G J, SUN S, et al. f-MICL: understanding and generalizing InfoNCE-based contrastive learning[EB/OL]. [2025-01-06]https://arxiv.org/abs/2402.10150.
[29]	HE X T, PAN Y L, TANG M Q, et al. Self-supervised video retrieval transformer network[EB/OL]. [2025-01-06]https://arxiv.org/abs/2104.07993.
[30]	KORDOPATIS-ZILOS G, PAPADOPOULOS S, PATRAS I, et al. ViSiL: fine-grained spatio-temporal video similarity learning[C]// 2019 IEEE/CVF International Conference on Computer Vision. New York: IEEE Press, 2019: 6351-6360.
[31]	JO W, LIM G, LEE G, et al. VVS: video-to-video retrieval with irrelevant frame suppression[C]// The 38th AAAI Conference on Artificial Intelligence. Palo Alto: AAAI Press, 2024: 2679-2687.
[32]	QIAN R, LI Y X, LIU H B, et al. Enhancing self-supervised video representation learning via multi-level feature optimization[C]// 2021 IEEE/CVF International Conference on Computer Vision. New York: IEEE Press, 2021: 7990-8001.
[33]	CHEN X L, XIE S N, HE K M. An empirical study of training self-supervised vision transformers[C]// 2021 IEEE/CVF International Conference on Computer Vision. New York: IEEE Press, 2021: 9640-9649.
[34]	ZBONTAR J, JING L, MISRA I, et al. Barlow twins: self-supervised learning via redundancy reduction[EB/OL]. [2024-07-23]https://dblp.uni-trier.de/db/conf/icml/icml2021.html#ZbontarJMLD21.
[35]	CHEN P H, HUANG D, HE D L, et al. RSPNet: relative speed perception for unsupervised video representation learning[C]// The 35th AAAI Conference on Artificial Intelligence. Palo Alto: AAAI Press, 2021: 1045-1053.
[36]	FANG B, WU W H, LIU C, et al. MaMiCo: macro-to-micro semantic correspondence for self-supervised video representation learning[C]// The 30th ACM International Conference on Multimedia. New York: ACM, 2022: 1348-1357.
[37]	SOOMRO K, ZAMIR A R, SHAH M. UCF101: a dataset of 101 human actions classes from videos in the wild[EB/OL]. [2025-01-06]https://arxiv.org/abs/1212.0402.
[38]	GHIASI G, CUI Y, SRINIVAS A, et al. Simple copy-paste is a strong data augmentation method for instance segmentation[C]// 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2021: 2918-2928.
[39]	BOCHKOVSKIY A, WANG C Y, LIAO H Y M. YOLOv4:optimal speed and accuracy of object detection[EB/OL]. [2025-01-06]https://arxiv.org/abs/2004.10934.
[40]	ZHANG H Y, CISSE M, DAUPHIN Y N, et al. Mixup: beyond empirical risk minimization[EB/OL]. [2025-01-06]https://arxiv.org/abs/1710.09412.
[41]	GUO P K, YANG H Y, SANO A. Empirical study of mix-based data augmentation methods in physiological time series data[C]// 2023 IEEE 11th International Conference on Healthcare Informatics. New York: IEEE Press, 2023: 206-213.
[42]	HE K M, ZHANG X Y, REN S Q, et al. Deep residual learning for image recognition[C]// 2016 IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2016: 770-778.
[43]	DING K Z, XU Z, TONG H H, et al. Data augmentation for deep graph learning: a survey[J]. ACM SIGKDD Explorations Newsletter, 2022, 24(2): 61-77.

模型	骨干网络	Top-100 mAp	Top mAp
CNN-V^[22]	VGG16	0.251	0.191
CTE^[8]	VGG16	-	0.510
CNN-L^[22]	VGG16	0.610	0.555
HDML+^[2]	VGG16	0.786	-
ITQ+^[25]	VGG16	0.789	-
IsoH+^[26]	VGG16	0.790	-
DML^[14]	VGG16	0.813	0.784
EVR-CL (本文方法)	VGG16	0.836	0.812
LAMV^[27]	ResNet34	-	0.781
VRL^[5]	ResNet50	0.860	-
clip-level SVRTN^[29]	ResNet50	0.860	-
frame-level ViSiL^[30]	ResNet50	0.869	-
frame-level SVRTN^[29]	ResNet50	0.871	-
Barlow Twins^[34]	ResNet50	0.872	0.848
MoCo-V3^[33]	ResNet50	0.873	0.855
EVR-CL (本文方法)	ResNet50	0.889	0.867

模型	骨干网络	Top-100 mAp	Top mAp
CNN-V^[22]	VGG16	0.251	0.191
CTE^[8]	VGG16	-	0.510
CNN-L^[22]	VGG16	0.610	0.555
HDML+^[2]	VGG16	0.786	-
ITQ+^[25]	VGG16	0.789	-
IsoH+^[26]	VGG16	0.790	-
DML^[14]	VGG16	0.813	0.784
EVR-CL (本文方法)	VGG16	0.836	0.812
LAMV^[27]	ResNet34	-	0.781
VRL^[5]	ResNet50	0.860	-
clip-level SVRTN^[29]	ResNet50	0.860	-
frame-level ViSiL^[30]	ResNet50	0.869	-
frame-level SVRTN^[29]	ResNet50	0.871	-
Barlow Twins^[34]	ResNet50	0.872	0.848
MoCo-V3^[33]	ResNet50	0.873	0.855
EVR-CL (本文方法)	ResNet50	0.889	0.867

模型	R@1	R@5	R@10
VCOP^[11]	14.10	30.30	40.40
VCP^[15]	18.60	33.60	42.50
Pace Pred^[16]	23.80	38.10	46.40
Var. PSP^[18]	24.60	41.90	51.30
TempTrans^[21]	26.40	48.50	59.40
Mem DPC^[23]	20.20	40.40	52.40
MFO^[32]	39.60	57.60	69.20
RSPNet^[35]	41.10	59.40	68.40
MaMico^[36]	42.90	-	70.30
EVR-CL(本文方法)	50.12	66.45	73.45

模型	R@1	R@5	R@10
VCOP^[11]	14.10	30.30	40.40
VCP^[15]	18.60	33.60	42.50
Pace Pred^[16]	23.80	38.10	46.40
Var. PSP^[18]	24.60	41.90	51.30
TempTrans^[21]	26.40	48.50	59.40
Mem DPC^[23]	20.20	40.40	52.40
MFO^[32]	39.60	57.60	69.20
RSPNet^[35]	41.10	59.40	68.40
MaMico^[36]	42.90	-	70.30
EVR-CL(本文方法)	50.12	66.45	73.45

模块		Top-100 mAp	Top mAp	R@1	R@5	R@10
CAFE	LSDS	Top-100 mAp	Top mAp	R@1	R@5	R@10
-	-	0.852	0.832	76.7	85.9	86.4
-	√	0.883	0.857	79.1	89.8	92.2
√	-	0.861	0.853	78.6	78.6	87.9
√	√	0.889	0.867	82.5	90.3	92.7