图学学报 ›› 2025, Vol. 46 ›› Issue (3): 491-501.DOI: 10.11996/JG.j.2095-302X.2025030491
凌非1(), 余京涛2, 朱哲燕1, 罗剑1, 朱继祥2, 陈先客2(
), 董建锋2,3(
)
收稿日期:
2024-09-24
接受日期:
2025-01-14
出版日期:
2025-06-30
发布日期:
2025-06-13
通讯作者:
陈先客(1998-),男,博士研究生。主要研究方向为图形图像处理和计算机视觉。E-mail:a397283164@163.com;董建锋(1991-),男,研究员,博士。主要研究方向为多媒体分析与检索和多模态学习。E-mail:dongjf24@gmail.com第一作者:
凌非(1990-),男,讲师,硕士。主要研究方向为视频理解和跨模态检索。E-mail:lingfei@zjtie.edu.cn
基金资助:
LING Fei1(), YU Jingtao2, ZHU Zheyan1, LUO Jian1, ZHU Jixiang2, CHEN Xianke2(
), DONG Jianfeng2,3(
)
Received:
2024-09-24
Accepted:
2025-01-14
Published:
2025-06-30
Online:
2025-06-13
First author:
LING Fei (1990-), lecturer, master. His main research interests cover video understanding and cross-modal retrieval. E-mail:lingfei@zjtie.edu.cn
Supported by:
摘要:
视频检索系统的性能很大程度上依赖标注数据,而在提高性能的同时减少对高昂手工标注的依赖是一个关键问题。为此,提出了一种基于对比学习的数据高效视频检索方法,包括2个关键的优化策略。首先,为构建更加多样且有效的学习数据,提出了基于内容感知的特征级别数据增强,利用基于帧间相似度的K-近邻算法来捕获深层语义信息,减少标注数据依赖。其次,设计了长-短动态采样策略,通过从视频中提取长片段及其内部短片段,使其能够构造具有多尺度信息的正样本对以进行更加有效的对比学习,同时通过动态调整采样长度来提高数据利用率。在SVD和UCF101数据集上的实验结果表明,该方法显著优于现有检索模型。大量消融实验证明,基于内容感知的特征级数据增强能提升模型适应性;长-短动态采样不仅适用于自监督学习,还能提升半监督模型性能。
中图分类号:
凌非, 余京涛, 朱哲燕, 罗剑, 朱继祥, 陈先客, 董建锋. 基于对比学习的数据高效视频检索[J]. 图学学报, 2025, 46(3): 491-501.
LING Fei, YU Jingtao, ZHU Zheyan, LUO Jian, ZHU Jixiang, CHEN Xianke, DONG Jianfeng. Data-efficient video retrieval with contrastive learning[J]. Journal of Graphics, 2025, 46(3): 491-501.
模型 | 骨干网络 | Top-100 mAp | Top mAp |
---|---|---|---|
CNN-V[ | VGG16 | 0.251 | 0.191 |
CTE[ | VGG16 | - | 0.510 |
CNN-L[ | VGG16 | 0.610 | 0.555 |
HDML+[ | VGG16 | 0.786 | - |
ITQ+[ | VGG16 | 0.789 | - |
IsoH+[ | VGG16 | 0.790 | - |
DML[ | VGG16 | 0.813 | 0.784 |
EVR-CL (本文方法) | VGG16 | 0.836 | 0.812 |
LAMV[ | ResNet34 | - | 0.781 |
VRL[ | ResNet50 | 0.860 | - |
clip-level SVRTN[ | ResNet50 | 0.860 | - |
frame-level ViSiL[ | ResNet50 | 0.869 | - |
frame-level SVRTN[ | ResNet50 | 0.871 | - |
Barlow Twins[ | ResNet50 | 0.872 | 0.848 |
MoCo-V3[ | ResNet50 | 0.873 | 0.855 |
EVR-CL (本文方法) | ResNet50 | 0.889 | 0.867 |
表1 在SVD数据集上的半监督视频检索性能比较
Table 1 Comparison of semi-supervised video retrieval performance on the SVD dataset
模型 | 骨干网络 | Top-100 mAp | Top mAp |
---|---|---|---|
CNN-V[ | VGG16 | 0.251 | 0.191 |
CTE[ | VGG16 | - | 0.510 |
CNN-L[ | VGG16 | 0.610 | 0.555 |
HDML+[ | VGG16 | 0.786 | - |
ITQ+[ | VGG16 | 0.789 | - |
IsoH+[ | VGG16 | 0.790 | - |
DML[ | VGG16 | 0.813 | 0.784 |
EVR-CL (本文方法) | VGG16 | 0.836 | 0.812 |
LAMV[ | ResNet34 | - | 0.781 |
VRL[ | ResNet50 | 0.860 | - |
clip-level SVRTN[ | ResNet50 | 0.860 | - |
frame-level ViSiL[ | ResNet50 | 0.869 | - |
frame-level SVRTN[ | ResNet50 | 0.871 | - |
Barlow Twins[ | ResNet50 | 0.872 | 0.848 |
MoCo-V3[ | ResNet50 | 0.873 | 0.855 |
EVR-CL (本文方法) | ResNet50 | 0.889 | 0.867 |
模型 | R@1 | R@5 | R@10 |
---|---|---|---|
VCOP[ | 14.10 | 30.30 | 40.40 |
VCP[ | 18.60 | 33.60 | 42.50 |
Pace Pred[ | 23.80 | 38.10 | 46.40 |
Var. PSP[ | 24.60 | 41.90 | 51.30 |
TempTrans[ | 26.40 | 48.50 | 59.40 |
Mem DPC[ | 20.20 | 40.40 | 52.40 |
MFO[ | 39.60 | 57.60 | 69.20 |
RSPNet[ | 41.10 | 59.40 | 68.40 |
MaMico[ | 42.90 | - | 70.30 |
EVR-CL(本文方法) | 50.12 | 66.45 | 73.45 |
表2 在UCF101数据集上的无监督视频检索性能比较
Table 2 Comparison of unsupervised video retrieval performance on the UCF101 dataset
模型 | R@1 | R@5 | R@10 |
---|---|---|---|
VCOP[ | 14.10 | 30.30 | 40.40 |
VCP[ | 18.60 | 33.60 | 42.50 |
Pace Pred[ | 23.80 | 38.10 | 46.40 |
Var. PSP[ | 24.60 | 41.90 | 51.30 |
TempTrans[ | 26.40 | 48.50 | 59.40 |
Mem DPC[ | 20.20 | 40.40 | 52.40 |
MFO[ | 39.60 | 57.60 | 69.20 |
RSPNet[ | 41.10 | 59.40 | 68.40 |
MaMico[ | 42.90 | - | 70.30 |
EVR-CL(本文方法) | 50.12 | 66.45 | 73.45 |
模块 | Top-100 mAp | Top mAp | R@1 | R@5 | R@10 | |
---|---|---|---|---|---|---|
CAFE | LSDS | |||||
- | - | 0.852 | 0.832 | 76.7 | 85.9 | 86.4 |
- | √ | 0.883 | 0.857 | 79.1 | 89.8 | 92.2 |
√ | - | 0.861 | 0.853 | 78.6 | 78.6 | 87.9 |
√ | √ | 0.889 | 0.867 | 82.5 | 90.3 | 92.7 |
表3 在SVD数据集上的消融实验对比
Table 3 Ablation study on the SVD dataset
模块 | Top-100 mAp | Top mAp | R@1 | R@5 | R@10 | |
---|---|---|---|---|---|---|
CAFE | LSDS | |||||
- | - | 0.852 | 0.832 | 76.7 | 85.9 | 86.4 |
- | √ | 0.883 | 0.857 | 79.1 | 89.8 | 92.2 |
√ | - | 0.861 | 0.853 | 78.6 | 78.6 | 87.9 |
√ | √ | 0.889 | 0.867 | 82.5 | 90.3 | 92.7 |
数据增强方案 | Top-100 mAp | Top mAp | 训练时间/% |
---|---|---|---|
基准模型 | 0.852 | 0.832 | - |
Mixup | 0.876 | 0.857 | 17.40 |
CAFE | 0.878 | 0.860 | 17.52 |
表4 数据增强方案对模型检索性能的影响
Table 4 The effectiveness of the data augmentation
数据增强方案 | Top-100 mAp | Top mAp | 训练时间/% |
---|---|---|---|
基准模型 | 0.852 | 0.832 | - |
Mixup | 0.876 | 0.857 | 17.40 |
CAFE | 0.878 | 0.860 | 17.52 |
K值 | Top-100 mAp | Top mAp |
---|---|---|
1 | 0.885 | 0.861 |
2 | 0.883 | 0.851 |
4 | 0.870 | 0.848 |
表5 CAFE中K-近邻搜索不同K值对模型性能影响
Table 5 The impact of varying K values in K-nearest neighbors within the CAFE module
K值 | Top-100 mAp | Top mAp |
---|---|---|
1 | 0.885 | 0.861 |
2 | 0.883 | 0.851 |
4 | 0.870 | 0.848 |
图7 CAFE模块和Mixup分别与基准模型的t-SNE可视化对比((a) CAFE与基准模型相比;(b) Mixup与基准模型相比)
Fig. 7 The comparison of t-SNE visualizations between the CAFE and Mixup with the baseline model ((a) CAFE vs baseline model; (b) Mixup vs baseline model)
[1] | MA Z, DONG J F, JI S L, et al. Let all be whitened: multi-teacher distillation for efficient visual retrieval[C]// The 38th AAAI Conference on Artificial Intelligence. Palo Alto: AAAI Press, 2024: 4126-4135. |
[2] | NOROUZI M, FLEET D J, SALAKHUTDINOV R. Hamming distance metric learning[C]// The 26th International Conference on Neural Information Processing Systems. Red Hook: Curran Associates Inc., 2012: 1061-1069. |
[3] |
毕春艳, 刘越. 基于深度学习的视频人体动作识别综述[J]. 图学学报, 2023, 44(4): 625-639.
DOI |
BI C Y, LIU Y. A survey of video human action recognition based on deep learning[J]. Journal of Graphics, 2023, 44(4): 625-639 (in Chinese). | |
[4] | HE S F, YANG X D, JIANG C, et al. A large-scale comprehensive dataset and copy-overlap aware evaluation protocol for segment-level video copy detection[C]// 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2022: 21086-21095. |
[5] | HE X T, PAN Y L, TANG M Q, et al. Learn from unlabeled videos for near-duplicate video retrieval[C]// The 45th International ACM SIGIR Conference on Research and Development in Information Retrieval. New York: ACM, 2022: 1002-1011. |
[6] | WEI Y W, WANG X, LI Q, et al. Contrastive learning for cold-start recommendation[C]// The 29th ACM International Conference on Multimedia. New York: ACM, 2021: 5382-5390. |
[7] | HE X N, ZHANG Y, FENG F L, et al. Addressing confounding feature issue for causal recommendation[J]. ACM Transactions on Information Systems, 2023, 41(3): 53. |
[8] | REVAUD J, ALMAZÁN J, REZENDE R S, et al. Learning with average precision: training image retrieval with a listwise loss[C]// 2019 IEEE/CVF International Conference on Computer Vision. New York: IEEE Press, 2019: 5107-5116. |
[9] | SHORTEN C, KHOSHGOFTAAR T M. A survey on image data augmentation for deep learning[J]. Journal of Big Data, 2019, 6(1): 60. |
[10] | JIANG Q Y, HE Y, LI G, et al. SVD: a large-scale short video dataset for near-duplicate video retrieval[C]// 2019 IEEE/CVF International Conference on Computer Vision. New York: IEEE Press, 2019: 5281-5289. |
[11] | XU D J, XIAO J, ZHAO Z, et al. Self-supervised spatiotemporal learning via video clip order prediction[C]// 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2019: 10334-10343. |
[12] |
冯尊登, 王洪元, 林龙, 等. 基于多分支注意网络与相似度学习策略的无监督行人重识别[J]. 图学学报, 2023, 44(2): 280-290.
DOI |
FENG Z D, WANG H Y, LIN L, et al. Unsupervised person re-identification with multi-branch attention network and similarity learning strategy[J]. Journal of Graphics, 2023, 44(2): 280-290 (in Chinese). | |
[13] | SIMONYAN K, ZISSERMAN A. Very deep convolutional networks for large-scale image recognition[EB/OL]. [2025-01-06]. https://arxiv.org/abs/1409.1556. |
[14] | KORDOPATIS-ZILOS G, PAPADOPOULOS S, PATRAS I, et al. Near-duplicate video retrieval with deep metric learning[C]// 2017 IEEE International Conference on Computer Vision Workshops. New York: IEEE Press, 2017: 347-356. |
[15] | LUO D Z, LIU C, ZHOU Y, et al. Video cloze procedure for self-supervised spatio-temporal learning[C]// The 34th AAAI Conference on Artificial Intelligence. Palo Alto: AAAI Press, 2020: 11701-11708. |
[16] | WANG J L, JIAO J B, LIU Y H. Self-supervised video representation learning by pace prediction[C]// The 16th European Conference on Computer Vision. Cham: Springer, 2020: 504-521. |
[17] | ZHANG B W, HU H X, LEE J, et al. A hierarchical multi-modal encoder for moment localization in video corpus[EB/OL]. [2025-01-06]https://arxiv.org/abs/2011.09046. |
[18] | CHO H, KIM T, CHANG H J, et al. Self-supervised visual learning by variable playback speeds prediction of a video[J]. IEEE Access, 2021, 9: 79562-79571. |
[19] | LI B Y, WU F, LIM S N, et al. On feature normalization and data augmentation[C]// 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2021: 12383-12392. |
[20] | WANG Y L, HUANG G, SONG S J, et al. Regularizing deep networks with semantic data augmentation[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022, 44(7): 3733-3748. |
[21] | JENNI S, MEISHVILI G, FAVARO P. Video representation learning by recognizing temporal transformations[EB/OL]. [2025-01-06]. https://arxiv.org/abs/2007.10730. |
[22] | KORDOPATIS-ZILOS G, PAPADOPOULOS S, PATRAS I, et al. Near-duplicate video retrieval by aggregating intermediate CNN layers[C]// The 23rd International Conference on MultiMedia Modeling. Cham: Springer, 2017: 251-263. |
[23] | HAN T D, XIE W D, ZISSERMAN A. Memory-augmented dense predictive coding for video representation learning[C]// The 16th European Conference on Computer Vision. Cham: Springer, 2020: 312-329. |
[24] | WANG H Y, LI B, WU S, et al. Rethinking the learning paradigm for dynamic facial expression recognition[C]// 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2023: 17958-17968. |
[25] |
GONG Y C, LAZEBNIK S, GORDO A, et al. Iterative quantization: a procrustean approach to learning binary codes for large-scale image retrieval[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2013, 35(12): 2916-2929.
DOI PMID |
[26] | KONG W, LI W J. Isotropic hashing[C]// The 26th International Conference on Neural Information Processing Systems. Red Hook: Curran Associates Inc., 2012: 1646-1654. |
[27] | BARALDI L, DOUZE M, CUCCHIARA R, et al. LAMV: learning to align and match videos with kernelized temporal layers[C]// 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2018: 7804-7813. |
[28] | LU Y W, ZHANG G J, SUN S, et al. f-MICL: understanding and generalizing InfoNCE-based contrastive learning[EB/OL]. [2025-01-06]https://arxiv.org/abs/2402.10150. |
[29] | HE X T, PAN Y L, TANG M Q, et al. Self-supervised video retrieval transformer network[EB/OL]. [2025-01-06]https://arxiv.org/abs/2104.07993. |
[30] | KORDOPATIS-ZILOS G, PAPADOPOULOS S, PATRAS I, et al. ViSiL: fine-grained spatio-temporal video similarity learning[C]// 2019 IEEE/CVF International Conference on Computer Vision. New York: IEEE Press, 2019: 6351-6360. |
[31] | JO W, LIM G, LEE G, et al. VVS: video-to-video retrieval with irrelevant frame suppression[C]// The 38th AAAI Conference on Artificial Intelligence. Palo Alto: AAAI Press, 2024: 2679-2687. |
[32] | QIAN R, LI Y X, LIU H B, et al. Enhancing self-supervised video representation learning via multi-level feature optimization[C]// 2021 IEEE/CVF International Conference on Computer Vision. New York: IEEE Press, 2021: 7990-8001. |
[33] | CHEN X L, XIE S N, HE K M. An empirical study of training self-supervised vision transformers[C]// 2021 IEEE/CVF International Conference on Computer Vision. New York: IEEE Press, 2021: 9640-9649. |
[34] | ZBONTAR J, JING L, MISRA I, et al. Barlow twins: self-supervised learning via redundancy reduction[EB/OL]. [2024-07-23]https://dblp.uni-trier.de/db/conf/icml/icml2021.html#ZbontarJMLD21. |
[35] | CHEN P H, HUANG D, HE D L, et al. RSPNet: relative speed perception for unsupervised video representation learning[C]// The 35th AAAI Conference on Artificial Intelligence. Palo Alto: AAAI Press, 2021: 1045-1053. |
[36] | FANG B, WU W H, LIU C, et al. MaMiCo: macro-to-micro semantic correspondence for self-supervised video representation learning[C]// The 30th ACM International Conference on Multimedia. New York: ACM, 2022: 1348-1357. |
[37] | SOOMRO K, ZAMIR A R, SHAH M. UCF101: a dataset of 101 human actions classes from videos in the wild[EB/OL]. [2025-01-06]https://arxiv.org/abs/1212.0402. |
[38] | GHIASI G, CUI Y, SRINIVAS A, et al. Simple copy-paste is a strong data augmentation method for instance segmentation[C]// 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2021: 2918-2928. |
[39] | BOCHKOVSKIY A, WANG C Y, LIAO H Y M. YOLOv4:optimal speed and accuracy of object detection[EB/OL]. [2025-01-06]https://arxiv.org/abs/2004.10934. |
[40] | ZHANG H Y, CISSE M, DAUPHIN Y N, et al. Mixup: beyond empirical risk minimization[EB/OL]. [2025-01-06]https://arxiv.org/abs/1710.09412. |
[41] | GUO P K, YANG H Y, SANO A. Empirical study of mix-based data augmentation methods in physiological time series data[C]// 2023 IEEE 11th International Conference on Healthcare Informatics. New York: IEEE Press, 2023: 206-213. |
[42] | HE K M, ZHANG X Y, REN S Q, et al. Deep residual learning for image recognition[C]// 2016 IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2016: 770-778. |
[43] | DING K Z, XU Z, TONG H H, et al. Data augmentation for deep graph learning: a survey[J]. ACM SIGKDD Explorations Newsletter, 2022, 24(2): 61-77. |
[1] | 卢洋, 陈林慧, 姜晓恒, 徐明亮. SDENet:基于多尺度注意力质量感知的合成缺陷数据评价网络[J]. 图学学报, 2025, 46(1): 94-103. |
[2] | 李大湘, 吉展, 刘颖, 唐垚. 改进YOLOv7遥感图像目标检测算法[J]. 图学学报, 2024, 45(4): 650-658. |
[3] | 程艳, 严志航, 赖建明, 王桂喜, 钟林辉. 基于语义引导的人像自动抠图模型[J]. 图学学报, 2024, 45(4): 683-695. |
[4] | 阎光伟, 刘润泽, 焦润海, 何慧. 基于改进Cascade RCNN的输电线路防振锤脱落检测方法[J]. 图学学报, 2023, 44(5): 849-860. |
[5] | 冯尊登, 王洪元, 林龙, 孙博言, 陈海琴. 基于多分支注意网络与相似度学习策略的无监督行人重识别[J]. 图学学报, 2023, 44(2): 280-290. |
[6] | 范溢华 , 王永振 , 燕雪峰 , 宫丽娜 , 郭延文 , 魏明强 . 人脸识别任务驱动的低光照图像增强算法 [J]. 图学学报, 2022, 43(6): 1170-1181. |
[7] | 史彩娟, 陈厚儒, 葛录录, 王子雯. 注意力残差多尺度特征增强的显著性实例分割[J]. 图学学报, 2021, 42(6): 883-890. |
阅读次数 | ||||||
全文 |
|
|||||
摘要 |
|
|||||