基于自监督的主动标签清洗

doi:10.11996/JG.j.2095-302X.2024030495

图学学报 ›› 2024, Vol. 45 ›› Issue (3): 495-504.DOI: 10.11996/JG.j.2095-302X.2024030495

• 图像处理与计算机视觉 • 上一篇下一篇

基于自监督的主动标签清洗

林晓¹^,²^,³(), 张秋阳¹, 郑晓妹¹^,², 杨启哲¹()

1.上海师范大学信息与机电工程学院，上海 200234
2.上海师范大学上海智能教育大数据工程技术研究中心，上海 200234
3.上海市中小学在线教育研究基地，上海 200234

收稿日期:2023-07-21 接受日期:2023-11-22 出版日期:2024-06-30 发布日期:2024-06-11
通讯作者:杨启哲(1994-)，男，讲师，博士。主要研究方向为人工智能。E-mail：qzyang@shnu.edu.cn
第一作者:林晓(1978-)，女，教授，博士。主要研究方向为图像处理。E-mail：lin6008@shnu.edu.cn
基金资助:
上海市促进产业高质量发展专项(2211106)

Self-supervised active label cleaning

LIN Xiao¹^,²^,³(), ZHANG Qiuyang¹, ZHENG Xiaomei¹^,², YANG Qizhe¹()

1. The College of Information, Mechanical and Electrical Engineering, Shanghai Normal University, Shanghai 200234, China
2. Shanghai Engineering Research Center of Intelligent Education and Big Data, Shanghai Normal University, Shanghai 200234, China
3. The Research Base of Online Education for Shanghai Middle and Primary Schools, Shanghai 200234, China

Received:2023-07-21 Accepted:2023-11-22 Published:2024-06-30 Online:2024-06-11
First author：LIN Xiao (1978-), professor, Ph.D. Her main research interest covers image processing. E-mail：lin6008@shnu.edu.cn
Supported by:
Shanghai Municipal Special Project for Promoting High-Quality Development of Industries(2211106)

摘要/Abstract

摘要：

主动标签清洗利用主动学习来进行标签噪声处理，以降低人工标注成本。现有的主动标签清洗方法仍然存在人工额外标注成本较高的问题，即挑选出的可疑样本中正确样本所占比例较高。为了缓解这一问题，提出了一种基于核心集的自监督主动标签清洗方法。首先利用自监督任务进行表征学习，随后将数据映射到特征空间中，并利用贪婪的K-Center集合覆盖方法挑选出可疑样本，最后根据不确定性筛选出标签噪声样本进行重标注。并同时考虑到了样本的代表性与不确定性，能够有效降低可疑样本中正确样本的比例。在含有不同比例标签噪声的公开数据集上的实验结果表明，在各迭代轮次中明显地降低了人工额外标注成本，同时也在一定程度上缓解了冷启动问题。此外，还通过消融实验证明了方法中自监督核心集采样模块和不确定性预测模块的有效性。

关键词: 主动学习, 自监督学习, 标签噪声, 标签清洗, 人工额外标注成本

Abstract:

Active label cleaning utilizes the active learning method for label noise processing to lower the cost of manual annotation. However, the existing active label cleaning methods still suffer from high cost of extra manual annotation, particularly due to a high proportion of correctly labeled samples among the selected suspicious ones. To address this problem, a self-supervised active label cleaning method based on core-set was proposed. Firstly, self-supervised tasks were employed for representation learning of all samples, followed by mapping the samples to a future space. Suspicious samples were then identified using a greedy K-Center set covering method, and label noise samples were selected for re-labeling based on uncertainty. By considering both the representativeness and uncertainty of samples, this method could effectively lower the proportion of correct samples in suspicious ones. Experimental results on public datasets with varying proportions of label noise demonstrated that the proposed method could significantly reduce the cost of extra manual annotation in each iteration, while also mitigating the cold start problem to some extent. Additionally, the effectiveness of the self-supervised core-set sampling module and the uncertainty prediction module in this method were validated through ablation experiments.

Key words: active learning, self-supervised learning, label noise, label cleaning, cost of extra manual annotation

中图分类号:

TP391

林晓, 张秋阳, 郑晓妹, 杨启哲. 基于自监督的主动标签清洗[J]. 图学学报, 2024, 45(3): 495-504.

LIN Xiao, ZHANG Qiuyang, ZHENG Xiaomei, YANG Qizhe. Self-supervised active label cleaning[J]. Journal of Graphics, 2024, 45(3): 495-504.

图/表 9

图1 传统主动学习框架与主动标签清洗框架

Fig. 1 Traditional active learning framework and active label cleaning framework

图2 基于自监督的主动标签清洗的整体结构图

Fig. 2 Overall structure diagram of self-supervised active label cleaning

图3 CIFAR10N (50%标签噪声)数据集上每个循环提取的数据的累积类别分布热图((a)随机采样；(b)均匀采样；(c)自监督核心集采样)

Fig. 3 Cumulative class distribution heat map of data extracted for each cycle on the CIFAR10N (50% label noise) dataset ((a) Random sampling; (b) Uniform sampling; (c) Self-supervised core set sampling)

图4 CIFAR10N (50%标签噪声)数据集上不同自监督代理任务下的标签清洗性能

Fig. 4 Label cleaning performance under different pretext tasks on the CIFAR10N (50% label noise) dataset

图5 CIFAR10N (50%标签噪声)数据集上的标签清洗性能

Fig. 5 Label cleaning performance on the CIFAR10N (50% label noise) dataset ((a) ResNet-18; (b) VGG19)

图6 CIFAR10N (20%，40%，50%标签噪声)数据集上的累计额外人工标注成本

Fig. 6 Cumulative additional manual annotation costs on the CIFAR10N (20%, 40%, 50% label noise) dataset

表1 CIFAR10N和Fashion MNIST_N数据集上最终标签清洗率

Table 1 Label cleaning rates on CIFAR10N and Fashion MNIST_N datasets

数据集	方法	噪声比例/%	K值	标签正确率/%
CIFAR10N	Ours	20	1 000	96.60
CIFAR10N	ALC	20	1 000	95.84
CIFAR10N	Bernhardt	20	1 000	96.32
CIFAR10N	Ours	50	2 000	98.90
CIFAR10N	ALC	50	2 000	98.87
CIFAR10N	Bernhardt	50	2 000	98.85
Fashion MNIST_N	Ours	20	1 000	96.01
Fashion MNIST_N	ALC	20	1 000	95.98
Fashion MNIST_N	Bernhardt	20	1 000	95.64
Fashion MNIST_N	Ours	50	2 000	98.90
Fashion MNIST_N	ALC	50	2 000	98.32
Fashion MNIST_N	Bernhardt	50	2 000	98.82

图7 CIFAR10N (50%标签噪声)上抽取的部分可视化样本((a)明显标注错误的清晰样本；(b)~(c)存在一定预测困难且标注错误的模糊样本)

Fig. 7 Part of the visual samples taken on CIFAR10N (50% label noise) ((a) Clear samples which are distinctly mislabelled; (b)~(c) Fuzzy samples which are difficult to predict and mislabelled)

图8 CIFAR10N (50%标签噪声)数据集上的精度对比结果

Fig. 8 Precision comparison results on the CIFAR10N (50% label noise) dataset

参考文献 23

[1]	HAN B, YAO Q M, YU X R, et al. Co-teaching: robust training of deep neural networks with extremely noisy labels[C]// The 32nd International Conference on Neural Information Processing Systems. New York: ACM, 2018: 8536-8546.
[2]	DENG J, DONG W, SOCHER R, et al. ImageNet: a large-scale hierarchical image database[C]// 2009 IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2009: 248-255.
[3]	ZHU X Q, WU X D. Class noise vs. attribute noise: a quantitative study[J]. Artificial Intelligence Review, 2004, 22(3): 177-210.
[4]	KIM Y, YIM J, YUN J, et al. NLNL: negative learning for noisy labels[C]// 2019 IEEE/CVF International Conference on Computer Vision. New York: IEEE Press, 2019: 101-110.
[5]	TONEVA M, SORDONI A, DES COMBES R T, et al. An empirical study of example forgetting during deep neural network learning[EB/OL]. (2019-11-15) [2023-03-20]. https://arxiv.org/abs/1812.05159.pdf.
[6]	HUANG J C, QU L, JIA R F, et al. O₂U-net: a simple noisy label detection approach for deep neural networks[C]// 2019 IEEE/CVF International Conference on Computer Vision. New York: IEEE Press, 2019: 3326-3334.
[7]	REBBAPRAGADA U D. Strategic targeting of outliers for expert review[EB/OL]. (2010-06-01) [2023-03-20]. blob: https://www.proquest.com/c4d84676-07c3-4eb0-84a0-89e32ab6bcbe.
[8]	EKAMBARAM R, FEFILATYEV S, SHREVE M, et al. Active cleaning of label noise[J]. Pattern Recognition, 2016, 51: 463-480.
[9]	BERNHARDT M, CASTRO D C, TANNO R, et al. Active label cleaning for improved dataset quality under resource constraints[J]. Nature Communications, 2022, 13(1): 1161. DOI PMID
[10]	JAISWAL A, BABU A R, ZADEH M Z, et al. A survey on contrastive self-supervised learning[EB/OL]. (2021-02-07) [2023-03-20]. http://arxiv.org/abs/2011.00362.pdf.
[11]	SENER O, SAVARESE S. Active learning for convolutional neural networks: a core-set approach[EB/OL]. (2018-06-01) [2023-03-20]. http://arxiv.org/abs/1708.00489.pdf.
[12]	HOULSBY N, HUSZÁR F, GHAHRAMANI Z, et al. Bayesian active learning for classification and preference learning[EB/OL]. (2011-12-24) [2023-03-20]. http://arxiv.org/abs/1112.5745.pdf.
[13]	GAL Y, GHAHRAMANI Z. Dropout as a Bayesian approximation: representing model uncertainty in deep learning[C]// The 33rd International Conference on International Conference on Machine Learning - Volume 48. New York:ACM, 2016: 1050-1059.
[14]	KIRSCH A, VAN AMERSFOORT J, GAL Y. BatchBALD: efficient and diverse batch acquisition for deep Bayesian active learning[EB/OL]. (2019-10-28) [2023-03-20]. http://arxiv.org/abs/1906.08158.pdf.
[15]	BORSOS Z, MUTNÝ M, KRAUSE A. Coresets via bilevel optimization for continual learning and streaming[EB/OL]. (2020-10-22) [2023-03-20]. http://arxiv.org/abs/2006.03875.pdf.
[16]	PINSLER R, GORDON J, NALISNICK E, et al. Bayesian batch active learning as sparse subset approximation[EB/OL]. (2021-02-08) [2023-03-20]. http://arxiv.org/abs/1908.02144.pdf.
[17]	DOERSCH C, GUPTA A, EFROS A A. Unsupervised visual representation learning by context prediction[C]// 2015 IEEE International Conference on Computer Vision. New York: IEEE Press, 2015: 1422-1430.
[18]	ZHANG R, ISOLA P, EFROS A A. Colorful image colorization[C]// European Conference on Computer Vision. Cham: Springer, 2016: 649-666.
[19]	PATHAK D, KRÄHENBÜHL P, DONAHUE J, et al. Context encoders: feature learning by inpainting[C]// 2016 IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2016: 2536-2544.
[20]	GIDARIS S, SINGH P, KOMODAKIS N. Unsupervised representation learning by predicting image rotations[EB/OL]. (2018-03-21) [2023-03-20]. http://arxiv.org/abs/1803.07728.pdf.
[21]	ALGAN G, ULUSOY I. Image classification with deep learning in the presence of noisy labels: a survey[J]. Knowledge-Based Systems, 2021, 215: 106771.
[22]	MANWANI N, SASTRY P S. Noise tolerance under risk minimization[J]. IEEE Transactions on Cybernetics, 2013, 43(3): 1146-1151. DOI PMID
[23]	KRIZHEVSKY A, HINTON G. Learning multiple layers of features from tiny images[EB/OL]. (2009-04-08) [2023-03-20]. https://www.cs.toronto.edu/-kriz/learning-features-2009-TR.pdf.

基于自监督的主动标签清洗

Self-supervised active label cleaning

RichHTML

PDF (PC)

可视化

摘要/Abstract

引用本文

使用本文

图/表 9

参考文献 23

相关文章 5

编辑推荐

Metrics

本文评价

[1]	安峰 , 戴军 , 韩振 , 严仲兴 . 引入注意力机制的自监督光流计算[J]. 图学学报, 2022, 43(5): 841-848.
[2]	周荣安, 符纯明. 融合式教学模式在机械制图课程中的应用研究[J]. 图学学报, 2020, 41(6): 1039-1043.
[3]	李农勤 1，杨维信 2,3 . 基于生成式对抗神经网络的手写文字图像补全[J]. 图学学报, 2019, 40(5): 878-884.
[4]	何蕊，高岱，栾英艳. 土木工程制图课程中“主动学习”教学模式实践[J]. 图学学报, 2018, 39(4): 782-785.
[5]	何蕊，栾英艳，高岱. 基于BIM 人才培养的土木工程课程体系改革研究[J]. 图学学报, 2017, 38(1): 102-108.