欢迎访问《图学学报》 分享到:

图学学报 ›› 2024, Vol. 45 ›› Issue (3): 495-504.DOI: 10.11996/JG.j.2095-302X.2024030495

• 图像处理与计算机视觉 • 上一篇    下一篇

基于自监督的主动标签清洗

林晓1,2,3(), 张秋阳1, 郑晓妹1,2, 杨启哲1()   

  1. 1.上海师范大学信息与机电工程学院,上海 200234
    2.上海师范大学上海智能教育大数据工程技术研究中心,上海 200234
    3.上海市中小学在线教育研究基地,上海 200234
  • 收稿日期:2023-07-21 接受日期:2023-11-22 出版日期:2024-06-30 发布日期:2024-06-11
  • 通讯作者:杨启哲(1994-),男,讲师,博士。主要研究方向为人工智能。E-mail:qzyang@shnu.edu.cn
  • 第一作者:林晓(1978-),女,教授,博士。主要研究方向为图像处理。E-mail:lin6008@shnu.edu.cn
  • 基金资助:
    上海市促进产业高质量发展专项(2211106)

Self-supervised active label cleaning

LIN Xiao1,2,3(), ZHANG Qiuyang1, ZHENG Xiaomei1,2, YANG Qizhe1()   

  1. 1. The College of Information, Mechanical and Electrical Engineering, Shanghai Normal University, Shanghai 200234, China
    2. Shanghai Engineering Research Center of Intelligent Education and Big Data, Shanghai Normal University, Shanghai 200234, China
    3. The Research Base of Online Education for Shanghai Middle and Primary Schools, Shanghai 200234, China
  • Received:2023-07-21 Accepted:2023-11-22 Published:2024-06-30 Online:2024-06-11
  • First author:LIN Xiao (1978-), professor, Ph.D. Her main research interest covers image processing. E-mail:lin6008@shnu.edu.cn
  • Supported by:
    Shanghai Municipal Special Project for Promoting High-Quality Development of Industries(2211106)

摘要:

主动标签清洗利用主动学习来进行标签噪声处理,以降低人工标注成本。现有的主动标签清洗方法仍然存在人工额外标注成本较高的问题,即挑选出的可疑样本中正确样本所占比例较高。为了缓解这一问题,提出了一种基于核心集的自监督主动标签清洗方法。首先利用自监督任务进行表征学习,随后将数据映射到特征空间中,并利用贪婪的K-Center集合覆盖方法挑选出可疑样本,最后根据不确定性筛选出标签噪声样本进行重标注。并同时考虑到了样本的代表性与不确定性,能够有效降低可疑样本中正确样本的比例。在含有不同比例标签噪声的公开数据集上的实验结果表明,在各迭代轮次中明显地降低了人工额外标注成本,同时也在一定程度上缓解了冷启动问题。此外,还通过消融实验证明了方法中自监督核心集采样模块和不确定性预测模块的有效性。

关键词: 主动学习, 自监督学习, 标签噪声, 标签清洗, 人工额外标注成本

Abstract:

Active label cleaning utilizes the active learning method for label noise processing to lower the cost of manual annotation. However, the existing active label cleaning methods still suffer from high cost of extra manual annotation, particularly due to a high proportion of correctly labeled samples among the selected suspicious ones. To address this problem, a self-supervised active label cleaning method based on core-set was proposed. Firstly, self-supervised tasks were employed for representation learning of all samples, followed by mapping the samples to a future space. Suspicious samples were then identified using a greedy K-Center set covering method, and label noise samples were selected for re-labeling based on uncertainty. By considering both the representativeness and uncertainty of samples, this method could effectively lower the proportion of correct samples in suspicious ones. Experimental results on public datasets with varying proportions of label noise demonstrated that the proposed method could significantly reduce the cost of extra manual annotation in each iteration, while also mitigating the cold start problem to some extent. Additionally, the effectiveness of the self-supervised core-set sampling module and the uncertainty prediction module in this method were validated through ablation experiments.

Key words: active learning, self-supervised learning, label noise, label cleaning, cost of extra manual annotation

中图分类号: