1. Introduction
Deep neural networks (DNNs) have achieved remarkable success on various computer vision tasks, such as image classification, segmentation, and object detection [7]. The most widely used paradigm for DNN training is the end-to-end supervised manner, whose performance largely relies on massive high-quality annotated data. However, collecting large-scale datasets with fully precise annotations (or called clean labels) is usually expensive and time-consuming, and sometimes even impossible. Noisy labels, which are systematically corrupted from ground-truth labels, are ubiquitous in many real-world applications, such as online queries [19], crowdsourcing [1], adversarial at-tacks [30], and medical images analysis [12]. On the other hand, it is well-known that over-parameterized neural net-works have enough capacity to memorize large-scale data with even completely random labels, leading to poor performance in generalization [28], [1], [11]. Therefore, robust learning with noisy labels has become an important and challenging task in computer vision [25], [8], [12], [31].