I. Introduction
Machine learning models are widely used in various real-world applications, such as machine translation [1], image recognition [2], and recommender system [3]. In practice, the model is typically trained offline and deployed to serve the samples coming during the testing period. That is, the model indiscriminately makes predictions for all testing samples, while they can differ a lot. For instance, some samples [see Fig. 1(b)] can be hard to make confident predictions. Apparently, it differs from the behavior of human students in the testing period (e.g., an examination), who would double-check the answer for hard questions. Due to the lack of double check, the current models encounter sharp performance drops on low-confidence samples [4], [5].
Examples of (a) normal sample and (b) hard sample in the class of dog and the corresponding model predictions.