I. Introduction
Modern visual recognition models, built on complex deep neural networks [1], [2], often suffer from overfitting to training data, inevitably leading to overly confident and unreliable predicted score dilemma [3], [4], [5]. This dilemma severely prevents their applications to high-risk scenarios, such as self-driving [6], [7] and medical diagnosis [8], [9]. To deal with this issue, numerous works [3], [10], [11] are intensively proposed for confidence calibration that can provide more accurate and reliable predicted confidence scores to indicate an accurate probability of correctness. Despite achieving impressive progress, these efforts predominantly concentrate on single-label settings, where each image is associated with a single category. However, these works can hardly be applied to multi-label scenarios, which are more reflective of real-world scenarios where images often contain objects from multiple categories [12], [13], [14]. Our work targets the multi-label confidence calibration (MLCC) task, seeking to extend and enhance calibration techniques for these more complex and practical scenarios.