I. Introduction
Instance segmentation aims to obtain the pixel-wise mask labels of the interested objects, which plays an important role in applications such as autonomous driving, robotic manipulation, image editing, cell segmentation, etc. Benefiting from the strong learning capacity of advanced CNN [1], [2], [3] and transformer [4], [5], [6] architectures, instance segmentation has achieved remarkable progresses in recent years. However, many of the existing instance segmentation models [7], [8], [9], [10], [11], [12], [13] are trained in a fully supervised manner, which heavily depend on the pixel-wise instance mask annotations and incur expensive and tedious labeling costs.