I. Introduction
Models based on convolutional neural networks have become popular for boundary detection and semantic segmentation problems in computer vision. However, one of their main shortcomings is that they need a large number of manually annotated images. These annotated approaches such as semantic labeling and instance segmentation cost tremendous manpower and have low effectiveness. To reduce the budget, bounding box-level and image-level labeling methods are proposed to make the annotation task cheaper and easier than pixel-level mask. However, existing techniques for training the label from bounding boxes predict segmentation results with high noise, and they lack the general ability to predict unknown categories. Figure 1 shows the pixel-level segmentation mask and rectangle mask, and figure 1(b) can be regarded as a particular kind of label for weakly-supervised learning.