I. Introduction
Object semantic segmentation (OSS) aims at predicting the class label of each pixel. Deep neural networks have achieved tremendous success on the OSS tasks, such as U-net [1], FCN [2], and Mask R-CNN [3]. However, these algorithms trained with full annotations require many investments to expensive labeling tasks. To reduce the budget, a promising alternative approach is to apply weak annotations for learning a decent network of segmentation. For example, previous works have implemented image-level labels [4]–[6]; scribbles [7]–[9]; bounding boxes [10], [11]; and points [12]–[14] as cheaper supervision information whereas the main disadvantage of these weakly supervised methods is the lack of the ability for generalizing the learned models to unseen classes. For instance, if a network is trained to segment dogs using thousands of images containing various breeds of dogs, it will not be able to segment bikes without retraining the network using many images containing bikes.