1. Introduction
Benefiting from the large-scale pixel-level training data and advanced convolutional neural network (CNN) architectures, fully-supervised semantic segmentation approaches, such as [4 , 20 , 22 , 42 , 38] , have made great progress recently. However, constructing a large-scale pixel-accurate dataset is fairly expensive and requires considerable human efforts and time cost. In order to economize human labors, researchers propose to learn semantic segmentation using weak supervision, such as bounding boxes [27] , points [2] , and even image-level annotations [26] . Among these weak supervisions, image-level annotations can be more easily obtained than other annotations. Thus, in this paper, we focus on semantic segmentation under image-level supervision.
Observation of our proposed approach. (a) Source images; (b-d) Intermediate attention maps produced by a classification network at different training stages; (e) Cumulative attention maps produced by combining attention maps in (b), (c), and (d) through a simple element-wise maximum operation. It can be easily observed that the discriminative regions continuously shift over different parts of the semantic objects. The fused attention maps in (e) can record most of semantic regions compared to (b), (c), and (d). Best viewed in color.