1. Introduction
Semantic segmentation is a fundamental problem in visual recognition systems. Regardless of tremendous progress for static image segmentation [3], [6], [8], [56], [63], video semantic segmentation (VSS) remains as a challenging problem mainly because of two reasons. First of all, processing a sheer amount of data in real time becomes non-trivial for some practical applications due to resource constraints. Secondly and more importantly, the segmentation predictions need to be temporally consistent in order to avoid the so-called "flickering" problem [31],[41].