1. Introduction
There are two main challenges in semantic video segmentation: accuracy and temporal consistency. While the accuracy challenge is essentially shared with the corresponding task for still images, temporal consistency/stability is a challenge unique to video footage. These two challenges are interconnected; for example, perfectly accurate segmentation masks are, by definition, also perfectly consistent. The opposite, however, doesn’t always hold; e.g. a sequence of empty masks is perfectly consistent, but usually far from accurate.