1. Introduction
This paper addresses action segmentation by labeling video frames with action classes under set-level weak supervision in training. Set-supervised training means that the ground truth specifies only a set of actions present in a training video. Their temporal ordering and the number of their occurrences remain unknown. This is an important problem arising from the proliferation of big video datasets where providing detailed annotations of a temporal ordering of actions is prohibitively expensive. One example application is action segmentation of videos that have been retrieved from a dataset based on word captions [6], [20], where the captions do not describe temporal relationships of actions.