1. Introduction
One of the challenges in human machine interaction is the automatic vision-based understanding of human actions in instructional videos. These videos depict a series of low-level actions that collectively accomplish a top-level task, such as preparing a meal or assembling an object. However, labeling each frame of these videos can be arduous and necessitate a significant amount of manual effort to note the start and end times of each action segment. Consequently, there has been a surge of research interest in developing weakly-supervised methods to learn the actions. In particular, such methods aim to overcome the challenge of weakly-labeled instructional videos, where only the ordered sequence of action labels (transcript) is provided without any information on the duration of each action.