I. Introduction
Automatically predicting and executing a sequence of actions given a specific task is an ability that is quite expected for intelligent robots [1], [2]. For example, to complete the task of “make tea” under the scene shown in Fig. 1, an agent needs to plan and successively execute a number of steps, e.g., “move to the tea box” and “grasp the tea box”. In this paper, we aim to train a neural network model to enable this capability, which has rarely been addressed in computer vision research.