I. Introduction
Imitation learning (IL) empowers agents to learn from ex-pert data instead of designing an explicit reward function [1] and has achieved remarkable successes in graphics [2], online games [3], and robotic manipulation [4]. The expert data in IL can be divided into two categories [5], [6], demonstrations and observations, among which demonstrations contain states and actions of experts' experiences, whereas observations only consist of states. In real world applications, the state is the proprioceptive state of an expert, which could be hard to access and record. By contrast, intelligent creatures grasp knowledge or skills by observing how peer fellows accomplish tasks without knowing their proprioceptive states [7]. In other words, intelligent creatures generally learn with visual inputs rather than state inputs. This learning scheme is more practical, but has been less studied in the IL community.