I. Introduction
An Ongoing trend in robotics research is the development of robots that can jointly understand human intention and action and execute manipulations for human domains. A key component for such robots is a knowledge representation that allows a robot to understand its actions in a way that mirrors how humans communicate about action [1]. Inspired by the theory of affordance [2] and prior work on joint object-action representation [3], the functional object-oriented network (FOON) was introduced as a knowledge graph representation for service robots [4], [5]. FOONs describe object-oriented manipulation actions through its nodes and edges and aims to be a high-level planning abstraction closer to human language and understanding. They can be automatically created from video demonstrations [6], and a set of FOONs can be merged into a single network from which knowledge can be quickly retrieved as plan sequences called task trees [4].