I. Introduction
As human beings, we discriminate various concepts through formation of “iconic representations” of them. Judgments of their resemblance or difference are based on similarity and difference comparisons of these iconic representations [1]. We also interpret their meaning through language. Therefore, in a sense, we ground the meanings of language to its perceptual context. For a robot to be more like a human, it must understand the sound patterns of words and understand their meanings. It must ground language in its world as mediated by its perceptual, motor, and cognitive capacities. Under such a scenario, it must analyze the current scene along with the associated utterance, integrate the extracted information, and then finally acquire their meanings.