I. Introduction
Referring expressions are commonly used when people talking with each other specifying some particular objects in the scene. For example, “the cup next to the computer”, “the brown bag on the chair”, etc. By understanding the referring expression, the target object can be localized in the scene given natural language description. Different from traditional visual perception tasks which have predefined object labels, the referring expression task is faced with more complex language and visual semantics making it a more challenging task. Currently, the referring expression task has aroused the attention from both the computer vision and natural language communities. And various methods and datasets for referring expression tasks are proposed [1]–[4].