1. Introduction
Referring image segmentation is a challenging problem at the intersection of computer vision and natural language processing. Given an image and a natural language expression, the goal is to produce a segmentation mask in the image corresponding to entities referred by the the natural language expression (see Fig. 4 for some examples). It is worth noting that the referring expression is not limited to specifying object categories (e.g. "person", "cat"). It can take any free form language description which may contain appearance attributes (e.g. "red", "long"), actions (e.g. "standing", "hold") and relative relationships (e.g. "left", "above"), etc. Referring image segmentation can potentially be used in a wide range of applications, such as interactive photo editing and human-robot interaction.