1. Introduction
Referring Expression segmentation (RES) is one of the most important tasks of multi-modal information processing. Given an image and a natural language expression that describes an object in the image, RES aims to find this target object and generate a segmentation mask for it. It has great potential in many applications, such as video production, human-machine interaction, and robotics. Currently, most of the existing methods follow the RES rules defined in the popular datasets ReferIt [20] and RefCoco [34], [47] and have achieved great progress in recent years.