1. Introduction
Visual grounding is a fundamental task across computer vision and natural language processing, where the goal is to find the correspondences between image content and textual descriptions. It has broad applications in image captioning [18], [53], visual question answering [43], [66], vision-language navigation [11], etc. Collecting detailed grounding annotations to train specialist models, however, is cumbersome. Therefore, zero-shot visual grounding [30], [38], [54] is an attractive alternative.
Illustration of how we disambiguate visual entities based on their interactions with other entities. The same entity or relationships in the image and caption are in the same color.