1. Introduction
Vision-Language (VL) tasks are a growing topic in both the Natural Language Processing (NLP) and Computer Vision communities with the majority of techniques relying on object proposal generation for pre-processing [2], [51]. Object proposals are a set of regions or bounding boxes deemed likely to contain the object specified by a detector. Object proposal generation offers an explainable, efficient, and highly effective bridge between raw images and VL tasks.