1. Introduction
Content-based image retrieval has been a fundamental computer vision task for decades. More recently, this task has evolved in the direction of enabling users to provide additional forms of interaction (e.g., sentences, attributes and clicks) along with the query image. Interactive image retrieval [11], [15], [57] is relevant in the context of online shopping, specifically for product categories for which appearance is one of the pre-eminent factors for selection, such as fashion items. In this context, it is not only necessary to train expressive visual representations of images [22], [33], [16], [42], [46], but also to empower the model with the ability of understanding interactions of the user and modify the search results accordingly.