1. Introduction
The state-of-the-art in image captioning is defined by in-creasingly large-scale models trained on increasingly large-scale datasets [11], [18], [39], [42]. Scaling up leads to higher computational demands for model pretraining and finetuning on downstream tasks. This becomes especially relevant when numerous model versions may be needed for different visual domains [1] and end-users in practical applications, e.g. image captioning for the visually impaired [10].