1. Introduction
Identifying salient regions of an image prone to holding visual attention remains a long-standing fuzzy problem [59] relying significantly on carefully annotated data [51], [5], [54]. Recently self-supervised (SSL) mechanisms based on large-scale pre-trained backbones [9], [6], [22], such as DINO [7], have demonstrated increased capability in segmenting images [21], [30] and extracting objects in the foreground [41], [39], [54], [4], [42].