1. Introduction
An explosion in self-supervised learning techniques, including adversarial [23], [31], [32], contrastive [11], [12], [26], [72], reconstructive [34], [66], and denoising [29], [60] approaches, combined with the focus on training large-scale foundation models [4] on vast collections of image data has produced deep neural networks exhibiting dramatic new capabilities. Recent examples of such models include CLIP [51], DINO [8], MAE [27], and Stable Diffusion [53]. As training is no longer primarily driven by annotated data, there is a critical need to understand what these models have learned, provide interpretable insight into how they work, and develop techniques for porting their learned representations for use in accomplishing additional tasks.
Our novel optimization procedure, resembling spectral clustering, leverages features throughout layers of a pre-trained model to extract dense structural representations of images. Shown are results of applying our method to Stable Diffusion [53]. Left: Analyzing internal feature affinity for a single input image yields region grouping. Right: Extending the affinity graph across images yields coherent dataset-level segmentation and reveals ‘what’ (object identity) and ‘where’ (spatial location) pathways, depending on the feature source.