1. Introduction
Vision Transformers (ViTs) [8] have achieved state-of-the-art performance in a number of vision tasks such as image classification [39], object detection [6] and semantic segmentation [5]. In ViT-based models, visual features are split into patches and projected into an embedding space. A series of repeating transformer encoder layers, consisting of alternating Multi-head Self-Attention (MSA) and Multi-Layer Perceptron (MLP) blocks extract feature representation from the embedded tokens for downstream tasks (e.g., classification). Recent studies have demonstrated the effectiveness of ViTs in learning uniform local and global spatial dependencies [31]. In addition, ViTs have a great capability in learning pretext tasks and can be scaled for distributed, collaborative, or federated learning scenarios. In this work, we study vulnerability of sharing ViT's gradients in the above mentioned settings.
Inverting gradients for image recovery. We show vision transformer gradients encode a surprising amount of information such that high-fidelity original image batches of high resolution can be recovered, see 112×112 pixel ms-celeb-lm and 224×224 pixel imagenetlk sample recovery above and more in experiments. Our method, gradvit, yields the first successful attempt to invert vit gradients, not achievable by previous state-of-the-art methods. We demonstrate that vits, despite lacking batchnorm layers, suffer even more data leakage compared to cnns. As insights we show that vit gradients (i) encode uneven original information across layers, and (ii) attention is all that reveals.