Conferences >2022 IEEE/CVF Conference on C...

GradViT: Gradient Inversion of Vision Transformers

Download PDF
Download References
Request Permissions
Save to
Alerts

Abstract:

In this work we demonstrate the vulnerability of vision transformers (ViTs) to gradient-based inversion attacks. During this attack, the original data batch is reconstruc...Show More

Metadata

Abstract:

In this work we demonstrate the vulnerability of vision transformers (ViTs) to gradient-based inversion attacks. During this attack, the original data batch is reconstructed given model weights and the corresponding gradients. We introduce a method, named GradViT, that optimizes random noise into naturally looking images via an iterative process. The optimization objective consists of (i) a loss on matching the gradients, (ii) image prior in the form of distance to batch-normalization statistics of a pretrained CNN model, and (iii) a total variation regularization on patches to guide correct recovery locations. We propose a unique loss scheduling function to overcome local minima during optimization. We evaluate GadViT on ImageNet1K and MS-Celeb-1M datasets, and observe unprecedentedly high fidelity and closeness to the original (hidden) data. During the analysis we find that vision transformers are significantly more vulnerable than previously studied CNNs due to the presence of the attention mechanism. Our method demonstrates new state-of-the-art results for gradient inversion in both qualitative and quantitative metrics. Project page at https://gradvit.github.io/.

Published in: 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Date of Conference: 18-24 June 2022

Date Added to IEEE Xplore: 27 September 2022

ISBN Information:

ISSN Information:

DOI: 10.1109/CVPR52688.2022.00978

Conference Location: New Orleans, LA, USA

Contents

1. Introduction

Vision Transformers (ViTs) [8] have achieved state-of-the-art performance in a number of vision tasks such as image classification [39], object detection [6] and semantic segmentation [5]. In ViT-based models, visual features are split into patches and projected into an embedding space. A series of repeating transformer encoder layers, consisting of alternating Multi-head Self-Attention (MSA) and Multi-Layer Perceptron (MLP) blocks extract feature representation from the embedded tokens for downstream tasks (e.g., classification). Recent studies have demonstrated the effectiveness of ViTs in learning uniform local and global spatial dependencies [31]. In addition, ViTs have a great capability in learning pretext tasks and can be scaled for distributed, collaborative, or federated learning scenarios. In this work, we study vulnerability of sharing ViT's gradients in the above mentioned settings. Figure 1.

Inverting gradients for image recovery. We show vision transformer gradients encode a surprising amount of information such that high-fidelity original image batches of high resolution can be recovered, see 112×112 pixel ms-celeb-lm and 224×224 pixel imagenetlk sample recovery above and more in experiments. Our method, gradvit, yields the first successful attempt to invert vit gradients, not achievable by previous state-of-the-art methods. We demonstrate that vits, despite lacking batchnorm layers, suffer even more data leakage compared to cnns. As insights we show that vit gradients (i) encode uneven original information across layers, and (ii) attention is all that reveals.

References is not available for this document.

GradViT: Gradient Inversion of Vision Transformers

Abstract:

Metadata

Abstract:

ISSN Information:

1. Introduction

References

IEEE Account

Purchase Details

Profile Information

Need Help?

GradViT: Gradient Inversion of Vision Transformers

Alerts

Abstract:

Metadata

Abstract:

ISSN Information:

1. Introduction

References