Loading [MathJax]/extensions/MathMenu.js

Emerging Properties in Self-Supervised Vision Transformers | IEEE Conference Publication | IEEE Xplore

- Donate
- Personal Sign In

ADVANCED SEARCH

Conferences >2021 IEEE/CVF International C...

Emerging Properties in Self-Supervised Vision Transformers

Download PDF
Download References
Request Permissions
Save to
Alerts

Abstract:

In this paper, we question if self-supervised learning provides new properties to Vision Transformer (ViT) [16] that stand out compared to convolutional networks (convnet...Show More

Metadata

Abstract:

In this paper, we question if self-supervised learning provides new properties to Vision Transformer (ViT) [16] that stand out compared to convolutional networks (convnets). Beyond the fact that adapting self-supervised methods to this architecture works particularly well, we make the following observations: first, self-supervised ViT features contain explicit information about the semantic segmentation of an image, which does not emerge as clearly with supervised ViTs, nor with convnets. Second, these features are also excellent k-NN classifiers, reaching 78.3% top-1 on ImageNet with a small ViT. Our study also underlines the importance of momentum encoder [26], multi-crop training [9], and the use of small patches with ViTs. We implement our findings into a simple self-supervised method, called DINO, which we interpret as a form of self-distillation with no labels. We show the synergy between DINO and ViTs by achieving 80.1% top-1 on ImageNet in linear evaluation with ViT-Base.

Published in: 2021 IEEE/CVF International Conference on Computer Vision (ICCV)

Date of Conference: 10-17 October 2021

Date Added to IEEE Xplore: 28 February 2022

ISBN Information:

ISSN Information:

DOI: 10.1109/ICCV48922.2021.00951

Conference Location: Montreal, QC, Canada

References is not available for this document.

Select All

1.

Rohan Anil, Gabriel Pereyra, Alexandre Passos, Robert Ormandi, George E Dahl and Geoffrey E Hinton, "Large scale distributed neural network training through online distillation", 2018.

2.

Yuki Markus Asano, Christian Rupprecht and Andrea Vedaldi, "Self-labelling via simultaneous clustering and representation learning", ICLR, 2020.

3.

Dzmitry Bahdanau, Kyunghyun Cho and Yoshua Bengio, "Neural machine translation by jointly learning to align and translate", 2014.

4.

Maxim Berman, Hervé Jégou, Vedaldi Andrea, Iasonas Kokkinos and Matthijs Douze, "MultiGrain: a unified image embedding for classes and instances", 2019.

5.

Piotr Bojanowski and Armand Joulin, "Unsupervised learning by predicting noise", ICML, 2017.

6.

Cristian Buciluǎ, Rich Caruana and Alexandru Niculescu-Mizil, "Model compression", SIGKDD, 2006.

CrossRef Google Scholar

7.

Mathilde Caron, Piotr Bojanowski, Armand Joulin and Matthijs Douze, "Deep clustering for unsupervised learning of visual features", ECCV, 2018.

CrossRef Google Scholar

8.

Mathilde Caron, Piotr Bojanowski, Julien Mairal and Armand Joulin, "Unsupervised pre-training of image features on non-curated data", ICCV, 2019.

9.

Mathilde Caron, Ishan Misra, Julien Mairal, Priya Goyal, Piotr Bojanowski and Armand Joulin, "Unsupervised learning of visual features by contrasting cluster assignments", NeurIPS, 2020.

10.

Mia Xu Chen, Orhan Firat, Ankur Bapna, Melvin Johnson, Wolfgang Macherey, George Foster, Llion Jones, Niki Parmar, Mike Schuster, Zhifeng Chen et al., "The best of both worlds: Combining recent advances in neural machine translation", 2018.

CrossRef Google Scholar

11.

Ting Chen, Simon Kornblith, Mohammad Norouzi and Geoffrey Hinton, "A simple framework for contrastive learning of visual representations", 2020.

12.

Ting Chen, Simon Kornblith, Kevin Swersky, Mohammad Norouzi and Geoffrey Hinton, "Big self-supervised models are strong semi-supervised learners", NeurIPS, 2020.

13.

Xinlei Chen, Haoqi Fan, Ross Girshick and Kaiming He, "Improved baselines with momentum contrastive learning", 2020.

14.

Xinlei Chen and Kaiming He, "Exploring simple siamese representation learning", 2020.

15.

Jacob Devlin, Ming-Wei Chang, Kenton Lee and Kristina Toutanova, "Bert: Pre-training of deep bidirectional transformers for language understanding", 2018.

16.

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly et al., "An image is worth 16x16 words: Transformers for image recognition at scale", 2020.

17.

Alexey Dosovitskiy, Philipp Fischer, Jost Tobias Springenberg, Martin Riedmiller and Thomas Brox, "Discriminative unsupervised feature learning with exemplar convolutional neural networks", TPAMI, 2016.

18.

Matthijs Douze, Hervé Jégou, Harsimrat Sandhawalia, Laurent Amsaleg and Cordelia Schmid, "Evaluation of gist descriptors for web-scale image search", CIVR, 2009.

CrossRef Google Scholar

19.

Aleksandr Ermolov, Aliaksandr Siarohin, Enver Sangineto and Nicu Sebe, "Whitening for self-supervised representation learning", 2020.

20.

Zhiyuan Fang, Jianfeng Wang, Lijuan Wang, Lei Zhang, Yezhou Yang and Zicheng Liu, Seed: Self-supervised distillation for visual representation, 2021.

21.

Spyros Gidaris, Andrei Bursuc, Gilles Puy, Nikos Komodakis, Matthieu Cord and Patrick Pérez, "Online bag-of-visual-words generation for unsupervised representation learning", 2020.

22.

Priya Goyal, Piotr Dollár, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, et al., "Accurate large minibatch sgd: Training imagenet in 1 hour", 2017.

23.

Jean-Bastien Grill, Florian Strub, Florent Altché, Corentin Tallec, Pierre H Richemond, Elena Buchatskaya, et al., "Bootstrap your own latent: A new approach to self-supervised learning", NeurIPS, 2020.

24.

Shir Gur, Ameen Ali and Lior Wolf, "Visualization of supervised and self-supervised neural networks via attribution guided factorization", 2020.

25.

Michael Gutmann and Aapo Hyvärinen, "Noise-contrastive estimation: A new estimation principle for unnormalized statistical models", International Conference on Artificial Intelligence and Statistics, 2010.

26.

Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie and Ross Girshick, "Momentum contrast for unsupervised visual representation learning", CVPR, 2020.

27.

Kaiming He, Xiangyu Zhang, Shaoqing Ren and Jian Sun, "Deep residual learning for image recognition", CVPR, 2016.

28.

Geoffrey Hinton, Oriol Vinyals and Jeff Dean, "Distilling the knowledge in a neural network", 2015.

29.

Jiabo Huang, Qi Dong, Shaogang Gong and Xiatian Zhu, "Unsupervised deep learning by neighbourhood discovery", ICML, 2019.

30.

Allan Jabri, Andrew Owens and Alexei A Efros, Space-time correspondence as a contrastive random walk, 2020.

References is not available for this document.