1. Introduction
Pre-trained convolutional neural networks, or convnets, are important components of image recognition applications [7, 8, 38, 46]. They improve the generalization of models trained on a limited amount of data [39] and speed up the training on applications when annotated data is abundant [20]. Convnets produce good generic representations when they are pre-trained on large supervised datasets like ImageNet [11]. However, designing such fully-annotated datasets has required a significant effort from the research community in terms of data cleansing and manual labeling. Scaling up the annotation process to datasets that are orders of magnitude bigger raises important difficulties. Using raw metadata as an alternative has been shown to perform comparatively well [23, 41], even surpassing ImageNet pre-training when trained on billions of images [30]. However, metadata are not always available, and when they are, they do not necessarily cover the full extent of a dataset. These difficulties motivate the design of methods that learn transferable features without using any annotation.