I. Introduction
Recent years have witnessed great development of content-based image retrieval (CBIR) [1]. However, many great challenges still exist. One challenging problem in CBIR is the semantic gap between low-level visual features and high-level human perceptions [2]. Traditional methods mainly based on low-level features [3], [4] like color features [5] and texture features [6], and usually cannot obtain satisfactory results. Different to traditional methods, convolutional neural network (CNN) has the ability to learn hierarchical features, including high-level features, and has made great contributions in feature extraction for CBIR in recent years. However, in training stage, CNN is always trained to well classify images, without considering intra-class and inter-class distances, while in retrieval stage, the task is almost entirely based on distance metric. It will result in worse retrieval performance if intra-class distances is smaller than inter-class distances for some samples. Inspired by this, Triplet-CNN, which contains three identical CNNs sharing all the weights and biases, is employed [7]. The net is trained by “telling” machines which samples are of the same class or of different classes, and penalizing it when a sample of a different class is recognized to be more similar than those of the same class.