1. Introduction
Visual tracking is one of the fundamental computer vision problems with a wide range of applications. Given a target object specified by a bounding box in the first frame, visual tracking aims to locate the target object in the subsequent frames. This is challenging as target objects often undergo significant appearance changes over time and may temporally leave the field of the view. Conventional trackers prior to the advances of deep learning mainly consist of a feature extraction module and a decision-making mechanism. The recent state-of-the-art deep trackers often use deep models pre-trained for the object recognition task to extract features, while putting more emphasis on designing effective decision-making modules. While various decision models, such as correlation filters [15], regressors [14], [35], [38], [37], and classifiers [16], [29], [32], are extensively explored, considerably less attention is paid to learning more discriminative deep features.