I. Introduction
Supervised classification of very high spatial resolution (VHSR) images is still an open research topic in the remote sensing (RS) field. Monitoring urbanization trends has become a crucial objective, and there is currently a high demand for such automatic RS classification techniques. Toward this direction in the last years, advanced methodologies have significantly contributed to the solution of the VHSR classification problem. Predominantly, methods based on the bag-of-visual-words (BoVW) approach have been proposed for solving this task by learning a dictionary for representing the image content in an unsupervised manner through the use of well-established feature descriptors (HOG, SIFT, etc.) and clustering algorithms. Such approaches include the spatial pyramid matching kernel (SPMK) [1], spatial pyramid cooccurrence kernel (SPCK++) [2], min-tree kd-tree [3], and sparse coding [4] methods. However, the main drawback of all these techniques lies in the assumption that a general feature descriptor can adequately represent the complex image structures by employing expert knowledge through manually designed all-around purpose features.