I. Introduction
Convolutional Neural Networks (CNNs) can extract ex-pressive learning representations from high-dimensional data, achieving impressive performance on current visual benchmarks [1]–[3]. The weight sharing mechanism, in which the convolution filter parameters are shared across all spatial positions, helps to extract the common features re-gardless of how the input images are translated. Nevertheless, current convolutional models are still ineffective in tackling other transformations like rotation and reflection [4], [5]. With the lack of an internal mechanism to deal with affine transformations, CNNs demonstrate limited generalization capability across various poses of real-world objects. An enormous number of replicated feature detectors or labeled training images are required exponentially as the dimensions of affine transformations increase [6], [7].