I. Introduction
Deep neural networks (DNNs) are becoming the dominant algorithm of machine learning [1], largely owing to their superiority in solving challenging real-world applications [2], [3]. Generally, the performance of DNNs relies on two deciding factors: 1) the architectures of the DNNs and 2) the weights associated with the architecture. The performance of a DNN in solving the corresponding problem can be promising, only when its architecture and the weights achieve the optimum combination simultaneously. Commonly, when the architecture of a DNN is determined, the optimal weights can be obtained through formulizing the loss as a continuous function, and then the exact optimization algorithms are employed for solving. In practice, the gradient-based optimization algorithms are the most popular ones in addressing the loss function, although they cannot theoretically guarantee the global optimum [4]. On the other hand, obtaining the optimal architectures is not a trivial task because the architectures cannot be directly optimized as the weights do. In practice, most, if not all, prevalent state-of-the-art DNN architectures are manually designed based on extensive human expertise, including ResNet [5], DenseNet [6], and among others.