Introduction
Recently, deep learning has become a significant part of information science research [1]–[4]. And deep learning has achieved outstanding performance in many fields, such as image classification [5]–[7], action recognition [8], [9], image captioning [10], and target localization [11]–[13]. The performance of a deep neural network is mainly determined by the structure of its model and its corresponding optimization algorithm. Therefore, a good optimization algorithm can improve the performance of the deep neural network under circumstance of fixed network architecture.
At present, first-order based optimization algorithms play an important role in deep learning on account of their efficiency and effectiveness in dealing with large-scale optimization problems [14]. In these algorithms, stochastic gradient descent (SGD) [15]–[19] is the representative method to train deep neural networks, in which it iteratively moves in the direction of the negative gradient to update parameters until convergence. Successively, some variants of SGD were proposed, which can automatically adjust the learning rate by using the square root operation of some form of averaging of the squared elements in the past gradients. Adagrad [20], as the first variant, works well for dealing with sparse data. However, this method uses all the past gradients, which can result in fast shrinkage of the learning rate. In order to resolve this problem, some algorithms, e.g., Adadelta [21], RMSprop [22], and Adam [23], were proposed by using the exponential moving average of past squared gradients. Although these optimization methods have achieved good performance in many applications, they can lead to undesirable convergence behavior. AMSGrad [24] was proposed to guarantee convergence by employing the idea, i.e., long-term memory of the past gradients. However, when the new gradients oscillate, AMSGrad may lead to the undesirable estimation. The defects of these existing methods motivate us to find a new way to resolve the problems and further improve the optimization performance.
In this paper, we present a new first-order gradient descent optimization algorithm that includes a more flexible weighting mechanism, which can resolve the undesirable convergence behavior and improve the performance of Adam and AMSGrad. Unlike most of methods, our proposed new weighting mechanism uses a dynamic exponential decay rate for second moment estimate instead of a preconfigured and fixed one. In addition, our method can easily adjust the degree to which how much the past gradients weigh in the estimation. The proposed new exponential moving average variant is developed on the basis of the idea, i.e., putting more memory of the past gradients than the recent gradients. The analysis indicates that our method can guarantee convergence, at the same time, the experimental results show that our method can further improve the performance. In the sequel, our method will be referred to as NWM-Adam.
The remainder of this paper is arranged as follows. In Section II, we will present the related works including the most common optimization algorithms. Subsequently, in Section III, we will describe our proposed NWM-Adam algorithm. Next, we provide the convergence analysis of our method in Section IV. Afterwards, in Section V, we will give the description of the experimental design and discusses the results. Finally, in Section VI, we will conclude our work and discuss some future works.
Related Work
Optimization problem is one of the most important research directions in computational mathematics [25]. In the field of deep learning, the selection of optimization algorithms remains top priority in one model. Even if the data sets and model architectures are identical, different optimization algorithms are likely to result in dramatically different training effects. In this section, we will outline some optimization algorithms that are widely used in deep learning.
Gradient descent method minimizes the objective function
Batch gradient descent, i.e., vanilla gradient descent, employs the whole training dataset to compute the gradient of the objective function with regard to the parameters \begin{align*} \theta=&\theta - \eta \cdot {\nabla _\theta }J(\theta) \qquad (\mathrm {Batch}) \tag{1}\\ \theta=&\theta - \eta \cdot {\nabla _\theta }J(\theta;{x^{(i)}};{y^{(i)}}) \qquad (\mathrm {Stochastic}) \tag{2}\\ \theta=&\theta - \eta \cdot {\nabla _\theta }J(\theta;{x^{(i:i + n)}};{y^{(i:i + n)}}) \qquad (\mathrm {Mini-batch})\tag{3}\end{align*}
SGD is prone to oscillations when meeting ravines [18]. Therefore, momentum [26] is employed, which can speed up SGD in the correct direction and restrain oscillations.\begin{align*} {m_{t}}=&\gamma {m_{t - 1}} - \eta \cdot {\nabla _\theta }J(\theta) \tag{4}\\ \theta=&\theta - {m_{t}}\tag{5}\end{align*}
On the basis of the original step size, SGD-Momentum adopts the past time step
Deep learning models often involve lots of parameters. Different types of parameters have different update frequency, which means larger update steps for infrequently updated parameters and smaller steps for frequently updated parameters. Adagrad [20] is an algorithm which can achieve this effect. This algorithm introduces the second moment.\begin{equation*} {G_{t}} = {\mathrm{diag}}\left({\sum \limits _{i = 1}^{t} {g_{i,1}^{2}},\sum \limits _{i = 1}^{t} {g_{i,2}^{2}}, \cdots,\sum \limits _{i = 1}^{t} {g_{i,d}^{2}} }\right)\tag{6}\end{equation*}
\begin{equation*} {\theta _{t,i}} = {\theta _{t - 1,i}} - \frac {\eta }{{\sqrt {{G_{t - 1,ii}} + \varepsilon } }} \cdot {g_{t - 1,i}}\tag{7}\end{equation*}
\begin{equation*} {g_{t - 1,i}} = {\nabla _\theta }J({\theta _{i}})\tag{8}\end{equation*}
Adagrad can help us adapt the learning rate to the parameters, which has no need of manually tuning the learning rate. However, the continual accumulation of the squared gradients in the denominator will result in the learning rate shrinking. In the end, the learning rate will become especially small, which means that we cannot get any knowledge.
Adadelta [21] modifies the calculation method of the second moment, which resolves the phenomenon that the learning rate of Adagrad changes aggressively. With Adadelta, we implement the accumulation as an exponentially decaying average of past squared gradients. The update rule of Adadelta is:\begin{align*} \Delta {\theta _{t - 1}}=&- \frac {{\sqrt {E{{\left [{ {\Delta {\theta ^{2}}} }\right]}_{t - 2}} + \varepsilon } }}{{\sqrt {E{{\left [{ {g^{2}} }\right]}_{t - 1}} + \varepsilon } }}{g_{t - 1}} \tag{9}\\ {\theta _{t}}=&{\theta _{t - 1}} + \Delta {\theta _{t - 1}} \tag{10}\\ E{\left [{ {g^{2}} }\right]_{t - 1}}=&\gamma E{\left [{ {g^{2}} }\right]_{t - 2}} + (1 - \gamma)g_{t - 1}^{2} \tag{11}\\ E{\left [{ {\Delta {\theta ^{2}}} }\right]_{t - 2}}=&\gamma E{\left [{ {\Delta {\theta ^{2}}} }\right]_{t - 3}} + (1 - \gamma)\Delta \theta _{t - 2}^{2}\tag{12}\end{align*}
RMSprop [22] also confines the window of accumulated past squared gradients to some fixed size instead of accumulating all past squared gradients, which is similar to Adadelta. This algorithm divides the learning rate by an exponentially decaying average of past squared gradients.\begin{align*} E{\left [{ {g^{2}} }\right]_{t - 1}}=&0.9E{\left [{ {g^{2}} }\right]_{t - 2}} + 0.1g_{t - 1}^{2} \tag{13}\\ {\theta _{t}}=&{\theta _{t - 1}} - \frac {\eta }{{\sqrt {E{{\left [{ {g^{2}} }\right]}_{t - 1}} + \varepsilon } }}{g_{t - 1}}\tag{14}\end{align*}
Adam [23] can be considered as the combination of RMSprop and momentum. In addition to using an exponentially decaying average of past squared gradients like RMSprop, Adam as well applies an exponentially decaying average of past gradients like momentum.\begin{align*} {m_{t}}=&{\beta _{1}}{m_{t - 1}} + (1 - {\beta _{1}}){g_{t}} \tag{15}\\ {v_{t}}=&{\beta _{2}}{v_{t - 1}} + (1 - {\beta _{2}})g_{t}^{2}\tag{16}\end{align*}
\begin{align*} {\hat m_{t}}=&\frac {m_{t}}{1 - \beta _{1}^{t}} \tag{17}\\ {\hat v_{t}}=&\frac {v_{t}}{1 - \beta _{2}^{t}}\tag{18}\end{align*}
Finally, the update rule of Adam is:\begin{equation*} {\theta _{t + 1}} = \theta {}_{t} - \frac {\eta }{{\sqrt {\hat v_{t}} + \varepsilon }}{\hat m_{t}}\tag{19}\end{equation*}
Zhang et al. [27] designed an adaptive exponential decay rate \begin{align*} {\beta _{t}}=&1 - \frac {1}{\iota _{t}} \tag{20}\\ {\iota _{t}}=&\left({1 - \frac {{{m_{t - 1}}}}{{\sqrt {{v_{t - 1}}} }}}\right){\iota _{t - 1}} + 1\tag{21}\end{align*}
Reddi et al. [24] pointed out the problem which exists in the proof of convergence of the Adam algorithm. At the same time, the authors showed an example of convex optimization problem, in which the Adam algorithm cannot converge to an optimal solution. In order to resolve this issue, the authors proposed to modify the Adam algorithm where the idea of “long-term memory” of past gradients is employed. The modified Adam is termed as AMSGrad. The main difference of AMSGrad with Adam is that AMSGrad uses the “max” operation to all
NWM-Adam
In this section, we describe our proposed NWM-Adam algorithm. The proposed NWM-Adam algorithm uses a dynamic exponential decay rate for second moment estimate instead of a preconfigured and fixed one. Firstly, we provide an universal update rule which can cover many gradient descent optimization algorithms.\begin{align*} {\theta _{t}}=&{\theta _{t - 1}} - \frac {{{\eta _{t - 1}}}}{{\sqrt {{V_{t - 1}}} }}{m_{t - 1}} \tag{22}\\ {m_{t - 1}}=&{\phi _{t - 1}}(\nabla {J_{1}}({\theta _{1}}),\ldots,\nabla {J_{t - 1}}({\theta _{t - 1}})) \tag{23}\\ {V_{t - 1}}=&{\psi _{t - 1}}(\nabla {J_{1}}({\theta _{1}}),\ldots,\nabla {J_{t - 1}}({\theta _{t - 1}}))\tag{24}\end{align*}
Reddi et al. [24] proved that the Adam algorithm cannot converge to an optimal solution even in the simple one-dimensional convex settings, which refutes the convergence of Adam given in [23]. In fact, this fundamental flaw also exists in some most-widely used exponential moving average methods, i.e., RMSprop, Adadelta. The main problem is related to the following quantity.\begin{equation*} {\Gamma _{t}} = \frac {{\sqrt {V_{t}} }}{\eta _{t}} - \frac {{\sqrt {{V_{t - 1}}} }}{{{\eta _{t - 1}}}}\tag{25}\end{equation*}
The above quantity denotes the change of the inverse of the learning rate with regard to each iteration. The update rules of SGD and Adagrad usually result in non-increasing learning rates. On the basis of the update rules of SGD and Adagrad in the previous section, we can observe that
In order to resolve the aforementioned convergence problem, Reddi et al. [24] proposed to employ an idea, i.e., long-term memory of past gradients, to the original Adam algorithm. In this paper, NWM-Adam, i.e., the proposed new exponential moving average variant, is also developed on the basis of this idea, which puts more memory of the past gradients than the recent gradients. In other words, our method employs larger weights on the past gradients than the recent gradients. Our proposed NWM-Adam can make the quantity
The following presents the details of our proposed NWM-Adam. Eqs. 22, 23 and 24 are the main component of our proposed algorithm. Suppose \begin{align*} {m_{t}}=&{\beta _{1}}{m_{t - 1}} + (1 - {\beta _{1}}){g_{t}} \tag{26}\\ {v_{t}}=&{\beta _{2,t}}{v_{t - 1}} + (1 - {\beta _{2,t}})g_{t}^{2}\tag{27}\end{align*}
\begin{equation*} {\beta _{2,t}} = \frac {{({t^\lambda } - 1)}}{t^\lambda }\tag{28}\end{equation*}
Therefore, NWM-Adam can easily adjust the degree to which how much the past gradients weigh in the estimation. The key difference of NWM-Adam from Adam is that it employs a changing
NWM{\text{-}}Adam
, Our Proposed Algorithm for Stochastic Optimization. {g_{t}}
Denotes the Gradient and g_{t}^{2}
Represents the Element Wise Square {g_{t}} \odot{g_{t}}
. \varepsilon
is to {10^{ - 8}}
, Which Can Avoid the Denominator is Zero.
For
do
Get gradients with regard to objective function at time step
Get the hyper-parameter
Update first moment estimate
Update second raw moment estimate
Update parameters
End
Until Meet the stopping criterion
Next, we give an explanation of the weighing scheme of the gradients in NWM-Adam. Let us revisit the Eq. 22, this equation can also be rewritten as follows.\begin{equation*} {\theta _{t}} = {\theta _{t - 1}} - \frac {{{\eta _{t - 1}}}}{{\sqrt {\rm E\left [{ {g^{2}} }\right]} }}{\mathrm{ E}}\left [{ g }\right]\tag{29}\end{equation*}
With regard to the estimation of
Convergence Analysis
In this section, we give the convergence analysis of our proposed method. Suppose the unknown sequence of convex objective functions \begin{equation*} R(T) = \sum \nolimits _{t = 1}^{T} {\left [{ {J_{t}({\theta _{t}}) - {J_{t}}({\theta ^{*}})} }\right]}\tag{30}\end{equation*}
The changing schedule of
,\eta _{t - 1}^{ - 1}\sqrt {{v_{t - 1,i}}} \le \eta _{t}^{ - 1}\sqrt {{v_{t,i}}} ;t \in ~2,3,\ldots,T ,{\delta ^{ - 1}}{\left({\sum \limits _{p = 1}^{t} {g_{p,i}^{2}} }\right)^{0.5}} \!\!\le \!\! \eta _{T}^{ - 1}{\left({\!\sum \limits _{p = 1}^{t} {\mathop \prod \limits _{q = 1}^{t - p} {\beta _{2,(t - q + 1)}}(1 \!\!-\!\! {\beta _{2,p}})g_{p,i}^{2}} }\right)^{0.5}} .t \in \left [{ T }\right]
In order to show that our optimization algorithm has a regret bound, we start with the following:\begin{align*} {\theta _{t + 1}}=&{\prod _{F,\sqrt {V_{t}} }}({\theta _{t}} - {\eta _{t}}{V_{t}}^{ - 0.5}{m_{t}}) \\=&\mathop {min}\limits _{x \in F} \left \|{ {V_{t}^{0.25}(\theta - ({\theta _{t}} - {\eta _{t}}{V_{t}}^{ - 0.5}{m_{t}}))} }\right \|\tag{31}\end{align*}
Based on the theory in [28], we can obtain the following:\begin{align*}&\hspace {-2pc}{\left \|{ {V_{t}^{0.25}({\theta _{t + 1}} - {\theta ^{*}})} }\right \|^{2}} \\\le&{\left \|{ {V_{t}^{0.25}({\theta _{t}} - {\theta ^{*}})} }\right \|^{2}} + \eta _{t}^{2}{\left \|{ {V_{t}^{ - 0.25}{m_{t}}} }\right \|^{2}} \\&-\, 2{\eta _{t}}\left \langle{ {{\beta _{1,t}}{m_{t - 1}} + (1 - {\beta _{1,t}}){g_{t}},{\theta _{t}} - {\theta ^{*}}} }\right \rangle\tag{32}\end{align*}
Then, we can get the following:\begin{align*}&\hspace {-2pc}\left \langle{ {g_{t},{\theta _{t}} - {\theta ^{*}}} }\right \rangle \\\le&0.5{\eta _{t}}{(1 - {\beta _{1,t}})^{ - 1}}{\left \|{ {V_{t}^{ - 0.25}{m_{t}}} }\right \|^{2}} \\&+\, 0.5\eta _{t}^{ - 1}{(1 - {\beta _{1,t}})^{ - 1}}[{\left \|{ {V_{t}^{0.25}({\theta _{t}} - {\theta ^{*}})} }\right \|^{2}} \\&-\, {\left \|{ {V_{t}^{0.25}({\theta _{t + 1}} - {\theta ^{*}})} }\right \|^{2}}] \\&+\, 0.5{\beta _{1,t}}{(1 - {\beta _{1,t}})^{ - 1}}\eta _{t}^{ - 1}{\left \|{ {V_{t}^{0.25}({\theta _{t}} - {\theta ^{*}})} }\right \|^{2}} \\&+\, 0.5{\beta _{1,t}}{(1 - {\beta _{1,t}})^{ - 1}}{\eta _{t}}{\left \|{ {V_{t}^{ - 0.25}{m_{t - 1}}} }\right \|^{2}}\tag{33}\\&\hspace {-2pc}\sum \limits _{t = 1}^{T} {J_{t}({\theta _{t}}) - } {J_{t}}({\theta ^{*}}) \\\le&\sum \limits _{t = 1}^{T} {\left \langle{ {g_{t},{\theta _{t}} - {\theta ^{*}}} }\right \rangle } \\\le&\sum \limits _{t = 1}^{T} [0.5{\eta _{t}}{(1 - {\beta _{1,t}})^{ - 1}}{\left \|{ {V_{t}^{ - 0.25}{m_{t}}} }\right \|^{2}} \\&+\, 0.5\eta _{t}^{ - 1}{(1 - {\beta _{1,t}})^{ - 1}}[{\left \|{ {V_{t}^{0.25}({\theta _{t}} - {\theta ^{*}})} }\right \|^{2}} \\&-\, {\left \|{ {V_{t}^{0.25}({\theta _{t + 1}} - {\theta ^{*}})} }\right \|^{2}}] \\&+\, 0.5{\beta _{1,t}}{(1 - {\beta _{1,t}})^{ - 1}}\eta _{t}^{ - 1}{\left \|{ {V_{t}^{0.25}({\theta _{t}} - {\theta ^{*}})} }\right \|^{2}} \\&+\, 0.5{\beta _{1,t}}{(1 - {\beta _{1,t}})^{ - 1}}{\eta _{t}}{\left \|{ {V_{t}^{ - 0.25}{m_{t - 1}}} }\right \|^{2}}]\tag{34}\end{align*}
We need some intermediate results for further bounding this inequality.\begin{align*} \sum \limits _{t = 1}^{T} {\eta _{t}{{\left \|{ {V_{t}^{ - 0.25}{m_{t}}} }\right \|}^{2}}}=&\sum \limits _{t = 1}^{T - 1} {\eta _{t}{{\left \|{ {V_{t}^{ - 0.25}{m_{t}}} }\right \|}^{2}}} \\&+\, {\eta _{T}}\sum \limits _{i = 1}^{d} {m_{T,i}^{2}{{({v_{T,i}})}^{ - 0.5}}}\tag{35}\end{align*}
Then, we can get the following inequality.\begin{align*} \sum \limits _{t = 1}^{T} {\eta _{t}{{\left \|{ {V_{t}^{ - 0.25}{m_{t}}} }\right \|}^{2}}}\le&\sum \limits _{t = 1}^{T - 1} {\eta _{t}{{\left \|{ {V_{t}^{ - 0.25}{m_{t}}} }\right \|}^{2}}} + \delta {(1 - {\beta _{1}})^{ - 1}} \\&\,\sum \limits _{i = 1}^{d} {\sum \limits _{j = 1}^{T} {\beta _{1}^{T - j}g_{j,i}^{2}} {{\left({\sum \limits _{j = 1}^{T} {g_{j,i}^{2}} }\right)}^{-0.5}}} \\\le&2\delta {(1 - {\beta _{1}})^{ - 2}}\sum \limits _{i = 1}^{d} {{{\left \|{ {{g_{1:T,i}}} }\right \|}_{2}}}\tag{36}\end{align*}
Next, we can get the following:\begin{align*}&\hspace{-0.6pc}\sum \limits _{t = 1}^{T} {J_{t}({\theta _{t}}) - } {J_{t}}({\theta ^{*}}) \le 2\delta {(1 - {\beta _{1}})^{ - 3}}\sum \limits _{i = 1}^{d} {{{\left \|{ {{g_{1:T,i}}} }\right \|}_{2}}} \\&+\, 0.5\eta _{1}^{ - 1}{(1 - {\beta _{1}})^{ - 1}}\sum \limits _{i = 1}^{d} {\sqrt {{v_{1,i}}} {{({\theta _{1,i}} - \theta _{i}^{*})}^{2}}} \\&+\, 0.5{(1 - {\beta _{1}})^{ - 1}}\sum \limits _{t = 2}^{T} {\sum \limits _{i = 1}^{d} {{{({\theta _{t,i}} - \theta _{i}^{*})}^{2}}} } [\eta _{t}^{ - 1}\sqrt {{v_{t,i}}} \\&-\, \eta _{t - 1}^{ - 1}\sqrt {{v_{t - 1,i}}}] \\&+\, 0.5{(1 - {\beta _{1}})^{ - 1}}\sum \limits _{t = 1}^{T} {\sum \limits _{i = 1}^{d} {{\beta _{1,t}}{{({\theta _{t,i}} - \theta _{i}^{*})}^{2}}\eta _{t}^{ - 1}\sqrt {{v_{t,i}}} } }\tag{37}\end{align*}
When \begin{align*} R(T)\le&2{(1 - {\beta _{1}})^{ - 3}}\delta \sum \limits _{i = 1}^{d} {{{\left \|{ {{g_{1:T,i}}} }\right \|}_{2}}} \\&+\, \frac {1}{2}D_\infty ^{2}\sqrt {T} {\eta ^{ - 1}}{(1 - {\beta _{1}})^{ - 1}}\sum \limits _{i = 1}^{d} {\sqrt {{v_{T,i}}} } \\&+\, \frac {1}{2}D_\infty ^{2}{(1 - {\beta _{1}})^{ - 1}}\sum \limits _{t = 1}^{T} {\sum \limits _{i = 1}^{d} {\eta _{t}^{ - 1}{\beta _{1,t}}\sqrt {{v_{t,i}}} } }\tag{38}\end{align*}
Experiments
In this section, in order to empirically evaluate the effectiveness of the proposed NWM-Adam algorithm, we evaluate the proposed algorithm in three different popular machine learning models, i.e., logistic regression, multi-layer fully connected neural networks, and deep convolutional neural networks. These different models are tested on the MNIST dataset [29] and the CIFAR-10 dataset [30]. By employing these models and datasets, we demonstrate that NWM-Adam can effectively work out the practical deep learning problems.
The experiments in this part compare our proposed NWM-Adam algorithm with other popular gradient descent optimization algorithms, i.e., Momentum, Adagrad, RMSprop, Adam, and AMSGrad. Same parameter initialization is utilized when comparing with different optimization algorithms and the results are showed using the best hyper-parameters. In this section, we first introduce the details of the utilized datasets. Then, we give the experimental results and analyze the performance of our proposed method.
A. Description of the Utilized Datasets
1) MNIST
MNIST is a large database of handwritten digits that is widely used for training and testing in the field of machine learning. This dataset was taken from American Census Bureau employees and American high school students. Its training set contains 60000 grayscale images, and testing set has 10000 grayscale images. These handwritten digits have been preprocessed, resized and centered, and the size of the images is fixed at
2) CIFAR-10
CIFAR-10 is a color image dataset which is closer to universal objects. This dataset was established by Alex Krizhevsky, Vinod Nair, and Geoffrey Hinton. It contains 10 classes, including airplane, automobile, bird, car, deer, dog, frog, horse, ship, and truck. This dataset has
B. Logistic Regression
For the first experiment, we evaluate our proposed algorithm on logistic regression problem, which can investigate the performance of the algorithm on convex problems. We perform this experiment using the MNIST dataset and compare the result with some popular gradient descent optimization algorithms, where the logistic regression problem gets one of the 10 class labels directly on the 784 dimension image vectors. The step size in this experiment is adjusted by
C. Multi-Layer Fully Connected Neural Networks
Multi-layer fully connected neural network is the powerful model with non-convex objective function. In this experiment, we use two neural network models to test the effectiveness of our proposed method in the multiclass classification problem on MNIST dataset.
The first neural network model has one fully connected hidden layer with 100 hidden units. The second neural network model has two fully connected hidden layers with 1024 hidden units in the first hidden layer and 512 hidden units in the second hidden layer. ReLU is used as the activation function. Softmax is used as a classifier acting on the outputs of the last hidden layer. Furthermore, constant step size is utilized throughout this experiment.
The training loss curves of these two fully connected neural network models by using different optimization algorithms are shown in Fig. 2 and Fig. 3. In order to compare clearly, we use the base-10 logarithm to plot the training loss. Fig. 2 and Fig. 3 all show that NWM-Adam algorithm outperforms other optimization algorithms, i.e., Momentum, Adagrad, RMSprop, Adam, and AMSGrad.
The training loss of fully connected neural networks (100 hidden units) using different optimization algorithms.
The training loss of fully connected neural networks (1024-512 hidden units) using different optimization algorithms.
The classification accuracy over ten rounds is presented in Table 2. From the table, we observe that our method outperforms all the other methods in terms of classification accuracy. Furthermore, the good performance of NWM-Adam is relatively stable.
The good performance of our algorithm benefits from proper utilization of the past gradients. Traditional gradient descent optimization algorithms all use a fixed exponential decay rate for second moment estimate, while our algorithm employs a dynamic exponential decay rate for this estimate. Furthermore, our algorithm can easily adjust the degree to which how much the past gradients weigh in the estimation. This is the reason why our algorithm is superior to others.
D. Deep Convolutional Neural Networks
In the last experiment, we use deep convolutional neural network (DCNN) to evaluate our method on the standard CIFAR-10 dataset. The residual architecture is used as the DCNN model in this part. In this architecture, the first layer is
In this experiment, we use the 20-layer ResNet (
The training loss curves for this problem are shown in Fig. 4 and Fig. 5. We can see that NWM-Adam performs considerably better than other optimization methods, which is in line with the conclusion of the experiment of multi-layer fully connected neural network. At the same time, we also provide the classification accuracy over ten rounds using different optimization methods, which is shown in Table 3. From the table, we can see that our optimization method can help the neural network further improve its classification accuracy, which demonstrates once again the superiority of our proposed optimization method.
The training loss of convolutional neural networks (20-layer ResNet) using different optimization algorithms.
The training loss of convolutional neural networks (110-layer ResNet) using different optimization algorithms.
Conclusion
In this paper, our method NWM-Adam has been designed as a more flexible machine learning first-order gradient descent optimization algorithm to resolve the undesirable convergence behavior of some optimization algorithms which employ fixed sized window of past gradients to scale the gradient updates and improve the performance of Adam and AMSGrad. The idea of our NWM-Adam algorithm is that placing more memory of the past gradients than the recent gradients. In addition, our optimization algorithm can easily control the degree to which how much the past gradients weigh in the estimation. The experimental results have demonstrated that our NWM-Adam optimization algorithm performs better than other popular gradient descent optimization algorithms on some convex and non-convex problems in the field of machine learning. The optimization idea in this paper can also be adopted to other kind of neural networks, like the memristor-based neural networks [31] and discontinuous neural networks [32]. In future work, we plan to develop an improved method which can implement updates for each weight.