Recently, deep learning has become a significant part of information science research [1]–[4]. And deep learning has achieved outstanding performance in many fields, such as image classification [5]–[7], action recognition [8], [9], image captioning [10], and target localization [11]–[13]. The performance of a deep neural network is mainly determined by the structure of its model and its corresponding optimization algorithm. Therefore, a good optimization algorithm can improve the performance of the deep neural network under circumstance of fixed network architecture.
At present, first-order based optimization algorithms play an important role in deep learning on account of their efficiency and effectiveness in dealing with large-scale optimization problems [14]. In these algorithms, stochastic gradient descent (SGD) [15]–[19] is the representative method to train deep neural networks, in which it iteratively moves in the direction of the negative gradient to update parameters until convergence. Successively, some variants of SGD were proposed, which can automatically adjust the learning rate by using the square root operation of some form of averaging of the squared elements in the past gradients. Adagrad [20], as the first variant, works well for dealing with sparse data. However, this method uses all the past gradients, which can result in fast shrinkage of the learning rate. In order to resolve this problem, some algorithms, e.g., Adadelta [21], RMSprop [22], and Adam [23], were proposed by using the exponential moving average of past squared gradients. Although these optimization methods have achieved good performance in many applications, they can lead to undesirable convergence behavior. AMSGrad [24] was proposed to guarantee convergence by employing the idea, i.e., long-term memory of the past gradients. However, when the new gradients oscillate, AMSGrad may lead to the undesirable estimation. The defects of these existing methods motivate us to find a new way to resolve the problems and further improve the optimization performance.
In this paper, we present a new first-order gradient descent optimization algorithm that includes a more flexible weighting mechanism, which can resolve the undesirable convergence behavior and improve the performance of Adam and AMSGrad. Unlike most of methods, our proposed new weighting mechanism uses a dynamic exponential decay rate for second moment estimate instead of a preconfigured and fixed one. In addition, our method can easily adjust the degree to which how much the past gradients weigh in the estimation. The proposed new exponential moving average variant is developed on the basis of the idea, i.e., putting more memory of the past gradients than the recent gradients. The analysis indicates that our method can guarantee convergence, at the same time, the experimental results show that our method can further improve the performance. In the sequel, our method will be referred to as NWM-Adam.
The remainder of this paper is arranged as follows. In Section II, we will present the related works including the most common optimization algorithms. Subsequently, in Section III, we will describe our proposed NWM-Adam algorithm. Next, we provide the convergence analysis of our method in Section IV. Afterwards, in Section V, we will give the description of the experimental design and discusses the results. Finally, in Section VI, we will conclude our work and discuss some future works.
Optimization problem is one of the most important research directions in computational mathematics [25]. In the field of deep learning, the selection of optimization algorithms remains top priority in one model. Even if the data sets and model architectures are identical, different optimization algorithms are likely to result in dramatically different training effects. In this section, we will outline some optimization algorithms that are widely used in deep learning.
Gradient descent method minimizes the objective function J(\theta)
through updating the model’s parameters \theta \in \mathbb {R}^{d}
in the direction of the negative gradient of the objective function - {\nabla _\theta }J(\theta)
with regard to the parameters \theta
. The updating step in every moment is determined by the learning rate \eta
.
Batch gradient descent, i.e., vanilla gradient descent, employs the whole training dataset to compute the gradient of the objective function with regard to the parameters \theta
. However, this method conducts only one update, which cannot update the parameters online. SGD uses each training example to conduct a parameter update, which could be used to learn online. Of course, frequent update of SGD may result in heavy fluctuation of the objective function. This phenomenon is likely to prevent SGD from converging to the exact minimum. Mini-batch gradient descent is the typical method to train neural networks today, in which every mini-batch of training dataset is processed to perform an update.\begin{align*} \theta=&\theta - \eta \cdot {\nabla _\theta }J(\theta) \qquad (\mathrm {Batch}) \tag{1}\\ \theta=&\theta - \eta \cdot {\nabla _\theta }J(\theta;{x^{(i)}};{y^{(i)}}) \qquad (\mathrm {Stochastic}) \tag{2}\\ \theta=&\theta - \eta \cdot {\nabla _\theta }J(\theta;{x^{(i:i + n)}};{y^{(i:i + n)}}) \qquad (\mathrm {Mini-batch})\tag{3}\end{align*}
View Source\begin{align*} \theta=&\theta - \eta \cdot {\nabla _\theta }J(\theta) \qquad (\mathrm {Batch}) \tag{1}\\ \theta=&\theta - \eta \cdot {\nabla _\theta }J(\theta;{x^{(i)}};{y^{(i)}}) \qquad (\mathrm {Stochastic}) \tag{2}\\ \theta=&\theta - \eta \cdot {\nabla _\theta }J(\theta;{x^{(i:i + n)}};{y^{(i:i + n)}}) \qquad (\mathrm {Mini-batch})\tag{3}\end{align*}
SGD is prone to oscillations when meeting ravines [18]. Therefore, momentum [26] is employed, which can speed up SGD in the correct direction and restrain oscillations.\begin{align*} {m_{t}}=&\gamma {m_{t - 1}} - \eta \cdot {\nabla _\theta }J(\theta) \tag{4}\\ \theta=&\theta - {m_{t}}\tag{5}\end{align*}
View Source\begin{align*} {m_{t}}=&\gamma {m_{t - 1}} - \eta \cdot {\nabla _\theta }J(\theta) \tag{4}\\ \theta=&\theta - {m_{t}}\tag{5}\end{align*}
On the basis of the original step size, SGD-Momentum adopts the past time step \gamma {m_{t - 1}}
, in which the coefficient \gamma
is usually around 0.9. In this algorithm, the direction of parameters update is not only determined by the current gradient, but also related to the previous descent direction. When using momentum in SGD, the dimensions whose gradient directions are similar will get faster updates, at the same time, the dimensions whose gradient directions have large variations will gain small updates. Therefore, we can speed up convergence and reduce oscillations by using this method.
Deep learning models often involve lots of parameters. Different types of parameters have different update frequency, which means larger update steps for infrequently updated parameters and smaller steps for frequently updated parameters. Adagrad [20] is an algorithm which can achieve this effect. This algorithm introduces the second moment.\begin{equation*} {G_{t}} = {\mathrm{diag}}\left({\sum \limits _{i = 1}^{t} {g_{i,1}^{2}},\sum \limits _{i = 1}^{t} {g_{i,2}^{2}}, \cdots,\sum \limits _{i = 1}^{t} {g_{i,d}^{2}} }\right)\tag{6}\end{equation*}
View Source\begin{equation*} {G_{t}} = {\mathrm{diag}}\left({\sum \limits _{i = 1}^{t} {g_{i,1}^{2}},\sum \limits _{i = 1}^{t} {g_{i,2}^{2}}, \cdots,\sum \limits _{i = 1}^{t} {g_{i,d}^{2}} }\right)\tag{6}\end{equation*}
where {G_{t}}\in \mathbb {R}^{d \times d}
is the diagonal matrix, in which its element {G_{t,ii}}
is the sum of squared gradients with respect to {\theta _{i}}
from the initial moment to time t
. The update for every parameter {\theta _{i}}
at each time step t
is:\begin{equation*} {\theta _{t,i}} = {\theta _{t - 1,i}} - \frac {\eta }{{\sqrt {{G_{t - 1,ii}} + \varepsilon } }} \cdot {g_{t - 1,i}}\tag{7}\end{equation*}
View Source\begin{equation*} {\theta _{t,i}} = {\theta _{t - 1,i}} - \frac {\eta }{{\sqrt {{G_{t - 1,ii}} + \varepsilon } }} \cdot {g_{t - 1,i}}\tag{7}\end{equation*}
where \varepsilon
is usually set to 1e - 8
, which can avoid division by zero. {g_{t - 1,i}}
is the gradient of the objective function with regard to the parameter {\theta _{i}}
at time step t - 1
:\begin{equation*} {g_{t - 1,i}} = {\nabla _\theta }J({\theta _{i}})\tag{8}\end{equation*}
View Source\begin{equation*} {g_{t - 1,i}} = {\nabla _\theta }J({\theta _{i}})\tag{8}\end{equation*}
Adagrad can help us adapt the learning rate to the parameters, which has no need of manually tuning the learning rate. However, the continual accumulation of the squared gradients in the denominator will result in the learning rate shrinking. In the end, the learning rate will become especially small, which means that we cannot get any knowledge.
Adadelta [21] modifies the calculation method of the second moment, which resolves the phenomenon that the learning rate of Adagrad changes aggressively. With Adadelta, we implement the accumulation as an exponentially decaying average of past squared gradients. The update rule of Adadelta is:\begin{align*} \Delta {\theta _{t - 1}}=&- \frac {{\sqrt {E{{\left [{ {\Delta {\theta ^{2}}} }\right]}_{t - 2}} + \varepsilon } }}{{\sqrt {E{{\left [{ {g^{2}} }\right]}_{t - 1}} + \varepsilon } }}{g_{t - 1}} \tag{9}\\ {\theta _{t}}=&{\theta _{t - 1}} + \Delta {\theta _{t - 1}} \tag{10}\\ E{\left [{ {g^{2}} }\right]_{t - 1}}=&\gamma E{\left [{ {g^{2}} }\right]_{t - 2}} + (1 - \gamma)g_{t - 1}^{2} \tag{11}\\ E{\left [{ {\Delta {\theta ^{2}}} }\right]_{t - 2}}=&\gamma E{\left [{ {\Delta {\theta ^{2}}} }\right]_{t - 3}} + (1 - \gamma)\Delta \theta _{t - 2}^{2}\tag{12}\end{align*}
View Source\begin{align*} \Delta {\theta _{t - 1}}=&- \frac {{\sqrt {E{{\left [{ {\Delta {\theta ^{2}}} }\right]}_{t - 2}} + \varepsilon } }}{{\sqrt {E{{\left [{ {g^{2}} }\right]}_{t - 1}} + \varepsilon } }}{g_{t - 1}} \tag{9}\\ {\theta _{t}}=&{\theta _{t - 1}} + \Delta {\theta _{t - 1}} \tag{10}\\ E{\left [{ {g^{2}} }\right]_{t - 1}}=&\gamma E{\left [{ {g^{2}} }\right]_{t - 2}} + (1 - \gamma)g_{t - 1}^{2} \tag{11}\\ E{\left [{ {\Delta {\theta ^{2}}} }\right]_{t - 2}}=&\gamma E{\left [{ {\Delta {\theta ^{2}}} }\right]_{t - 3}} + (1 - \gamma)\Delta \theta _{t - 2}^{2}\tag{12}\end{align*}
RMSprop [22] also confines the window of accumulated past squared gradients to some fixed size instead of accumulating all past squared gradients, which is similar to Adadelta. This algorithm divides the learning rate by an exponentially decaying average of past squared gradients.\begin{align*} E{\left [{ {g^{2}} }\right]_{t - 1}}=&0.9E{\left [{ {g^{2}} }\right]_{t - 2}} + 0.1g_{t - 1}^{2} \tag{13}\\ {\theta _{t}}=&{\theta _{t - 1}} - \frac {\eta }{{\sqrt {E{{\left [{ {g^{2}} }\right]}_{t - 1}} + \varepsilon } }}{g_{t - 1}}\tag{14}\end{align*}
View Source\begin{align*} E{\left [{ {g^{2}} }\right]_{t - 1}}=&0.9E{\left [{ {g^{2}} }\right]_{t - 2}} + 0.1g_{t - 1}^{2} \tag{13}\\ {\theta _{t}}=&{\theta _{t - 1}} - \frac {\eta }{{\sqrt {E{{\left [{ {g^{2}} }\right]}_{t - 1}} + \varepsilon } }}{g_{t - 1}}\tag{14}\end{align*}
Adam [23] can be considered as the combination of RMSprop and momentum. In addition to using an exponentially decaying average of past squared gradients like RMSprop, Adam as well applies an exponentially decaying average of past gradients like momentum.\begin{align*} {m_{t}}=&{\beta _{1}}{m_{t - 1}} + (1 - {\beta _{1}}){g_{t}} \tag{15}\\ {v_{t}}=&{\beta _{2}}{v_{t - 1}} + (1 - {\beta _{2}})g_{t}^{2}\tag{16}\end{align*}
View Source\begin{align*} {m_{t}}=&{\beta _{1}}{m_{t - 1}} + (1 - {\beta _{1}}){g_{t}} \tag{15}\\ {v_{t}}=&{\beta _{2}}{v_{t - 1}} + (1 - {\beta _{2}})g_{t}^{2}\tag{16}\end{align*}
where {\beta _{1}}
and {\beta _{2}}
are usually set to 0.9 and 0.999, respectively. During the initial stage, {m_{t}}
and {v_{t}}
are easily biased towards starting value. Therefore, bias-correction should be computed for the first and second moment.\begin{align*} {\hat m_{t}}=&\frac {m_{t}}{1 - \beta _{1}^{t}} \tag{17}\\ {\hat v_{t}}=&\frac {v_{t}}{1 - \beta _{2}^{t}}\tag{18}\end{align*}
View Source\begin{align*} {\hat m_{t}}=&\frac {m_{t}}{1 - \beta _{1}^{t}} \tag{17}\\ {\hat v_{t}}=&\frac {v_{t}}{1 - \beta _{2}^{t}}\tag{18}\end{align*}
Finally, the update rule of Adam is:\begin{equation*} {\theta _{t + 1}} = \theta {}_{t} - \frac {\eta }{{\sqrt {\hat v_{t}} + \varepsilon }}{\hat m_{t}}\tag{19}\end{equation*}
View Source\begin{equation*} {\theta _{t + 1}} = \theta {}_{t} - \frac {\eta }{{\sqrt {\hat v_{t}} + \varepsilon }}{\hat m_{t}}\tag{19}\end{equation*}
Zhang et al. [27] designed an adaptive exponential decay rate {\beta _{t}}
for Adam, i.e., AEDR-Adam, in which this exponential decay rate for the first moment estimates and the second raw moment estimates to increase when a large step is taken and to decrease when a small step is taken.\begin{align*} {\beta _{t}}=&1 - \frac {1}{\iota _{t}} \tag{20}\\ {\iota _{t}}=&\left({1 - \frac {{{m_{t - 1}}}}{{\sqrt {{v_{t - 1}}} }}}\right){\iota _{t - 1}} + 1\tag{21}\end{align*}
View Source\begin{align*} {\beta _{t}}=&1 - \frac {1}{\iota _{t}} \tag{20}\\ {\iota _{t}}=&\left({1 - \frac {{{m_{t - 1}}}}{{\sqrt {{v_{t - 1}}} }}}\right){\iota _{t - 1}} + 1\tag{21}\end{align*}
Reddi et al. [24] pointed out the problem which exists in the proof of convergence of the Adam algorithm. At the same time, the authors showed an example of convex optimization problem, in which the Adam algorithm cannot converge to an optimal solution. In order to resolve this issue, the authors proposed to modify the Adam algorithm where the idea of “long-term memory” of past gradients is employed. The modified Adam is termed as AMSGrad. The main difference of AMSGrad with Adam is that AMSGrad uses the “max” operation to all {v_{t}}
until the present time step and employs this maximum value to the running average of the gradient instead of using original {v_{t}}
in Adam.
In this section, we describe our proposed NWM-Adam algorithm. The proposed NWM-Adam algorithm uses a dynamic exponential decay rate for second moment estimate instead of a preconfigured and fixed one. Firstly, we provide an universal update rule which can cover many gradient descent optimization algorithms.\begin{align*} {\theta _{t}}=&{\theta _{t - 1}} - \frac {{{\eta _{t - 1}}}}{{\sqrt {{V_{t - 1}}} }}{m_{t - 1}} \tag{22}\\ {m_{t - 1}}=&{\phi _{t - 1}}(\nabla {J_{1}}({\theta _{1}}),\ldots,\nabla {J_{t - 1}}({\theta _{t - 1}})) \tag{23}\\ {V_{t - 1}}=&{\psi _{t - 1}}(\nabla {J_{1}}({\theta _{1}}),\ldots,\nabla {J_{t - 1}}({\theta _{t - 1}}))\tag{24}\end{align*}
View Source\begin{align*} {\theta _{t}}=&{\theta _{t - 1}} - \frac {{{\eta _{t - 1}}}}{{\sqrt {{V_{t - 1}}} }}{m_{t - 1}} \tag{22}\\ {m_{t - 1}}=&{\phi _{t - 1}}(\nabla {J_{1}}({\theta _{1}}),\ldots,\nabla {J_{t - 1}}({\theta _{t - 1}})) \tag{23}\\ {V_{t - 1}}=&{\psi _{t - 1}}(\nabla {J_{1}}({\theta _{1}}),\ldots,\nabla {J_{t - 1}}({\theta _{t - 1}}))\tag{24}\end{align*}
where {\phi _{t - 1}}
and {\psi _{t - 1}}
denote the averaging functions, {\eta _{t - 1}}
is the step size, and \frac {{{\eta _{t - 1}}}}{{\sqrt {{V_{t - 1}}} }}
is the learning rate.
Reddi et al. [24] proved that the Adam algorithm cannot converge to an optimal solution even in the simple one-dimensional convex settings, which refutes the convergence of Adam given in [23]. In fact, this fundamental flaw also exists in some most-widely used exponential moving average methods, i.e., RMSprop, Adadelta. The main problem is related to the following quantity.\begin{equation*} {\Gamma _{t}} = \frac {{\sqrt {V_{t}} }}{\eta _{t}} - \frac {{\sqrt {{V_{t - 1}}} }}{{{\eta _{t - 1}}}}\tag{25}\end{equation*}
View Source\begin{equation*} {\Gamma _{t}} = \frac {{\sqrt {V_{t}} }}{\eta _{t}} - \frac {{\sqrt {{V_{t - 1}}} }}{{{\eta _{t - 1}}}}\tag{25}\end{equation*}
The above quantity denotes the change of the inverse of the learning rate with regard to each iteration. The update rules of SGD and Adagrad usually result in non-increasing learning rates. On the basis of the update rules of SGD and Adagrad in the previous section, we can observe that {\Gamma _{t}}\succeq 0
with respect to all t \in \left [{ T }\right]
. However, the positive semi-definiteness of the {\Gamma _{t}}
cannot be guaranteed with regard to the algorithms which employ exponential moving average to estimate the square of gradient, i.e., RMSprop, Adadelta and Adam. This problem will result in non-convergence.
In order to resolve the aforementioned convergence problem, Reddi et al. [24] proposed to employ an idea, i.e., long-term memory of past gradients, to the original Adam algorithm. In this paper, NWM-Adam, i.e., the proposed new exponential moving average variant, is also developed on the basis of this idea, which puts more memory of the past gradients than the recent gradients. In other words, our method employs larger weights on the past gradients than the recent gradients. Our proposed NWM-Adam can make the quantity {\Gamma _{t}}
positive semi-definite and guarantee convergence, at the same time, our algorithm can further improve the performance of Adam and AMSGrad.
The following presents the details of our proposed NWM-Adam. Eqs. 22, 23 and 24 are the main component of our proposed algorithm. Suppose {m_{t}}
is the first moment estimate of gradient and {v_{t}}
is the second moment estimate of gradient. These two estimates are computed as follows.\begin{align*} {m_{t}}=&{\beta _{1}}{m_{t - 1}} + (1 - {\beta _{1}}){g_{t}} \tag{26}\\ {v_{t}}=&{\beta _{2,t}}{v_{t - 1}} + (1 - {\beta _{2,t}})g_{t}^{2}\tag{27}\end{align*}
View Source\begin{align*} {m_{t}}=&{\beta _{1}}{m_{t - 1}} + (1 - {\beta _{1}}){g_{t}} \tag{26}\\ {v_{t}}=&{\beta _{2,t}}{v_{t - 1}} + (1 - {\beta _{2,t}})g_{t}^{2}\tag{27}\end{align*}
where {g_{t}}
denotes the gradient of the objective function J
in which cross entropy is used as the objective function. In this algorithm, {\beta _{2,t}}
changes over time step t
, which is designed as follows.\begin{equation*} {\beta _{2,t}} = \frac {{({t^\lambda } - 1)}}{t^\lambda }\tag{28}\end{equation*}
View Source\begin{equation*} {\beta _{2,t}} = \frac {{({t^\lambda } - 1)}}{t^\lambda }\tag{28}\end{equation*}
Therefore, NWM-Adam can easily adjust the degree to which how much the past gradients weigh in the estimation. The key difference of NWM-Adam from Adam is that it employs a changing {\beta _{2,t}}
for controlling the exponential decay rate of moving average of the squared gradient instead of constant {\beta _{2}}
in Adam. By doing this, our proposed NWM-Adam algorithm can realize that {\Gamma _{t}}
is positive semi-definite, which can resolve the convergence issue of Adam. The pseudo-code of our proposed algorithm NWM-Adam is presented in Algorithm 1.
SECTION Algorithm 1
NWM{\text{-}}Adam
, Our Proposed Algorithm for Stochastic Optimization. {g_{t}}
Denotes the Gradient and g_{t}^{2}
Represents the Element Wise Square {g_{t}} \odot{g_{t}}
. \varepsilon
is to {10^{ - 8}}
, Which Can Avoid the Denominator is Zero.
{\beta _{1}}
: A hyper-parameter for controlling the exponential decay rate of moving average of the gradient {m_{t}}
{\beta _{2,t}}
: A hyper-parameter for controlling the exponential decay rate of moving average of the squared gradient {v_{t}}
{m_{0}} \leftarrow 0
: Initialize {1^{{\mathrm{st}}}}
moment vector
{v_{0}} \leftarrow 0
: Initialize {2^{{\mathrm{nd}}}}
moment vector
For t \in \left [{ {1,2,\ldots,T} }\right]
Get gradients with regard to objective function at time step t
{g_{t}} = {\nabla _\theta }{J_{t}}({\theta _{t}})
Get the hyper-parameter {\beta _{2,t}}
{\beta _{2,t}} = ({t^\lambda } - 1)/{t^\lambda }
Update first moment estimate
{m_{t}} \leftarrow {\beta _{1}}{m_{t - 1}} + (1 - {\beta _{1}}){g_{t}}
Update second raw moment estimate
{v_{t}} \leftarrow {\beta _{2,t}}{v_{t - 1}} + (1 - {\beta _{2,t}})g_{t}^{2}
{\theta _{t + 1}} \leftarrow {\theta _{t}} - \eta {m_{t}}/(\sqrt {v_{t}} + \varepsilon)
Until Meet the stopping criterion
Next, we give an explanation of the weighing scheme of the gradients in NWM-Adam. Let us revisit the Eq. 22, this equation can also be rewritten as follows.\begin{equation*} {\theta _{t}} = {\theta _{t - 1}} - \frac {{{\eta _{t - 1}}}}{{\sqrt {\rm E\left [{ {g^{2}} }\right]} }}{\mathrm{ E}}\left [{ g }\right]\tag{29}\end{equation*}
View Source\begin{equation*} {\theta _{t}} = {\theta _{t - 1}} - \frac {{{\eta _{t - 1}}}}{{\sqrt {\rm E\left [{ {g^{2}} }\right]} }}{\mathrm{ E}}\left [{ g }\right]\tag{29}\end{equation*}
With regard to the estimation of {\mathrm{ E}}\left [{ {g^{2}} }\right]
, methods using exponential moving average may make this estimation unstable when the gradient magnitude varies a lot from batch to batch. The weighing strategy in NWM-Adam can provide more stable estimation than AMSGrad and Adam. Namely, the weighing scheme of our proposed NWM-Adam algorithm can stabilize the estimation when the new gradients oscillate.
SECTION IV.
Convergence Analysis
In this section, we give the convergence analysis of our proposed method. Suppose the unknown sequence of convex objective functions {J_{1}}(\theta)
, {J_{2}}(\theta)
, \dots
, {J_{T}}(\theta)
. At each time step t
, predicting the parameter {\theta _{t}}
and evaluating the objective function {J_{t}}
is our purpose. On account of the unknown sequence in advance, we employ the regret to evaluate our method, which is the summation of all previous difference value between the prediction {J_{t}}({\theta _{t}})
and the best value {J_{t}}({\theta ^{*}})
. Then, the defined regret is given as follows.\begin{equation*} R(T) = \sum \nolimits _{t = 1}^{T} {\left [{ {J_{t}({\theta _{t}}) - {J_{t}}({\theta ^{*}})} }\right]}\tag{30}\end{equation*}
View Source\begin{equation*} R(T) = \sum \nolimits _{t = 1}^{T} {\left [{ {J_{t}({\theta _{t}}) - {J_{t}}({\theta ^{*}})} }\right]}\tag{30}\end{equation*}
where {\theta ^{*}} = \arg \min \sum \nolimits _{t = 1}^{T} {J_{t}} (\theta)
. Assume that {\left \|{ {\nabla {J_{t}}(\theta)} }\right \|_\infty } \le {G_\infty }
, and {\left \|{ {\theta _{n} - {\theta _{m}}} }\right \|_\infty } \le {D_\infty }
for all t \in \left [{ T }\right]
and \forall m,n \in ~1,2,\ldots,T
. At the same time, let {\eta _{t}} = \eta /\sqrt {t}
and {\beta _{1,t}} \le {\beta _{1}}
for all t \in \left [{ T }\right]
. We employ {g_{1:t,i}} = \left [{ {{g_{1,i}},{g_{2,i}},\ldots,{g_{t,i}}} }\right]
to represent a vector which is the {i^{th}}
dimension of the gradient sequence over all steps till time step t
.
The changing schedule of {\beta _{2,t}}
satisfies the following two conditions:
\eta _{t - 1}^{ - 1}\sqrt {{v_{t - 1,i}}} \le \eta _{t}^{ - 1}\sqrt {{v_{t,i}}}
, t \in ~2,3,\ldots,T
;
{\delta ^{ - 1}}{\left({\sum \limits _{p = 1}^{t} {g_{p,i}^{2}} }\right)^{0.5}} \!\!\le \!\! \eta _{T}^{ - 1}{\left({\!\sum \limits _{p = 1}^{t} {\mathop \prod \limits _{q = 1}^{t - p} {\beta _{2,(t - q + 1)}}(1 \!\!-\!\! {\beta _{2,p}})g_{p,i}^{2}} }\right)^{0.5}}
, t \in \left [{ T }\right]
.
In order to show that our optimization algorithm has a regret bound, we start with the following:\begin{align*} {\theta _{t + 1}}=&{\prod _{F,\sqrt {V_{t}} }}({\theta _{t}} - {\eta _{t}}{V_{t}}^{ - 0.5}{m_{t}}) \\=&\mathop {min}\limits _{x \in F} \left \|{ {V_{t}^{0.25}(\theta - ({\theta _{t}} - {\eta _{t}}{V_{t}}^{ - 0.5}{m_{t}}))} }\right \|\tag{31}\end{align*}
View Source\begin{align*} {\theta _{t + 1}}=&{\prod _{F,\sqrt {V_{t}} }}({\theta _{t}} - {\eta _{t}}{V_{t}}^{ - 0.5}{m_{t}}) \\=&\mathop {min}\limits _{x \in F} \left \|{ {V_{t}^{0.25}(\theta - ({\theta _{t}} - {\eta _{t}}{V_{t}}^{ - 0.5}{m_{t}}))} }\right \|\tag{31}\end{align*}
Based on the theory in [28], we can obtain the following:\begin{align*}&\hspace {-2pc}{\left \|{ {V_{t}^{0.25}({\theta _{t + 1}} - {\theta ^{*}})} }\right \|^{2}} \\\le&{\left \|{ {V_{t}^{0.25}({\theta _{t}} - {\theta ^{*}})} }\right \|^{2}} + \eta _{t}^{2}{\left \|{ {V_{t}^{ - 0.25}{m_{t}}} }\right \|^{2}} \\&-\, 2{\eta _{t}}\left \langle{ {{\beta _{1,t}}{m_{t - 1}} + (1 - {\beta _{1,t}}){g_{t}},{\theta _{t}} - {\theta ^{*}}} }\right \rangle\tag{32}\end{align*}
View Source\begin{align*}&\hspace {-2pc}{\left \|{ {V_{t}^{0.25}({\theta _{t + 1}} - {\theta ^{*}})} }\right \|^{2}} \\\le&{\left \|{ {V_{t}^{0.25}({\theta _{t}} - {\theta ^{*}})} }\right \|^{2}} + \eta _{t}^{2}{\left \|{ {V_{t}^{ - 0.25}{m_{t}}} }\right \|^{2}} \\&-\, 2{\eta _{t}}\left \langle{ {{\beta _{1,t}}{m_{t - 1}} + (1 - {\beta _{1,t}}){g_{t}},{\theta _{t}} - {\theta ^{*}}} }\right \rangle\tag{32}\end{align*}
Then, we can get the following:\begin{align*}&\hspace {-2pc}\left \langle{ {g_{t},{\theta _{t}} - {\theta ^{*}}} }\right \rangle \\\le&0.5{\eta _{t}}{(1 - {\beta _{1,t}})^{ - 1}}{\left \|{ {V_{t}^{ - 0.25}{m_{t}}} }\right \|^{2}} \\&+\, 0.5\eta _{t}^{ - 1}{(1 - {\beta _{1,t}})^{ - 1}}[{\left \|{ {V_{t}^{0.25}({\theta _{t}} - {\theta ^{*}})} }\right \|^{2}} \\&-\, {\left \|{ {V_{t}^{0.25}({\theta _{t + 1}} - {\theta ^{*}})} }\right \|^{2}}] \\&+\, 0.5{\beta _{1,t}}{(1 - {\beta _{1,t}})^{ - 1}}\eta _{t}^{ - 1}{\left \|{ {V_{t}^{0.25}({\theta _{t}} - {\theta ^{*}})} }\right \|^{2}} \\&+\, 0.5{\beta _{1,t}}{(1 - {\beta _{1,t}})^{ - 1}}{\eta _{t}}{\left \|{ {V_{t}^{ - 0.25}{m_{t - 1}}} }\right \|^{2}}\tag{33}\\&\hspace {-2pc}\sum \limits _{t = 1}^{T} {J_{t}({\theta _{t}}) - } {J_{t}}({\theta ^{*}}) \\\le&\sum \limits _{t = 1}^{T} {\left \langle{ {g_{t},{\theta _{t}} - {\theta ^{*}}} }\right \rangle } \\\le&\sum \limits _{t = 1}^{T} [0.5{\eta _{t}}{(1 - {\beta _{1,t}})^{ - 1}}{\left \|{ {V_{t}^{ - 0.25}{m_{t}}} }\right \|^{2}} \\&+\, 0.5\eta _{t}^{ - 1}{(1 - {\beta _{1,t}})^{ - 1}}[{\left \|{ {V_{t}^{0.25}({\theta _{t}} - {\theta ^{*}})} }\right \|^{2}} \\&-\, {\left \|{ {V_{t}^{0.25}({\theta _{t + 1}} - {\theta ^{*}})} }\right \|^{2}}] \\&+\, 0.5{\beta _{1,t}}{(1 - {\beta _{1,t}})^{ - 1}}\eta _{t}^{ - 1}{\left \|{ {V_{t}^{0.25}({\theta _{t}} - {\theta ^{*}})} }\right \|^{2}} \\&+\, 0.5{\beta _{1,t}}{(1 - {\beta _{1,t}})^{ - 1}}{\eta _{t}}{\left \|{ {V_{t}^{ - 0.25}{m_{t - 1}}} }\right \|^{2}}]\tag{34}\end{align*}
View Source\begin{align*}&\hspace {-2pc}\left \langle{ {g_{t},{\theta _{t}} - {\theta ^{*}}} }\right \rangle \\\le&0.5{\eta _{t}}{(1 - {\beta _{1,t}})^{ - 1}}{\left \|{ {V_{t}^{ - 0.25}{m_{t}}} }\right \|^{2}} \\&+\, 0.5\eta _{t}^{ - 1}{(1 - {\beta _{1,t}})^{ - 1}}[{\left \|{ {V_{t}^{0.25}({\theta _{t}} - {\theta ^{*}})} }\right \|^{2}} \\&-\, {\left \|{ {V_{t}^{0.25}({\theta _{t + 1}} - {\theta ^{*}})} }\right \|^{2}}] \\&+\, 0.5{\beta _{1,t}}{(1 - {\beta _{1,t}})^{ - 1}}\eta _{t}^{ - 1}{\left \|{ {V_{t}^{0.25}({\theta _{t}} - {\theta ^{*}})} }\right \|^{2}} \\&+\, 0.5{\beta _{1,t}}{(1 - {\beta _{1,t}})^{ - 1}}{\eta _{t}}{\left \|{ {V_{t}^{ - 0.25}{m_{t - 1}}} }\right \|^{2}}\tag{33}\\&\hspace {-2pc}\sum \limits _{t = 1}^{T} {J_{t}({\theta _{t}}) - } {J_{t}}({\theta ^{*}}) \\\le&\sum \limits _{t = 1}^{T} {\left \langle{ {g_{t},{\theta _{t}} - {\theta ^{*}}} }\right \rangle } \\\le&\sum \limits _{t = 1}^{T} [0.5{\eta _{t}}{(1 - {\beta _{1,t}})^{ - 1}}{\left \|{ {V_{t}^{ - 0.25}{m_{t}}} }\right \|^{2}} \\&+\, 0.5\eta _{t}^{ - 1}{(1 - {\beta _{1,t}})^{ - 1}}[{\left \|{ {V_{t}^{0.25}({\theta _{t}} - {\theta ^{*}})} }\right \|^{2}} \\&-\, {\left \|{ {V_{t}^{0.25}({\theta _{t + 1}} - {\theta ^{*}})} }\right \|^{2}}] \\&+\, 0.5{\beta _{1,t}}{(1 - {\beta _{1,t}})^{ - 1}}\eta _{t}^{ - 1}{\left \|{ {V_{t}^{0.25}({\theta _{t}} - {\theta ^{*}})} }\right \|^{2}} \\&+\, 0.5{\beta _{1,t}}{(1 - {\beta _{1,t}})^{ - 1}}{\eta _{t}}{\left \|{ {V_{t}^{ - 0.25}{m_{t - 1}}} }\right \|^{2}}]\tag{34}\end{align*}
We need some intermediate results for further bounding this inequality.\begin{align*} \sum \limits _{t = 1}^{T} {\eta _{t}{{\left \|{ {V_{t}^{ - 0.25}{m_{t}}} }\right \|}^{2}}}=&\sum \limits _{t = 1}^{T - 1} {\eta _{t}{{\left \|{ {V_{t}^{ - 0.25}{m_{t}}} }\right \|}^{2}}} \\&+\, {\eta _{T}}\sum \limits _{i = 1}^{d} {m_{T,i}^{2}{{({v_{T,i}})}^{ - 0.5}}}\tag{35}\end{align*}
View Source\begin{align*} \sum \limits _{t = 1}^{T} {\eta _{t}{{\left \|{ {V_{t}^{ - 0.25}{m_{t}}} }\right \|}^{2}}}=&\sum \limits _{t = 1}^{T - 1} {\eta _{t}{{\left \|{ {V_{t}^{ - 0.25}{m_{t}}} }\right \|}^{2}}} \\&+\, {\eta _{T}}\sum \limits _{i = 1}^{d} {m_{T,i}^{2}{{({v_{T,i}})}^{ - 0.5}}}\tag{35}\end{align*}
Then, we can get the following inequality.\begin{align*} \sum \limits _{t = 1}^{T} {\eta _{t}{{\left \|{ {V_{t}^{ - 0.25}{m_{t}}} }\right \|}^{2}}}\le&\sum \limits _{t = 1}^{T - 1} {\eta _{t}{{\left \|{ {V_{t}^{ - 0.25}{m_{t}}} }\right \|}^{2}}} + \delta {(1 - {\beta _{1}})^{ - 1}} \\&\,\sum \limits _{i = 1}^{d} {\sum \limits _{j = 1}^{T} {\beta _{1}^{T - j}g_{j,i}^{2}} {{\left({\sum \limits _{j = 1}^{T} {g_{j,i}^{2}} }\right)}^{-0.5}}} \\\le&2\delta {(1 - {\beta _{1}})^{ - 2}}\sum \limits _{i = 1}^{d} {{{\left \|{ {{g_{1:T,i}}} }\right \|}_{2}}}\tag{36}\end{align*}
View Source\begin{align*} \sum \limits _{t = 1}^{T} {\eta _{t}{{\left \|{ {V_{t}^{ - 0.25}{m_{t}}} }\right \|}^{2}}}\le&\sum \limits _{t = 1}^{T - 1} {\eta _{t}{{\left \|{ {V_{t}^{ - 0.25}{m_{t}}} }\right \|}^{2}}} + \delta {(1 - {\beta _{1}})^{ - 1}} \\&\,\sum \limits _{i = 1}^{d} {\sum \limits _{j = 1}^{T} {\beta _{1}^{T - j}g_{j,i}^{2}} {{\left({\sum \limits _{j = 1}^{T} {g_{j,i}^{2}} }\right)}^{-0.5}}} \\\le&2\delta {(1 - {\beta _{1}})^{ - 2}}\sum \limits _{i = 1}^{d} {{{\left \|{ {{g_{1:T,i}}} }\right \|}_{2}}}\tag{36}\end{align*}
Next, we can get the following:\begin{align*}&\hspace{-0.6pc}\sum \limits _{t = 1}^{T} {J_{t}({\theta _{t}}) - } {J_{t}}({\theta ^{*}}) \le 2\delta {(1 - {\beta _{1}})^{ - 3}}\sum \limits _{i = 1}^{d} {{{\left \|{ {{g_{1:T,i}}} }\right \|}_{2}}} \\&+\, 0.5\eta _{1}^{ - 1}{(1 - {\beta _{1}})^{ - 1}}\sum \limits _{i = 1}^{d} {\sqrt {{v_{1,i}}} {{({\theta _{1,i}} - \theta _{i}^{*})}^{2}}} \\&+\, 0.5{(1 - {\beta _{1}})^{ - 1}}\sum \limits _{t = 2}^{T} {\sum \limits _{i = 1}^{d} {{{({\theta _{t,i}} - \theta _{i}^{*})}^{2}}} } [\eta _{t}^{ - 1}\sqrt {{v_{t,i}}} \\&-\, \eta _{t - 1}^{ - 1}\sqrt {{v_{t - 1,i}}}] \\&+\, 0.5{(1 - {\beta _{1}})^{ - 1}}\sum \limits _{t = 1}^{T} {\sum \limits _{i = 1}^{d} {{\beta _{1,t}}{{({\theta _{t,i}} - \theta _{i}^{*})}^{2}}\eta _{t}^{ - 1}\sqrt {{v_{t,i}}} } }\tag{37}\end{align*}
View Source\begin{align*}&\hspace{-0.6pc}\sum \limits _{t = 1}^{T} {J_{t}({\theta _{t}}) - } {J_{t}}({\theta ^{*}}) \le 2\delta {(1 - {\beta _{1}})^{ - 3}}\sum \limits _{i = 1}^{d} {{{\left \|{ {{g_{1:T,i}}} }\right \|}_{2}}} \\&+\, 0.5\eta _{1}^{ - 1}{(1 - {\beta _{1}})^{ - 1}}\sum \limits _{i = 1}^{d} {\sqrt {{v_{1,i}}} {{({\theta _{1,i}} - \theta _{i}^{*})}^{2}}} \\&+\, 0.5{(1 - {\beta _{1}})^{ - 1}}\sum \limits _{t = 2}^{T} {\sum \limits _{i = 1}^{d} {{{({\theta _{t,i}} - \theta _{i}^{*})}^{2}}} } [\eta _{t}^{ - 1}\sqrt {{v_{t,i}}} \\&-\, \eta _{t - 1}^{ - 1}\sqrt {{v_{t - 1,i}}}] \\&+\, 0.5{(1 - {\beta _{1}})^{ - 1}}\sum \limits _{t = 1}^{T} {\sum \limits _{i = 1}^{d} {{\beta _{1,t}}{{({\theta _{t,i}} - \theta _{i}^{*})}^{2}}\eta _{t}^{ - 1}\sqrt {{v_{t,i}}} } }\tag{37}\end{align*}
When \eta _{t}^{ - 1}\sqrt {{v_{t,i}}} - \eta _{t - 1}^{ - 1}\sqrt {{v_{t - 1,i}}}
is positive semi-definite, we can get the following regret bound for the sequence {\theta _{t}}
generated using our proposed NWM-Adam algorithm.\begin{align*} R(T)\le&2{(1 - {\beta _{1}})^{ - 3}}\delta \sum \limits _{i = 1}^{d} {{{\left \|{ {{g_{1:T,i}}} }\right \|}_{2}}} \\&+\, \frac {1}{2}D_\infty ^{2}\sqrt {T} {\eta ^{ - 1}}{(1 - {\beta _{1}})^{ - 1}}\sum \limits _{i = 1}^{d} {\sqrt {{v_{T,i}}} } \\&+\, \frac {1}{2}D_\infty ^{2}{(1 - {\beta _{1}})^{ - 1}}\sum \limits _{t = 1}^{T} {\sum \limits _{i = 1}^{d} {\eta _{t}^{ - 1}{\beta _{1,t}}\sqrt {{v_{t,i}}} } }\tag{38}\end{align*}
View Source\begin{align*} R(T)\le&2{(1 - {\beta _{1}})^{ - 3}}\delta \sum \limits _{i = 1}^{d} {{{\left \|{ {{g_{1:T,i}}} }\right \|}_{2}}} \\&+\, \frac {1}{2}D_\infty ^{2}\sqrt {T} {\eta ^{ - 1}}{(1 - {\beta _{1}})^{ - 1}}\sum \limits _{i = 1}^{d} {\sqrt {{v_{T,i}}} } \\&+\, \frac {1}{2}D_\infty ^{2}{(1 - {\beta _{1}})^{ - 1}}\sum \limits _{t = 1}^{T} {\sum \limits _{i = 1}^{d} {\eta _{t}^{ - 1}{\beta _{1,t}}\sqrt {{v_{t,i}}} } }\tag{38}\end{align*}
In this section, in order to empirically evaluate the effectiveness of the proposed NWM-Adam algorithm, we evaluate the proposed algorithm in three different popular machine learning models, i.e., logistic regression, multi-layer fully connected neural networks, and deep convolutional neural networks. These different models are tested on the MNIST dataset [29] and the CIFAR-10 dataset [30]. By employing these models and datasets, we demonstrate that NWM-Adam can effectively work out the practical deep learning problems.
The experiments in this part compare our proposed NWM-Adam algorithm with other popular gradient descent optimization algorithms, i.e., Momentum, Adagrad, RMSprop, Adam, and AMSGrad. Same parameter initialization is utilized when comparing with different optimization algorithms and the results are showed using the best hyper-parameters. In this section, we first introduce the details of the utilized datasets. Then, we give the experimental results and analyze the performance of our proposed method.
A. Description of the Utilized Datasets
1) MNIST
MNIST is a large database of handwritten digits that is widely used for training and testing in the field of machine learning. This dataset was taken from American Census Bureau employees and American high school students. Its training set contains 60000 grayscale images, and testing set has 10000 grayscale images. These handwritten digits have been preprocessed, resized and centered, and the size of the images is fixed at 28 \times 28
. These images are divided into 10 classes, from 0 to 9.
2) CIFAR-10
CIFAR-10 is a color image dataset which is closer to universal objects. This dataset was established by Alex Krizhevsky, Vinod Nair, and Geoffrey Hinton. It contains 10 classes, including airplane, automobile, bird, car, deer, dog, frog, horse, ship, and truck. This dataset has 50000\,\,32 \times 32
training images and 10000 test images. Compared with handwritten digits, CIFAR-10 contains real objects in the real world, which is not only noisy, but also has different proportions and features, which makes it very difficult to recognize.
B. Logistic Regression
For the first experiment, we evaluate our proposed algorithm on logistic regression problem, which can investigate the performance of the algorithm on convex problems. We perform this experiment using the MNIST dataset and compare the result with some popular gradient descent optimization algorithms, where the logistic regression problem gets one of the 10 class labels directly on the 784 dimension image vectors. The step size in this experiment is adjusted by 1/\sqrt {t}
decay. The minibatch size is set to 128. We report the training loss with respect to epochs in Fig. 1. From the Fig. 1 we can observe that our proposed NWM-Adam algorithm yields similar convergence as AMSGrad and both perform better than Adam, RMSprop, Adagrad and Momentum. We also report the classification accuracy over ten rounds in Table 1. From the Table 1, we can see that our NWM-Adam is slightly better than AMSGrad.
C. Multi-Layer Fully Connected Neural Networks
Multi-layer fully connected neural network is the powerful model with non-convex objective function. In this experiment, we use two neural network models to test the effectiveness of our proposed method in the multiclass classification problem on MNIST dataset.
The first neural network model has one fully connected hidden layer with 100 hidden units. The second neural network model has two fully connected hidden layers with 1024 hidden units in the first hidden layer and 512 hidden units in the second hidden layer. ReLU is used as the activation function. Softmax is used as a classifier acting on the outputs of the last hidden layer. Furthermore, constant step size is utilized throughout this experiment.
The training loss curves of these two fully connected neural network models by using different optimization algorithms are shown in Fig. 2 and Fig. 3. In order to compare clearly, we use the base-10 logarithm to plot the training loss. Fig. 2 and Fig. 3 all show that NWM-Adam algorithm outperforms other optimization algorithms, i.e., Momentum, Adagrad, RMSprop, Adam, and AMSGrad.
The classification accuracy over ten rounds is presented in Table 2. From the table, we observe that our method outperforms all the other methods in terms of classification accuracy. Furthermore, the good performance of NWM-Adam is relatively stable.
The good performance of our algorithm benefits from proper utilization of the past gradients. Traditional gradient descent optimization algorithms all use a fixed exponential decay rate for second moment estimate, while our algorithm employs a dynamic exponential decay rate for this estimate. Furthermore, our algorithm can easily adjust the degree to which how much the past gradients weigh in the estimation. This is the reason why our algorithm is superior to others.
D. Deep Convolutional Neural Networks
In the last experiment, we use deep convolutional neural network (DCNN) to evaluate our method on the standard CIFAR-10 dataset. The residual architecture is used as the DCNN model in this part. In this architecture, the first layer is 3 \times 3
convolutions. Then, it includes a stack of 3m
residual blocks with an equal number of blocks for each feature map size. Finally, this architecture ends with a global average pooling, a 10-way fully connected layer, and softmax.
In this experiment, we use the 20-layer ResNet (m = 6
) and the 110-layer ResNet (m = 36
). The minibatch size is also set to 128, which is similar to previous experiments. Furthermore, we employ constant step size throughout this experiment.
The training loss curves for this problem are shown in Fig. 4 and Fig. 5. We can see that NWM-Adam performs considerably better than other optimization methods, which is in line with the conclusion of the experiment of multi-layer fully connected neural network. At the same time, we also provide the classification accuracy over ten rounds using different optimization methods, which is shown in Table 3. From the table, we can see that our optimization method can help the neural network further improve its classification accuracy, which demonstrates once again the superiority of our proposed optimization method.
In this paper, our method NWM-Adam has been designed as a more flexible machine learning first-order gradient descent optimization algorithm to resolve the undesirable convergence behavior of some optimization algorithms which employ fixed sized window of past gradients to scale the gradient updates and improve the performance of Adam and AMSGrad. The idea of our NWM-Adam algorithm is that placing more memory of the past gradients than the recent gradients. In addition, our optimization algorithm can easily control the degree to which how much the past gradients weigh in the estimation. The experimental results have demonstrated that our NWM-Adam optimization algorithm performs better than other popular gradient descent optimization algorithms on some convex and non-convex problems in the field of machine learning. The optimization idea in this paper can also be adopted to other kind of neural networks, like the memristor-based neural networks [31] and discontinuous neural networks [32]. In future work, we plan to develop an improved method which can implement updates for each weight.