The results in [1] indicate that there exist singular regions in the parameter spaces for almost all learning machines, and the singularities are the subspaces where the Fisher information matrix degenerates [2], [3]. As a typical type of learning machines, feedforward neural networks have been widely used in many fields [4]–[7]. Due to the existence of the singularities, the learning dynamics of neural networks often present strange behaviors. For example, the learning process may become very slow and plateau phenomenon often occurs (an example is shown in Fig. 1) [8]. Also because of the singularities, the standard statistical paradigm of the Cramér-Rao theorem does not hold [9], [10] and the classical model selection criteria, such as Akaike information criterion (AIC), Bayes information criterion (BIC) and minimum description length (MDL), often fail in determining appropriate network structure [11].
Many researchers have investigated the learning dynamics near singularities of feedforward neural networks, such as multilayer perceptrons(MLPs) [12]–[15], radial basis function (RBF) networks [16], [17], Gaussian mixtures [18], ect, and plenty of results have been obtained. These theoretical analysis results all indicate that the singularities seriously affect the learning dynamics of feedforward neural networks and make us further recognize the essence of the singularities.
In recent years, deep learning has become a very hot topic in the machine learning community [19]. Deep neural networks are designed based on traditional neural networks [20]–[22]. Due to the much larger number of hidden layers and architecture size, training deep neural networks also faces many challenges [23], [24]. [25] investigates the deep linear neural networks and finds that the error does not change under a scaling transformation. This would cause the training difficulty which is called scaling symmetries in [26] and [27]. Reference [27], [28] investigate the influence of singularities in deep neural networks. These results all indicate that the singularities also seriously affect the learning dynamics in deep neural networks.
Although many results about the learning dynamics near the singularities have been obtained, the size of the influence area of singularities is still unknown. Reference [29] takes a general mathematical analysis of the learning dynamics near singularities in layered networks and obtains the common learning trajectories near overlap singularities which are represented by a very simple form. The theoretical learning trajectories indicate that when the learning process is affected by the overlap singularity, the student parameters always arrive in such singularity precisely, namely two units finally overlap exactly. However, for MLPs, we find different case in practical simulation experiments, the influence area of the overlap singularity is larger than the theoretical analysis.
In this paper, we aim to investigate the influence area of the overlap singularity in MLPs by analyzing the generalization error surface. The remainder of the paper is organized as follows. In Section 2, we give a more detailed introduction to the motivation of this paper. The generalization error surface of the MLPs near the overlap singularity is analyzed in Section 3. Section 4 is devoted to the simulations and Section 5 states conclusions and discussions.
SECTION II.
Learning Paradigm
Here, we firstly introduce a typical learning paradigm of MLPs using the standard gradient descent algorithm to minimize the mean square error loss function. For a typical MLP with single hidden layer, it accepts an input vector \boldsymbol {x}
and gives a scalar output, i.e.:\begin{equation*} f(\boldsymbol {x},\boldsymbol {\theta })=\sum \limits _{i=1}^{k}{w_{i}\phi (\boldsymbol {x},\boldsymbol {J}_{i})},\tag{1}\end{equation*}
View Source
\begin{equation*} f(\boldsymbol {x},\boldsymbol {\theta })=\sum \limits _{i=1}^{k}{w_{i}\phi (\boldsymbol {x},\boldsymbol {J}_{i})},\tag{1}\end{equation*}
where k
denotes the hidden unit number, \boldsymbol {J}_{i}\in \mathcal {R}^{n}
, i=1,\cdots,k
denotes the weight from the input layer to the i
th hidden unit and w_{i}\in \mathcal {R}
denotes the weight from the i
th hidden unit to the output layer; n
denotes the number of input nodes and \phi (\boldsymbol {x},\boldsymbol {J}_{i})=\phi (\boldsymbol {J}_{i}^{T}\boldsymbol {x})
denotes an activation function. \boldsymbol {\theta }=\{\boldsymbol {J}_{1}, \cdots,\boldsymbol {J}_{k},{w}_{1}, \cdots, {w}_{k}\}
represents all the parameters of the model (1).
Now we introduce two types of singularities [8], [29]. If two hidden units i
and j
overlap, i.e. \boldsymbol {J}_{i}=\boldsymbol {J}_{j}
, w_{i}\phi (\boldsymbol {x},\boldsymbol {J}_{i})+w_{j}\phi (\boldsymbol {x},\boldsymbol {J}_{j})=(w_{i}+w_{j})\phi (\boldsymbol {x},\boldsymbol {J}_{i})
remains the same value when w_{i}+w_{j}
takes a fixed value, regardless of particular values of w_{i}
and w_{j}
. Therefore, we can identify their sum w=w_{i}+w_{j}
, nevertheless, each of w_{i}
and w_{j}
remains unidentifiable. When w_{i} = 0
, w_{i}\phi (\boldsymbol {x},\boldsymbol {J}_{i})=0
, whatever value \boldsymbol {J}_{i}
takes. So there are mainly two types of singular regions in the parameter space of the unipolar activation function based MLPs as follows [30]:
Overlap singularity:\begin{equation*} \mathcal {R}_{1}=\{\boldsymbol {\theta }|\boldsymbol {J}_{i}=\boldsymbol {J}_{j}\},\tag{2}\end{equation*}
View Source
\begin{equation*} \mathcal {R}_{1}=\{\boldsymbol {\theta }|\boldsymbol {J}_{i}=\boldsymbol {J}_{j}\},\tag{2}\end{equation*}
Elimination singularity:\begin{equation*} \mathcal {R}_{2}=\{\boldsymbol {\theta }|w_{i}=0\}.\tag{3}\end{equation*}
View Source
\begin{equation*} \mathcal {R}_{2}=\{\boldsymbol {\theta }|w_{i}=0\}.\tag{3}\end{equation*}
In the case of regression, we have a number of observed data (\boldsymbol {x}_{1},y_{1}), \ldots, (\boldsymbol {x}_{t}, y_{t})
, which are generated by:\begin{equation*} y=f_{0}(\boldsymbol {x})+\varepsilon,\tag{4}\end{equation*}
View Source
\begin{equation*} y=f_{0}(\boldsymbol {x})+\varepsilon,\tag{4}\end{equation*}
where \boldsymbol {x}\in \mathcal {R}^{n}
, y\in \mathcal {R}
, and f_{0}(\boldsymbol {x})
is an unknown true generating function (which is called the teacher function). \varepsilon
is an additive noise, usually subject to Gaussian distribution with zero mean. f_{0}(\boldsymbol {x})
can be approximated by a MLP, which is the student neural model with the form of model (1).
Since the MLPs have universal approximation ability, we can also assume that the teacher model is described by a MLP with s
hidden units:\begin{equation*} y=f_{0}(\boldsymbol {x})+\varepsilon =f_{0}(\boldsymbol {x},\boldsymbol {\theta }_{0})+\varepsilon =\sum _{i=1}^{s}v_{i}\phi (\boldsymbol {x},\boldsymbol {t}_{i})+\varepsilon,\tag{5}\end{equation*}
View Source
\begin{equation*} y=f_{0}(\boldsymbol {x})+\varepsilon =f_{0}(\boldsymbol {x},\boldsymbol {\theta }_{0})+\varepsilon =\sum _{i=1}^{s}v_{i}\phi (\boldsymbol {x},\boldsymbol {t}_{i})+\varepsilon,\tag{5}\end{equation*}
where \boldsymbol {t}_{i}\in \mathcal {R}^{n}
and v_{i}\in \mathcal {R}
denote the weight parameters connected to the i
th hidden unit, and \boldsymbol {\theta }_{0}=(\boldsymbol {t}_{1}, \cdots, \boldsymbol {t}_{s},v_{1},\cdots, v_{s})
is the teacher parameter.
The training input is subject to Gaussian distribution with mean zero and covariance identity matrix \mathbf {I}_{n}
:\begin{equation*} q(\boldsymbol {x})=(\sqrt {2\pi })^{-n}\exp \left ({-\frac {\|\boldsymbol {x}\|^{2}}{2}}\right),\tag{6}\end{equation*}
View Source
\begin{equation*} q(\boldsymbol {x})=(\sqrt {2\pi })^{-n}\exp \left ({-\frac {\|\boldsymbol {x}\|^{2}}{2}}\right),\tag{6}\end{equation*}
and the loss function is defined as:\begin{equation*} l(y,\boldsymbol {x},\boldsymbol {\theta })=\frac {1}{2}(y-f(\boldsymbol {x},\boldsymbol {\theta }))^{2}.\tag{7}\end{equation*}
View Source
\begin{equation*} l(y,\boldsymbol {x},\boldsymbol {\theta })=\frac {1}{2}(y-f(\boldsymbol {x},\boldsymbol {\theta }))^{2}.\tag{7}\end{equation*}
Then by using the gradient descent method to minimize the above loss, the training process can be completed and the learning trajectories can be obtained.
From the results in [29], the theoretical learning trajectories near \mathcal {R}_{1}
are:\begin{equation*} h=\frac {2 w^{*}}{3}\log \frac {(z^{2}+3)^{2}}{|z|}+C,\tag{8}\end{equation*}
View Source
\begin{equation*} h=\frac {2 w^{*}}{3}\log \frac {(z^{2}+3)^{2}}{|z|}+C,\tag{8}\end{equation*}
where C
is a constant depending on the initial model parameter (h^{(0)}, z^{(0)})
and \begin{align*} h=&\frac {1}{2}\boldsymbol {u}^{T}\boldsymbol {u},\tag{9}\\ \boldsymbol {u}=&\boldsymbol {J}_{i}-\boldsymbol {J}_{j},\tag{10}\\ z=&\frac {w_{i}-w_{j}}{w_{i}+w_{j}}.\tag{11}\end{align*}
View Source
\begin{align*} h=&\frac {1}{2}\boldsymbol {u}^{T}\boldsymbol {u},\tag{9}\\ \boldsymbol {u}=&\boldsymbol {J}_{i}-\boldsymbol {J}_{j},\tag{10}\\ z=&\frac {w_{i}-w_{j}}{w_{i}+w_{j}}.\tag{11}\end{align*}
The trajectories are shown in Fig. 2.
For the overlap singularity, \boldsymbol {J}_{1}=\boldsymbol {J}_{2}
, we have \boldsymbol {u}=0
, namely h=0
. For the elimination singularity, w_{1}(\text {or}\,\,w_{2})=0
, we have z=-1(\text {or} +1)
. Thus the line h=0
represents the overlap singularity, and the lines z=\pm 1
represent the elimination singularity. The overlap singularity is partially stable where the stable area(thick black area in Fig. 2(a) and Fig. 2(b), respectively) is determined by H(w^{*},\boldsymbol {J}^{*}) = \dfrac {1}{4}w^{*}\left \langle{ e(y,\boldsymbol {x},\boldsymbol {\theta })\frac {\partial ^{2}\phi (\boldsymbol {x},\boldsymbol {J})}{\partial \boldsymbol {J}\partial \boldsymbol {J}^{T}}}\right \rangle \mid _{\boldsymbol {\theta }=\boldsymbol {\theta }^{*}}
[29].
From the trajectory h\sim z
shown in Fig. 2 which is obtained by theoretical analysis, for the learning processes which are affected by the overlap singularity, the student parameters always arrive in the line h=0
, namely the two hidden units finally overlap exactly. However, the practical situation is different, the influence area might be larger than the theoretical analysis. Next we analyse the generalization error L(\boldsymbol {\theta })
at first.
SECTION III.
Theoretical Analysis of Error Surface Near Overlap Singularity
In this section, we analyse the generalization error near the overlap singularity for MLPs and show that the generalization error surface is much flatter near the overlap singularity. Just as pointed out in [16], it is enough to investigate the model with two hidden units for capturing the essence of the learning dynamics near the singularities. Without loss of generality, we analyse the case that both the teacher model and the student model have two hidden units, namely the teacher model and student model have the following form, respectively:\begin{equation*} f(\boldsymbol {x},\boldsymbol {\theta }_{0}) = v_{1}\phi (\boldsymbol {x},\boldsymbol {t}_{1})+v_{2}\phi (\boldsymbol {x},\boldsymbol {t}_{2}),\tag{12}\end{equation*}
View Source
\begin{equation*} f(\boldsymbol {x},\boldsymbol {\theta }_{0}) = v_{1}\phi (\boldsymbol {x},\boldsymbol {t}_{1})+v_{2}\phi (\boldsymbol {x},\boldsymbol {t}_{2}),\tag{12}\end{equation*}
and \begin{equation*} f(\boldsymbol {x},\boldsymbol {\theta }) = w_{1}\phi (\boldsymbol {x},\boldsymbol {J}_{1})+w_{2}\phi (\boldsymbol {x},\boldsymbol {J}_{2}).\tag{13}\end{equation*}
View Source
\begin{equation*} f(\boldsymbol {x},\boldsymbol {\theta }) = w_{1}\phi (\boldsymbol {x},\boldsymbol {J}_{1})+w_{2}\phi (\boldsymbol {x},\boldsymbol {J}_{2}).\tag{13}\end{equation*}
For a given sample set, the corresponding error surface can be obtained by calculating the loss function l(y,\boldsymbol {x},\theta)
. Given that the error surface of the loss function cannot avoid the disturbance of the samples, in order to overcome this problem, we can investigate the generalization error L(\boldsymbol {\theta })
of MLPs instead:\begin{equation*} L(\boldsymbol {\theta })=\left \langle{ l(y,\boldsymbol {x},\boldsymbol {\theta })}\right \rangle,\tag{14}\end{equation*}
View Source
\begin{equation*} L(\boldsymbol {\theta })=\left \langle{ l(y,\boldsymbol {x},\boldsymbol {\theta })}\right \rangle,\tag{14}\end{equation*}
where \langle \cdot \rangle
denotes the average over (y_{t}, \boldsymbol {x}_{t})
with respect to the teacher distribution, \begin{equation*} p_{0}(y,\boldsymbol {x})=q(\boldsymbol {x})\frac {1}{\sqrt {2\pi }}\exp \left ({-\frac {1}{2}(y-f_{0}(\boldsymbol {x}))^{2}}\right).\tag{15}\end{equation*}
View Source
\begin{equation*} p_{0}(y,\boldsymbol {x})=q(\boldsymbol {x})\frac {1}{\sqrt {2\pi }}\exp \left ({-\frac {1}{2}(y-f_{0}(\boldsymbol {x}))^{2}}\right).\tag{15}\end{equation*}
Since the overlap singularity is basically related with the weights \boldsymbol {J}_{i}
, i=1,\,\,2
, we can mainly focus on the weights \boldsymbol {J}_{i}
, not w_{i}
. Thus in order to quantitatively analyse and visualize the generalization error, without loss of generality, we investigate the case that the student output weights are fixed to the teacher output weights and the dimension of the input is chosen to be 1 in this paper, namely we set w_{1}=v_{1}
and w_{2}=v_{2}
, then the teacher and student model are of the following forms:\begin{equation*} f({x},\boldsymbol {\theta }_{0}) = v_{1}\phi (x,{t}_{1})+v_{2}\phi ({x},{t}_{2})+\varepsilon,\tag{16}\end{equation*}
View Source
\begin{equation*} f({x},\boldsymbol {\theta }_{0}) = v_{1}\phi (x,{t}_{1})+v_{2}\phi ({x},{t}_{2})+\varepsilon,\tag{16}\end{equation*}
and \begin{equation*} f({x},\boldsymbol {\theta }) = v_{1}\phi ({x},{J}_{1})+v_{2}\phi ({x},{J}_{2}),\tag{17}\end{equation*}
View Source
\begin{equation*} f({x},\boldsymbol {\theta }) = v_{1}\phi ({x},{J}_{1})+v_{2}\phi ({x},{J}_{2}),\tag{17}\end{equation*}
respectively.
As the output weights have been set to the optimal values, the two parameters v_{1}
and v_{2}
do not participate in the learning process and only J_{1}
and J_{2}
need to be modified. Thus the system parameters have become \boldsymbol {\theta }=[J_{1},\,\,J_{2}]^{T}
.
Next, we can quantitatively analyse the generalization error surface of MLPs. When the student parameters arrives in the overlap singularity \mathcal {R}^{*}=\{\boldsymbol {\theta }^{*}|J_{1}=J_{2}=J^{*}\}
, then for an arbitrary student parameter \hat {\boldsymbol {\theta }}=[\hat {J}_{1},\hat {J}_{2}]
, which is near the overlap singularity \boldsymbol {\theta }^{*}=[J^{*},\,\,J^{*}]
, it can be seen as adding an bias term to \boldsymbol {\theta }^{*}
, namely \hat {\boldsymbol {\theta }} = \boldsymbol {\theta }^{*}+\Delta \boldsymbol {\theta }^{*}
, where \Delta \boldsymbol {\theta }^{*}=\hat {\boldsymbol {\theta }}-\boldsymbol {\theta }=[\hat {J}_{1}-J^{*},\,\,\hat {J}_{2}-J^{*}]^{T}
.
Then by taking the Taylor expansion of L(\hat {\boldsymbol {\theta }})
at \boldsymbol {\theta }^{*}
, we have:\begin{align*}&\hspace {-0.8pc}L(\hat {\boldsymbol {\theta }}) = L(\boldsymbol {\theta }^{*})+(\Delta \boldsymbol {\theta }^{*})^{T}\frac {\partial L(\boldsymbol {\theta }^{*})}{\partial \boldsymbol {\theta }^{*}} \\&\qquad \qquad +\,\frac {1}{2}(\Delta \boldsymbol {\theta }^{*})^{T}\frac {\partial ^{2} L(\boldsymbol {\theta }^{*})}{\partial {\boldsymbol {\theta }^{*}}\partial {{\boldsymbol {\theta }^{*}}^{T}}}\Delta \boldsymbol {\theta }^{*}+O(\|\Delta \boldsymbol {\theta }^{*}\|^{3}).\tag{18}\end{align*}
View Source
\begin{align*}&\hspace {-0.8pc}L(\hat {\boldsymbol {\theta }}) = L(\boldsymbol {\theta }^{*})+(\Delta \boldsymbol {\theta }^{*})^{T}\frac {\partial L(\boldsymbol {\theta }^{*})}{\partial \boldsymbol {\theta }^{*}} \\&\qquad \qquad +\,\frac {1}{2}(\Delta \boldsymbol {\theta }^{*})^{T}\frac {\partial ^{2} L(\boldsymbol {\theta }^{*})}{\partial {\boldsymbol {\theta }^{*}}\partial {{\boldsymbol {\theta }^{*}}^{T}}}\Delta \boldsymbol {\theta }^{*}+O(\|\Delta \boldsymbol {\theta }^{*}\|^{3}).\tag{18}\end{align*}
As shown in [30], for the overlap singularity \boldsymbol {\theta }^{*}
, we have \frac {\partial L(\boldsymbol {\theta }^{*})}{\partial \boldsymbol {\theta }^{*}}=\boldsymbol {0}
, then Eq. (18) can be rewritten as:\begin{align*} L(\hat {\boldsymbol {\theta }})- L(\boldsymbol {\theta }^{*})=&\frac {1}{2}(\Delta \boldsymbol {\theta }^{*})^{T}\frac {\partial ^{2} L(\boldsymbol {\theta }^{*})}{\partial {\boldsymbol {\theta }^{*}}\partial {{\boldsymbol {\theta }^{*}}^{T}}}\Delta \boldsymbol {\theta }^{*}+O(\|\Delta \boldsymbol {\theta }^{*}\|^{3}) \\=&\frac {1}{2}{\Delta \boldsymbol {\theta }^{*}}^{T} H(\boldsymbol {\theta }^{*})\Delta \boldsymbol {\theta }^{*}+O(\|\boldsymbol {\theta }^{*}\|^{3}),\tag{19}\end{align*}
View Source
\begin{align*} L(\hat {\boldsymbol {\theta }})- L(\boldsymbol {\theta }^{*})=&\frac {1}{2}(\Delta \boldsymbol {\theta }^{*})^{T}\frac {\partial ^{2} L(\boldsymbol {\theta }^{*})}{\partial {\boldsymbol {\theta }^{*}}\partial {{\boldsymbol {\theta }^{*}}^{T}}}\Delta \boldsymbol {\theta }^{*}+O(\|\Delta \boldsymbol {\theta }^{*}\|^{3}) \\=&\frac {1}{2}{\Delta \boldsymbol {\theta }^{*}}^{T} H(\boldsymbol {\theta }^{*})\Delta \boldsymbol {\theta }^{*}+O(\|\boldsymbol {\theta }^{*}\|^{3}),\tag{19}\end{align*}
where H(\boldsymbol {\theta })=\dfrac {\partial ^{2}\,\,L(\boldsymbol {\theta })}{\partial {\boldsymbol {\theta }}{\partial {\boldsymbol {\theta }}}^{T}}
is the Hessian matrix of the MLPs.
For Eq. (19), we have:\begin{align*} \Delta {\boldsymbol {\theta }^{*}}^{T} H(\boldsymbol {\theta }^{*})\Delta \boldsymbol {\theta }^{*}=&\Delta {\boldsymbol {\theta }^{*}}^{T} \left [{\begin{matrix}\dfrac {\partial ^{2} L(\boldsymbol {\theta }^{*})}{\partial {J^{*}}^{2}} &\quad \dfrac {\partial ^{2} L(\boldsymbol {\theta }^{*})}{\partial {J^{*}}^{2}}\\ \dfrac {\partial ^{2} L(\boldsymbol {\theta }^{*})}{\partial {J^{*}}^{2}} &\quad \dfrac {\partial ^{2} L(\boldsymbol {\theta }^{*})}{\partial {J^{*}}^{2}}\end{matrix}}\right]\Delta \boldsymbol {\theta }^{*} \\=&\Delta {\boldsymbol {\theta }^{*}}^{T} \left [{\begin{matrix}1 &\quad 1\\ 1 &\quad 1\end{matrix}}\right]\Delta \boldsymbol {\theta }^{*}\dfrac {\partial ^{2} L(\boldsymbol {\theta }^{*})}{\partial {J^{*}}^{2}},\tag{20}\end{align*}
View Source
\begin{align*} \Delta {\boldsymbol {\theta }^{*}}^{T} H(\boldsymbol {\theta }^{*})\Delta \boldsymbol {\theta }^{*}=&\Delta {\boldsymbol {\theta }^{*}}^{T} \left [{\begin{matrix}\dfrac {\partial ^{2} L(\boldsymbol {\theta }^{*})}{\partial {J^{*}}^{2}} &\quad \dfrac {\partial ^{2} L(\boldsymbol {\theta }^{*})}{\partial {J^{*}}^{2}}\\ \dfrac {\partial ^{2} L(\boldsymbol {\theta }^{*})}{\partial {J^{*}}^{2}} &\quad \dfrac {\partial ^{2} L(\boldsymbol {\theta }^{*})}{\partial {J^{*}}^{2}}\end{matrix}}\right]\Delta \boldsymbol {\theta }^{*} \\=&\Delta {\boldsymbol {\theta }^{*}}^{T} \left [{\begin{matrix}1 &\quad 1\\ 1 &\quad 1\end{matrix}}\right]\Delta \boldsymbol {\theta }^{*}\dfrac {\partial ^{2} L(\boldsymbol {\theta }^{*})}{\partial {J^{*}}^{2}},\tag{20}\end{align*}
Moreover, \begin{align*}&\hspace {-1.2pc}\Delta {\boldsymbol {\theta }^{*}}^{T} \left [{\begin{matrix}1 & 1\\ 1 & 1\end{matrix}}\right]\Delta \boldsymbol {\theta }^{*} \\=&\left [{\hat {J}_{1}-J^{*},~\hat {J}_{2}-J^{*} }\right] \left [{\begin{matrix}1 & 1\\ 1 & 1\end{matrix}}\right]\left [{\hat {J}_{1}-J^{*},~\hat {J}_{2}-J^{*} }\right]^{T} \\=&[(\hat {J}_{1}-J^{*})+(\hat {J}_{2}-J^{*}),(\hat {J}_{1}-J^{*})+(\hat {J}_{2}-J^{*})]\left [{\!\begin{matrix}\hat {J}_{1}-J^{*}\\ \hat {J}_{2}-J^{*}\end{matrix} \!}\right] \\=&(\hat {J}_{1}-J^{*})^{2}+2(\hat {J}_{1}-J^{*})(\hat {J}_{2}-J^{*})+(\hat {J}_{2}-J^{*})^{2} \\\leq&2((\hat {J}_{1}-J^{*})^{2}+(\hat {J}_{2}-J^{*})^{2}) \\=&2\|\Delta \boldsymbol {\theta }^{*}\|^{2}.\tag{21}\end{align*}
View Source
\begin{align*}&\hspace {-1.2pc}\Delta {\boldsymbol {\theta }^{*}}^{T} \left [{\begin{matrix}1 & 1\\ 1 & 1\end{matrix}}\right]\Delta \boldsymbol {\theta }^{*} \\=&\left [{\hat {J}_{1}-J^{*},~\hat {J}_{2}-J^{*} }\right] \left [{\begin{matrix}1 & 1\\ 1 & 1\end{matrix}}\right]\left [{\hat {J}_{1}-J^{*},~\hat {J}_{2}-J^{*} }\right]^{T} \\=&[(\hat {J}_{1}-J^{*})+(\hat {J}_{2}-J^{*}),(\hat {J}_{1}-J^{*})+(\hat {J}_{2}-J^{*})]\left [{\!\begin{matrix}\hat {J}_{1}-J^{*}\\ \hat {J}_{2}-J^{*}\end{matrix} \!}\right] \\=&(\hat {J}_{1}-J^{*})^{2}+2(\hat {J}_{1}-J^{*})(\hat {J}_{2}-J^{*})+(\hat {J}_{2}-J^{*})^{2} \\\leq&2((\hat {J}_{1}-J^{*})^{2}+(\hat {J}_{2}-J^{*})^{2}) \\=&2\|\Delta \boldsymbol {\theta }^{*}\|^{2}.\tag{21}\end{align*}
Then we have:\begin{equation*} L(\hat {\boldsymbol {\theta }})- L(\boldsymbol {\theta }^{*})\leq \dfrac {\partial ^{2} L(\boldsymbol {\theta }^{*})}{\partial {J^{*}}^{2}}\|\Delta \boldsymbol {\theta }^{*}\|^{2}+O(\|\boldsymbol {\theta }^{*}\|^{3}).\tag{22}\end{equation*}
View Source
\begin{equation*} L(\hat {\boldsymbol {\theta }})- L(\boldsymbol {\theta }^{*})\leq \dfrac {\partial ^{2} L(\boldsymbol {\theta }^{*})}{\partial {J^{*}}^{2}}\|\Delta \boldsymbol {\theta }^{*}\|^{2}+O(\|\boldsymbol {\theta }^{*}\|^{3}).\tag{22}\end{equation*}
From Eq. (14), it is obvious that L(\theta)\geq 0
. Given that the points in the overlap singularity are all local minima [29], [31], then for an arbitrary point near the overlap singularity, we can obtain L(\hat {\boldsymbol {\theta }})> L(\boldsymbol {\theta }^{*})
, i.e. L(\hat {\boldsymbol {\theta }})- L(\boldsymbol {\theta }^{*})>0
.
By taking some calculations, we can obtain the following results:
Theorem 1:
The difference of the generalization error between the point \hat {\boldsymbol {\theta }}
around the overlap singularity and the overlap singularity \boldsymbol {\theta }^{*}
satisfies the following in-equation:\begin{align*} L(\hat {\boldsymbol {\theta }})- L(\boldsymbol {\theta }^{*}) < \frac {2.31}{\pi }(|v_{1}|+|v_{2}|)^{2}\|\Delta \boldsymbol {\theta }^{*}\|^{2}+O(\|\boldsymbol {\theta }^{*}\|^{3}). \\\tag{23}\end{align*}
View Source
\begin{align*} L(\hat {\boldsymbol {\theta }})- L(\boldsymbol {\theta }^{*}) < \frac {2.31}{\pi }(|v_{1}|+|v_{2}|)^{2}\|\Delta \boldsymbol {\theta }^{*}\|^{2}+O(\|\boldsymbol {\theta }^{*}\|^{3}). \\\tag{23}\end{align*}
Proof:
The calculation process is shown in Appendix.
From Theorem 1, we can see that the generalization error near the overlap singularity changes by the order of O(\|\Delta \boldsymbol {\theta }^{*}\|^{2})
where the coefficient is small than \frac {2.31}{\pi }(|v_{1}|+|v_{2}|)^{2}
. The nearer the student parameters to the overlap singularities, the smaller the distance between the two hidden units. Especially when \|\Delta \boldsymbol {\theta }^{*}\|^{2} < 1
, the difference between L(\hat {\boldsymbol {\theta }})
and L(\boldsymbol {\theta }^{*})
is much less than the distance between \hat {\boldsymbol {\theta }}
and \boldsymbol {\theta }^{*}
, which implies that the generalization error near the overlap singularity is much flatter. Thus, when the student parameters arrive in the space near the overlap singularity, the variation of the generalization error is much less which leads to the parameters change slightly. Even though the two hidden units are not precisely equal to each other, the learning process is still affected by the overlap singularity. Thus the influence area of the overlap singularity in MLPs is larger than the theoretical results which is only the subspace \mathcal {R}^{*}=\{\boldsymbol {\theta }^{*}|\boldsymbol {J}_{1} = \boldsymbol {J}_{2}\}
.
To verify the above analysis, we carry out two experiments involving different cases in the simulation part.
SECTION IV.
Simulation Part
In this section, we take three experiments to verify the above analytical results. In Experiment 1, we focus on the artificial case that the teacher model is described by MLPs. Then in Experiments 2 and 3, we consider the factual case that two real datasets is approximated by the MLPs. The simulation results can illustrate the correctness of Theorem 1.
A. Generalization Error Surface of the MLPs
In this experiment, we consider the case that the teacher model and student model are of the forms in Eq. (16) and Eq. (17), respectively. Since only J_{1}
and J_{2}
are the variable parameters, the generalization error surface can be shown visually after the corresponding generalization errors are obtained. In the generalization error surface, the influence of the overlap singularity can be observed directly and clearly.
In order to obtain the generalization error surface, we should get the analytical form of generalization error at first. However, it is hard to deal with this problem because of the non-integrability of the traditional log-sigmoid function f(x) = \cfrac {1}{1+\text {e}^{-\lambda x}}
. In order to overcome this problem, we adopt the error funcion f(x)=\cfrac {1}{\sqrt {2\pi }}\displaystyle {\int }_{-\infty }^{x}\exp \left({-\frac {t^{2}}{2}}\right)\mathrm {d}t
as the activation function and use the following averaged learning equation to investigate the learning dynamics of MLPs [31], [32]:\begin{equation*} \dot {J_{i}} =-\eta \frac {\partial {L(\boldsymbol {\theta })}}{\partial J_{i}},\tag{24}\end{equation*}
View Source
\begin{equation*} \dot {J_{i}} =-\eta \frac {\partial {L(\boldsymbol {\theta })}}{\partial J_{i}},\tag{24}\end{equation*}
where i=1,\,\,2
, and \eta
denotes the learning rate.
By using the obtained results in Eq. (18)–(21) and Eq. (24) [15], we have:\begin{align*} {\dot {J}}_{i}=\eta {v_{i}}\left({\sum \limits _{j=1}^{2}{v_{j}{P}_{2}({t}_{j},{J}_{i})}-\sum \limits _{j=1}^{2}{v_{j}{P}_{2}({J}_{j},{J}_{i})}}\right),\quad \text {for}~i=1,~2 \\[-5pt]\tag{25}\end{align*}
View Source
\begin{align*} {\dot {J}}_{i}=\eta {v_{i}}\left({\sum \limits _{j=1}^{2}{v_{j}{P}_{2}({t}_{j},{J}_{i})}-\sum \limits _{j=1}^{2}{v_{j}{P}_{2}({J}_{j},{J}_{i})}}\right),\quad \text {for}~i=1,~2 \\[-5pt]\tag{25}\end{align*}
and the analytical form of generalization error \begin{align*}&\hspace {-1pc}L(\boldsymbol {\theta })=\frac {1}{2}\sum \limits _{i=1}^{2}\sum \limits _{j=1}^{2}v_{i}v_{j}P_{1}({t}_{i},{t}_{j})-\sum \limits _{i=1}^{2}\sum \limits _{j=1}^{2}v_{i}v_{j}P_{1}({t}_{i},{J}_{j}) \\&~~\qquad \qquad \qquad \qquad \quad \;\;\, +\,\frac {1}{2}\sum \limits _{i=1}^{2}\sum \limits _{j=1}^{2}v_{i}v_{j}P_{1}({J}_{i},{J}_{j}),\tag{26}\end{align*}
View Source
\begin{align*}&\hspace {-1pc}L(\boldsymbol {\theta })=\frac {1}{2}\sum \limits _{i=1}^{2}\sum \limits _{j=1}^{2}v_{i}v_{j}P_{1}({t}_{i},{t}_{j})-\sum \limits _{i=1}^{2}\sum \limits _{j=1}^{2}v_{i}v_{j}P_{1}({t}_{i},{J}_{j}) \\&~~\qquad \qquad \qquad \qquad \quad \;\;\, +\,\frac {1}{2}\sum \limits _{i=1}^{2}\sum \limits _{j=1}^{2}v_{i}v_{j}P_{1}({J}_{i},{J}_{j}),\tag{26}\end{align*}
where:\begin{equation*} P_{1}(t,J)=\frac {1}{2\pi }\arcsin {\frac {Jt}{\sqrt {1+J^{2}}\sqrt {1+t^{2}}}}+\frac {1}{4},\tag{27}\end{equation*}
View Source
\begin{equation*} P_{1}(t,J)=\frac {1}{2\pi }\arcsin {\frac {Jt}{\sqrt {1+J^{2}}\sqrt {1+t^{2}}}}+\frac {1}{4},\tag{27}\end{equation*}
and \begin{equation*} P_{2}(t,J)=\frac {1}{2\pi }\frac {1}{\sqrt {1+J^{2}+t^{2}}}\frac {t}{\sqrt {1+J^{2}}}.\tag{28}\end{equation*}
View Source
\begin{equation*} P_{2}(t,J)=\frac {1}{2\pi }\frac {1}{\sqrt {1+J^{2}+t^{2}}}\frac {t}{\sqrt {1+J^{2}}}.\tag{28}\end{equation*}
Then for given teacher parameters and initial values of student parameters, the learning process of the student parameters can be obtained by solving Eq. (25).
In this experiment, we choose the teacher parameters t_{1} = 0.48
, t_{2}=-0.85
, v_{1} =0.60
and v_{2}= 0.20
. By letting the initial states of the two student parameters be identical, we can obtain the best approximation J^{*}=0.1647
. Then we choose the initial student parameters as J_{1}^{(0)}=-0.90
and J_{2}^{(0)}=0.65
, the final states are J_{1} = 0.1314
and J_{2} = 0.2684
. The experiment results are shown in Fig. 3, where Fig. 3(a)-(d) represent the surface of L(\hat {\boldsymbol {\theta }})-L(\boldsymbol {\theta }^{*})
near the overlap singularity, the trajectory of generalization error, the trajectory of J_{1}
and J_{2}
, and the generalization error surface near the overlap singularity, respectively. ’°’ and ’\times
’ represent the initial state and final state, respectively.
As shown in Fig. 3(a), L(\hat {\boldsymbol {\theta }})-L(\boldsymbol {\theta }^{*})
is much smaller than \dfrac {2.31}{\pi }(|v_{1}|+|v_{2}|)^{2}\|\Delta \boldsymbol {\theta }^{*}\|^{2}
, which verifies the correctness of Theorem 1. From Fig. 3(b), it can be seen that the generalization error decreases fast at the early stage of the learning process, then almost remains unchanged till the end of the training. Meanwhile, J_{1}
and J_{2}
become close to each other (shown in Fig. 3(c)). It is clear in Fig. 3(d) that the learning trajectory tends to the overlap singularity. Although the two hidden units do not overlap exactly, the learning dynamics are still influenced by the overlap singularity. After the training process, \dfrac {\partial L(\boldsymbol {\theta })}{\partial \boldsymbol {\theta }}=1.0e-05 \times [{ 0.2317, -0.7515}]^{T}
, i.e. the gradient becomes very small and the generalization error surface is very flat, thus the student parameters remain slight change even the training process becomes longer, i.e. the influence area of the overlap singularity is much larger.
B. Cuff-Less Blood Pressure Estimation
After having taken an artificial experiment to verify the obtained results, next we do two real experiments to verify the validity of the theoretical analysis. In experiment 2, the MLPs are used to approximate the blood pressure estimation database [33]. In this online waveform database, after collecting the photoplethysmograph (PPG) and electrocardiogram (ECG) signal, the arterial blood pressure (ABP) signal can be estimated by using approximation algorithms. Then in this experiment, the input is \hat {\boldsymbol {x}}=[x_{1},\,\,x_{2}]^{T}
and the output is y
for MLPs, where x_{1}
is PPG, x_{2}
is ECG and y
is ABP. For machine learning, preprocessing is usually required for obtaining the better performance and we choose the Gaussian normalization in this paper [34]. \hat {\boldsymbol {x}}(k)
is normalized as:\begin{equation*} \boldsymbol {x}(k) = \frac {\hat {\boldsymbol {x}}(k)-\boldsymbol {\mu }}{\boldsymbol {\delta }},\quad \text {for}~k=1,2,\cdots,M\tag{29}\end{equation*}
View Source
\begin{equation*} \boldsymbol {x}(k) = \frac {\hat {\boldsymbol {x}}(k)-\boldsymbol {\mu }}{\boldsymbol {\delta }},\quad \text {for}~k=1,2,\cdots,M\tag{29}\end{equation*}
where \boldsymbol {\mu }
is the sample mean value of the \boldsymbol {x}
:\begin{equation*} \boldsymbol {\mu }=\dfrac {1}{M}\sum \limits _{i=1}^{M}\hat {\boldsymbol {x}}(i),\tag{30}\end{equation*}
View Source
\begin{equation*} \boldsymbol {\mu }=\dfrac {1}{M}\sum \limits _{i=1}^{M}\hat {\boldsymbol {x}}(i),\tag{30}\end{equation*}
\boldsymbol {\delta }
is the sample standard deviation:\begin{equation*} \boldsymbol {\delta }=\sqrt {\dfrac {1}{M}\sum \limits _{i=1}^{M}(\hat {\boldsymbol {x}}(i)-\boldsymbol {\mu })^{2}},\tag{31}\end{equation*}
View Source
\begin{equation*} \boldsymbol {\delta }=\sqrt {\dfrac {1}{M}\sum \limits _{i=1}^{M}(\hat {\boldsymbol {x}}(i)-\boldsymbol {\mu })^{2}},\tag{31}\end{equation*}
and M
is the number of sample in the data set.
Different from Experiment 1, as the input distribution is unknown, the ALE is inapplicable. Instead, we use the batch mode learning in the training process. Given that the dimension of input is not 1, in order to directly show how close between two units i
and j
is, in Experiments 2 and 3 we adopt the squared Euclidean distance h(i,j)
which is defined in Eq. (9). The closer the two hidden units are to each other, the closer the squared Euclidean distance between them is to 0. The hidden unit number of the student model is chosen as k=8
, namely the student MLP is given by:\begin{equation*} f(\boldsymbol {x},\boldsymbol {\theta })=\sum \limits _{i=1}^{8}{w_{i}\phi (\boldsymbol {x},\boldsymbol {J}_{i})}.\tag{32}\end{equation*}
View Source
\begin{equation*} f(\boldsymbol {x},\boldsymbol {\theta })=\sum \limits _{i=1}^{8}{w_{i}\phi (\boldsymbol {x},\boldsymbol {J}_{i})}.\tag{32}\end{equation*}
We use N=200
samples to train the MLPs, and use the sum squared training error to replace the generalization error. Then the model is trained for 10000 times with the learning rate \eta =0.03
, and the initial and final states of the student parameters are as follows:\begin{align*} \boldsymbol {J}^{(0)}=&\left [{\boldsymbol {J}_{1}^{(0)},~\boldsymbol {J}_{2}^{(0)},~\boldsymbol {J}_{3}^{(0)},\cdots,\boldsymbol {J}_ {8}^{(0)}}\right] \\=&\left [{ \begin{matrix} 0.7369 & -0.0981 & -0.0031 & -0.9533 \\ 0.8760 & -0.9857 & 0.9645 & 0.2263 \end{matrix}}\right. \\&\left.{\begin{matrix} 0.6443 & -0.1546 & -0.9210 & 0.9314\\ -0.1941 & 0.2266 & -0.6531 & -0.9248 \end{matrix}}\right], \tag{33}\\ \boldsymbol {w}^{(0)}=&\left [{w_{1}^{(0)},~w_{2}^{(0)},~w_{3}^{(0)},\cdots,w_{8}^{(0)}}\right] \\=&[0.3828~~0.9461,~\,\,-0.6826,~~0.2903 \\&-\,0.7796,~~0.2582,~~0.7194,~\,\,-0.3073], \tag{34}\\ \boldsymbol {J}=&\left [{\boldsymbol {J}_{1},~\boldsymbol {J}_{2},~\boldsymbol {J}_{3},\cdots,\boldsymbol {J}_{8}}\right] \\=&\left [{ \begin{matrix} 1.2186 & 1.3180 & -1.2851 & -1.1058 \\ 0.9104 & -0.4648 & 0.4644 & 0.3508 \end{matrix}}\right. \\&\left.{\begin{matrix} 0.4478 & -0.3738 & -2.0233 & 2.8486\\ -0.7589 & 0.0577 & -1.5594 & -0.8602 \end{matrix}}\right], \tag{35}\\ \boldsymbol {w}=&\left [{w_{1},~w_{2},~w_{3},\cdots,w_{8}}\right] \\=&[0.7465,~~1.5188,~\,\,-0.8877,~\,\,-0.1907 \\&-\,0.5114,~~0.5144,~~0.9126,~\,\,-1.8684].\tag{36}\end{align*}
View Source
\begin{align*} \boldsymbol {J}^{(0)}=&\left [{\boldsymbol {J}_{1}^{(0)},~\boldsymbol {J}_{2}^{(0)},~\boldsymbol {J}_{3}^{(0)},\cdots,\boldsymbol {J}_ {8}^{(0)}}\right] \\=&\left [{ \begin{matrix} 0.7369 & -0.0981 & -0.0031 & -0.9533 \\ 0.8760 & -0.9857 & 0.9645 & 0.2263 \end{matrix}}\right. \\&\left.{\begin{matrix} 0.6443 & -0.1546 & -0.9210 & 0.9314\\ -0.1941 & 0.2266 & -0.6531 & -0.9248 \end{matrix}}\right], \tag{33}\\ \boldsymbol {w}^{(0)}=&\left [{w_{1}^{(0)},~w_{2}^{(0)},~w_{3}^{(0)},\cdots,w_{8}^{(0)}}\right] \\=&[0.3828~~0.9461,~\,\,-0.6826,~~0.2903 \\&-\,0.7796,~~0.2582,~~0.7194,~\,\,-0.3073], \tag{34}\\ \boldsymbol {J}=&\left [{\boldsymbol {J}_{1},~\boldsymbol {J}_{2},~\boldsymbol {J}_{3},\cdots,\boldsymbol {J}_{8}}\right] \\=&\left [{ \begin{matrix} 1.2186 & 1.3180 & -1.2851 & -1.1058 \\ 0.9104 & -0.4648 & 0.4644 & 0.3508 \end{matrix}}\right. \\&\left.{\begin{matrix} 0.4478 & -0.3738 & -2.0233 & 2.8486\\ -0.7589 & 0.0577 & -1.5594 & -0.8602 \end{matrix}}\right], \tag{35}\\ \boldsymbol {w}=&\left [{w_{1},~w_{2},~w_{3},\cdots,w_{8}}\right] \\=&[0.7465,~~1.5188,~\,\,-0.8877,~\,\,-0.1907 \\&-\,0.5114,~~0.5144,~~0.9126,~\,\,-1.8684].\tag{36}\end{align*}
The simulation results are shown in Fig. 4, where Fig. 4(a)-(c) represent the trajectories of training error, h(3,4)
and w_{i}
, i=1,\cdots,8
, respectively. ’°’ and ’\times
’ represent the initial state and final state, respectively. h(3,4)
is the squared Euclidean distance between hidden nodes 3 and 4, i.e. h(3,4)=\dfrac {1}{2}(\boldsymbol {J}_{3}-\boldsymbol {J}_{4})^{T}(\boldsymbol {J}_{3}-\boldsymbol {J}_{4})
.
From Fig. 4, we can see that the learning dynamics are similar to the results in Experiment 1. At the early stage of the training process, the training error reduces fast and remains almost unchanged till the end (see Fig. 4(a)). Corresponding to this, h(3,4)
tends to zero rapidly (see Fig. 4) and finally retains a small value (0.0225). From Eq. (35), after training the units \boldsymbol {J}_{3} = [-1.2851,\,\, 0.4644]^{T}
and \boldsymbol {J}_{4}=[-1.1058,\,\,0.3508]^{T}
, \boldsymbol {J}_{3}
and \boldsymbol {J}_{4}
have some difference, however the learning process is still affected by the overlap singularity. This is in accordance to the obtained results that the overlap singularity has larger influence area.
C. Combined Cycle Power Plant (CCPP) Dataset
Combined Cycle Power Plant (CCPP) dataset is also a regression benchmark from the UCI Machine Learning Repository [35]–[37]. Hourly average ambient variables Temperature (T), Ambient Pressure (AP), Relative Humidity (RH) and Exhaust Vacuum (V) are used to predict the net hourly electrical energy output (EP) of the plant. Thus for the MLPs, the input is \boldsymbol {x} = [x_{1},\,\,x_{2},\,\,x_{3},\,\,x_{4}]^{T}
, where x_{1}
is T, x_{2}
is AP, x_{3}
is RH and x_{4}
is V, the output is EP. The preprocessing is also done as the Experiment 2.
The hidden unit number of the student model is chosen as k=10
, namely the student MLP is given by:\begin{equation*} f(\boldsymbol {x},\boldsymbol {\theta })=\sum \limits _{i=1}^{10}{w_{i}\phi (\boldsymbol {x},\boldsymbol {J}_{i})}.\tag{37}\end{equation*}
View Source
\begin{equation*} f(\boldsymbol {x},\boldsymbol {\theta })=\sum \limits _{i=1}^{10}{w_{i}\phi (\boldsymbol {x},\boldsymbol {J}_{i})}.\tag{37}\end{equation*}
Batch mode learning is used to accomplish the training process and we use N=200
samples to train the MLPs. Then the model is trained for 15000 times with the learning rate \eta =0.001
.
Fig. 5 presents the simulation results, where Fig. 5(a)-(c) represent the trajectories of training error, h(1,3)
and w_{i}
, i=1,\cdots,10
, respectively. ’°’ and ’\times
’ represent the initial state and final state, respectively. h(1,3)
is the squared Euclidean distance between hidden nodes 1 and 2, i.e. h(1,3)=\dfrac {1}{2}(\boldsymbol {J}_{1}-\boldsymbol {J}_{3})^{T}(\boldsymbol {J}_{1}-\boldsymbol {J}_{3})
.
The initial state and final state of hidden unit 1 and 3 are shown as follows:\begin{align*} [\boldsymbol {J}_{1}^{(0)},~\boldsymbol {J}_{3}^{(0)}]=&\left [{ \begin{matrix} 1.2442 &\quad 0.7505 \\ -0.7139 &\quad 0.1948 \\ 1.9443 &\quad 1.0089 \\ -0.1407 &\quad 0.2051 \end{matrix}}\right], \tag{38}\\{}[{w}_{1}^{(0)},~w_{3}^{(0)}]=&[-1.2276,~~1.7313],\tag{39}\end{align*}
View Source
\begin{align*} [\boldsymbol {J}_{1}^{(0)},~\boldsymbol {J}_{3}^{(0)}]=&\left [{ \begin{matrix} 1.2442 &\quad 0.7505 \\ -0.7139 &\quad 0.1948 \\ 1.9443 &\quad 1.0089 \\ -0.1407 &\quad 0.2051 \end{matrix}}\right], \tag{38}\\{}[{w}_{1}^{(0)},~w_{3}^{(0)}]=&[-1.2276,~~1.7313],\tag{39}\end{align*}
and \begin{align*} [\boldsymbol {J}_{1},~\boldsymbol {J}_{3}]=&\left [{ \begin{matrix} 0.5509 &\quad 0.4581 \\ -0.2229 &\quad 0.0554 \\ 0.9107 &\quad 0.9899 \\ 0.3372 &\quad 0.2898 \end{matrix}}\right], \tag{40}\\{}[w_{1},~w_{3}]=&[-1.5202,~~0.6636],\tag{41}\end{align*}
View Source
\begin{align*} [\boldsymbol {J}_{1},~\boldsymbol {J}_{3}]=&\left [{ \begin{matrix} 0.5509 &\quad 0.4581 \\ -0.2229 &\quad 0.0554 \\ 0.9107 &\quad 0.9899 \\ 0.3372 &\quad 0.2898 \end{matrix}}\right], \tag{40}\\{}[w_{1},~w_{3}]=&[-1.5202,~~0.6636],\tag{41}\end{align*}
respectively.
As shown in Fig. 5, the experiment results are similar to those previously obtained in Experiment 2. From the initial values and final values of \boldsymbol {J}_{1}
and \boldsymbol {J}_{3}
(shown in Eq. (38) and Eq. (40)) and the trajectory of h(1,3)
(shown in Fig. 5(b)), although there is some difference between \boldsymbol {J}_{1}
and \boldsymbol {J}_{3}
, the learning process is also affected by the overlap singularity.
From the results in Experiments 1 - 3, the correctness of Theorem 1 has been verified. Thus the overlap singularity indeed has much larger influence for MLPs. Researchers should pay more attention to investigate the way to avoid or reduce the influence of the overlap singularity.
For the widely-used MLPs, there exist overlap singularities in the parameter space, and the overlap singularities seriously affected the learning dynamics of MLPs. In the previous theoretical analysis, under the batch mode learning, the model parameters always arrive and trap in the overlap singularity when affected by such singularity. However, by analyzing the generalization error surface near the overlap singularity, for an arbitrary point \hat {\boldsymbol {\theta }}
which is near the overlap singularity and the point \boldsymbol {\theta }^{*}
which is on the overlap singularity, we prove that the difference of generalization error between \hat {\boldsymbol {\theta }}
and \boldsymbol {\theta }^{*}
is attenuated by second-order of \|\hat {\boldsymbol {\theta }}-\boldsymbol {\theta }^{*}\|
. Thus the generalization error surface is much flatter near the overlap singularity, which leads to that, when the learning process is near the overlap singularity, even though the two hidden units have some difference between each other, the learning dynamics are still affected by the overlap singularity. Because of the flatness of the generalization error surface near the overlap singularity, the model parameters change slightly even after much longer training. By taking an artificial experiment and two real dataset experiments, the obtained results are verify the validity in the simulation part. Due to its much larger influence, the overlap singularity should be paid more attention to investigate how to avoid or reduce its serious influence in the future.
For simplicity, we introduce the following notations:\begin{align*} P_{3}(t,J)=&\left \langle{ \phi (x,t)\frac {\partial ^{2}\phi (x,J)}{\partial J^{2}}}\right \rangle,\tag{A-1}\\ P_{4}(t,J)=&\left \langle{ \frac {\partial \phi (x,t)}{\partial t}\frac {\partial \phi (x,J)}{\partial J}}\right \rangle.\tag{A-2}\end{align*}
View Source
\begin{align*} P_{3}(t,J)=&\left \langle{ \phi (x,t)\frac {\partial ^{2}\phi (x,J)}{\partial J^{2}}}\right \rangle,\tag{A-1}\\ P_{4}(t,J)=&\left \langle{ \frac {\partial \phi (x,t)}{\partial t}\frac {\partial \phi (x,J)}{\partial J}}\right \rangle.\tag{A-2}\end{align*}
From Eq. (14), we can obtain:\begin{align*}&\hspace {-2pc}\frac {\partial ^{2} L(\boldsymbol {\theta }^{*})}{\partial {\boldsymbol {\theta }^{*}}^{2}} \\=&(v_{1}+v_{2})^{2}\left \langle{ \frac {\partial \phi (\boldsymbol {x},J^{*})}{\partial {J^{*}}}\frac {\partial \phi (\boldsymbol {x},J^{*})}{\partial {J^{*}}}}\right \rangle \\&+\,(v_{1}+v_{2})^{2}\left \langle{ \phi (\boldsymbol {x},J^{*})\frac {\partial ^{2}\phi (\boldsymbol {x},J^{*})}{\partial {J^{*}}^{2}}}\right \rangle \\&-\,(v_{1}+v_{2})\left \langle{ \left ({v_{1}\phi (\boldsymbol {x},t_{1})+v_{2}\phi (\boldsymbol {x},t_{2})}\right)\frac {\partial ^{2}\phi (\boldsymbol {x},J^{*})}{\partial {J^{*}}^{2}}}\right \rangle \\=&(v_{1}+v_{2})^{2}P_{4}(J^{*},J^{*})+(v_{1}+v_{2})^{2}P_{3}(J^{*},J^{*}) \\&-\,(v_{1}+v_{2})\left ({v_{1}P_{3}(t_{1},J^{*})+v_{2}P_{3}(t_{2},J^{*})}\right) \\\leq&(v_{1}+v_{2})^{2}|P_{4}(J^{*},J^{*})+P_{3}(J^{*},J^{*})| \\&+\,|v_{1}+v_{2}|\left |{v_{1}P_{3}(t_{1},J^{*})+v_{2}P_{3}(t_{2},J^{*})}\right |.\tag{A-3}\end{align*}
View Source
\begin{align*}&\hspace {-2pc}\frac {\partial ^{2} L(\boldsymbol {\theta }^{*})}{\partial {\boldsymbol {\theta }^{*}}^{2}} \\=&(v_{1}+v_{2})^{2}\left \langle{ \frac {\partial \phi (\boldsymbol {x},J^{*})}{\partial {J^{*}}}\frac {\partial \phi (\boldsymbol {x},J^{*})}{\partial {J^{*}}}}\right \rangle \\&+\,(v_{1}+v_{2})^{2}\left \langle{ \phi (\boldsymbol {x},J^{*})\frac {\partial ^{2}\phi (\boldsymbol {x},J^{*})}{\partial {J^{*}}^{2}}}\right \rangle \\&-\,(v_{1}+v_{2})\left \langle{ \left ({v_{1}\phi (\boldsymbol {x},t_{1})+v_{2}\phi (\boldsymbol {x},t_{2})}\right)\frac {\partial ^{2}\phi (\boldsymbol {x},J^{*})}{\partial {J^{*}}^{2}}}\right \rangle \\=&(v_{1}+v_{2})^{2}P_{4}(J^{*},J^{*})+(v_{1}+v_{2})^{2}P_{3}(J^{*},J^{*}) \\&-\,(v_{1}+v_{2})\left ({v_{1}P_{3}(t_{1},J^{*})+v_{2}P_{3}(t_{2},J^{*})}\right) \\\leq&(v_{1}+v_{2})^{2}|P_{4}(J^{*},J^{*})+P_{3}(J^{*},J^{*})| \\&+\,|v_{1}+v_{2}|\left |{v_{1}P_{3}(t_{1},J^{*})+v_{2}P_{3}(t_{2},J^{*})}\right |.\tag{A-3}\end{align*}
By using the results of Eq. (60)-(62) in [31], the explicit expressions of P_{3}(t,J)
and P_{4}(t,J)
are given as follows:
For t\neq J
, \begin{align*} P_{3}(t,J)=&-\frac {1}{2\pi }\frac {Jt}{1+J^{2}}\frac {1}{\sqrt {1+J^{2}+t^{2}}} \\&\times \frac {2(1+J^{2}+t^{2})+1+J^{2}}{(1+J^{2})(1+J^{2}+t^{2})} \\=&-\frac {Jt}{2\pi }\frac {3(1+J^{2})+2t^{2}}{(1+J^{2})^{2}(1+J^{2}+t^{2})^{\frac {3}{2}}}, \tag{A-4}\\ P_{3}(J,J)=&-\frac {1}{2\pi }\frac {1}{1+J^{2}}\frac {1}{\sqrt {1+2J^{2}}} \\&\times \left ({1-\frac {2(2+3J^{2})J^{2}}{(1+J^{2})(1+2J^{2})} -\frac {1+J^{2}}{1+2J^{2}}}\right) \\=&-\frac {J^{2}}{2\pi }\frac {3+5J^{2}}{(1+J^{2})^{2}(1+2J^{2})^{\frac {3}{2}}},\tag{A-5}\\ P_{4}(J,J)=&\frac {1}{2\pi }\frac {1}{(1+2J^{2})^{\frac {3}{2}}}.\tag{A-6}\end{align*}
View Source
\begin{align*} P_{3}(t,J)=&-\frac {1}{2\pi }\frac {Jt}{1+J^{2}}\frac {1}{\sqrt {1+J^{2}+t^{2}}} \\&\times \frac {2(1+J^{2}+t^{2})+1+J^{2}}{(1+J^{2})(1+J^{2}+t^{2})} \\=&-\frac {Jt}{2\pi }\frac {3(1+J^{2})+2t^{2}}{(1+J^{2})^{2}(1+J^{2}+t^{2})^{\frac {3}{2}}}, \tag{A-4}\\ P_{3}(J,J)=&-\frac {1}{2\pi }\frac {1}{1+J^{2}}\frac {1}{\sqrt {1+2J^{2}}} \\&\times \left ({1-\frac {2(2+3J^{2})J^{2}}{(1+J^{2})(1+2J^{2})} -\frac {1+J^{2}}{1+2J^{2}}}\right) \\=&-\frac {J^{2}}{2\pi }\frac {3+5J^{2}}{(1+J^{2})^{2}(1+2J^{2})^{\frac {3}{2}}},\tag{A-5}\\ P_{4}(J,J)=&\frac {1}{2\pi }\frac {1}{(1+2J^{2})^{\frac {3}{2}}}.\tag{A-6}\end{align*}
Then, we have:\begin{align*}&\hspace {-1.2pc}P_{4}(J^{*},J^{*})+P_{3}(J^{*},J^{*}) \\=&\frac {1}{2\pi }\left ({\frac {1}{(1+2{J^{*}}^{2})^{\frac {3}{2}}}-\frac {J^{*}{}^{2}(3+5{J^{*}}^{2})}{(1+{J^{*}}^{2})^{2}(1+2{J^{*}}^{2}) ^{\frac {3}{2}}}}\right) \\=&\frac {1}{2\pi }\frac {(1+{J^{*}}^{2})^{2}-{J^{*}}^{2}(3+5{J^{*}}^{2})}{(1+{J^{*}}^{2})^{2}(1+2{J^{*}}^{2})^{\frac {3}{2}}} \\=&\frac {1}{2\pi }\frac {\frac {17}{16}-\left({2{J^{*}}^{2}+\frac {1}{4}}\right)^{2}}{(1+{J^{*}}^{2})^{2}(1+2{J^{*}}^{2})^{\frac {3}{2}}}.\tag{A-7}\end{align*}
View Source
\begin{align*}&\hspace {-1.2pc}P_{4}(J^{*},J^{*})+P_{3}(J^{*},J^{*}) \\=&\frac {1}{2\pi }\left ({\frac {1}{(1+2{J^{*}}^{2})^{\frac {3}{2}}}-\frac {J^{*}{}^{2}(3+5{J^{*}}^{2})}{(1+{J^{*}}^{2})^{2}(1+2{J^{*}}^{2}) ^{\frac {3}{2}}}}\right) \\=&\frac {1}{2\pi }\frac {(1+{J^{*}}^{2})^{2}-{J^{*}}^{2}(3+5{J^{*}}^{2})}{(1+{J^{*}}^{2})^{2}(1+2{J^{*}}^{2})^{\frac {3}{2}}} \\=&\frac {1}{2\pi }\frac {\frac {17}{16}-\left({2{J^{*}}^{2}+\frac {1}{4}}\right)^{2}}{(1+{J^{*}}^{2})^{2}(1+2{J^{*}}^{2})^{\frac {3}{2}}}.\tag{A-7}\end{align*}
For 0\leq {J^{*}}^{2} < \frac {\sqrt {17}-1}{8}
, we have P_{4}(J^{*},J^{*})+P_{3}(J^{*},J^{*})>0
. As P_{4}(J^{*},J^{*})>0
, and P_{3}(J^{*},J^{*}) < 0
, then we can obtain:\begin{align*}&\hspace {-1.2pc}|P_{4}(J^{*},J^{*})+P_{3}(J^{*},J^{*})| < P_{4}(J^{*},J^{*}) \\&\qquad \qquad \qquad \qquad \quad =\frac {1}{2\pi }\frac {1}{(1+2{J^{*}}^{2})^{\frac {3}{2}}} < \frac {1}{2\pi }.\tag{A-8}\end{align*}
View Source
\begin{align*}&\hspace {-1.2pc}|P_{4}(J^{*},J^{*})+P_{3}(J^{*},J^{*})| < P_{4}(J^{*},J^{*}) \\&\qquad \qquad \qquad \qquad \quad =\frac {1}{2\pi }\frac {1}{(1+2{J^{*}}^{2})^{\frac {3}{2}}} < \frac {1}{2\pi }.\tag{A-8}\end{align*}
For {J^{*}}^{2}\geq \dfrac {\sqrt {17}-1}{8}
, we have P_{4}(J^{*},J^{*})+P_{3}(J^{*},J^{*})\leq 0
. As P_{4}(J^{*},J^{*})>0
, and P_{3}(J^{*},J^{*}) < 0
, then we can obtain:\begin{align*}&\hspace {-1.2pc}|P_{4}(J^{*},J^{*})+P_{3}(J^{*},J^{*})| < |P_{3}(J^{*},J^{*})| \\=&\frac {1}{2\pi }\frac {J^{*}{}^{2}(3\!+\!5{J^{*}}^{2})}{(1\!+\!{J^{*}}^{2})^{2}(1\!+\!2{J^{*}}^{2})^{\frac {3}{2}}} < \frac {1}{2\pi }\frac {(1\!+\!{J^{*}}^{2})\cdot 3(1\!+\!2{J^{*}}^{2})}{(1+{J^{*}}^{2})^{2}(1+2{J^{*}}^{2})^{\frac {3}{2}}} \\=&\frac {1}{2\pi }\frac {3}{(1+{J^{*}}^{2})(1+2{J^{*}}^{2})^{\frac {1}{2}}} < \frac {1.62}{2\pi }.\tag{A-9}\end{align*}
View Source
\begin{align*}&\hspace {-1.2pc}|P_{4}(J^{*},J^{*})+P_{3}(J^{*},J^{*})| < |P_{3}(J^{*},J^{*})| \\=&\frac {1}{2\pi }\frac {J^{*}{}^{2}(3\!+\!5{J^{*}}^{2})}{(1\!+\!{J^{*}}^{2})^{2}(1\!+\!2{J^{*}}^{2})^{\frac {3}{2}}} < \frac {1}{2\pi }\frac {(1\!+\!{J^{*}}^{2})\cdot 3(1\!+\!2{J^{*}}^{2})}{(1+{J^{*}}^{2})^{2}(1+2{J^{*}}^{2})^{\frac {3}{2}}} \\=&\frac {1}{2\pi }\frac {3}{(1+{J^{*}}^{2})(1+2{J^{*}}^{2})^{\frac {1}{2}}} < \frac {1.62}{2\pi }.\tag{A-9}\end{align*}
Next we focus on P_{3}(t_{i},J^{*})
, we have:\begin{align*} |P_{3}(t_{i},J^{*})|=&\frac {1}{2\pi }\frac {|J^{*}|\cdot |t_{i}|\cdot 3(1+J^{2})+2t^{2}}{(1+J^{2})^{2}(1+J^{2}+t^{2})^{\frac {3}{2}}} \\\leq&\frac {1}{2\pi }\frac {|J^{*}|\cdot |t_{i}|\cdot 3(1+J^{2}+t^{2})}{(1+J^{2})^{2}(1+J^{2}+t^{2})^{\frac {3}{2}}} \\=&\frac {3}{2\pi }\frac {|J^{*}||t_{i}|}{(1+J^{2})^{2}(1+J^{2}+t^{2})^{\frac {1}{2}}} \\ < &\frac {3}{2\pi }.\tag{A-10}\end{align*}
View Source
\begin{align*} |P_{3}(t_{i},J^{*})|=&\frac {1}{2\pi }\frac {|J^{*}|\cdot |t_{i}|\cdot 3(1+J^{2})+2t^{2}}{(1+J^{2})^{2}(1+J^{2}+t^{2})^{\frac {3}{2}}} \\\leq&\frac {1}{2\pi }\frac {|J^{*}|\cdot |t_{i}|\cdot 3(1+J^{2}+t^{2})}{(1+J^{2})^{2}(1+J^{2}+t^{2})^{\frac {3}{2}}} \\=&\frac {3}{2\pi }\frac {|J^{*}||t_{i}|}{(1+J^{2})^{2}(1+J^{2}+t^{2})^{\frac {1}{2}}} \\ < &\frac {3}{2\pi }.\tag{A-10}\end{align*}
Then we can get:\begin{align*}&\hspace {-2pc}|v_{1}P_{3}(t_{1},J^{*})+v_{2}P_{3}(t_{2},J^{*})| \\\leq&|v_{1}||P_{3}(t_{1},J^{*})|+|v_{2}||P_{3}(t_{2},J^{*})| \\ < &\frac {3}{2\pi }(|v_{1}|+|v_{2}|).\tag{A-11}\end{align*}
View Source
\begin{align*}&\hspace {-2pc}|v_{1}P_{3}(t_{1},J^{*})+v_{2}P_{3}(t_{2},J^{*})| \\\leq&|v_{1}||P_{3}(t_{1},J^{*})|+|v_{2}||P_{3}(t_{2},J^{*})| \\ < &\frac {3}{2\pi }(|v_{1}|+|v_{2}|).\tag{A-11}\end{align*}
Overall, we have:
For 0\leq {J^{*}}^{2} < \dfrac {\sqrt {17}-1}{8}
, \begin{align*} \frac {\partial ^{2} L(\boldsymbol {\theta }^{*})}{\partial {J^{*}}^{2}} < &\frac {(v_{1}+v_{2})^{2}}{2\pi }+\frac {3}{2\pi }|v_{1}+v_{2}|(|v_{1}|+|v_{2}|) \\ < &\frac {|v_{1}|+|v_{2}|}{2\pi }(|v_{1}|+|v_{2}|+3(|v_{1}|+|v_{2}|)) \\=&\frac {2}{\pi }(|v_{1}|+|v_{2}|)^{2}.\tag{A-12}\end{align*}
View Source
\begin{align*} \frac {\partial ^{2} L(\boldsymbol {\theta }^{*})}{\partial {J^{*}}^{2}} < &\frac {(v_{1}+v_{2})^{2}}{2\pi }+\frac {3}{2\pi }|v_{1}+v_{2}|(|v_{1}|+|v_{2}|) \\ < &\frac {|v_{1}|+|v_{2}|}{2\pi }(|v_{1}|+|v_{2}|+3(|v_{1}|+|v_{2}|)) \\=&\frac {2}{\pi }(|v_{1}|+|v_{2}|)^{2}.\tag{A-12}\end{align*}
For {J^{*}}^{2}\geq \dfrac {\sqrt {17}-1}{8}
, \begin{align*} \frac {\partial ^{2} L(\boldsymbol {\theta }^{*})}{\partial {J^{*}}^{2}} < &\frac {1.62}{2\pi }(v_{1}+v_{2})^{2}+\frac {3}{2\pi }(|v_{1}+v_{2}|(|v_{1}|+|v_{2}|) \\ < &\frac {4.62}{2\pi }(|v_{1}|+|v_{2}|)^{2} \\=&\frac {2.31}{\pi }(|v_{1}|+|v_{2}|)^{2}.\tag{A-13}\end{align*}
View Source
\begin{align*} \frac {\partial ^{2} L(\boldsymbol {\theta }^{*})}{\partial {J^{*}}^{2}} < &\frac {1.62}{2\pi }(v_{1}+v_{2})^{2}+\frac {3}{2\pi }(|v_{1}+v_{2}|(|v_{1}|+|v_{2}|) \\ < &\frac {4.62}{2\pi }(|v_{1}|+|v_{2}|)^{2} \\=&\frac {2.31}{\pi }(|v_{1}|+|v_{2}|)^{2}.\tag{A-13}\end{align*}
Thus from Eq. (A-12) and Eq. (A-13), \dfrac {\partial ^{2}\,\,L(\boldsymbol {\theta }^{*})}{\partial {J^{*}}^{2}} < \dfrac {2.31}{\pi }(|v_{1}|+|v_{2}|)^{2}
. This proves Theorem 1.