Introduction
Orthogonality constrained optimization problems have been playing crucial roles in signal processing and machine learning applications [1], [2], [3]. Such applications include sparse principal component analysis [4], [5], [6], 1-bit compressed sensing [7], [8], the joint diagonalization problem for independent component analysis [9], [10], [11], the orthogonal Procrustes problem [12], [13], [14], [15], the nearest low-rank correlation matrix problem [16], [17], [18], nonlinear eigenvalue problem [19], [20], [21], and the enhancement of the generalization performance in deep neural networks [22], [23], [24]. The orthogonality constraint set defined by
Problem 1.1:
For \begin{equation*} \mathrm {find}~\boldsymbol {U}^{\star } \in \mathop {\mathrm {argmin}}\limits _{\boldsymbol {U}\in \mathrm {St} (p,N)} ~f(\boldsymbol {U}).\end{equation*}
A realistic goal for Problem 1.1 is to find a stationary point
A family of Cayley-type transforms [22], [26], [28], [29], [30], [31], [32], [33] resolves the nonlinearity of \begin{align*} Q_{N,p}(\boldsymbol {S}) &:= Q_{N,p}:=\left \{{ \left.{\begin{bmatrix} \boldsymbol {A} & -\boldsymbol {B}^{ \mathsf {T}} \\ \boldsymbol {B} & \boldsymbol {0} \end{bmatrix} \in \mathbb {R}^{N\times N} }\right \vert \; \substack { -\boldsymbol {A}^{ \mathsf {T}} = \boldsymbol {A} \in \mathbb {R}^{p\times p}, \\ \boldsymbol {B} \in \mathbb {R}^{(N-p)\times p}}}\right \} \\ & \equiv \mathbb {R}^{Np-\frac {1}{2}p(p+1)} \tag{1}\end{align*}
As a result, Problem 1.1 can be reformulated as
Problem 1.2:
For a given differentiable function \begin{align*} \textrm {find} ~& \boldsymbol {V}^{\star } \in Q_{N,p}(\boldsymbol {S}) \\ & \textrm {such that} ~f\circ \Phi _{\boldsymbol {S}}^{-1}(\boldsymbol {V}^{\star }) < \min f(\mathrm {St}(p,N)) + \epsilon.\end{align*}
The naive Cayley parametrization (NCP) strategy [28], [33] updates (i) estimates
However, in a case where
To suppress the singular-point issue in the NCP strategy, we recently proposed an Adaptive Localized Cayley Parametrization (ALCP) strategy [42], which adaptively adjusts the location of singular points, for
Problem 1.3:
For a given differentiable function \begin{align*} \textrm {find} ~& (\boldsymbol {V}^{\star },\boldsymbol {S}^{\star }) \in Q_{N,p} \times {\mathrm{ O}}(N) \\ & \textrm {such that} ~f\circ \Phi _{\boldsymbol {S}^{\star }}^{-1}(\boldsymbol {V}^{\star }) < \min f(\mathrm {St}(p,N)) + \epsilon.\end{align*}
In the ALCP strategy, until the singular-point issue is detected by a computable restarting condition, we use the same
Convergence rates, in terms of gradient sequence, of Euclidean optimization algorithms have been of great interest in order to estimate the decay speed of the gradient to zero. Indeed, many Euclidean optimization algorithms, e.g., the Nesterov-type accelerated gradient method [39], [40], [41] and the adaptive momentum optimization algorithms [43], [44], [45], have been designed carefully with convergence rates. We call a class of such Euclidean optimization algorithms Nonconvex Optimization over the Euclidean space with convergence Rates (NOE-r) (see Definition 2.4 for the formal definition of NOE-r).
So far, any convergence rate analysis for the ALCP strategy has not yet been established.3 Here, we encounter the following fundamental question:
Can we derive any convergence rate of the ALCP strategy, in terms of gradient sequence, by incorporating NOE-r?
To establish a convergence rate for the ALCP strategy, we introduce a modification of the restarting condition [33], of the ALCP strategy, with a polynomial lower bound of the usage count of the same center point. Thanks to the NOE-r and the proposed modification, we successfully derive a convergence rate of the ALCP strategy in Theorem 3.3. Numerical experiments in Section IV demonstrate the effectiveness of the ALCP strategy compared with the standard strategy (see below), for Problem 1.1, known as the retraction-based strategy [25], [46], [47].
Related Works and Primitive Goal Of This Paper: The so-called retraction-based strategy [25], [46], [47] is known as the standard generic strategy for optimization over Riemannian manifolds, which include
A partial preliminary short version of this paper was presented in [31].
Notation:
Preliminaries
A. Adaptive Localized Cayley Parametrization for Optimization Over $ \mathrm{St}(p,N)$
The generalized left-localized Cayley transform \begin{align*} \Phi _{\boldsymbol {S}}&: \mathrm {St}(p,N) \setminus E_{N,p}(\boldsymbol {S})\to Q_{N,p}(\boldsymbol {S}) \\ &: \boldsymbol {U}\mapsto \begin{bmatrix} \boldsymbol {A}_{\boldsymbol {S}}(\boldsymbol {U}) & -\boldsymbol {B}^{ \mathsf {T}}_{\boldsymbol {S}}(\boldsymbol {U}) \\ \boldsymbol {B}_{\boldsymbol {S}}(\boldsymbol {U}) & \boldsymbol {0} \end{bmatrix} \tag{2}\end{align*}
\begin{align*} \boldsymbol {A}_{\boldsymbol {S}}(\boldsymbol {U}) &:= 2(\boldsymbol {I}_{p}+\boldsymbol {S}_{\mathrm {le}}^{ \mathsf {T}}\boldsymbol {U})^{-\mathrm {T}} \mathop {\mathrm {S_{kew}}}(\boldsymbol {U}^{ \mathsf {T}}\boldsymbol {S}_{\mathrm {le}})(\boldsymbol {I}_{p}+\boldsymbol {S}_{\mathrm {le}}^{ \mathsf {T}}\boldsymbol {U})^{-1} \in Q_{p,p}, \\ \boldsymbol {B}_{\boldsymbol {S}}(\boldsymbol {U}) &:= - \boldsymbol {S}_{\mathrm {ri}}^{ \mathsf {T}}\boldsymbol {U}(\boldsymbol {I}_{p}+\boldsymbol {S}_{\mathrm {le}}^{ \mathsf {T}}\boldsymbol {U})^{-1} \in \mathbb {R}^{(N-p) \times p},\end{align*}
\begin{equation*} E_{N,p}(\boldsymbol {S}):= \{\boldsymbol {U} \in \mathrm {St} (p,N) \mid \det (\boldsymbol {I}_{p} + \boldsymbol {S}_{\mathrm {le}}^{ \mathsf {T}}\boldsymbol {U}) = 0\} \tag{3}\end{equation*}
\begin{align*} \Phi _{\boldsymbol {S}}^{-1}&: Q_{N,p}(\boldsymbol {S})\to \mathrm {St} (p,N)\setminus E_{N,p}(\boldsymbol {S}) \\ &: \boldsymbol {V} \mapsto \boldsymbol {S}(\boldsymbol {I}-\boldsymbol {V})(\boldsymbol {I}+\boldsymbol {V})^{-1}\boldsymbol {I}_{N\times p}. \tag{4}\end{align*}
Fact 2.1 (Optimality Condition[33]):
Let
is a local minimizer of Problem 1.1 if and only if$\boldsymbol {U}$ is a local minimizer of$\Phi _{\boldsymbol {S}}(\boldsymbol {U})$ over$f\circ \Phi _{\boldsymbol {S}}^{-1}$ .$Q_{N,p}(\boldsymbol {S})$ is a stationary point of Problem 1.14 if and only if$\boldsymbol {U}$ is a stationary point of$\Phi _{\boldsymbol {S}}(\boldsymbol {U})$ over$f_{\boldsymbol {S}}:=f\circ \Phi _{\boldsymbol {S}}^{-1}$ , i.e.,$Q_{N,p}(\boldsymbol {S})$ (see Appendix A for$\nabla f_{\boldsymbol {S}}(\Phi _{\boldsymbol {S}}(\boldsymbol {U})) = \boldsymbol {0}$ ).$\nabla f_{\boldsymbol {S}}$
By using \begin{align*} \boldsymbol {V}_{n+1}:= \begin{cases} \displaystyle \Phi _{\boldsymbol {S}_{[l+1]}}(\boldsymbol {U}_{n+1}) & \mathrm {(if~restarting~condition~holds)} \\ \displaystyle \widetilde {\boldsymbol {V}}_{n+1} & \mathrm {(otherwise)}. \end{cases}\end{align*}
Algorithm 1 Adaptive Localized Cayley Parametrization Strategy
while
else
end if
end while
Algorithm 2 Choice of Center Point
Compute the singular value decomposition of
Remark 2.2: [On ALCP strategy (Algorithm 1)]:
(Properties of
) In Algorithm 1, the index set${\mathcal {N}}_{l}$ satisfies${\mathcal {N}}_{l} \subset \mathbb {N}_{0}$ where\begin{align*} \begin{cases} \displaystyle n\in {\mathcal {N}}_{l} \Rightarrow \boldsymbol {V}_{n} \in Q_{N,p}(\boldsymbol {S}_{[l]}), \\ \displaystyle (l_{1}\neq l_{2})\quad \mathcal {N}_{l_{1}} \cap \mathcal {N}_{l_{2}} = \emptyset, \\ \displaystyle \mathcal {N}_{l+1} \neq \emptyset \Rightarrow \left |{{\mathcal {N}}_{l}}\right | = \max ({\mathcal {N}}_{l}) - \min ({\mathcal {N}}_{l}) + 1, \\ \displaystyle \mathcal {N}_{l+1} \neq \emptyset \Rightarrow \max ({\mathcal {N}}_{l})+1 = \min (\mathcal {N}_{l+1}), \end{cases} \tag{5}\end{align*} View Source\begin{align*} \begin{cases} \displaystyle n\in {\mathcal {N}}_{l} \Rightarrow \boldsymbol {V}_{n} \in Q_{N,p}(\boldsymbol {S}_{[l]}), \\ \displaystyle (l_{1}\neq l_{2})\quad \mathcal {N}_{l_{1}} \cap \mathcal {N}_{l_{2}} = \emptyset, \\ \displaystyle \mathcal {N}_{l+1} \neq \emptyset \Rightarrow \left |{{\mathcal {N}}_{l}}\right | = \max ({\mathcal {N}}_{l}) - \min ({\mathcal {N}}_{l}) + 1, \\ \displaystyle \mathcal {N}_{l+1} \neq \emptyset \Rightarrow \max ({\mathcal {N}}_{l})+1 = \min (\mathcal {N}_{l+1}), \end{cases} \tag{5}\end{align*}
implies that$\mathcal {N}_{l+1} \neq \emptyset $ is determined as a finite interval${\mathcal {N}}_{l}$ , i.e.,${\mathcal {N}}_{l} = [\min ({\mathcal {N}}_{l}),\max (\mathcal {N}_{l+1})] \cap \mathbb {N}_{0}$ \begin{align*} {\mathcal {N}}_{l} = \begin{cases} \displaystyle [\min ({\mathcal {N}}_{l}), \max ({\mathcal {N}}_{l})] \cap \mathbb {N}_{0} & (\mathrm {if}~\max ({\mathcal {N}}_{l}) < +\infty) \\ \displaystyle [\min ({\mathcal {N}}_{l}), \infty) \cap \mathbb {N}_{0} & (\mathrm {otherwise}). \end{cases}\end{align*} View Source\begin{align*} {\mathcal {N}}_{l} = \begin{cases} \displaystyle [\min ({\mathcal {N}}_{l}), \max ({\mathcal {N}}_{l})] \cap \mathbb {N}_{0} & (\mathrm {if}~\max ({\mathcal {N}}_{l}) < +\infty) \\ \displaystyle [\min ({\mathcal {N}}_{l}), \infty) \cap \mathbb {N}_{0} & (\mathrm {otherwise}). \end{cases}\end{align*}
(Singular-point issue possibly impacts NCP strategy6) Any singular point corresponds to a point at infinity in
through$Q_{N,p}(\boldsymbol {S})$ [33, Theorem 2.3 (d)]. Clearly, if a minimizer$\Phi _{\boldsymbol {S}}^{-1}$ in Problem 1.1 lies near the singular-point set$\boldsymbol {U}^{\star } \in \mathrm {St} (p,N)$ [28], [33], then$E_{N,p}(\boldsymbol {S})$ becomes extremely large, which implies (i) many iterations are required for$ \left \lVert{ \Phi _{\boldsymbol {S}}(\boldsymbol {U}^{\star }) }\right \rVert _{2}$ to approach$\boldsymbol {V}_{n}$ , and (ii) a direct application of any Euclidean optimization algorithm to$\Phi _{\boldsymbol {S}}(\boldsymbol {U}^{\star })$ may induce slow convergence.$f\circ \Phi _{\boldsymbol {S}}^{-1}$ (Suppression of singular-point issue in NCP strategy by ALCP strategy) To suppress the undesired influence caused by
, the ALCP strategy (Algorithm 1) updates$E_{N,p}(\boldsymbol {S}_{[l]})$ adaptively so that the parametric expression$\boldsymbol {S}_{[l]}$ of$\Phi _{\boldsymbol {S}_{[l]}}(\boldsymbol {U}^{\star }) \in Q_{N,p}(\boldsymbol {S}_{[l]})$ is not too distant from zero by using Algorithm 2 (see Remark 2.2 (d) and Remark 2.3).$\boldsymbol {U}^{\star }$ (Detection of singular-point issue) We can detect the singular-point issue, in line 8 of Algorithm 1, by checking whether
exceeds a predetermined threshold$ \left \lVert{ \widetilde {\boldsymbol {V}}_{n+1} }\right \rVert _{2}$ [42], where$T > 0$ is a tentative update to judge whether the parametric expression$\widetilde {\boldsymbol {V}}_{n+1}=\mathcal {A}^{\langle l \rangle }(\boldsymbol {V}_{n})$ of$\widetilde {\boldsymbol {V}}_{n+1}$ is too influenced, by$\boldsymbol {U}_{n+1}$ , to keep$E_{N,p}(\boldsymbol {S}_{[l]})$ or not. We can also use$\boldsymbol {S}_{[l]}$ in place of\begin{equation*} \left \lVert{ \widetilde {\boldsymbol {V}}_{n+1} }\right \rVert _{\mathrm {block}}:= \left \lVert{ [\![\widetilde {\boldsymbol {V}}_{n+1} ]\!]_{11} }\right \rVert _{2} + \left \lVert{ [\![\widetilde {\boldsymbol {V}}_{n+1} ]\!]_{21} }\right \rVert _{2} > T \tag{6}\end{equation*} View Source\begin{equation*} \left \lVert{ \widetilde {\boldsymbol {V}}_{n+1} }\right \rVert _{\mathrm {block}}:= \left \lVert{ [\![\widetilde {\boldsymbol {V}}_{n+1} ]\!]_{11} }\right \rVert _{2} + \left \lVert{ [\![\widetilde {\boldsymbol {V}}_{n+1} ]\!]_{21} }\right \rVert _{2} > T \tag{6}\end{equation*}
because$ \left \lVert{ \widetilde {\boldsymbol {V}}_{n+1} }\right \rVert _{2}$ is a more computationally efficient7 (equivalent) norm for$ \left \lVert{ \cdot }\right \rVert _{\mathrm {block}}$ [42].$Q_{N,p}(\boldsymbol {S})$
Remark 2.3:
[Updating Center Points by Algorithm 2] To suppress the singular-point issue, a choice of
B. Nonconvex Optimization Over the Euclidean Space With Convergence Rates
Many Euclidean optimization algorithms have been designed to achieve better convergence. Especially in the machine learning area, convergence rates in terms of the gradient of the cost function are of great interest, e.g., [38], [39], [40], [43], [44], [45], and [41]. Here, we focus on the class of Nonconvex Optimization over the Euclidean space with convergence Rates (NOE-r) defined below (see also Example 2.5).
Definition 2.4 (NOE-r):
Assume that the cost function \begin{align*} \hspace {-1em} \begin{cases} \mathrm {(i)}& J~\mathrm {is~bounded~below.} \\ \mathrm {(ii)}& J~\mathrm {is~differentiable.} \\ \mathrm {(iii)} & \nabla J~\mathrm {is~Lipschitz~continuous~with~its~Lipschitz} \\ & \mathrm {constant}~L_{\nabla J}> 0,~\mathrm {i.e.,~for~all~} \boldsymbol {x}_{1},\boldsymbol {x}_{2} \in \mathcal {X}, \\ & \left \lVert{ \nabla J(\boldsymbol {x}_{1})-\nabla J(\boldsymbol {x}_{2}) }\right \rVert \leq L_{\nabla J} \left \lVert{ \boldsymbol {x}_{1} - \boldsymbol {x}_{2} }\right \rVert. \end{cases} \tag{7}\end{align*}
(Update) Let
. Choose an initial guess$\mathcal {N}:= \mathbb {N}_{0}$ , of a stationary point of$\boldsymbol {x}_{\min (\mathcal {N})} (=\boldsymbol {x}_{0}) \in \mathcal {X}$ , arbitrarily. Estimates$J$ of a stationary point of$(\boldsymbol {x}_{n})_{n\in \mathcal {N}} \subset \mathcal {X}$ are generated by$J$ with a certain available record8\begin{equation*} (n\in \mathcal {N}) \quad \mathcal {A}: \left ({\boldsymbol {x}_{n},\mathfrak {R}_{[\min (\mathcal {N}),n]}}\right) \mapsto \boldsymbol {x}_{n+1} \in \mathcal {X}\end{equation*} View Source\begin{equation*} (n\in \mathcal {N}) \quad \mathcal {A}: \left ({\boldsymbol {x}_{n},\mathfrak {R}_{[\min (\mathcal {N}),n]}}\right) \mapsto \boldsymbol {x}_{n+1} \in \mathcal {X}\end{equation*}
(see Example 2.5). For simplicity, we also use the notation$\mathfrak {R}_{[\min (\mathcal {N}),n]}$ for a single update from$\mathcal {A}(\boldsymbol {x}_{n})$ .$\boldsymbol {x}_{n}$ (Convergence rate in terms of gradient sequence) The algorithm
has a convergence rate$\mathcal {A}$ in terms of gradient sequence, which means in this paper (see also [38, p.28]), for any cost function$\mathfrak {o}(cn^{-\alpha })$ satisfying (7), the algorithm$J$ satisfies$\mathcal {A}$ \begin{align*} & (\exists \alpha > 0, \exists c > 0, \forall n \in \mathcal {N}) \\ & \quad \min _{k \in \mathcal {N}, k \leq n} \left \lVert{ \nabla J(\boldsymbol {x}_{k}) }\right \rVert \leq \frac {c}{(n-\min (\mathcal {N})+1)^{\alpha }}. \tag{8}\end{align*} View Source\begin{align*} & (\exists \alpha > 0, \exists c > 0, \forall n \in \mathcal {N}) \\ & \quad \min _{k \in \mathcal {N}, k \leq n} \left \lVert{ \nabla J(\boldsymbol {x}_{k}) }\right \rVert \leq \frac {c}{(n-\min (\mathcal {N})+1)^{\alpha }}. \tag{8}\end{align*}
Example 2.5 (Selected Algorithms in NOE-r):
Many Euclidean optimization algorithms belong to NOE-r of
(Gradient descent method) Under the condition (7) for
, a gradient descent method updates$J$ with\begin{equation*} \mathcal {A}_{\mathrm {GDM}}:\boldsymbol {x}_{n} \mapsto \boldsymbol {x}_{n+1}:=\boldsymbol {x}_{n} - \gamma _{n} \nabla f(\boldsymbol {x}_{n})\end{equation*} View Source\begin{equation*} \mathcal {A}_{\mathrm {GDM}}:\boldsymbol {x}_{n} \mapsto \boldsymbol {x}_{n+1}:=\boldsymbol {x}_{n} - \gamma _{n} \nabla f(\boldsymbol {x}_{n})\end{equation*}
, where a stepsize$\mathfrak {R}_{[\min (\mathcal {N}), n]}:= (\gamma _{n}, \nabla f(\boldsymbol {x}_{n}))$ satisfies, e.g., the Armijo condition (see, e.g., [34]). Then,$\gamma _{n}>0$ has a convergence rate$\mathcal {A}_{\mathrm {GDM}}$ (see, e.g., [38], [41]), where$\mathfrak {o}(L_{\nabla J}^{1/2}\Delta _{J}^{1/2}n^{-1/2})$ .$\Delta _{J}:= J(\boldsymbol {x}_{0}) - \inf _{\boldsymbol {x}\in \mathcal {X}}J(\boldsymbol {x})$ (Nesterov-type accelerated gradient method (NAG)) The NAG [37], [38] was originally proposed for a convex function
satisfying (7), where the NAG achieves the optimal convergence rate in terms of function sequence. The NAG has been extended to nonconvex$J$ satisfying (7) with convergence rate$J$ but in terms of gradient sequence (see (8)) [39], [40], [41]. The following briefly summarizes a special instance of extended NAGs, say rNAG, with a restarting scheme [40, Theorem 4.1]. The rNAG updates$\mathfrak {o}(cn^{-1/2})$ with\begin{align*}\mathcal {A}_{\mathrm {rNAG}}:\boldsymbol {x}_{n} \mapsto \boldsymbol {x}_{n+1}:= \begin{cases} \displaystyle \boldsymbol {y}_{n} - \gamma _{n}\nabla J(\boldsymbol {y}_{n}) & (\mathrm {if}~(10)\mathrm {~holds}) \\ \displaystyle \boldsymbol {x}_{n} & (\mathrm {otherwise}) \end{cases} \tag{9}\end{align*} View Source\begin{align*}\mathcal {A}_{\mathrm {rNAG}}:\boldsymbol {x}_{n} \mapsto \boldsymbol {x}_{n+1}:= \begin{cases} \displaystyle \boldsymbol {y}_{n} - \gamma _{n}\nabla J(\boldsymbol {y}_{n}) & (\mathrm {if}~(10)\mathrm {~holds}) \\ \displaystyle \boldsymbol {x}_{n} & (\mathrm {otherwise}) \end{cases} \tag{9}\end{align*}
based on the test whether the value$\mathfrak {R}_{[\min (\mathcal {N}),n]}:= (\boldsymbol {y}_{n},\gamma _{n},\nabla J(\boldsymbol {y}_{n}))$ sufficiently decreases, i.e.,$J(\boldsymbol {y}_{n} - \gamma _{n}\nabla J(\boldsymbol {y}_{n}))$ or not, where\begin{equation*} J(\boldsymbol {y}_{n} - \gamma _{n}\nabla J(\boldsymbol {y}_{n})) \leq J(\boldsymbol {x}_{n}) - c_{R} \gamma _{n} \left \lVert{ \nabla J(\boldsymbol {y}_{n}) }\right \rVert ^{2}, \tag{10}\end{equation*} View Source\begin{equation*} J(\boldsymbol {y}_{n} - \gamma _{n}\nabla J(\boldsymbol {y}_{n})) \leq J(\boldsymbol {x}_{n}) - c_{R} \gamma _{n} \left \lVert{ \nabla J(\boldsymbol {y}_{n}) }\right \rVert ^{2}, \tag{10}\end{equation*}
is designed with\begin{align*} \boldsymbol {y}_{n}:= \begin{cases} \displaystyle \boldsymbol {x}_{n} & (\mathrm {if}~n = \min (\mathcal {N})) \\ \displaystyle \boldsymbol {x}_{n} + \beta _{n}(\boldsymbol {x}_{n} - \boldsymbol {x}_{n-1}) & (\mathrm {if}~n > \min (\mathcal {N})), \end{cases} \tag{11}\end{align*} View Source\begin{align*} \boldsymbol {y}_{n}:= \begin{cases} \displaystyle \boldsymbol {x}_{n} & (\mathrm {if}~n = \min (\mathcal {N})) \\ \displaystyle \boldsymbol {x}_{n} + \beta _{n}(\boldsymbol {x}_{n} - \boldsymbol {x}_{n-1}) & (\mathrm {if}~n > \min (\mathcal {N})), \end{cases} \tag{11}\end{align*}
,$\beta _{n}:= \frac {n-\min (\mathcal {N})}{n+3-\min (\mathcal {N})}$ is a predetermined small constant, and$c_{R} > 0$ is a stepsize designed to ensure a convergence rate9$\gamma _{n} > 0$ . More precisely, each stepsize$\mathfrak {o}(cn^{-1/2})$ is chosen to satisfy, with a predetermined$\gamma _{n}$ ,$\rho \in (0,1]$ By noting the latter condition in (12) is achieved automatically by\begin{align*} {\begin{array}{cccc}& \frac {\rho }{L_{\nabla J}} \leq \gamma _{n} \leq \gamma _{n-1}~(\mathrm {for}~n > \min (\mathcal {N}))~\mathrm {and} \\ & J(\boldsymbol {y}_{n} - \gamma _{n}\nabla J(\boldsymbol {y}_{n})) \leq J(\boldsymbol {y}_{n}) - \frac {\gamma _{n}}{2} \left \lVert{ \nabla J(\boldsymbol {y}_{n}) }\right \rVert ^{2}. \end{array}}\biggl \} \tag{12}\end{align*} View Source\begin{align*} {\begin{array}{cccc}& \frac {\rho }{L_{\nabla J}} \leq \gamma _{n} \leq \gamma _{n-1}~(\mathrm {for}~n > \min (\mathcal {N}))~\mathrm {and} \\ & J(\boldsymbol {y}_{n} - \gamma _{n}\nabla J(\boldsymbol {y}_{n})) \leq J(\boldsymbol {y}_{n}) - \frac {\gamma _{n}}{2} \left \lVert{ \nabla J(\boldsymbol {y}_{n}) }\right \rVert ^{2}. \end{array}}\biggl \} \tag{12}\end{align*}
, such a$\gamma _{n} \leq L_{\nabla J}^{-1}$ satisfying (12) can be given as the output of Algorithm 3 (backtracking algorithm, see, e.g., [34] for its key idea), by setting in Algorithm 3$\gamma _{n}$ \begin{align*} \gamma _{\mathrm {initial}}:= \begin{cases} \displaystyle \gamma ' & (\mathrm {if}~n=\min (\mathcal {N})) \\ \displaystyle \gamma _{n-1} & (\mathrm {if}~n > \min (\mathcal {N})), \end{cases}\end{align*} View Source\begin{align*} \gamma _{\mathrm {initial}}:= \begin{cases} \displaystyle \gamma ' & (\mathrm {if}~n=\min (\mathcal {N})) \\ \displaystyle \gamma _{n-1} & (\mathrm {if}~n > \min (\mathcal {N})), \end{cases}\end{align*}
, and$\boldsymbol {x}:= \boldsymbol {x}_{n}$ , where$\boldsymbol {y}:=\boldsymbol {y}_{n}$ and$\rho \in (0,1)$ are predetermined arbitrarily. Remark that if the output$\gamma ' > L_{\nabla J}^{-1}$ of Algorithm 3 and$\gamma _{n}$ do not satisfy (10), then (i)$(\boldsymbol {x}_{n},\boldsymbol {y}_{n})$ by (9); (ii)$\boldsymbol {x}_{n+1} = \boldsymbol {x}_{n}$ by (11), therefore the rNAG restarts from$\boldsymbol {y}_{n+1} = \boldsymbol {x}_{n+1} + \beta _{n+1}(\boldsymbol {x}_{n+1}-\boldsymbol {x}_{n}) = \boldsymbol {x}_{n+1}$ .$\boldsymbol {y}_{n+1} = \boldsymbol {x}_{n+1} = \boldsymbol {x}_{n}$
Algorithm 3 Backtracking Algorithm
while
end while
Convergence Rate Analysis
Recall that (i)
In this section, we present a convergence rate analysis of Algorithm 1 under the following condition (Condition 3.1). More precisely, for Algorithm 1, we first present a center point-wise convergence analysis in Lemma 3.2, then an overall convergence analysis in Theorem 3.3.
Condition 3.1 (For Convergence Analysis ofAlgorithm 1):
In Algorithm 1,
For some
,$l_{\max } \in \mathbb {N}_{0}$ and$\mathcal {N}_{l_{\max }} = \bigl [\min (\mathcal {N}_{l_{\max }}), \infty \bigr) \cap \mathbb {N}_{0}$ . In this case, we use the index set${\mathcal {N}}_{l} = \emptyset ~(\forall l > l_{\max })$ of chosen center points during the process of Algorithm 1.$\Lambda _{\mathrm{ c}}:= \{0,1,\ldots,l_{\max }\}$ For all
,$l \in \mathbb {N}_{0}$ and${\mathcal {N}}_{l} = [\min ({\mathcal {N}}_{l}), \max ({\mathcal {N}}_{l})] \cap \mathbb {N}_{0}$ . In this case, we use the index set$\max ({\mathcal {N}}_{l}) + 1 = \min (\mathcal {N}_{l+1})$ of chosen center points during the process of Algorithm 1.$\Lambda _{\mathrm{ c}}:= \mathbb {N}_{0}$

Lemma 3.2:
Assume that (i)
For all
, the condition (7) is achieved when$l\in \Lambda _{\mathrm{ c}}$ and$J$ are given respectively by$\mathcal {X}$ and$f_{\boldsymbol {S}_{[l]}}:=f\circ \Phi _{\boldsymbol {S}_{[l]}}^{-1}$ .$Q_{N,p}(\boldsymbol {S}_{[l]})$ (Center point-wise convergence analysis) Suppose that, for some
and all$c > 0$ ,$l \in \Lambda _{\mathrm{ c}}$ belongs to NOE-r of$\mathcal {A}^{\langle l \rangle }$ in the sense of Definition 2.4 (b). Let$\mathfrak {o}(cn^{-1/2})$ be generated by Algorithm 1 incorporating$(\boldsymbol {V}_{n})_{n \in \bigcup _{l\in \Lambda _{\mathrm{ c}}} {\mathcal {N}}_{l}}$ . Then, we have$\mathcal {A}^{\langle l \rangle }~(l\in \Lambda _{\mathrm{ c}})$ and\begin{align*} (l\in \Lambda _{\mathrm{ c}}, n \in {\mathcal {N}}_{l}) & \min _{k \in {\mathcal {N}}_{l}, k \leq n} \left \lVert{ \nabla f_{\boldsymbol {S}_{[l]}}(\boldsymbol {V}_{k}) }\right \rVert _{F} \\ & \quad \leq \frac {c}{(n-\min ({\mathcal {N}}_{l})+1)^{1/2}} \tag{13}\end{align*} View Source\begin{align*} (l\in \Lambda _{\mathrm{ c}}, n \in {\mathcal {N}}_{l}) & \min _{k \in {\mathcal {N}}_{l}, k \leq n} \left \lVert{ \nabla f_{\boldsymbol {S}_{[l]}}(\boldsymbol {V}_{k}) }\right \rVert _{F} \\ & \quad \leq \frac {c}{(n-\min ({\mathcal {N}}_{l})+1)^{1/2}} \tag{13}\end{align*}
\begin{equation*} (n\in \mathbb {N}_{0}) ~\min _{k=0,1\ldots,n} \left \lVert{ \nabla f_{\boldsymbol {S}_{[\ell (k)] }}(\boldsymbol {V}_{k}) }\right \rVert _{F} \leq \sqrt {\frac {\ell (n)+1}{n+1}}c. \tag{14}\end{equation*} View Source\begin{equation*} (n\in \mathbb {N}_{0}) ~\min _{k=0,1\ldots,n} \left \lVert{ \nabla f_{\boldsymbol {S}_{[\ell (k)] }}(\boldsymbol {V}_{k}) }\right \rVert _{F} \leq \sqrt {\frac {\ell (n)+1}{n+1}}c. \tag{14}\end{equation*}
Proof:
The image of
of$f$ , say$ \mathrm {St}(p,N)$ , is bounded below by the compactness of$f(\mathrm {St}(p,N)) \subset \mathbb {R}$ and the continuity of$ \mathrm {St}(p,N)$ . From$f$ ,$\Phi _{\boldsymbol {S}}^{-1}(Q_{N,p}(\boldsymbol {S})) \subset \mathrm {St} (p,N)$ is also bounded below, where$f_{\boldsymbol {S}_{[l]}}(Q_{N,p}(\boldsymbol {S}_{[l]})) \subset f(\mathrm {St}(p,N))$ is the image of$f_{\boldsymbol {S}_{[l]}}(Q_{N,p}(\boldsymbol {S}_{[l]}))$ of$f_{\boldsymbol {S}_{[l]}}$ . Regarding the condition (7), see Fact 1.1 for the differentiability of$Q_{N,p}(\boldsymbol {S}_{[l]})$ and the Lipschitz continuity of$f_{\boldsymbol {S}}$ .$\nabla f_{\boldsymbol {S}_{[l]}}$ (13) is clear from (a) and the assumption on
(see also (8)).$\mathcal {A}^{\langle l \rangle }$ (Proof of (14)) Clearly, for any
, we have$n\in \mathbb {N}_{0}$ where the last equality is verified by (5) and\begin{align*} &\hspace {-.5pc} \sum _{l=0}^{\ell (n)-1} \left |{{\mathcal {N}}_{l}}\right |\min _{k\in {\mathcal {N}}_{l}} \left \lVert{ \nabla f_{\boldsymbol {S}_{[l]}}(\boldsymbol {V}_{k}) }\right \rVert _{F}^{2} \\ & \quad + (n-\min (\mathcal {N}_{\ell (n)}) + 1)\min _{\min (\mathcal {N}_{\ell (n)})\leq k \leq n} \left \lVert{ \nabla f_{\boldsymbol {S}_{[\ell (n)] }}(\boldsymbol {V}_{k}) }\right \rVert _{F}^{2} \\ & \geq \left ({n-\min (\mathcal {N}_{\ell (n)}) + 1 + \sum _{l=0}^{\ell (n)-1} \left |{{\mathcal {N}}_{l}}\right |}\right) \\ & \hspace {9em} \times \min _{k=0,1,\ldots,n} \left \lVert{ \nabla f_{\boldsymbol {S}_{[\ell (k)] }}(\boldsymbol {V}_{k}) }\right \rVert _{F}^{2} \\ & = (n+1)\min _{k=0,1,\ldots,n} \left \lVert{ \nabla f_{\boldsymbol {S}_{[\ell (k)] }}(\boldsymbol {V}_{k}) }\right \rVert _{F}^{2}, \tag{15}\end{align*} View Source\begin{align*} &\hspace {-.5pc} \sum _{l=0}^{\ell (n)-1} \left |{{\mathcal {N}}_{l}}\right |\min _{k\in {\mathcal {N}}_{l}} \left \lVert{ \nabla f_{\boldsymbol {S}_{[l]}}(\boldsymbol {V}_{k}) }\right \rVert _{F}^{2} \\ & \quad + (n-\min (\mathcal {N}_{\ell (n)}) + 1)\min _{\min (\mathcal {N}_{\ell (n)})\leq k \leq n} \left \lVert{ \nabla f_{\boldsymbol {S}_{[\ell (n)] }}(\boldsymbol {V}_{k}) }\right \rVert _{F}^{2} \\ & \geq \left ({n-\min (\mathcal {N}_{\ell (n)}) + 1 + \sum _{l=0}^{\ell (n)-1} \left |{{\mathcal {N}}_{l}}\right |}\right) \\ & \hspace {9em} \times \min _{k=0,1,\ldots,n} \left \lVert{ \nabla f_{\boldsymbol {S}_{[\ell (k)] }}(\boldsymbol {V}_{k}) }\right \rVert _{F}^{2} \\ & = (n+1)\min _{k=0,1,\ldots,n} \left \lVert{ \nabla f_{\boldsymbol {S}_{[\ell (k)] }}(\boldsymbol {V}_{k}) }\right \rVert _{F}^{2}, \tag{15}\end{align*}
\begin{align*} &\hspace {-1pc} \sum _{l=0}^{\ell (n)-1} \left |{{\mathcal {N}}_{l}}\right | = \sum _{l=0}^{\ell (n)-1}(\max ({\mathcal {N}}_{l})- \min ({\mathcal {N}}_{l}) + 1) \\ & = \sum _{l=0}^{\ell (n)-1}(\min (\mathcal {N}_{l+1})- \min ({\mathcal {N}}_{l})) \\ & = \min (\mathcal {N}_{\ell (n)})- \min (\mathcal {N}_{0}) = \min (\mathcal {N}_{\ell (n)}) \quad (\because \min (\mathcal {N}_{0}) = 0).\end{align*} View Source\begin{align*} &\hspace {-1pc} \sum _{l=0}^{\ell (n)-1} \left |{{\mathcal {N}}_{l}}\right | = \sum _{l=0}^{\ell (n)-1}(\max ({\mathcal {N}}_{l})- \min ({\mathcal {N}}_{l}) + 1) \\ & = \sum _{l=0}^{\ell (n)-1}(\min (\mathcal {N}_{l+1})- \min ({\mathcal {N}}_{l})) \\ & = \min (\mathcal {N}_{\ell (n)})- \min (\mathcal {N}_{0}) = \min (\mathcal {N}_{\ell (n)}) \quad (\because \min (\mathcal {N}_{0}) = 0).\end{align*}
Moreover, by (13), we have \begin{equation*} (0 \leq l \leq \ell (n)-1) \quad \left |{{\mathcal {N}}_{l}}\right |\min _{k\in {\mathcal {N}}_{l}} \left \lVert{ \nabla f_{\boldsymbol {S}_{[l]}}(\boldsymbol {V}_{k}) }\right \rVert _{F}^{2} \leq c^{2},\end{equation*}
\begin{equation*} (n+1-\min (\mathcal {N}_{\ell (n)}))\min _{\min (\mathcal {N}_{\ell (n)})\leq k \leq n} \left \lVert{ \nabla f_{\boldsymbol {S}_{[\ell (k)] }}(\boldsymbol {V}_{n}) }\right \rVert _{F}^{2} \leq c^{2}.\end{equation*}
\begin{equation*} \min _{k=0,1\ldots,n} \left \lVert{ \nabla f_{\boldsymbol {S}_{[\ell (k)] }}(\boldsymbol {V}_{k}) }\right \rVert _{F}^{2} \leq \frac {\ell (n)+1}{n+1}c^{2}.\end{equation*}
From (14) in Lemma 3.2 (b), to evaluate the speed of convergence of \begin{equation*} (n\in \mathbb {N}_{0}) \quad \frac {\ell (n) + 1}{n+1} \leq \frac {\widehat {c}}{(n+1)^{\widehat {\alpha }}}. \tag{16}\end{equation*}
\begin{align*} \left.{\begin{array}{ccccccccc} & \left \lVert{ \widetilde {\boldsymbol {V}}_{n+1} }\right \rVert _{\mathrm {block}} > T ~\mathrm {and} \\ & \quad n - \min ({\mathcal {N}}_{l}) + 1 \geq \eta (l):= \left \lfloor{ \frac {l}{\tau } }\right \rfloor ^{\theta }, \end{array}}\right \} \tag{17}\end{align*}
\begin{equation*} \hspace {-1.5em} \mathcal {N}_{l+1} \neq \emptyset \Rightarrow \left |{{\mathcal {N}}_{l}}\right | \overset {(5)}{=} \max ({\mathcal {N}}_{l}) - \min ({\mathcal {N}}_{l}) + 1\geq \eta (l), \tag{18}\end{equation*}
Finally, we present the following overall convergence analysis of Algorithm 1.
Theorem 3.3 (Overall Convergence Analysis):
Assume that (i) \begin{equation*} (n\in \mathbb {N}_{0}) ~\min _{k=0,1,\ldots,n} \left \lVert{ \nabla f_{\boldsymbol {S}_{[\ell (k)] }}(\boldsymbol {V}_{k}) }\right \rVert _{F} \leq \frac {\sqrt {(\sqrt {3}+3)\tau }c}{(n+1)^{\frac {\theta }{2(\theta +1)}}} \tag{19}\end{equation*}
\begin{align*} &\hspace {-.5pc} \liminf _{n\to \infty } \left \lVert{ \nabla f_{\boldsymbol {S}_{[\ell (n)] }}(\boldsymbol {V}_{n}) }\right \rVert _{F} \\ &:= \lim _{n\to \infty }\inf \underbrace {\left \{{ \left \lVert{ \nabla f_{\boldsymbol {S}_{[\ell (k)] }}(\boldsymbol {V}_{k}) }\right \rVert _{F}\mid k \geq n}\right \}}_{=:\mathfrak {G}_{n}} = 0. \tag{20}\end{align*}
Proof:
See Appendix B.
Remark 3.4 (Interpretation of Theorem 3.3):
As stated just after Problem 1.1, a realistic goal of Problem 1.1 is to find its stationary point
. More precisely, according to Fact 2.1 (b), it is desired to approximate$\boldsymbol {U}^{\star } \in \mathrm {St} (p,N)$ satisfying$(\boldsymbol {V}^{\star }, \boldsymbol {S}^{\star }) \in Q_{N,p}\times {\mathrm{ O}}(N)$ with a certain sequence$\nabla f_{\boldsymbol {S}^{\star }}(\boldsymbol {V}^{\star }) = \boldsymbol {0}$ , e.g., by Algorithm 1. Clearly, the RHS in (19) converges to zero as$(\boldsymbol {V}_{n},\boldsymbol {S}_{[\ell (n)] })_{n=0}^{\infty } \subset Q_{N,p}\times {\mathrm{ O}}(N)$ . This upper bound of$n\to \infty $ tells us how fast the gradient approaches zero.$\min _{k=0,1,\ldots,n} \left \lVert{ \nabla f_{\boldsymbol {S}_{[\ell (k)] }}(\boldsymbol {V}_{k}) }\right \rVert _{F}$ The evaluations shown in (19) and (20) can also apply to the NCP strategy (incorporating an algorithm of NOE-r as
) as a special case of the ALCP strategy (Algorithm 1) by setting a sufficiently large$\mathcal {A}^{\langle 0 \rangle }$ in (17).$T > 0$
Numerical Experiments
In the experiments, we demonstrate the numerical performance of Algorithm 1 incorporating the restarting condition (17) with \begin{align*} \begin{cases} \displaystyle n=1000, \\ \displaystyle \quad \mathrm {or} \\ \displaystyle \frac { \left \lVert{ \boldsymbol {G}_{n} }\right \rVert _{F}}{ \left \lVert{ \boldsymbol {G}_{0} }\right \rVert _{F}} < 10^{-6}, \\ \displaystyle \quad \mathrm {or} \\ \displaystyle \left |{f(\boldsymbol {U}_{n})-f(\boldsymbol {U}_{n-1})}\right | + \left |{f(\boldsymbol {U}_{n-1})-f(\boldsymbol {U}_{n-2})}\right | < 10^{-6}, \end{cases} \tag{21}\end{align*}
\begin{align*} \boldsymbol {G}_{n}:= \begin{cases} \displaystyle \nabla (f\circ \Phi _{\boldsymbol {S}_{[\ell (n)] }}^{-1})(\boldsymbol {V}_{n}) & \textrm {(rNAG+ALCP, rNAG+NCP)} \\ \displaystyle \mathop {\mathrm {grad}}f(\boldsymbol {U}_{n}) & \textrm {(rNAG+Retr)}. \end{cases}\end{align*}
All the experiments were performed in MATLAB on MacBook Pro (13)-inch, M1, 2020) with 16 GB of RAM. To evaluate the numerical performance of rNAG+ALCP, we considered the following standard test problems (i) linear eigenvalue problem, e.g., [25]; (ii) nonlinear eigenvalue problem, e.g., [21]. For each problem, we generated an initial guess
Linear eigenvalue problem: For a given symmetric matrix
, we considered$\boldsymbol {A} \in \mathbb {R}^{N\times N}$ Any global minimizer\begin{equation*} \mathop {\mathrm {minimize}}_{\boldsymbol {U}\in \mathrm {St} (p,N)} f_{1}(\boldsymbol {U}):=- \mathrm {Tr}(\boldsymbol {U}^{ \mathsf {T}}\boldsymbol {A}\boldsymbol {U}). \tag{22}\end{equation*} View Source\begin{equation*} \mathop {\mathrm {minimize}}_{\boldsymbol {U}\in \mathrm {St} (p,N)} f_{1}(\boldsymbol {U}):=- \mathrm {Tr}(\boldsymbol {U}^{ \mathsf {T}}\boldsymbol {A}\boldsymbol {U}). \tag{22}\end{equation*}
of the problem (22) is an orthonormal eigenbasis associated with the$\boldsymbol {U}^{\star }$ largest eigenvalues of$p$ [25]. In our experiment, we used$\boldsymbol {A}$ with randomly chosen$\boldsymbol {A}:=\widetilde {\boldsymbol {A}}^{ \mathsf {T}}\widetilde {\boldsymbol {A}} \in \mathbb {R}^{N\times N}$ of which each entry is sampled by the standard normal distribution.$\widetilde {\boldsymbol {A}} \in \mathbb {R}^{N\times N}$ Nonlinear eigenvalue problem: By following [19], [20] and [21], we considered
with some diagonal matrix\begin{equation*} \mathrm {find}~\boldsymbol {U}^{\star } \in \mathrm {St} (p,N)~\mathrm {s.t.}~H(\boldsymbol {U}^{\star })\boldsymbol {U}^{\star } = \boldsymbol {U}^{\star }\boldsymbol {D} \tag{23}\end{equation*} View Source\begin{equation*} \mathrm {find}~\boldsymbol {U}^{\star } \in \mathrm {St} (p,N)~\mathrm {s.t.}~H(\boldsymbol {U}^{\star })\boldsymbol {U}^{\star } = \boldsymbol {U}^{\star }\boldsymbol {D} \tag{23}\end{equation*}
, where$\boldsymbol {D} \in \mathbb {R}^{N\times N}$ is given with a symmetric Laplacian matrix$H(\boldsymbol {U}):= \boldsymbol {L}+ \zeta \mathrm {diag}(\boldsymbol {L}^{-1}\psi (\boldsymbol {U})) \in \mathbb {R}^{N\times N}$ ,$\boldsymbol {L} \in \mathbb {R}^{N\times N}$ ,$\psi (\boldsymbol {U}):=\mathrm {Diag}(\boldsymbol {U}\boldsymbol {U}^{ \mathsf {T}}) \in \mathbb {R}^{N}$ , and$\zeta > 0$ . The solution of (23) can be characterized as a global minimizer of (see, e.g., [20])$\mathrm {Diag}:\mathbb {R}^{N\times N}\to \mathbb {R}^{N}:\boldsymbol {X}\mapsto [[\boldsymbol {X}]_{11}~[\boldsymbol {X}]_{22}~\cdots ~[\boldsymbol {X}]_{NN}]$ In our experiment, we used\begin{equation*} \mathop {\mathrm {minimize}}_{\boldsymbol {U}\in \mathrm {St} (p,N)} f_{2}(\boldsymbol {U}):=\frac {1}{2} \mathrm {Tr}(\boldsymbol {U}^{ \mathsf {T}}\boldsymbol {L}\boldsymbol {U}) + \frac {\zeta }{4}\psi (\boldsymbol {U})^{ \mathsf {T}}\boldsymbol {L}^{-1}\psi (\boldsymbol {U}). \tag{24}\end{equation*} View Source\begin{equation*} \mathop {\mathrm {minimize}}_{\boldsymbol {U}\in \mathrm {St} (p,N)} f_{2}(\boldsymbol {U}):=\frac {1}{2} \mathrm {Tr}(\boldsymbol {U}^{ \mathsf {T}}\boldsymbol {L}\boldsymbol {U}) + \frac {\zeta }{4}\psi (\boldsymbol {U})^{ \mathsf {T}}\boldsymbol {L}^{-1}\psi (\boldsymbol {U}). \tag{24}\end{equation*}
and$\zeta = 3$ as one-dimensional discrete Laplacian with 2 on the diagonal and −1 on the sub- and super- diagonals by following [21].$\boldsymbol {L}$
Tables 1 and 2 summarize respectively the average results for problems (22) and (24) over 10 trials with
Convergence histories of each algorithm applied to the problem (22) regarding the value
Convergence histories of each algorithm applied to the problem (24) regarding the value
Remark 4.1 (Main Findings From Numerical Experiments):
(Comparison of rNAG+ALCP and rNAG+NCP) As expected in the design concept of the ALCP strategy (see the paragraphs just before and after Problem 1.3, and Remark 2.2 (c)), Figures 1 and 2 show that rNAG+ALCP improves dramatically the convergence speed of rNAG+NCP.
(Comparison of rNAG+Retr and rNAG+ALCP) From a view of the number of iterations in Tables 1 and 2, the performance of rNAG+ALCP is comparable to that of rNAG+Retr. From Figures 1 and 2, we found that the proposed rNAG+ALCP for all
outperforms rNAG+Retr in a view of CPU time.$\tau,\theta \in \{1, 10\}$ (Robustness against choices of
used in (17)) From Figures 1 and 2, the numerical performance of rNAG+ALCP is robust against chosen$\tau,\theta $ . Moreover, from Tables 1 and 2, for rNAG+ALCP with$\tau, \theta $ , we found that all results are the same except for “time”. This is mainly because, for rNAG+ALCP with$(\tau,\theta) \neq (1,10)$ , (i)$(\tau,\theta) \neq (1,10)$ holds during the process of algorithm due to the small number of restarting (see (17) and “restart”); (ii) the latter condition$\eta (l) \leq 1$ in (17) always holds true; (iii) every center point was updated at the same iteration.$n - \min ({\mathcal {N}}_{l}) + 1 \geq \eta (l)$
Conclusion
We presented a convergence rate analysis of the adaptive localized Cayley parametrization (ALCP) strategy for optimization over the Stiefel manifold with a new restarting condition. In the ALCP strategy, an optimization problem over the Stiefel manifold is translated into optimization problems over the Euclidean space with the tunable generalized left-localized Cayley transform. The proposed analysis provides a convergence rate, in terms of gradient sequence, for the ALCP strategy. The numerical experiments demonstrate the effectiveness of the ALCP strategy compared with the retraction-based strategy and the naive Cayley parametrization strategy.
Appendix AGradient of Function After the Cayley Parametrization
Gradient of Function After the Cayley Parametrization
We briefly review the gradient of
Fact 1.1 (Gradient after the Cayley Parametrization):
For a differentiable function
([33, Proposition 2.9, Remark 2.11])
is differentiable, and the computational complexity of$f_{\boldsymbol {S}}$ with$f_{\boldsymbol {S}}$ designed by Algorithm 2 is$\boldsymbol {S}$ flops.$\mathfrak {o}(Np^{2}+p^{3})$ ([33, Proposition A.11]) Assume that the Lipschitz continuity of
with its Lipschitz constant$\nabla f$ as$L_{\nabla f} > 0$ Then, the gradient\begin{align*} & (\forall \boldsymbol {U}_{1},\boldsymbol {U}_{2} \in \mathrm {St} (p,N)) \\ & \quad \left \lVert{ \nabla f(\boldsymbol {U}_{1})-\nabla f(\boldsymbol {U}_{2}) }\right \rVert _{F} \leq L_{\nabla f} \left \lVert{ \boldsymbol {U}_{1} - \boldsymbol {U}_{2} }\right \rVert _{F}.\end{align*} View Source\begin{align*} & (\forall \boldsymbol {U}_{1},\boldsymbol {U}_{2} \in \mathrm {St} (p,N)) \\ & \quad \left \lVert{ \nabla f(\boldsymbol {U}_{1})-\nabla f(\boldsymbol {U}_{2}) }\right \rVert _{F} \leq L_{\nabla f} \left \lVert{ \boldsymbol {U}_{1} - \boldsymbol {U}_{2} }\right \rVert _{F}.\end{align*}
is Lipschitz continuous with its Lipschitz constant$\nabla f_{\boldsymbol {S}}$ over$4(\mu + L_{\nabla f})$ as$Q_{N,p}(\boldsymbol {S})$ where\begin{align*} & (\forall \boldsymbol {V}_{1},\boldsymbol {V}_{2} \in Q_{N,p}(\boldsymbol {S})) \\ & \quad \left \lVert{ \nabla f_{\boldsymbol {S}}(\boldsymbol {V}_{1}) - \nabla f_{\boldsymbol {S}}(\boldsymbol {V}_{2}) }\right \rVert _{F} \leq 4(\mu + L_{\nabla f}) \left \lVert{ \boldsymbol {V}_{1}-\boldsymbol {V}_{2} }\right \rVert _{F},\end{align*} View Source\begin{align*} & (\forall \boldsymbol {V}_{1},\boldsymbol {V}_{2} \in Q_{N,p}(\boldsymbol {S})) \\ & \quad \left \lVert{ \nabla f_{\boldsymbol {S}}(\boldsymbol {V}_{1}) - \nabla f_{\boldsymbol {S}}(\boldsymbol {V}_{2}) }\right \rVert _{F} \leq 4(\mu + L_{\nabla f}) \left \lVert{ \boldsymbol {V}_{1}-\boldsymbol {V}_{2} }\right \rVert _{F},\end{align*}
.$\mu \geq \max _{\boldsymbol {U}\in \mathrm {St} (p,N)} \left \lVert{ \nabla f(\boldsymbol {U}) }\right \rVert _{2}$
Appendix BProof of Theorem 3.3
Proof of Theorem 3.3
The following claim is required for our analysis.
Claim 2.1:
Let
\begin{equation*} (\Upsilon \geq 1) \quad \sum _{l=0}^{\Upsilon -1}\left \lfloor{ \frac {l}{\tau } }\right \rfloor ^{\theta } \geq \tau \frac {\left({ \left \lfloor{ \frac {\Upsilon }{\tau } }\right \rfloor -1}\right)^{\theta +1}}{\theta +1}. \tag{25}\end{equation*} View Source\begin{equation*} (\Upsilon \geq 1) \quad \sum _{l=0}^{\Upsilon -1}\left \lfloor{ \frac {l}{\tau } }\right \rfloor ^{\theta } \geq \tau \frac {\left({ \left \lfloor{ \frac {\Upsilon }{\tau } }\right \rfloor -1}\right)^{\theta +1}}{\theta +1}. \tag{25}\end{equation*}
\begin{align*} &\hspace {-1pc} \mathrm {If}~n + 1\geq \sum _{l=0}^{\max \{\Upsilon - 1,0\}}\left \lfloor{ \frac {l}{\tau }}\right \rfloor ^{\theta }, ~\mathrm {then} \\ & \left.{\begin{array}{ccccccc}& \Upsilon + 1 \leq \tau (\sqrt {3}+3) \sqrt [\theta +1]{n+1} \\ & \left ({ \Leftrightarrow \frac {\Upsilon + 1}{\tau \times \sqrt [\theta +1]{n+1}} \leq \sqrt {3}+3 }\right) \end{array}}\right \} \tag{26}\end{align*} View Source\begin{align*} &\hspace {-1pc} \mathrm {If}~n + 1\geq \sum _{l=0}^{\max \{\Upsilon - 1,0\}}\left \lfloor{ \frac {l}{\tau }}\right \rfloor ^{\theta }, ~\mathrm {then} \\ & \left.{\begin{array}{ccccccc}& \Upsilon + 1 \leq \tau (\sqrt {3}+3) \sqrt [\theta +1]{n+1} \\ & \left ({ \Leftrightarrow \frac {\Upsilon + 1}{\tau \times \sqrt [\theta +1]{n+1}} \leq \sqrt {3}+3 }\right) \end{array}}\right \} \tag{26}\end{align*}
Proof:
(Step 1) We first show
Since (i) the RHS of (27) is invariant for\begin{equation*} \sum _{l=0}^{\Upsilon -1}\left \lfloor{ \frac {l}{\tau } }\right \rfloor ^{\theta } \geq \tau \sum _{m=0}^{ \left \lfloor{ \frac {\Upsilon }{\tau } }\right \rfloor -1}m^{\theta }. \tag{27}\end{equation*} View Source\begin{equation*} \sum _{l=0}^{\Upsilon -1}\left \lfloor{ \frac {l}{\tau } }\right \rfloor ^{\theta } \geq \tau \sum _{m=0}^{ \left \lfloor{ \frac {\Upsilon }{\tau } }\right \rfloor -1}m^{\theta }. \tag{27}\end{equation*}
with every$\Upsilon \in [\kappa \tau, (\kappa +1)\tau -1]$ and (ii) the LHS of (27) is monotonically increasing for$\kappa \in \mathbb {N}_{0}$ , it is sufficient to verify the inequality (27) specially for the case$\Upsilon $ . In this case, the LHS of (27) is evaluated as$\Upsilon = \kappa \tau $ which equals the RHS of (27).\begin{align*} &\hspace {-1pc} \sum _{l=0}^{\kappa \tau -1}\left \lfloor{ \frac {l}{\tau } }\right \rfloor ^{\theta } = \sum _{m=0}^{\kappa -1} \sum _{i=0}^{\tau -1} \left \lfloor{ \frac {m\tau + i}{\tau } }\right \rfloor ^{\theta } =\sum _{m=0}^{\kappa -1} \sum _{i=0}^{\tau -1} m^{\theta } \\ & = \tau \sum _{m=0}^{\kappa -1} m^{\theta } = \tau \sum _{m=0}^{\frac {\Upsilon }{\tau }-1} m^{\theta } = \tau \sum _{m=0}^{ \left \lfloor{ \frac {\Upsilon }{\tau } }\right \rfloor -1} m^{\theta },\end{align*} View Source\begin{align*} &\hspace {-1pc} \sum _{l=0}^{\kappa \tau -1}\left \lfloor{ \frac {l}{\tau } }\right \rfloor ^{\theta } = \sum _{m=0}^{\kappa -1} \sum _{i=0}^{\tau -1} \left \lfloor{ \frac {m\tau + i}{\tau } }\right \rfloor ^{\theta } =\sum _{m=0}^{\kappa -1} \sum _{i=0}^{\tau -1} m^{\theta } \\ & = \tau \sum _{m=0}^{\kappa -1} m^{\theta } = \tau \sum _{m=0}^{\frac {\Upsilon }{\tau }-1} m^{\theta } = \tau \sum _{m=0}^{ \left \lfloor{ \frac {\Upsilon }{\tau } }\right \rfloor -1} m^{\theta },\end{align*}
(Step 2) By combining (27) and the clear relation
we deduce (25).\begin{equation*} \tau \sum _{m=0}^{ \left \lfloor{ \frac {\Upsilon }{\tau } }\right \rfloor -1}m^{\theta } \geq \tau \int ^{ \left \lfloor{ \frac {\Upsilon }{\tau } }\right \rfloor -1}_{0} x^{\theta } dx = \tau \frac {\left({ \left \lfloor{ \frac {\Upsilon }{\tau } }\right \rfloor -1}\right)^{\theta +1}}{\theta +1},\end{equation*} View Source\begin{equation*} \tau \sum _{m=0}^{ \left \lfloor{ \frac {\Upsilon }{\tau } }\right \rfloor -1}m^{\theta } \geq \tau \int ^{ \left \lfloor{ \frac {\Upsilon }{\tau } }\right \rfloor -1}_{0} x^{\theta } dx = \tau \frac {\left({ \left \lfloor{ \frac {\Upsilon }{\tau } }\right \rfloor -1}\right)^{\theta +1}}{\theta +1},\end{equation*}
(Step 1) For the special case of
, (26) is clear by$\Upsilon \in [0, \tau -1]$ .$\Upsilon + 1 \leq \tau \leq \tau (3+\sqrt {3})\leq \tau (3+\sqrt {3})\sqrt [\theta +1]{n+1}$ (Step 2) For the other cases
, combining (25) and the assumption of (26), we have$\Upsilon \in [\tau, \infty)$ which implies\begin{equation*} n+1 \geq \sum _{l=0}^{\Upsilon -1}\left \lfloor{ \frac {l}{\tau } }\right \rfloor ^{\theta } \geq \tau \frac {\left({ \left \lfloor{ \frac {\Upsilon }{\tau } }\right \rfloor -1}\right)^{\theta +1}}{\theta +1},\end{equation*} View Source\begin{equation*} n+1 \geq \sum _{l=0}^{\Upsilon -1}\left \lfloor{ \frac {l}{\tau } }\right \rfloor ^{\theta } \geq \tau \frac {\left({ \left \lfloor{ \frac {\Upsilon }{\tau } }\right \rfloor -1}\right)^{\theta +1}}{\theta +1},\end{equation*}
equivalently\begin{equation*} \sqrt [\theta +1]{\tau ^{-1}(\theta +1)(n+1)} \geq \left \lfloor{ \frac {\Upsilon }{\tau } }\right \rfloor - 1 \geq \frac {\Upsilon }{\tau } - 2,\end{equation*} View Source\begin{equation*} \sqrt [\theta +1]{\tau ^{-1}(\theta +1)(n+1)} \geq \left \lfloor{ \frac {\Upsilon }{\tau } }\right \rfloor - 1 \geq \frac {\Upsilon }{\tau } - 2,\end{equation*}
\begin{equation*} \Upsilon + 1 \leq \sqrt [\theta +1]{\tau ^{\theta }(\theta +1)(n+1)} + 2\tau + 1.\end{equation*} View Source\begin{equation*} \Upsilon + 1 \leq \sqrt [\theta +1]{\tau ^{\theta }(\theta +1)(n+1)} + 2\tau + 1.\end{equation*}
(Step 3) By evaluating the RHS of (26), we derive the desired upper bound:
where in the last inequality, the monotone decreasing property of\begin{align*} &\hspace {-1pc} \frac {\sqrt [\theta +1]{\tau ^{\theta } (\theta +1)(n+1)}+2\tau + 1}{\tau \times \sqrt [\theta +1]{n+1}} \\ & = \tau ^{-1/(\theta +1)}\sqrt [\theta +1]{\theta +1} + \frac {2\tau + 1}{\tau \times \sqrt [\theta +1]{n+1}} \\ & \leq \sqrt [\theta +1]{\theta +1} + \frac {2\tau + 1}{\tau }~(\because \tau \geq 1, n \geq 0) \\ & \leq \left ({(\theta +1)+1}\right)^{1/(\theta +1)} + 3 \leq \sqrt {3} + 3\end{align*} View Source\begin{align*} &\hspace {-1pc} \frac {\sqrt [\theta +1]{\tau ^{\theta } (\theta +1)(n+1)}+2\tau + 1}{\tau \times \sqrt [\theta +1]{n+1}} \\ & = \tau ^{-1/(\theta +1)}\sqrt [\theta +1]{\theta +1} + \frac {2\tau + 1}{\tau \times \sqrt [\theta +1]{n+1}} \\ & \leq \sqrt [\theta +1]{\theta +1} + \frac {2\tau + 1}{\tau }~(\because \tau \geq 1, n \geq 0) \\ & \leq \left ({(\theta +1)+1}\right)^{1/(\theta +1)} + 3 \leq \sqrt {3} + 3\end{align*}
is used.$\left ({(\theta +1)+1}\right)^{1/(\theta +1)}~(\theta \in \mathbb {N})$
Now, we are ready to prove Theorem 3.3.
Proof ofTheorem 3.3:
For any \begin{equation*} n + 1 \geq \sum _{l=0}^{\max \{\ell (n)-1, 0\}} \left \lfloor{ \frac {l}{\tau } }\right \rfloor ^{\theta }. \tag{28}\end{equation*}
\begin{equation*} n+1 \geq \sum _{l=0}^{\ell (n)-1} \left |{{\mathcal {N}}_{l}}\right | \overset {(18)}{\geq } \sum _{l=0}^{\ell (n)-1}\eta (l) \overset {(17)}{=} \sum _{l=0}^{\ell (n)-1} \left \lfloor{ \frac {l}{\tau } }\right \rfloor ^{\theta }.\end{equation*}
\begin{equation*} 0\leq \inf \mathfrak {G}_{0} \leq \lim _{n\to \infty } \frac {\sqrt {(\sqrt {3}+3)\tau }c}{(n+1)^{\frac {\theta }{2(\theta +1)}}} = 0.\end{equation*}
\begin{equation*} (\forall n \in \mathbb {N}_{0})~\mathfrak {g}_{n}:=\min _{0\leq k \leq n} \left \lVert{ \nabla f_{\boldsymbol {S}_{[\ell (k)] }}(\boldsymbol {V}_{k}) }\right \rVert _{F} > 0.\end{equation*}
\begin{equation*} \inf \mathfrak {G}_{0} = \min \{\mathfrak {g}_{n}, \inf \mathfrak {G}_{n}\}= 0,\end{equation*}