Processing math: 0%
Adaptive Localized Cayley Parametrization for Optimization Over Stiefel Manifold and Its Convergence Rate Analysis | IEEE Journals & Magazine | IEEE Xplore

Adaptive Localized Cayley Parametrization for Optimization Over Stiefel Manifold and Its Convergence Rate Analysis


Numerical comparison between the retraction-based (Retr) strategy, the naive Cayley parametrization (NCP) strategy, and the proposed adaptive localized Cayley parametriza...

Abstract:

The Adaptive Localized Cayley Parametrization (ALCP) strategy for orthogonality constrained optimization has been proposed as a scheme to utilize Euclidean optimization a...Show More

Abstract:

The Adaptive Localized Cayley Parametrization (ALCP) strategy for orthogonality constrained optimization has been proposed as a scheme to utilize Euclidean optimization algorithms on an adaptive parametrization of the Stiefel manifold via a tunable generalized left-localized Cayley parametrization. Thanks to the adaptive parametrization, the ALCP strategy enjoys stable convergence of the estimate sequence of a minimizer by avoiding a certain singular-point issue. In this paper, we present a convergence rate analysis for the ALCP strategy using Euclidean optimization algorithms such as Nesterov-type accelerated gradient methods. Numerical experiments demonstrate the effectiveness of the ALCP strategy.
Numerical comparison between the retraction-based (Retr) strategy, the naive Cayley parametrization (NCP) strategy, and the proposed adaptive localized Cayley parametriza...
Published in: IEEE Access ( Volume: 12)
Page(s): 31312 - 31323
Date of Publication: 21 February 2024
Electronic ISSN: 2169-3536

Funding Agency:


CCBY - IEEE is not the copyright holder of this material. Please follow the instructions via https://creativecommons.org/licenses/by/4.0/ to obtain full-text articles and stipulations in the API documentation.
SECTION I.

Introduction

Orthogonality constrained optimization problems have been playing crucial roles in signal processing and machine learning applications [1], [2], [3]. Such applications include sparse principal component analysis [4], [5], [6], 1-bit compressed sensing [7], [8], the joint diagonalization problem for independent component analysis [9], [10], [11], the orthogonal Procrustes problem [12], [13], [14], [15], the nearest low-rank correlation matrix problem [16], [17], [18], nonlinear eigenvalue problem [19], [20], [21], and the enhancement of the generalization performance in deep neural networks [22], [23], [24]. The orthogonality constraint set defined by \mathrm {St}(p,N):= \{\boldsymbol {U} \in \mathbb {R}^{N\times p} \mid \boldsymbol {U}^{ \mathsf {T}}\boldsymbol {U} = \boldsymbol {I}_{p}\} (p\leq N) is called the Stiefel manifold. In this paper, we consider the following optimization problem over \mathrm {St}(p,N) :

Problem 1.1:

For (N,p) \in \mathbb {N}^{2} satisfying p \leq N and a given differentiable function f:\mathbb {R}^{N\times p }\to \mathbb {R} , \begin{equation*} \mathrm {find}~\boldsymbol {U}^{\star } \in \mathop {\mathrm {argmin}}\limits _{\boldsymbol {U}\in \mathrm {St} (p,N)} ~f(\boldsymbol {U}).\end{equation*} View SourceRight-click on figure for MathML and additional features.

A realistic goal for Problem 1.1 is to find a stationary point \boldsymbol {U}^{\star } \in \mathrm {St} (p,N) , which satisfies a necessary condition for a local minimizer of f over \mathrm {St}(p,N) [25], [26], [27], because Problem 1.1 is a nonconvex optimization problem in general. Problem 1.1 is challenging mainly because of the severe nonlinearity of \mathrm {St}(p,N) .

A family of Cayley-type transforms [22], [26], [28], [29], [30], [31], [32], [33] resolves the nonlinearity of \mathrm {St}(p,N) in Problem 1.1 by translating \mathrm {St}(p,N) into a Euclidean space. With a generalized left-localized Cayley transform \Phi _{\boldsymbol {S}}: \mathrm {St}(p,N)\setminus E_{N,p}(\boldsymbol {S}) \to Q_{N,p}(\boldsymbol {S}) ~(\boldsymbol {S}\in {\mathrm{ O}}(N):= \mathrm {St}(N,N)) [33] (see (2)), a subset \mathrm {St}(p,N) \setminus E_{N,p}(\boldsymbol {S}) can be parameterized in terms of the Np-\frac {1}{2}p(p+1) -dimensional subspace\begin{align*} Q_{N,p}(\boldsymbol {S}) &:= Q_{N,p}:=\left \{{ \left.{\begin{bmatrix} \boldsymbol {A} & -\boldsymbol {B}^{ \mathsf {T}} \\ \boldsymbol {B} & \boldsymbol {0} \end{bmatrix} \in \mathbb {R}^{N\times N} }\right \vert \; \substack { -\boldsymbol {A}^{ \mathsf {T}} = \boldsymbol {A} \in \mathbb {R}^{p\times p}, \\ \boldsymbol {B} \in \mathbb {R}^{(N-p)\times p}}}\right \} \\ & \equiv \mathbb {R}^{Np-\frac {1}{2}p(p+1)} \tag{1}\end{align*} View SourceRight-click on figure for MathML and additional features.of the set of all skew-symmetric matrices Q_{N,N}:= \{\boldsymbol {V} \in \mathbb {R}^{N\times N} \mid \boldsymbol {V}^{ \mathsf {T}} = -\boldsymbol {V}\} , where the singular-point set E_{N,p}(\boldsymbol {S}) \subset \mathrm {St} (p,N) is a subset of \mathrm {St}(p,N) where \Phi _{\boldsymbol {S}} can not be defined (see (3)). The subset \mathrm {St}(p,N)\setminus E_{N,p}(\boldsymbol {S}) is open and dense1 for p < N [33, Theorem 2.3], and therefore we can parameterize almost all points in \mathrm {St}(p,N) with \Phi _{\boldsymbol {S}} and its inverse \Phi _{\boldsymbol {S}}^{-1}:Q_{N,p}(\boldsymbol {S}) \to \mathrm {St} (p,N) \setminus E_{N,p}(\boldsymbol {S}) (see (4)).

As a result, Problem 1.1 can be reformulated as

Problem 1.2:

For a given differentiable function f:\mathbb {R}^{N\times p}\to \mathbb {R} , \boldsymbol {S} \in {\mathrm{ O}}(N) , and \epsilon > 0 , 2\begin{align*} \textrm {find} ~& \boldsymbol {V}^{\star } \in Q_{N,p}(\boldsymbol {S}) \\ & \textrm {such that} ~f\circ \Phi _{\boldsymbol {S}}^{-1}(\boldsymbol {V}^{\star }) < \min f(\mathrm {St}(p,N)) + \epsilon.\end{align*} View SourceRight-click on figure for MathML and additional features.

The naive Cayley parametrization (NCP) strategy [28], [33] updates (i) estimates (\boldsymbol {V}_{n})_{n=0}^{\infty } \subset Q_{N,p}(\boldsymbol {S}) of a solution \boldsymbol {V}^{\star } to Problem 1.2 by using any optimization algorithm designed for the Euclidean space, for short Euclidean optimization algorithm in this paper, and (ii) estimates \boldsymbol {U}_{n}:=\Phi _{\boldsymbol {S}}^{-1}(\boldsymbol {V}_{n}) \in \mathrm {St} (p,N) of a minimizer \boldsymbol {U}^{\star } of Problem 1.1. Such Euclidean optimization algorithms include, e.g., the gradient descent method, the Newton method [34], the quasi-Newton method [35], the conjugate gradient method [36], and the Nesterov-type accelerated gradient method [37], [38], [39], [40], [41].

However, in a case where \boldsymbol {U}^{\star } is close to the singular-point set E_{N,p}(\boldsymbol {S}) , the NCP strategy may suffer from slow convergence of updating (\boldsymbol {V}_{n})_{n=0}^{\infty } [28], [33]. We call this performance degradation a singular-point issue.

To suppress the singular-point issue in the NCP strategy, we recently proposed an Adaptive Localized Cayley Parametrization (ALCP) strategy [42], which adaptively adjusts the location of singular points, for

Problem 1.3:

For a given differentiable function f:\mathbb {R}^{N\times p}\to \mathbb {R} and \epsilon > 0 , \begin{align*} \textrm {find} ~& (\boldsymbol {V}^{\star },\boldsymbol {S}^{\star }) \in Q_{N,p} \times {\mathrm{ O}}(N) \\ & \textrm {such that} ~f\circ \Phi _{\boldsymbol {S}^{\star }}^{-1}(\boldsymbol {V}^{\star }) < \min f(\mathrm {St}(p,N)) + \epsilon.\end{align*} View SourceRight-click on figure for MathML and additional features.

In the ALCP strategy, until the singular-point issue is detected by a computable restarting condition, we use the same \boldsymbol {S} for updating \boldsymbol {V}_{n} \in Q_{N,p}(\boldsymbol {S}) by using Euclidean optimization algorithms for Problem 1.2 with \boldsymbol {S} in analogy with the NCP strategy (see also the later part [“Related works and primitive goal of this paper”] in this section). After the detection of the singular-point issue at the tentative estimate \widetilde {\boldsymbol {V}}_{n+1} \in Q_{N,p}(\boldsymbol {S}) , we update \boldsymbol {S} \in {\mathrm{ O}}(N) to \boldsymbol {S}' \in {\mathrm{ O}}(N) so that \boldsymbol {U}_{n+1}:= \Phi _{\boldsymbol {S}}^{-1}(\widetilde {\boldsymbol {V}}_{n+1}) \in \mathrm {St} (p,N) is distant from the new singular-point set E_{N,p}(\boldsymbol {S}') . The parametric expression of \boldsymbol {U}_{n+1} is updated from \widetilde {\boldsymbol {V}}_{n+1} \in Q_{N,p}(\boldsymbol {S}) to \boldsymbol {V}_{n+1}:= \Phi _{\boldsymbol {S}'}(\boldsymbol {U}_{n+1}) \in Q_{N,p}(\boldsymbol {S}') , and the minimization f\circ \Phi _{\boldsymbol {S}'}^{-1} over Q_{N,p}(\boldsymbol {S}') is restarted with the initial guess \boldsymbol {V}_{n+1}\in Q_{N,p}(\boldsymbol {S}') . This restarting technique in the ALCP strategy computationally realizes an adaptively parameterized space, and improves numerical performance of the NCP strategy without suffering from impacts of the singular-point issue.

Convergence rates, in terms of gradient sequence, of Euclidean optimization algorithms have been of great interest in order to estimate the decay speed of the gradient to zero. Indeed, many Euclidean optimization algorithms, e.g., the Nesterov-type accelerated gradient method [39], [40], [41] and the adaptive momentum optimization algorithms [43], [44], [45], have been designed carefully with convergence rates. We call a class of such Euclidean optimization algorithms Nonconvex Optimization over the Euclidean space with convergence Rates (NOE-r) (see Definition 2.4 for the formal definition of NOE-r).

So far, any convergence rate analysis for the ALCP strategy has not yet been established.3 Here, we encounter the following fundamental question:

Can we derive any convergence rate of the ALCP strategy, in terms of gradient sequence, by incorporating NOE-r?

To establish a convergence rate for the ALCP strategy, we introduce a modification of the restarting condition [33], of the ALCP strategy, with a polynomial lower bound of the usage count of the same center point. Thanks to the NOE-r and the proposed modification, we successfully derive a convergence rate of the ALCP strategy in Theorem 3.3. Numerical experiments in Section IV demonstrate the effectiveness of the ALCP strategy compared with the standard strategy (see below), for Problem 1.1, known as the retraction-based strategy [25], [46], [47].

Related Works and Primitive Goal Of This Paper: The so-called retraction-based strategy [25], [46], [47] is known as the standard generic strategy for optimization over Riemannian manifolds, which include \mathrm {St}(p,N) as a special instance. In order to extend ideas established for optimization over the Euclidean space to that over \mathrm {St}(p,N) , the retraction-based strategy approximates \mathrm {St}(p,N) by the tangent space to \mathrm {St}(p,N) locally around the current iterate \boldsymbol {U}_{n} \in \mathrm {St} (p,N) with a retraction mapping, where the tangent space varies according to the tangent point \boldsymbol {U}_{n} . However, the retraction-based strategy encounters certain difficulties in utilizing past estimates of a solution for updating the current estimate of a solution. This is mainly because the past estimates are expressed on different tangent spaces from the current tangent space. Such difficulties prevent designers of algorithms from extending powerful optimization algorithms over the Euclidean space to new algorithms over \mathrm {St}(p,N) . This is because such advanced algorithms, e.g., [39], [40], [41], [43], [44], and [45], effectively exploit the past estimates of a solution for updating the current estimate of a solution. The major goal of this paper is to resolve such difficulties in the retraction-based strategy by utilizing an especially beautiful nature of \mathrm {St}(p,N) , i.e., \Phi _{\boldsymbol {S}}: \mathrm {St}(p,N) \setminus E_{N,p}(\boldsymbol {S}) \to Q_{N,p}(\boldsymbol {S}) and \Phi _{\boldsymbol {S}}^{-1}:Q_{N,p}(\boldsymbol {S}) \to \mathrm {St} (p,N) \setminus E_{N,p}(\boldsymbol {S}) .

A partial preliminary short version of this paper was presented in [31].

Notation: \mathbb {N} , \mathbb {N}_{0} , and \mathbb {R} denote respectively the set of all positive integers, the set of all nonnegative integers, and the set of all real numbers. For general n\in \mathbb {N} , \boldsymbol {I}_{n} stands for the identity matrix in \mathbb {R}^{n\times n} , but the identity matrix in \mathbb {R}^{N\times N} is denoted simply by \boldsymbol {I} . For p \leq N , \boldsymbol {I}_{N\times p} \in \mathbb {R}^{N\times p} denotes the matrix of the first p columns of \boldsymbol {I} . For p < N , the matrices \boldsymbol {U}_{\mathrm {up}} \in \mathbb {R}^{p \times p} and \boldsymbol {U}_{\mathrm {lo}} \in \mathbb {R}^{(N-p) \times p} denote respectively the upper and the lower block matrices of \boldsymbol {U} \in \mathbb {R}^{N\times p} , i.e., \boldsymbol {U}=[\boldsymbol {U}_{\mathrm {up}}^{ \mathsf {T}}~\boldsymbol {U}_{\mathrm {lo}}^{ \mathsf {T}}]^{ \mathsf {T}} . The matrices \boldsymbol {S}_{\mathrm {le}} \in \mathbb {R}^{N\times p} and \boldsymbol {S}_{\mathrm {ri}} \in \mathbb {R}^{N\times (N-p)} denote respectively the left and right block matrices of \boldsymbol {S} \in \mathbb {R}^{N\times N} , i.e., \boldsymbol {S}=[\boldsymbol {S}_{\mathrm {le}}~\boldsymbol {S}_{\mathrm {ri}}] . For a square matrix \begin{aligned} \boldsymbol {X}:= \begin{bmatrix} \boldsymbol {X}_{11} \in \mathbb {R}^{p\times p} & \boldsymbol {X}_{12} \in \mathbb {R}^{p\times (N-p)} \\ \boldsymbol {X}_{21} \in \mathbb {R}^{(N-p)\times p} & \boldsymbol {X}_{22} \in \mathbb {R}^{(N-p)\times (N-p)} \end{bmatrix}\in \mathbb {R}^{N\times N} \end{aligned} , we use the notation [\![\boldsymbol {X} ]\!]_{ij}:= \boldsymbol {X}_{ij} for i,j \in \{1,2\} . For \boldsymbol {X} \in \mathbb {R}^{m\times n} , [\boldsymbol {X}]_{ij} denotes the (i,j) entry of \boldsymbol {X} and \boldsymbol {X}^{ \mathsf {T}} denotes the transpose of \boldsymbol {X} . For \boldsymbol {X}\in \mathbb {R}^{n\times n} , \mathop {\mathrm {S_{kew}}}(\boldsymbol {X}) = (\boldsymbol {X}-\boldsymbol {X}^{ \mathsf {T}})/2 is the skew-symmetric component of \boldsymbol {X} . For square matrices \boldsymbol {X}_{i}\in \mathbb {R}^{n_{i}\times n_{i}}~(1\leq i \leq k) , \mathrm {diag}(\boldsymbol {X}_{1},\boldsymbol {X}_{2},\ldots,\boldsymbol {X}_{k})\in \mathbb {R}^{\left({\sum _{i=1}^{k}n_{i}}\right)\times \left({\sum _{i=1}^{k}n_{i}}\right)} denotes the block diagonal matrix with diagonal blocks \boldsymbol {X}_{1},\boldsymbol {X}_{2},\ldots,\boldsymbol {X}_{k} . For a given matrix \boldsymbol {X} \in \mathbb {R}^{m\times n} , \left \lVert{ \boldsymbol {X} }\right \rVert _{2} and \left \lVert{ \boldsymbol {X} }\right \rVert _{F} denote respectively the spectral norm and the Frobenius norm. For a subset \mathcal {N} of \mathbb {N}_{0} , \left |{\mathcal {N}}\right | denotes the cardinality of \mathcal {N} . To distinguish from the symbol for the orthogonal group {\mathrm{ O}}(N) , the symbol \mathfrak {o}(\cdot) is used in place of the standard big O notation. \lfloor \cdot \rfloor : \mathbb {R}\to \mathbb {Z}: t \mapsto \max \{n\in \mathbb {Z} \mid n\leq t\} stands for the floor function, where \mathbb {Z} is the set of all integers. For a differentiable function J:\mathcal {X} \to \mathbb {R} defined over the Euclidean space \mathcal {X} with an inner product {\langle \cdot,\cdot \rangle } , \nabla J(\boldsymbol {x}) is the gradient of J at \boldsymbol {x} \in \mathcal {X} satisfying \lim _{t\to 0,t\neq 0, t\in \mathbb {R}} \frac {J(\boldsymbol {x}+t\boldsymbol {v}) - J(\boldsymbol {x})}{t} = {\langle \nabla J(\boldsymbol {x}),\boldsymbol {v} \rangle } for all \boldsymbol {v} \in \mathcal {X} .

SECTION II.

Preliminaries

A. Adaptive Localized Cayley Parametrization for Optimization Over \mathrm{St}(p,N)

The generalized left-localized Cayley transform \Phi _{\boldsymbol {S}} [33] is an extension of the classical Cayley transform [48] for parametrization of \mathrm {St}(p,N) . For p,N\in \mathbb {N} satisfying p\leq N , \Phi _{\boldsymbol {S}} with \boldsymbol {S}\in {\mathrm{ O}}(N) is defined as \begin{align*} \Phi _{\boldsymbol {S}}&: \mathrm {St}(p,N) \setminus E_{N,p}(\boldsymbol {S})\to Q_{N,p}(\boldsymbol {S}) \\ &: \boldsymbol {U}\mapsto \begin{bmatrix} \boldsymbol {A}_{\boldsymbol {S}}(\boldsymbol {U}) & -\boldsymbol {B}^{ \mathsf {T}}_{\boldsymbol {S}}(\boldsymbol {U}) \\ \boldsymbol {B}_{\boldsymbol {S}}(\boldsymbol {U}) & \boldsymbol {0} \end{bmatrix} \tag{2}\end{align*} View SourceRight-click on figure for MathML and additional features. with\begin{align*} \boldsymbol {A}_{\boldsymbol {S}}(\boldsymbol {U}) &:= 2(\boldsymbol {I}_{p}+\boldsymbol {S}_{\mathrm {le}}^{ \mathsf {T}}\boldsymbol {U})^{-\mathrm {T}} \mathop {\mathrm {S_{kew}}}(\boldsymbol {U}^{ \mathsf {T}}\boldsymbol {S}_{\mathrm {le}})(\boldsymbol {I}_{p}+\boldsymbol {S}_{\mathrm {le}}^{ \mathsf {T}}\boldsymbol {U})^{-1} \in Q_{p,p}, \\ \boldsymbol {B}_{\boldsymbol {S}}(\boldsymbol {U}) &:= - \boldsymbol {S}_{\mathrm {ri}}^{ \mathsf {T}}\boldsymbol {U}(\boldsymbol {I}_{p}+\boldsymbol {S}_{\mathrm {le}}^{ \mathsf {T}}\boldsymbol {U})^{-1} \in \mathbb {R}^{(N-p) \times p},\end{align*} View SourceRight-click on figure for MathML and additional features. where (i) \begin{equation*} E_{N,p}(\boldsymbol {S}):= \{\boldsymbol {U} \in \mathrm {St} (p,N) \mid \det (\boldsymbol {I}_{p} + \boldsymbol {S}_{\mathrm {le}}^{ \mathsf {T}}\boldsymbol {U}) = 0\} \tag{3}\end{equation*} View SourceRight-click on figure for MathML and additional features. is called the singular-point set of \Phi _{\boldsymbol {S}} , and (ii) Q_{N,p}(\boldsymbol {S}) is clearly a vector space [33] (see (1)). For \boldsymbol {S} \in {\mathrm{ O}}(N) , the inversion mapping of \Phi _{\boldsymbol {S}} is given by \begin{align*} \Phi _{\boldsymbol {S}}^{-1}&: Q_{N,p}(\boldsymbol {S})\to \mathrm {St} (p,N)\setminus E_{N,p}(\boldsymbol {S}) \\ &: \boldsymbol {V} \mapsto \boldsymbol {S}(\boldsymbol {I}-\boldsymbol {V})(\boldsymbol {I}+\boldsymbol {V})^{-1}\boldsymbol {I}_{N\times p}. \tag{4}\end{align*} View SourceRight-click on figure for MathML and additional features. We call a tunable parameter \boldsymbol {S} a center point of \Phi _{\boldsymbol {S}} .

Fact 2.1 (Optimality Condition[33]):

Let f:\mathbb {R}^{N\times p} \to \mathbb {R} be differentiable. Let \boldsymbol {U} \in \mathrm {St} (p,N) and \boldsymbol {S} \in {\mathrm{ O}}(N) satisfy \boldsymbol {U} \in \mathrm {St} (p,N) \setminus E_{N,p}(\boldsymbol {S}) . Then, the following holds:

  1. \boldsymbol {U} is a local minimizer of Problem 1.1 if and only if \Phi _{\boldsymbol {S}}(\boldsymbol {U}) is a local minimizer of f\circ \Phi _{\boldsymbol {S}}^{-1} over Q_{N,p}(\boldsymbol {S}) .

  2. \boldsymbol {U} is a stationary point of Problem 1.14 if and only if \Phi _{\boldsymbol {S}}(\boldsymbol {U}) is a stationary point of f_{\boldsymbol {S}}:=f\circ \Phi _{\boldsymbol {S}}^{-1} over Q_{N,p}(\boldsymbol {S}) , i.e., \nabla f_{\boldsymbol {S}}(\Phi _{\boldsymbol {S}}(\boldsymbol {U})) = \boldsymbol {0} (see Appendix A for \nabla f_{\boldsymbol {S}} ).

By using \Phi _{\boldsymbol {S}} and \Phi _{\boldsymbol {S}}^{-1} , we illustrate the adaptive localized Cayley parametrization (ALCP) strategy [42] in Algorithm 1 for Problem 1.1 via Problem 1.3, where center points \boldsymbol {S}_{[l]} are adaptively updated to suppress the singular-point issue (see Remark 2.2 (b) and (c)). Basically, while keeping the l th center point \boldsymbol {S}_{[l]} , Algorithm 1 updates from \boldsymbol {V}_{n} \in Q_{N,p}(\boldsymbol {S}_{[l]}) to a tentative estimate \widetilde {\boldsymbol {V}}_{n+1} \in Q_{N,p}(\boldsymbol {S}_{[l]}) by using a single update of a Euclidean optimization algorithm \mathcal {A}^{\langle l \rangle } for the minimization of f\circ \Phi _{\boldsymbol {S}_{[l]}}^{-1} (see Section II-B for \mathcal {A}^{\langle l \rangle } ). In a case where the singular-point issue is detected by checking a certain restarting condition (see Remark 2.2 (d)), we update the center point from \boldsymbol {S}_{[l]} to \boldsymbol {S}_{[l+1]} \in {\mathrm{ O}}(N) (see line 1) so that \boldsymbol {U}_{n+1} is distant from the singular-point set E_{N,p}(\boldsymbol {S}_{[l+1]}) (see Algorithm 2 and Remark 2.3). Then, the parametric expression of \boldsymbol {U}_{n+1} is also changed from \widetilde {\boldsymbol {V}}_{n+1} \in Q_{N,p}(\boldsymbol {S}_{[l]}) to \Phi _{\boldsymbol {S}_{[l+1]}}(\boldsymbol {U}_{n+1}) \in Q_{N,p}(\boldsymbol {S}_{[l+1]}) . At last, the next iterate is given as \begin{align*} \boldsymbol {V}_{n+1}:= \begin{cases} \displaystyle \Phi _{\boldsymbol {S}_{[l+1]}}(\boldsymbol {U}_{n+1}) & \mathrm {(if~restarting~condition~holds)} \\ \displaystyle \widetilde {\boldsymbol {V}}_{n+1} & \mathrm {(otherwise)}. \end{cases}\end{align*} View SourceRight-click on figure for MathML and additional features. Just after updating a center point from \boldsymbol {S}_{[l]} to \boldsymbol {S}_{[l+1]} \in {\mathrm{ O}}(N) , we restart the minimization of f\circ \Phi _{\boldsymbol {S}_{[l+1]}}^{-1} by applying a Euclidean optimization algorithm \mathcal {A}^{\langle l+1 \rangle } with the initial guess \boldsymbol {V}_{n+1} \in Q_{N,p}(\boldsymbol {S}_{[l+1]}) .

Algorithm 1 Adaptive Localized Cayley Parametrization Strategy

Input:

\boldsymbol {U}_{0} \in \mathrm {St} (p,N)

1:

{\mathcal {N}}_{l} \leftarrow \emptyset ~(l\in \mathbb {N}_{0})~\triangleright Initialization of index sets

2:

l \leftarrow 0 , n \leftarrow 0 , \mathcal {N}_{0} \leftarrow \{0\}

3:

\boldsymbol {S}_{[l]} \leftarrow \mathrm { Algorithm~ 2}(\boldsymbol {U}_{n})

4:

\boldsymbol {V}_{n} \leftarrow \Phi _{\boldsymbol {S}_{[l]}}(\boldsymbol {U}_{n})

5:

while \left \lVert{ \nabla (f\circ \Phi _{\boldsymbol {S}_{[l]}}^{-1}) (\boldsymbol {V}_{n}) }\right \rVert _{F} > 0 5 do

6:

\widetilde{\boldsymbol {V}}_{n+1} \leftarrow \mathcal {A}^{\langle l \rangle }(\boldsymbol {V}_{n})

7:

\boldsymbol {U}_{n+1} \leftarrow \Phi _{\boldsymbol {S}_{[l]}}^{-1}(\widetilde{\boldsymbol {V}}_{n+1})

8:

if restarting condition (e.g., (6) and (17)) holds then

9:

\mathcal {N}_{l+1} \leftarrow \{n+1\}

10:

\boldsymbol {S}_{[l+1]} \leftarrow \mathrm { Algorithm~ 2}(\boldsymbol {U}_{n+1})

11:

\triangleright Update the center point

12:

\boldsymbol {V}_{n+1} \leftarrow \Phi _{\boldsymbol {S}_{[l+1]}}(\boldsymbol {U}_{n+1})

13:

\triangleright New parametric expression of \boldsymbol {U}_{n+1} using \boldsymbol {S}_{[l+1]}

14:

l\leftarrow l+1

15:

else

16:

{\mathcal {N}}_{l} \leftarrow {\mathcal {N}}_{l}\cup \{n+1\}

17:

\boldsymbol {V}_{n+1} \leftarrow \widetilde{\boldsymbol {V}}_{n+1}

18:

end if

19:

n \leftarrow n+1

20:

end while

Algorithm 2 Choice of Center Point

Input:

\boldsymbol {U} = \begin{bmatrix} \boldsymbol {U}_{\mathrm {up}}^{ \mathsf {T}} & \boldsymbol {U}_{\mathrm {lo}}^{ \mathsf {T}} \end{bmatrix}^{ \mathsf {T}} \in \mathrm {St} (p,N)

Compute the singular value decomposition of \boldsymbol {U}_{\mathrm {up}} = \boldsymbol {Q}_{1}\boldsymbol {\Sigma }\boldsymbol {Q}_{2}^{ \mathsf {T}} .

\boldsymbol {S} \leftarrow \mathrm {diag} (\boldsymbol {Q}_{1}\boldsymbol {Q}_{2}^{ \mathsf {T}},\boldsymbol {I}_{N-p}) \in {\mathrm{ O}}(N) .

Output:

\boldsymbol {S}

Remark 2.2: [On ALCP strategy (Algorithm 1)]:

  1. (Properties of {\mathcal {N}}_{l} ) In Algorithm 1, the index set {\mathcal {N}}_{l} \subset \mathbb {N}_{0} satisfies\begin{align*} \begin{cases} \displaystyle n\in {\mathcal {N}}_{l} \Rightarrow \boldsymbol {V}_{n} \in Q_{N,p}(\boldsymbol {S}_{[l]}), \\ \displaystyle (l_{1}\neq l_{2})\quad \mathcal {N}_{l_{1}} \cap \mathcal {N}_{l_{2}} = \emptyset, \\ \displaystyle \mathcal {N}_{l+1} \neq \emptyset \Rightarrow \left |{{\mathcal {N}}_{l}}\right | = \max ({\mathcal {N}}_{l}) - \min ({\mathcal {N}}_{l}) + 1, \\ \displaystyle \mathcal {N}_{l+1} \neq \emptyset \Rightarrow \max ({\mathcal {N}}_{l})+1 = \min (\mathcal {N}_{l+1}), \end{cases} \tag{5}\end{align*} View SourceRight-click on figure for MathML and additional features. where \mathcal {N}_{l+1} \neq \emptyset implies that {\mathcal {N}}_{l} is determined as a finite interval {\mathcal {N}}_{l} = [\min ({\mathcal {N}}_{l}),\max (\mathcal {N}_{l+1})] \cap \mathbb {N}_{0} , i.e., \begin{align*} {\mathcal {N}}_{l} = \begin{cases} \displaystyle [\min ({\mathcal {N}}_{l}), \max ({\mathcal {N}}_{l})] \cap \mathbb {N}_{0} & (\mathrm {if}~\max ({\mathcal {N}}_{l}) < +\infty) \\ \displaystyle [\min ({\mathcal {N}}_{l}), \infty) \cap \mathbb {N}_{0} & (\mathrm {otherwise}). \end{cases}\end{align*} View SourceRight-click on figure for MathML and additional features.

  2. (Singular-point issue possibly impacts NCP strategy6) Any singular point corresponds to a point at infinity in Q_{N,p}(\boldsymbol {S}) through \Phi _{\boldsymbol {S}}^{-1} [33, Theorem 2.3 (d)]. Clearly, if a minimizer \boldsymbol {U}^{\star } \in \mathrm {St} (p,N) in Problem 1.1 lies near the singular-point set E_{N,p}(\boldsymbol {S}) [28], [33], then \left \lVert{ \Phi _{\boldsymbol {S}}(\boldsymbol {U}^{\star }) }\right \rVert _{2} becomes extremely large, which implies (i) many iterations are required for \boldsymbol {V}_{n} to approach \Phi _{\boldsymbol {S}}(\boldsymbol {U}^{\star }) , and (ii) a direct application of any Euclidean optimization algorithm to f\circ \Phi _{\boldsymbol {S}}^{-1} may induce slow convergence.

  3. (Suppression of singular-point issue in NCP strategy by ALCP strategy) To suppress the undesired influence caused by E_{N,p}(\boldsymbol {S}_{[l]}) , the ALCP strategy (Algorithm 1) updates \boldsymbol {S}_{[l]} adaptively so that the parametric expression \Phi _{\boldsymbol {S}_{[l]}}(\boldsymbol {U}^{\star }) \in Q_{N,p}(\boldsymbol {S}_{[l]}) of \boldsymbol {U}^{\star } is not too distant from zero by using Algorithm 2 (see Remark 2.2 (d) and Remark 2.3).

  4. (Detection of singular-point issue) We can detect the singular-point issue, in line 8 of Algorithm 1, by checking whether \left \lVert{ \widetilde {\boldsymbol {V}}_{n+1} }\right \rVert _{2} exceeds a predetermined threshold T > 0 [42], where \widetilde {\boldsymbol {V}}_{n+1}=\mathcal {A}^{\langle l \rangle }(\boldsymbol {V}_{n}) is a tentative update to judge whether the parametric expression \widetilde {\boldsymbol {V}}_{n+1} of \boldsymbol {U}_{n+1} is too influenced, by E_{N,p}(\boldsymbol {S}_{[l]}) , to keep \boldsymbol {S}_{[l]} or not. We can also use \begin{equation*} \left \lVert{ \widetilde {\boldsymbol {V}}_{n+1} }\right \rVert _{\mathrm {block}}:= \left \lVert{ [\![\widetilde {\boldsymbol {V}}_{n+1} ]\!]_{11} }\right \rVert _{2} + \left \lVert{ [\![\widetilde {\boldsymbol {V}}_{n+1} ]\!]_{21} }\right \rVert _{2} > T \tag{6}\end{equation*} View SourceRight-click on figure for MathML and additional features. in place of \left \lVert{ \widetilde {\boldsymbol {V}}_{n+1} }\right \rVert _{2} because \left \lVert{ \cdot }\right \rVert _{\mathrm {block}} is a more computationally efficient7 (equivalent) norm for Q_{N,p}(\boldsymbol {S}) [42].

Remark 2.3:

[Updating Center Points by Algorithm 2] To suppress the singular-point issue, a choice of \boldsymbol {S}_{[l+1]} \in {\mathrm{ O}}(N) plays a crucial role in Algorithm 1. Algorithm 2 serves as a proper design of \boldsymbol {S}_{[l+1]} for the latest estimate \boldsymbol {U}_{n+1} \in \mathrm {St} (p,N) because \left \lVert{ \Phi _{\boldsymbol {S}_{[l+1]}}(\boldsymbol {U}_{n+1}) }\right \rVert _{2} \leq 1 and \left \lVert{ \Phi _{\boldsymbol {S}_{[l+1]}}(\boldsymbol {U}_{n+1}) }\right \rVert _{\mathrm {block}} \leq 1 are guaranteed [33, Theorem 2.7], implying \Phi _{\boldsymbol {S}_{[l+1]}}(\boldsymbol {U}_{n+1}) is not too distant from zero (see Remark 2.2 (c) for its effect on the singular-point issue). Moreover, the computational complexities for \Phi _{\boldsymbol {S}_{[l+1]}} and \Phi _{\boldsymbol {S}_{[l+1]}}^{-1} with \boldsymbol {S}_{[l+1]} designed by Algorithm 2 are more efficient (\mathfrak {o}(Np^{2} + p^{3}) flops) than those complexities (\mathfrak {o}(N^{2}p + p^{3}) flops) with general center points [33, Sec. 2].

B. Nonconvex Optimization Over the Euclidean Space With Convergence Rates

Many Euclidean optimization algorithms have been designed to achieve better convergence. Especially in the machine learning area, convergence rates in terms of the gradient of the cost function are of great interest, e.g., [38], [39], [40], [43], [44], [45], and [41]. Here, we focus on the class of Nonconvex Optimization over the Euclidean space with convergence Rates (NOE-r) defined below (see also Example 2.5).

Definition 2.4 (NOE-r):

Assume that the cost function J:\mathcal {X}\to \mathbb {R} defined over the Euclidean space \mathcal {X} satisfies the following condition:\begin{align*} \hspace {-1em} \begin{cases} \mathrm {(i)}& J~\mathrm {is~bounded~below.} \\ \mathrm {(ii)}& J~\mathrm {is~differentiable.} \\ \mathrm {(iii)} & \nabla J~\mathrm {is~Lipschitz~continuous~with~its~Lipschitz} \\ & \mathrm {constant}~L_{\nabla J}> 0,~\mathrm {i.e.,~for~all~} \boldsymbol {x}_{1},\boldsymbol {x}_{2} \in \mathcal {X}, \\ & \left \lVert{ \nabla J(\boldsymbol {x}_{1})-\nabla J(\boldsymbol {x}_{2}) }\right \rVert \leq L_{\nabla J} \left \lVert{ \boldsymbol {x}_{1} - \boldsymbol {x}_{2} }\right \rVert. \end{cases} \tag{7}\end{align*} View SourceRight-click on figure for MathML and additional features. We say that an algorithm \mathcal {A} belongs to NOE-r of \mathfrak {o}(cn^{-\alpha }) with \alpha >0 for the minimization of J if the following hold:

  1. (Update) Let \mathcal {N}:= \mathbb {N}_{0} . Choose an initial guess \boldsymbol {x}_{\min (\mathcal {N})} (=\boldsymbol {x}_{0}) \in \mathcal {X} , of a stationary point of J , arbitrarily. Estimates (\boldsymbol {x}_{n})_{n\in \mathcal {N}} \subset \mathcal {X} of a stationary point of J are generated by \begin{equation*} (n\in \mathcal {N}) \quad \mathcal {A}: \left ({\boldsymbol {x}_{n},\mathfrak {R}_{[\min (\mathcal {N}),n]}}\right) \mapsto \boldsymbol {x}_{n+1} \in \mathcal {X}\end{equation*} View SourceRight-click on figure for MathML and additional features.with a certain available record8 \mathfrak {R}_{[\min (\mathcal {N}),n]} (see Example 2.5). For simplicity, we also use the notation \mathcal {A}(\boldsymbol {x}_{n}) for a single update from \boldsymbol {x}_{n} .

  2. (Convergence rate in terms of gradient sequence) The algorithm \mathcal {A} has a convergence rate \mathfrak {o}(cn^{-\alpha }) in terms of gradient sequence, which means in this paper (see also [38, p.28]), for any cost function J satisfying (7), the algorithm \mathcal {A} satisfies \begin{align*} & (\exists \alpha > 0, \exists c > 0, \forall n \in \mathcal {N}) \\ & \quad \min _{k \in \mathcal {N}, k \leq n} \left \lVert{ \nabla J(\boldsymbol {x}_{k}) }\right \rVert \leq \frac {c}{(n-\min (\mathcal {N})+1)^{\alpha }}. \tag{8}\end{align*} View SourceRight-click on figure for MathML and additional features.

Example 2.5 (Selected Algorithms in NOE-r):

Many Euclidean optimization algorithms belong to NOE-r of \mathfrak {o}(cn^{-1/2}) for some c > 0 , e.g., the gradient descent method [38], Nesterov-type accelerated gradient methods [39], [40], [41], and adaptive momentum optimization algorithms [43], [44], [45]. Here, we introduce some methods below.

  1. (Gradient descent method) Under the condition (7) for J , a gradient descent method updates \begin{equation*} \mathcal {A}_{\mathrm {GDM}}:\boldsymbol {x}_{n} \mapsto \boldsymbol {x}_{n+1}:=\boldsymbol {x}_{n} - \gamma _{n} \nabla f(\boldsymbol {x}_{n})\end{equation*} View SourceRight-click on figure for MathML and additional features. with \mathfrak {R}_{[\min (\mathcal {N}), n]}:= (\gamma _{n}, \nabla f(\boldsymbol {x}_{n})) , where a stepsize \gamma _{n}>0 satisfies, e.g., the Armijo condition (see, e.g., [34]). Then, \mathcal {A}_{\mathrm {GDM}} has a convergence rate \mathfrak {o}(L_{\nabla J}^{1/2}\Delta _{J}^{1/2}n^{-1/2}) (see, e.g., [38], [41]), where \Delta _{J}:= J(\boldsymbol {x}_{0}) - \inf _{\boldsymbol {x}\in \mathcal {X}}J(\boldsymbol {x}) .

  2. (Nesterov-type accelerated gradient method (NAG)) The NAG [37], [38] was originally proposed for a convex function J satisfying (7), where the NAG achieves the optimal convergence rate in terms of function sequence. The NAG has been extended to nonconvex J satisfying (7) with convergence rate \mathfrak {o}(cn^{-1/2}) but in terms of gradient sequence (see (8)) [39], [40], [41]. The following briefly summarizes a special instance of extended NAGs, say rNAG, with a restarting scheme [40, Theorem 4.1]. The rNAG updates \begin{align*}\mathcal {A}_{\mathrm {rNAG}}:\boldsymbol {x}_{n} \mapsto \boldsymbol {x}_{n+1}:= \begin{cases} \displaystyle \boldsymbol {y}_{n} - \gamma _{n}\nabla J(\boldsymbol {y}_{n}) & (\mathrm {if}~(10)\mathrm {~holds}) \\ \displaystyle \boldsymbol {x}_{n} & (\mathrm {otherwise}) \end{cases} \tag{9}\end{align*} View SourceRight-click on figure for MathML and additional features. with \mathfrak {R}_{[\min (\mathcal {N}),n]}:= (\boldsymbol {y}_{n},\gamma _{n},\nabla J(\boldsymbol {y}_{n})) based on the test whether the value J(\boldsymbol {y}_{n} - \gamma _{n}\nabla J(\boldsymbol {y}_{n})) sufficiently decreases, i.e., \begin{equation*} J(\boldsymbol {y}_{n} - \gamma _{n}\nabla J(\boldsymbol {y}_{n})) \leq J(\boldsymbol {x}_{n}) - c_{R} \gamma _{n} \left \lVert{ \nabla J(\boldsymbol {y}_{n}) }\right \rVert ^{2}, \tag{10}\end{equation*} View SourceRight-click on figure for MathML and additional features. or not, where \begin{align*} \boldsymbol {y}_{n}:= \begin{cases} \displaystyle \boldsymbol {x}_{n} & (\mathrm {if}~n = \min (\mathcal {N})) \\ \displaystyle \boldsymbol {x}_{n} + \beta _{n}(\boldsymbol {x}_{n} - \boldsymbol {x}_{n-1}) & (\mathrm {if}~n > \min (\mathcal {N})), \end{cases} \tag{11}\end{align*} View SourceRight-click on figure for MathML and additional features. is designed with \beta _{n}:= \frac {n-\min (\mathcal {N})}{n+3-\min (\mathcal {N})} , c_{R} > 0 is a predetermined small constant, and \gamma _{n} > 0 is a stepsize designed to ensure a convergence rate9 \mathfrak {o}(cn^{-1/2}) . More precisely, each stepsize \gamma _{n} is chosen to satisfy, with a predetermined \rho \in (0,1] , \begin{align*} {\begin{array}{cccc}& \frac {\rho }{L_{\nabla J}} \leq \gamma _{n} \leq \gamma _{n-1}~(\mathrm {for}~n > \min (\mathcal {N}))~\mathrm {and} \\ & J(\boldsymbol {y}_{n} - \gamma _{n}\nabla J(\boldsymbol {y}_{n})) \leq J(\boldsymbol {y}_{n}) - \frac {\gamma _{n}}{2} \left \lVert{ \nabla J(\boldsymbol {y}_{n}) }\right \rVert ^{2}. \end{array}}\biggl \} \tag{12}\end{align*} View SourceRight-click on figure for MathML and additional features. By noting the latter condition in (12) is achieved automatically by \gamma _{n} \leq L_{\nabla J}^{-1} , such a \gamma _{n} satisfying (12) can be given as the output of Algorithm 3 (backtracking algorithm, see, e.g., [34] for its key idea), by setting in Algorithm 3\begin{align*} \gamma _{\mathrm {initial}}:= \begin{cases} \displaystyle \gamma ' & (\mathrm {if}~n=\min (\mathcal {N})) \\ \displaystyle \gamma _{n-1} & (\mathrm {if}~n > \min (\mathcal {N})), \end{cases}\end{align*} View SourceRight-click on figure for MathML and additional features.\boldsymbol {x}:= \boldsymbol {x}_{n} , and \boldsymbol {y}:=\boldsymbol {y}_{n} , where \rho \in (0,1) and \gamma ' > L_{\nabla J}^{-1} are predetermined arbitrarily. Remark that if the output \gamma _{n} of Algorithm 3 and (\boldsymbol {x}_{n},\boldsymbol {y}_{n}) do not satisfy (10), then (i) \boldsymbol {x}_{n+1} = \boldsymbol {x}_{n} by (9); (ii) \boldsymbol {y}_{n+1} = \boldsymbol {x}_{n+1} + \beta _{n+1}(\boldsymbol {x}_{n+1}-\boldsymbol {x}_{n}) = \boldsymbol {x}_{n+1} by (11), therefore the rNAG restarts from \boldsymbol {y}_{n+1} = \boldsymbol {x}_{n+1} = \boldsymbol {x}_{n} .

Algorithm 3 Backtracking Algorithm

Input:

\gamma _{\mathrm {initial}} > 0, ~\boldsymbol {x},\boldsymbol {y} \in \mathcal {X}

\gamma \leftarrow \gamma _{\mathrm {initial}}

while \gamma > L_{\nabla J}^{-1} and

J(\boldsymbol {y} - \gamma \nabla J(\boldsymbol {y})) > J(\boldsymbol {y}) - \frac {\gamma }{2} \left \lVert{ \nabla J(\boldsymbol {y}) }\right \rVert ^{2} and

J(\boldsymbol {y} - \gamma \nabla J(\boldsymbol {y})) > J(\boldsymbol {x}) - c_{R}\gamma \left \lVert{ \nabla J(\boldsymbol {y}) }\right \rVert ^{2} do

\gamma \leftarrow \rho \gamma

end while

Output:

\gamma

SECTION III.

Convergence Rate Analysis

Recall that (i) \boldsymbol {S}_{[l]} \in {\mathrm{ O}}(N) is employed as the l th center point of \Phi _{\boldsymbol {S}_{[l]}}^{-1} in Algorithm 1, (ii) {\mathcal {N}}_{l} \subset \mathbb {N}_{0} is the index interval where \boldsymbol {V}_{n} \in Q_{N,p}(\boldsymbol {S}_{[l]})~(n\in {\mathcal {N}}_{l}) are produced (see also (5)), (iii) \mathcal {A}^{\langle l \rangle } is a Euclidean optimization algorithm, which updates \boldsymbol {V}_{n} \in Q_{N,p}(\boldsymbol {S}_{[l]})~(n\in {\mathcal {N}}_{l}) for the minimization of f_{\boldsymbol {S}_{[l]}}:=f\circ \Phi _{\boldsymbol {S}_{[l]}}^{-1} .

In this section, we present a convergence rate analysis of Algorithm 1 under the following condition (Condition 3.1). More precisely, for Algorithm 1, we first present a center point-wise convergence analysis in Lemma 3.2, then an overall convergence analysis in Theorem 3.3.

Condition 3.1 (For Convergence Analysis ofAlgorithm 1):

In Algorithm 1, \left \lVert{ \nabla f_{\boldsymbol {S}_{[l]}}(\boldsymbol {V}_{n}) }\right \rVert _{F} does not achieve zero for any n \in \mathbb {N}_{0} , implying that either (i) or (ii) below holds true.

  1. For some l_{\max } \in \mathbb {N}_{0} , \mathcal {N}_{l_{\max }} = \bigl [\min (\mathcal {N}_{l_{\max }}), \infty \bigr) \cap \mathbb {N}_{0} and {\mathcal {N}}_{l} = \emptyset ~(\forall l > l_{\max }) . In this case, we use the index set \Lambda _{\mathrm{ c}}:= \{0,1,\ldots,l_{\max }\} of chosen center points during the process of Algorithm 1.

  2. For all l \in \mathbb {N}_{0} , {\mathcal {N}}_{l} = [\min ({\mathcal {N}}_{l}), \max ({\mathcal {N}}_{l})] \cap \mathbb {N}_{0} and \max ({\mathcal {N}}_{l}) + 1 = \min (\mathcal {N}_{l+1}) . In this case, we use the index set \Lambda _{\mathrm{ c}}:= \mathbb {N}_{0} of chosen center points during the process of Algorithm 1.

Note that, in both cases, a wide-sense monotone increasing function \begin{equation*} \ell :\mathbb {N}_{0} \to \mathbb {N}_{0}:n\mapsto l,~\mathrm {s.t.}~n\in {\mathcal {N}}_{l}\end{equation*} View SourceRight-click on figure for MathML and additional features. is well-defined, and thus \boldsymbol {V}_{n} \in Q_{N,p}(\boldsymbol {S}_{[\ell (n)] }) and \boldsymbol {U}_{n}:= \Phi _{\boldsymbol {S}_{[\ell (n)] }}^{-1}(\boldsymbol {V}_{n}) \in \mathrm {St} (p,N) are well-defined for all n\in \mathbb {N}_{0} .

Lemma 3.2:

Assume that (i) f:\mathbb {R}^{N\times p} \to \mathbb {R} is differentiable, and (ii) \nabla f is Lipschitz continuous with a Lipschitz constant L_{\nabla f} > 0 (see (7)). Consider the situation where Condition 3.1 holds true. Then, the following hold:

  1. For all l\in \Lambda _{\mathrm{ c}} , the condition (7) is achieved when J and \mathcal {X} are given respectively by f_{\boldsymbol {S}_{[l]}}:=f\circ \Phi _{\boldsymbol {S}_{[l]}}^{-1} and Q_{N,p}(\boldsymbol {S}_{[l]}) .

  2. (Center point-wise convergence analysis) Suppose that, for some c > 0 and all l \in \Lambda _{\mathrm{ c}} , \mathcal {A}^{\langle l \rangle } belongs to NOE-r of \mathfrak {o}(cn^{-1/2}) in the sense of Definition 2.4 (b). Let (\boldsymbol {V}_{n})_{n \in \bigcup _{l\in \Lambda _{\mathrm{ c}}} {\mathcal {N}}_{l}} be generated by Algorithm 1 incorporating \mathcal {A}^{\langle l \rangle }~(l\in \Lambda _{\mathrm{ c}}) . Then, we have \begin{align*} (l\in \Lambda _{\mathrm{ c}}, n \in {\mathcal {N}}_{l}) & \min _{k \in {\mathcal {N}}_{l}, k \leq n} \left \lVert{ \nabla f_{\boldsymbol {S}_{[l]}}(\boldsymbol {V}_{k}) }\right \rVert _{F} \\ & \quad \leq \frac {c}{(n-\min ({\mathcal {N}}_{l})+1)^{1/2}} \tag{13}\end{align*} View SourceRight-click on figure for MathML and additional features. and \begin{equation*} (n\in \mathbb {N}_{0}) ~\min _{k=0,1\ldots,n} \left \lVert{ \nabla f_{\boldsymbol {S}_{[\ell (k)] }}(\boldsymbol {V}_{k}) }\right \rVert _{F} \leq \sqrt {\frac {\ell (n)+1}{n+1}}c. \tag{14}\end{equation*} View SourceRight-click on figure for MathML and additional features.

Proof:

  1. The image of f of \mathrm {St}(p,N) , say f(\mathrm {St}(p,N)) \subset \mathbb {R} , is bounded below by the compactness of \mathrm {St}(p,N) and the continuity of f . From \Phi _{\boldsymbol {S}}^{-1}(Q_{N,p}(\boldsymbol {S})) \subset \mathrm {St} (p,N) , f_{\boldsymbol {S}_{[l]}}(Q_{N,p}(\boldsymbol {S}_{[l]})) \subset f(\mathrm {St}(p,N)) is also bounded below, where f_{\boldsymbol {S}_{[l]}}(Q_{N,p}(\boldsymbol {S}_{[l]})) is the image of f_{\boldsymbol {S}_{[l]}} of Q_{N,p}(\boldsymbol {S}_{[l]}) . Regarding the condition (7), see Fact 1.1 for the differentiability of f_{\boldsymbol {S}} and the Lipschitz continuity of \nabla f_{\boldsymbol {S}_{[l]}} .

  2. (13) is clear from (a) and the assumption on \mathcal {A}^{\langle l \rangle } (see also (8)).

    (Proof of (14)) Clearly, for any n\in \mathbb {N}_{0} , we have \begin{align*} &\hspace {-.5pc} \sum _{l=0}^{\ell (n)-1} \left |{{\mathcal {N}}_{l}}\right |\min _{k\in {\mathcal {N}}_{l}} \left \lVert{ \nabla f_{\boldsymbol {S}_{[l]}}(\boldsymbol {V}_{k}) }\right \rVert _{F}^{2} \\ & \quad + (n-\min (\mathcal {N}_{\ell (n)}) + 1)\min _{\min (\mathcal {N}_{\ell (n)})\leq k \leq n} \left \lVert{ \nabla f_{\boldsymbol {S}_{[\ell (n)] }}(\boldsymbol {V}_{k}) }\right \rVert _{F}^{2} \\ & \geq \left ({n-\min (\mathcal {N}_{\ell (n)}) + 1 + \sum _{l=0}^{\ell (n)-1} \left |{{\mathcal {N}}_{l}}\right |}\right) \\ & \hspace {9em} \times \min _{k=0,1,\ldots,n} \left \lVert{ \nabla f_{\boldsymbol {S}_{[\ell (k)] }}(\boldsymbol {V}_{k}) }\right \rVert _{F}^{2} \\ & = (n+1)\min _{k=0,1,\ldots,n} \left \lVert{ \nabla f_{\boldsymbol {S}_{[\ell (k)] }}(\boldsymbol {V}_{k}) }\right \rVert _{F}^{2}, \tag{15}\end{align*} View SourceRight-click on figure for MathML and additional features.where the last equality is verified by (5) and \begin{align*} &\hspace {-1pc} \sum _{l=0}^{\ell (n)-1} \left |{{\mathcal {N}}_{l}}\right | = \sum _{l=0}^{\ell (n)-1}(\max ({\mathcal {N}}_{l})- \min ({\mathcal {N}}_{l}) + 1) \\ & = \sum _{l=0}^{\ell (n)-1}(\min (\mathcal {N}_{l+1})- \min ({\mathcal {N}}_{l})) \\ & = \min (\mathcal {N}_{\ell (n)})- \min (\mathcal {N}_{0}) = \min (\mathcal {N}_{\ell (n)}) \quad (\because \min (\mathcal {N}_{0}) = 0).\end{align*} View SourceRight-click on figure for MathML and additional features.

Moreover, by (13), we have \begin{equation*} (0 \leq l \leq \ell (n)-1) \quad \left |{{\mathcal {N}}_{l}}\right |\min _{k\in {\mathcal {N}}_{l}} \left \lVert{ \nabla f_{\boldsymbol {S}_{[l]}}(\boldsymbol {V}_{k}) }\right \rVert _{F}^{2} \leq c^{2},\end{equation*} View SourceRight-click on figure for MathML and additional features. and \begin{equation*} (n+1-\min (\mathcal {N}_{\ell (n)}))\min _{\min (\mathcal {N}_{\ell (n)})\leq k \leq n} \left \lVert{ \nabla f_{\boldsymbol {S}_{[\ell (k)] }}(\boldsymbol {V}_{n}) }\right \rVert _{F}^{2} \leq c^{2}.\end{equation*} View SourceRight-click on figure for MathML and additional features. Finally, by substituting these relations to the LHS of (15), we obtain \begin{equation*} \min _{k=0,1\ldots,n} \left \lVert{ \nabla f_{\boldsymbol {S}_{[\ell (k)] }}(\boldsymbol {V}_{k}) }\right \rVert _{F}^{2} \leq \frac {\ell (n)+1}{n+1}c^{2}.\end{equation*} View SourceRight-click on figure for MathML and additional features.

From (14) in Lemma 3.2 (b), to evaluate the speed of convergence of \min _{k=0,1\ldots,n} \left \lVert{ \nabla f_{\boldsymbol {S}_{[\ell (k)] }}(\boldsymbol {V}_{k}) }\right \rVert _{F} to zero, it is sufficient to show that the RHS of (14) has the following upper bound with some \widehat {\alpha } > 0 and some \widehat {c} > 0 :\begin{equation*} (n\in \mathbb {N}_{0}) \quad \frac {\ell (n) + 1}{n+1} \leq \frac {\widehat {c}}{(n+1)^{\widehat {\alpha }}}. \tag{16}\end{equation*} View SourceRight-click on figure for MathML and additional features. To guarantee the existence of such \widehat {\alpha } > 0 and \widehat {c} > 0 , we modify10 the restarting condition (6) by newly introducing a polynomial lower bound \eta (l) of the usage count of \boldsymbol {S}_{[l]} as \begin{align*} \left.{\begin{array}{ccccccccc} & \left \lVert{ \widetilde {\boldsymbol {V}}_{n+1} }\right \rVert _{\mathrm {block}} > T ~\mathrm {and} \\ & \quad n - \min ({\mathcal {N}}_{l}) + 1 \geq \eta (l):= \left \lfloor{ \frac {l}{\tau } }\right \rfloor ^{\theta }, \end{array}}\right \} \tag{17}\end{align*} View SourceRight-click on figure for MathML and additional features. where (i) T > 0 is a predetermined threshold for the detection of the singular-point issue (see Remark 2.2 (d)), (ii) \theta \in \mathbb {N} is a tunable polynomial order of \eta (\cdot) , and (iii) \tau \in \mathbb {N} is a tunable invariant period of \eta (\cdot) . Thanks to the latter condition in (17), we ensure \begin{equation*} \hspace {-1.5em} \mathcal {N}_{l+1} \neq \emptyset \Rightarrow \left |{{\mathcal {N}}_{l}}\right | \overset {(5)}{=} \max ({\mathcal {N}}_{l}) - \min ({\mathcal {N}}_{l}) + 1\geq \eta (l), \tag{18}\end{equation*} View SourceRight-click on figure for MathML and additional features.

Finally, we present the following overall convergence analysis of Algorithm 1.

Theorem 3.3 (Overall Convergence Analysis):

Assume that (i) f:\mathbb {R}^{N\times p} \to \mathbb {R} is differentiable, and (ii) \nabla f is Lipschitz continuous with a Lipschitz constant L_{\nabla f} > 0 . Consider the situation where Condition 3.1 holds true. Suppose that, for some c > 0 and all l \in \Lambda _{\mathrm{ c}} , \mathcal {A}^{\langle l \rangle } belongs to NOE-r of \mathfrak {o}(cn^{-1/2}) in the sense of Definition 2.4 (b). Let (\boldsymbol {V}_{n})_{n \in \bigcup _{l\in \Lambda _{\mathrm{ c}}} {\mathcal {N}}_{l}} be generated by Algorithm 1 incorporating \mathcal {A}^{\langle l \rangle }~(l\in \Lambda _{\mathrm{ c}}) and (17) with some T > 0 , and (\theta,\tau) \in \mathbb {N}^{2} as the restarting condition in line 8 of Algorithm 1. Then, we have\begin{equation*} (n\in \mathbb {N}_{0}) ~\min _{k=0,1,\ldots,n} \left \lVert{ \nabla f_{\boldsymbol {S}_{[\ell (k)] }}(\boldsymbol {V}_{k}) }\right \rVert _{F} \leq \frac {\sqrt {(\sqrt {3}+3)\tau }c}{(n+1)^{\frac {\theta }{2(\theta +1)}}} \tag{19}\end{equation*} View SourceRight-click on figure for MathML and additional features. and11\begin{align*} &\hspace {-.5pc} \liminf _{n\to \infty } \left \lVert{ \nabla f_{\boldsymbol {S}_{[\ell (n)] }}(\boldsymbol {V}_{n}) }\right \rVert _{F} \\ &:= \lim _{n\to \infty }\inf \underbrace {\left \{{ \left \lVert{ \nabla f_{\boldsymbol {S}_{[\ell (k)] }}(\boldsymbol {V}_{k}) }\right \rVert _{F}\mid k \geq n}\right \}}_{=:\mathfrak {G}_{n}} = 0. \tag{20}\end{align*} View SourceRight-click on figure for MathML and additional features.

Proof:

See Appendix B.

Remark 3.4 (Interpretation of Theorem 3.3):

  1. As stated just after Problem 1.1, a realistic goal of Problem 1.1 is to find its stationary point \boldsymbol {U}^{\star } \in \mathrm {St} (p,N) . More precisely, according to Fact 2.1 (b), it is desired to approximate (\boldsymbol {V}^{\star }, \boldsymbol {S}^{\star }) \in Q_{N,p}\times {\mathrm{ O}}(N) satisfying \nabla f_{\boldsymbol {S}^{\star }}(\boldsymbol {V}^{\star }) = \boldsymbol {0} with a certain sequence (\boldsymbol {V}_{n},\boldsymbol {S}_{[\ell (n)] })_{n=0}^{\infty } \subset Q_{N,p}\times {\mathrm{ O}}(N) , e.g., by Algorithm 1. Clearly, the RHS in (19) converges to zero as n\to \infty . This upper bound of \min _{k=0,1,\ldots,n} \left \lVert{ \nabla f_{\boldsymbol {S}_{[\ell (k)] }}(\boldsymbol {V}_{k}) }\right \rVert _{F} tells us how fast the gradient approaches zero.

  2. The evaluations shown in (19) and (20) can also apply to the NCP strategy (incorporating an algorithm of NOE-r as \mathcal {A}^{\langle 0 \rangle } ) as a special case of the ALCP strategy (Algorithm 1) by setting a sufficiently large T > 0 in (17).

SECTION IV.

Numerical Experiments

In the experiments, we demonstrate the numerical performance of Algorithm 1 incorporating the restarting condition (17) with T:= 1.5 and \tau,\theta \in \{1, 10\} . We employed \mathcal {A}_{\mathrm {rNAG}} in (9) as \mathcal {A}^{\langle l \rangle }~(l\in \Lambda _{\mathrm{ c}}) in Algorithm 1, say rNAG+ACLP, where we used \rho := 0.5 , c_{R}:= 2^{-13} , and \gamma ':=1 in rNAG and Algorithm 3. We compared rNAG+ALCP with (i) the NCP strategy (say rNAG+NCP) incorporating \mathcal {A}_{\mathrm {rNAG}} (see also Remark 3.4 (b)), and (ii) the corresponding retraction-based strategy (say rNAG+Retr) incorporating \mathcal {A}_{\mathrm {rNAG}} [40, Algorithm 4.1] (for retraction-based strategies, see [25] and the last part of Section I). Both algorithms were terminated at the n th iterate, achieving \begin{align*} \begin{cases} \displaystyle n=1000, \\ \displaystyle \quad \mathrm {or} \\ \displaystyle \frac { \left \lVert{ \boldsymbol {G}_{n} }\right \rVert _{F}}{ \left \lVert{ \boldsymbol {G}_{0} }\right \rVert _{F}} < 10^{-6}, \\ \displaystyle \quad \mathrm {or} \\ \displaystyle \left |{f(\boldsymbol {U}_{n})-f(\boldsymbol {U}_{n-1})}\right | + \left |{f(\boldsymbol {U}_{n-1})-f(\boldsymbol {U}_{n-2})}\right | < 10^{-6}, \end{cases} \tag{21}\end{align*} View SourceRight-click on figure for MathML and additional features. where \begin{align*} \boldsymbol {G}_{n}:= \begin{cases} \displaystyle \nabla (f\circ \Phi _{\boldsymbol {S}_{[\ell (n)] }}^{-1})(\boldsymbol {V}_{n}) & \textrm {(rNAG+ALCP, rNAG+NCP)} \\ \displaystyle \mathop {\mathrm {grad}}f(\boldsymbol {U}_{n}) & \textrm {(rNAG+Retr)}. \end{cases}\end{align*} View SourceRight-click on figure for MathML and additional features. \mathrm {grad} f(\boldsymbol {U}) is the Riemannian gradient of f at \boldsymbol {U} \in \mathrm {St} (p,N) .12

All the experiments were performed in MATLAB on MacBook Pro (13)-inch, M1, 2020) with 16 GB of RAM. To evaluate the numerical performance of rNAG+ALCP, we considered the following standard test problems (i) linear eigenvalue problem, e.g., [25]; (ii) nonlinear eigenvalue problem, e.g., [21]. For each problem, we generated an initial guess \boldsymbol {U}_{0}\in \mathrm {St} (p,N) randomly by MATLAB code ‘orth(rand(N,p))’ for 10 trials.

  1. Linear eigenvalue problem: For a given symmetric matrix \boldsymbol {A} \in \mathbb {R}^{N\times N} , we considered \begin{equation*} \mathop {\mathrm {minimize}}_{\boldsymbol {U}\in \mathrm {St} (p,N)} f_{1}(\boldsymbol {U}):=- \mathrm {Tr}(\boldsymbol {U}^{ \mathsf {T}}\boldsymbol {A}\boldsymbol {U}). \tag{22}\end{equation*} View SourceRight-click on figure for MathML and additional features. Any global minimizer \boldsymbol {U}^{\star } of the problem (22) is an orthonormal eigenbasis associated with the p largest eigenvalues of \boldsymbol {A} [25]. In our experiment, we used \boldsymbol {A}:=\widetilde {\boldsymbol {A}}^{ \mathsf {T}}\widetilde {\boldsymbol {A}} \in \mathbb {R}^{N\times N} with randomly chosen \widetilde {\boldsymbol {A}} \in \mathbb {R}^{N\times N} of which each entry is sampled by the standard normal distribution.

  2. Nonlinear eigenvalue problem: By following [19], [20] and [21], we considered \begin{equation*} \mathrm {find}~\boldsymbol {U}^{\star } \in \mathrm {St} (p,N)~\mathrm {s.t.}~H(\boldsymbol {U}^{\star })\boldsymbol {U}^{\star } = \boldsymbol {U}^{\star }\boldsymbol {D} \tag{23}\end{equation*} View SourceRight-click on figure for MathML and additional features. with some diagonal matrix \boldsymbol {D} \in \mathbb {R}^{N\times N} , where H(\boldsymbol {U}):= \boldsymbol {L}+ \zeta \mathrm {diag}(\boldsymbol {L}^{-1}\psi (\boldsymbol {U})) \in \mathbb {R}^{N\times N} is given with a symmetric Laplacian matrix \boldsymbol {L} \in \mathbb {R}^{N\times N} , \psi (\boldsymbol {U}):=\mathrm {Diag}(\boldsymbol {U}\boldsymbol {U}^{ \mathsf {T}}) \in \mathbb {R}^{N} , \zeta > 0 , and \mathrm {Diag}:\mathbb {R}^{N\times N}\to \mathbb {R}^{N}:\boldsymbol {X}\mapsto [[\boldsymbol {X}]_{11}~[\boldsymbol {X}]_{22}~\cdots ~[\boldsymbol {X}]_{NN}] . The solution of (23) can be characterized as a global minimizer of (see, e.g., [20]) \begin{equation*} \mathop {\mathrm {minimize}}_{\boldsymbol {U}\in \mathrm {St} (p,N)} f_{2}(\boldsymbol {U}):=\frac {1}{2} \mathrm {Tr}(\boldsymbol {U}^{ \mathsf {T}}\boldsymbol {L}\boldsymbol {U}) + \frac {\zeta }{4}\psi (\boldsymbol {U})^{ \mathsf {T}}\boldsymbol {L}^{-1}\psi (\boldsymbol {U}). \tag{24}\end{equation*} View SourceRight-click on figure for MathML and additional features. In our experiment, we used \zeta = 3 and \boldsymbol {L} as one-dimensional discrete Laplacian with 2 on the diagonal and −1 on the sub- and super- diagonals by following [21].

Tables 1 and 2 summarize respectively the average results for problems (22) and (24) over 10 trials with N\in \{1000,2000\} and p\in \{1, 10\} , where for each output \boldsymbol {U}^{\diamond } \in \mathrm {St} (p,N) of these algorithms, ‘fval-optimal’ means the value f(\boldsymbol {U}^{\diamond }) - f(\boldsymbol {U}^{\star }) , ‘fval’ means the value f(\boldsymbol {U}^{\diamond }) , ‘feasi’ means the feasibility \left \lVert{ \boldsymbol {I}_{p}-\boldsymbol {U}^{\diamond \mathsf {T} }\boldsymbol {U}^{\diamond } }\right \rVert _{F} , ‘nrmg’ means the norm \left \lVert{ \mathrm {grad} f(\boldsymbol {U}^{\diamond }) }\right \rVert _{F} , ‘itr’ means the number of iterations, ‘time’ means the CPU time (s), and ‘restart’ means the number of updating center points for rNAG+ALCP. Figures 1 and 2 demonstrate the convergence histories for problems (22) and (24) respectively, where the plots show CPU time on the horizontal axis versus the value of the cost function on the vertical axis. The following remark summarizes the main points found in Tables 1–​2 and Figures 1–​2.

TABLE 1 Performance of Each Algorithm Applied to the Problem (22)
Table 1- 
Performance of Each Algorithm Applied to the Problem (22)
TABLE II Performance of Each Algorithm Applied to the Problem (24)
Table II- 
Performance of Each Algorithm Applied to the Problem (24)
FIGURE 1. - Convergence histories of each algorithm applied to the problem (22) regarding the value 
$f(\boldsymbol{U})-f(\boldsymbol{U}^{\star})$
 at CPU time for each problem size, where 
$\boldsymbol{U}^{\star}\in\mathrm{St} (p,N)$
 was obtained by the eigenvalue decomposition of 
$\boldsymbol{A}$
. Markers are put at every 15 iterations.
FIGURE 1.

Convergence histories of each algorithm applied to the problem (22) regarding the value f(\boldsymbol{U})-f(\boldsymbol{U}^{\star}) at CPU time for each problem size, where \boldsymbol{U}^{\star}\in\mathrm{St} (p,N) was obtained by the eigenvalue decomposition of \boldsymbol{A} . Markers are put at every 15 iterations.

FIGURE 2. - Convergence histories of each algorithm applied to the problem (24) regarding the value 
$f(\boldsymbol{U})$
 at CPU time for each problem size. Markers are put at every 15 iterations.
FIGURE 2.

Convergence histories of each algorithm applied to the problem (24) regarding the value f(\boldsymbol{U}) at CPU time for each problem size. Markers are put at every 15 iterations.

Remark 4.1 (Main Findings From Numerical Experiments):

  1. (Comparison of rNAG+ALCP and rNAG+NCP) As expected in the design concept of the ALCP strategy (see the paragraphs just before and after Problem 1.3, and Remark 2.2 (c)), Figures 1 and 2 show that rNAG+ALCP improves dramatically the convergence speed of rNAG+NCP.

  2. (Comparison of rNAG+Retr and rNAG+ALCP) From a view of the number of iterations in Tables 1 and 2, the performance of rNAG+ALCP is comparable to that of rNAG+Retr. From Figures 1 and 2, we found that the proposed rNAG+ALCP for all \tau,\theta \in \{1, 10\} outperforms rNAG+Retr in a view of CPU time.

  3. (Robustness against choices of \tau,\theta used in (17)) From Figures 1 and 2, the numerical performance of rNAG+ALCP is robust against chosen \tau, \theta . Moreover, from Tables 1 and 2, for rNAG+ALCP with (\tau,\theta) \neq (1,10) , we found that all results are the same except for “time”. This is mainly because, for rNAG+ALCP with (\tau,\theta) \neq (1,10) , (i) \eta (l) \leq 1 holds during the process of algorithm due to the small number of restarting (see (17) and “restart”); (ii) the latter condition n - \min ({\mathcal {N}}_{l}) + 1 \geq \eta (l) in (17) always holds true; (iii) every center point was updated at the same iteration.

SECTION V.

Conclusion

We presented a convergence rate analysis of the adaptive localized Cayley parametrization (ALCP) strategy for optimization over the Stiefel manifold with a new restarting condition. In the ALCP strategy, an optimization problem over the Stiefel manifold is translated into optimization problems over the Euclidean space with the tunable generalized left-localized Cayley transform. The proposed analysis provides a convergence rate, in terms of gradient sequence, for the ALCP strategy. The numerical experiments demonstrate the effectiveness of the ALCP strategy compared with the retraction-based strategy and the naive Cayley parametrization strategy.

Appendix A

Gradient of Function After the Cayley Parametrization

We briefly review the gradient of f\circ \Phi _{\boldsymbol {S}}^{-1} in a case where the standard inner product {\langle \boldsymbol {V}_{1},\boldsymbol {V}_{2} \rangle } = \mathrm {Tr}(\boldsymbol {V}_{1}^{ \mathsf {T}}\boldsymbol {V}_{2})~(\boldsymbol {V}_{1},\boldsymbol {V}_{2}\in Q_{N,p}(\boldsymbol {S})) is equipped in Q_{N,p}(\boldsymbol {S}) .

Fact 1.1 (Gradient after the Cayley Parametrization):

For a differentiable function f:\mathbb {R}^{N\times p}\to \mathbb {R} and \boldsymbol {S} \in {\mathrm{ O}}(N) , the following hold with f_{\boldsymbol {S}}:=f\circ \Phi _{\boldsymbol {S}}^{-1}:Q_{N,p}(\boldsymbol {S}) \to \mathbb {R} :

  1. ([33, Proposition 2.9, Remark 2.11]) f_{\boldsymbol {S}} is differentiable, and the computational complexity of f_{\boldsymbol {S}} with \boldsymbol {S} designed by Algorithm 2 is \mathfrak {o}(Np^{2}+p^{3}) flops.

  2. ([33, Proposition A.11]) Assume that the Lipschitz continuity of \nabla f with its Lipschitz constant L_{\nabla f} > 0 as \begin{align*} & (\forall \boldsymbol {U}_{1},\boldsymbol {U}_{2} \in \mathrm {St} (p,N)) \\ & \quad \left \lVert{ \nabla f(\boldsymbol {U}_{1})-\nabla f(\boldsymbol {U}_{2}) }\right \rVert _{F} \leq L_{\nabla f} \left \lVert{ \boldsymbol {U}_{1} - \boldsymbol {U}_{2} }\right \rVert _{F}.\end{align*} View SourceRight-click on figure for MathML and additional features. Then, the gradient \nabla f_{\boldsymbol {S}} is Lipschitz continuous with its Lipschitz constant 4(\mu + L_{\nabla f}) over Q_{N,p}(\boldsymbol {S}) as \begin{align*} & (\forall \boldsymbol {V}_{1},\boldsymbol {V}_{2} \in Q_{N,p}(\boldsymbol {S})) \\ & \quad \left \lVert{ \nabla f_{\boldsymbol {S}}(\boldsymbol {V}_{1}) - \nabla f_{\boldsymbol {S}}(\boldsymbol {V}_{2}) }\right \rVert _{F} \leq 4(\mu + L_{\nabla f}) \left \lVert{ \boldsymbol {V}_{1}-\boldsymbol {V}_{2} }\right \rVert _{F},\end{align*} View SourceRight-click on figure for MathML and additional features. where \mu \geq \max _{\boldsymbol {U}\in \mathrm {St} (p,N)} \left \lVert{ \nabla f(\boldsymbol {U}) }\right \rVert _{2} .

Appendix B

Proof of Theorem 3.3

The following claim is required for our analysis.

Claim 2.1:

Let \theta,\tau \in \mathbb {N} , and n, \Upsilon \in \mathbb {N}_{0} . Then,

  1. \begin{equation*} (\Upsilon \geq 1) \quad \sum _{l=0}^{\Upsilon -1}\left \lfloor{ \frac {l}{\tau } }\right \rfloor ^{\theta } \geq \tau \frac {\left({ \left \lfloor{ \frac {\Upsilon }{\tau } }\right \rfloor -1}\right)^{\theta +1}}{\theta +1}. \tag{25}\end{equation*} View SourceRight-click on figure for MathML and additional features.

  2. \begin{align*} &\hspace {-1pc} \mathrm {If}~n + 1\geq \sum _{l=0}^{\max \{\Upsilon - 1,0\}}\left \lfloor{ \frac {l}{\tau }}\right \rfloor ^{\theta }, ~\mathrm {then} \\ & \left.{\begin{array}{ccccccc}& \Upsilon + 1 \leq \tau (\sqrt {3}+3) \sqrt [\theta +1]{n+1} \\ & \left ({ \Leftrightarrow \frac {\Upsilon + 1}{\tau \times \sqrt [\theta +1]{n+1}} \leq \sqrt {3}+3 }\right) \end{array}}\right \} \tag{26}\end{align*} View SourceRight-click on figure for MathML and additional features.

Proof:

  1. (Step 1) We first show \begin{equation*} \sum _{l=0}^{\Upsilon -1}\left \lfloor{ \frac {l}{\tau } }\right \rfloor ^{\theta } \geq \tau \sum _{m=0}^{ \left \lfloor{ \frac {\Upsilon }{\tau } }\right \rfloor -1}m^{\theta }. \tag{27}\end{equation*} View SourceRight-click on figure for MathML and additional features. Since (i) the RHS of (27) is invariant for \Upsilon \in [\kappa \tau, (\kappa +1)\tau -1] with every \kappa \in \mathbb {N}_{0} and (ii) the LHS of (27) is monotonically increasing for \Upsilon , it is sufficient to verify the inequality (27) specially for the case \Upsilon = \kappa \tau . In this case, the LHS of (27) is evaluated as \begin{align*} &\hspace {-1pc} \sum _{l=0}^{\kappa \tau -1}\left \lfloor{ \frac {l}{\tau } }\right \rfloor ^{\theta } = \sum _{m=0}^{\kappa -1} \sum _{i=0}^{\tau -1} \left \lfloor{ \frac {m\tau + i}{\tau } }\right \rfloor ^{\theta } =\sum _{m=0}^{\kappa -1} \sum _{i=0}^{\tau -1} m^{\theta } \\ & = \tau \sum _{m=0}^{\kappa -1} m^{\theta } = \tau \sum _{m=0}^{\frac {\Upsilon }{\tau }-1} m^{\theta } = \tau \sum _{m=0}^{ \left \lfloor{ \frac {\Upsilon }{\tau } }\right \rfloor -1} m^{\theta },\end{align*} View SourceRight-click on figure for MathML and additional features. which equals the RHS of (27).

    (Step 2) By combining (27) and the clear relation \begin{equation*} \tau \sum _{m=0}^{ \left \lfloor{ \frac {\Upsilon }{\tau } }\right \rfloor -1}m^{\theta } \geq \tau \int ^{ \left \lfloor{ \frac {\Upsilon }{\tau } }\right \rfloor -1}_{0} x^{\theta } dx = \tau \frac {\left({ \left \lfloor{ \frac {\Upsilon }{\tau } }\right \rfloor -1}\right)^{\theta +1}}{\theta +1},\end{equation*} View SourceRight-click on figure for MathML and additional features. we deduce (25).

  2. (Step 1) For the special case of \Upsilon \in [0, \tau -1] , (26) is clear by \Upsilon + 1 \leq \tau \leq \tau (3+\sqrt {3})\leq \tau (3+\sqrt {3})\sqrt [\theta +1]{n+1} .

    (Step 2) For the other cases \Upsilon \in [\tau, \infty) , combining (25) and the assumption of (26), we have \begin{equation*} n+1 \geq \sum _{l=0}^{\Upsilon -1}\left \lfloor{ \frac {l}{\tau } }\right \rfloor ^{\theta } \geq \tau \frac {\left({ \left \lfloor{ \frac {\Upsilon }{\tau } }\right \rfloor -1}\right)^{\theta +1}}{\theta +1},\end{equation*} View SourceRight-click on figure for MathML and additional features. which implies \begin{equation*} \sqrt [\theta +1]{\tau ^{-1}(\theta +1)(n+1)} \geq \left \lfloor{ \frac {\Upsilon }{\tau } }\right \rfloor - 1 \geq \frac {\Upsilon }{\tau } - 2,\end{equation*} View SourceRight-click on figure for MathML and additional features. equivalently \begin{equation*} \Upsilon + 1 \leq \sqrt [\theta +1]{\tau ^{\theta }(\theta +1)(n+1)} + 2\tau + 1.\end{equation*} View SourceRight-click on figure for MathML and additional features.

    (Step 3) By evaluating the RHS of (26), we derive the desired upper bound:\begin{align*} &\hspace {-1pc} \frac {\sqrt [\theta +1]{\tau ^{\theta } (\theta +1)(n+1)}+2\tau + 1}{\tau \times \sqrt [\theta +1]{n+1}} \\ & = \tau ^{-1/(\theta +1)}\sqrt [\theta +1]{\theta +1} + \frac {2\tau + 1}{\tau \times \sqrt [\theta +1]{n+1}} \\ & \leq \sqrt [\theta +1]{\theta +1} + \frac {2\tau + 1}{\tau }~(\because \tau \geq 1, n \geq 0) \\ & \leq \left ({(\theta +1)+1}\right)^{1/(\theta +1)} + 3 \leq \sqrt {3} + 3\end{align*} View SourceRight-click on figure for MathML and additional features. where in the last inequality, the monotone decreasing property of \left ({(\theta +1)+1}\right)^{1/(\theta +1)}~(\theta \in \mathbb {N}) is used.

Now, we are ready to prove Theorem 3.3.

Proof ofTheorem 3.3:

For any n\in \mathbb {N}_{0} , we have \begin{equation*} n + 1 \geq \sum _{l=0}^{\max \{\ell (n)-1, 0\}} \left \lfloor{ \frac {l}{\tau } }\right \rfloor ^{\theta }. \tag{28}\end{equation*} View SourceRight-click on figure for MathML and additional features. This is because (i) for the special case \ell (n) = 0 , we have n+1 \geq 0 = \sum _{l=0}^{0} \left \lfloor{ \frac {l}{\tau } }\right \rfloor ^{\theta } , and (ii) for the other cases \ell (n) \geq 1 , we have \begin{equation*} n+1 \geq \sum _{l=0}^{\ell (n)-1} \left |{{\mathcal {N}}_{l}}\right | \overset {(18)}{\geq } \sum _{l=0}^{\ell (n)-1}\eta (l) \overset {(17)}{=} \sum _{l=0}^{\ell (n)-1} \left \lfloor{ \frac {l}{\tau } }\right \rfloor ^{\theta }.\end{equation*} View SourceRight-click on figure for MathML and additional features. By combining (28) and Claim 2.1 (b) with \Upsilon := \ell (n) , we have \ell (n) + 1 \leq \tau (\sqrt {3}+3) \sqrt [\theta +1]{n+1} . Finally, by substituting the last inequality into (14), we have (19), which also implies \begin{equation*} 0\leq \inf \mathfrak {G}_{0} \leq \lim _{n\to \infty } \frac {\sqrt {(\sqrt {3}+3)\tau }c}{(n+1)^{\frac {\theta }{2(\theta +1)}}} = 0.\end{equation*} View SourceRight-click on figure for MathML and additional features. Moreover, by noting that \left \lVert{ \nabla f_{\boldsymbol {S}_{[\ell (n)] }}(\boldsymbol {V}_{n}) }\right \rVert _{F} > 0~(\forall n \in \mathbb {N}_{0}) holds under Condition 3.1, we have \begin{equation*} (\forall n \in \mathbb {N}_{0})~\mathfrak {g}_{n}:=\min _{0\leq k \leq n} \left \lVert{ \nabla f_{\boldsymbol {S}_{[\ell (k)] }}(\boldsymbol {V}_{k}) }\right \rVert _{F} > 0.\end{equation*} View SourceRight-click on figure for MathML and additional features. Therefore, we get \inf \mathfrak {G}_{n} = 0~(\forall n\in \mathbb {N}_{0}) by \begin{equation*} \inf \mathfrak {G}_{0} = \min \{\mathfrak {g}_{n}, \inf \mathfrak {G}_{n}\}= 0,\end{equation*} View SourceRight-click on figure for MathML and additional features. and thus 0 = \lim _{n\to \infty }\inf \mathfrak {G}_{n} =:\liminf _{n\to \infty } \left \lVert{ \nabla f_{\boldsymbol {S}_{[\ell (n)] }}(\boldsymbol {V}_{n}) }\right \rVert _{F} .

References

References is not available for this document.