Adaptive Localized Cayley Parametrization for Optimization Over Stiefel Manifold and Its Convergence Rate Analysis

SECTION I.

Introduction

Orthogonality constrained optimization problems have been playing crucial roles in signal processing and machine learning applications [1], [2], [3]. Such applications include sparse principal component analysis [4], [5], [6], 1-bit compressed sensing [7], [8], the joint diagonalization problem for independent component analysis [9], [10], [11], the orthogonal Procrustes problem [12], [13], [14], [15], the nearest low-rank correlation matrix problem [16], [17], [18], nonlinear eigenvalue problem [19], [20], [21], and the enhancement of the generalization performance in deep neural networks [22], [23], [24]. The orthogonality constraint set defined by $\mathrm {St}(p,N):= \{\boldsymbol {U} \in \mathbb {R}^{N\times p} \mid \boldsymbol {U}^{ \mathsf {T}}\boldsymbol {U} = \boldsymbol {I}_{p}\} (p\leq N)$ is called the Stiefel manifold. In this paper, we consider the following optimization problem over $\mathrm {St}(p,N)$ :

Problem 1.1:

For $(N,p) \in \mathbb {N}^{2}$ satisfying $p \leq N$ and a given differentiable function $f:\mathbb {R}^{N\times p }\to \mathbb {R}$ , $\begin{equation*} \mathrm {find}~\boldsymbol {U}^{\star } \in \mathop {\mathrm {argmin}}\limits _{\boldsymbol {U}\in \mathrm {St} (p,N)} ~f(\boldsymbol {U}).\end{equation*}$ View Source

A realistic goal for Problem 1.1 is to find a stationary point $\boldsymbol {U}^{\star } \in \mathrm {St} (p,N)$ , which satisfies a necessary condition for a local minimizer of $f$ over $\mathrm {St}(p,N)$ [25], [26], [27], because Problem 1.1 is a nonconvex optimization problem in general. Problem 1.1 is challenging mainly because of the severe nonlinearity of $\mathrm {St}(p,N)$ .

A family of Cayley-type transforms [22], [26], [28], [29], [30], [31], [32], [33] resolves the nonlinearity of $\mathrm {St}(p,N)$ in Problem 1.1 by translating $\mathrm {St}(p,N)$ into a Euclidean space. With a generalized left-localized Cayley transform $\Phi _{\boldsymbol {S}}: \mathrm {St}(p,N)\setminus E_{N,p}(\boldsymbol {S}) \to Q_{N,p}(\boldsymbol {S}) ~(\boldsymbol {S}\in {\mathrm{ O}}(N):= \mathrm {St}(N,N))$ [33] (see (2)), a subset $\mathrm {St}(p,N) \setminus E_{N,p}(\boldsymbol {S})$ can be parameterized in terms of the $Np-\frac {1}{2}p(p+1)$ -dimensional subspace $\begin{align*} Q_{N,p}(\boldsymbol {S}) &:= Q_{N,p}:=\left \{{ \left.{\begin{bmatrix} \boldsymbol {A} & -\boldsymbol {B}^{ \mathsf {T}} \\ \boldsymbol {B} & \boldsymbol {0} \end{bmatrix} \in \mathbb {R}^{N\times N} }\right \vert \; \substack { -\boldsymbol {A}^{ \mathsf {T}} = \boldsymbol {A} \in \mathbb {R}^{p\times p}, \\ \boldsymbol {B} \in \mathbb {R}^{(N-p)\times p}}}\right \} \\ & \equiv \mathbb {R}^{Np-\frac {1}{2}p(p+1)} \tag{1}\end{align*}$ View Sourceof the set of all skew-symmetric matrices $Q_{N,N}:= \{\boldsymbol {V} \in \mathbb {R}^{N\times N} \mid \boldsymbol {V}^{ \mathsf {T}} = -\boldsymbol {V}\}$ , where the singular-point set $E_{N,p}(\boldsymbol {S}) \subset \mathrm {St} (p,N)$ is a subset of $\mathrm {St}(p,N)$ where $\Phi _{\boldsymbol {S}}$ can not be defined (see (3)). The subset $\mathrm {St}(p,N)\setminus E_{N,p}(\boldsymbol {S})$ is open and dense1 for $p < N$ [33, Theorem 2.3], and therefore we can parameterize almost all points in $\mathrm {St}(p,N)$ with $\Phi _{\boldsymbol {S}}$ and its inverse $\Phi _{\boldsymbol {S}}^{-1}:Q_{N,p}(\boldsymbol {S}) \to \mathrm {St} (p,N) \setminus E_{N,p}(\boldsymbol {S})$ (see (4)).

As a result, Problem 1.1 can be reformulated as

Problem 1.2:

For a given differentiable function $f:\mathbb {R}^{N\times p}\to \mathbb {R}$ , $\boldsymbol {S} \in {\mathrm{ O}}(N)$ , and $\epsilon > 0$ , 2 $\begin{align*} \textrm {find} ~& \boldsymbol {V}^{\star } \in Q_{N,p}(\boldsymbol {S}) \\ & \textrm {such that} ~f\circ \Phi _{\boldsymbol {S}}^{-1}(\boldsymbol {V}^{\star }) < \min f(\mathrm {St}(p,N)) + \epsilon.\end{align*}$ View Source

The naive Cayley parametrization (NCP) strategy [28], [33] updates (i) estimates $(\boldsymbol {V}_{n})_{n=0}^{\infty } \subset Q_{N,p}(\boldsymbol {S})$ of a solution $\boldsymbol {V}^{\star }$ to Problem 1.2 by using any optimization algorithm designed for the Euclidean space, for short Euclidean optimization algorithm in this paper, and (ii) estimates $\boldsymbol {U}_{n}:=\Phi _{\boldsymbol {S}}^{-1}(\boldsymbol {V}_{n}) \in \mathrm {St} (p,N)$ of a minimizer $\boldsymbol {U}^{\star }$ of Problem 1.1. Such Euclidean optimization algorithms include, e.g., the gradient descent method, the Newton method [34], the quasi-Newton method [35], the conjugate gradient method [36], and the Nesterov-type accelerated gradient method [37], [38], [39], [40], [41].

However, in a case where $\boldsymbol {U}^{\star }$ is close to the singular-point set $E_{N,p}(\boldsymbol {S})$ , the NCP strategy may suffer from slow convergence of updating $(\boldsymbol {V}_{n})_{n=0}^{\infty }$ [28], [33]. We call this performance degradation a singular-point issue.

To suppress the singular-point issue in the NCP strategy, we recently proposed an Adaptive Localized Cayley Parametrization (ALCP) strategy [42], which adaptively adjusts the location of singular points, for

Problem 1.3:

For a given differentiable function $f:\mathbb {R}^{N\times p}\to \mathbb {R}$ and $\epsilon > 0$ , $\begin{align*} \textrm {find} ~& (\boldsymbol {V}^{\star },\boldsymbol {S}^{\star }) \in Q_{N,p} \times {\mathrm{ O}}(N) \\ & \textrm {such that} ~f\circ \Phi _{\boldsymbol {S}^{\star }}^{-1}(\boldsymbol {V}^{\star }) < \min f(\mathrm {St}(p,N)) + \epsilon.\end{align*}$ View Source

In the ALCP strategy, until the singular-point issue is detected by a computable restarting condition, we use the same $\boldsymbol {S}$ for updating $\boldsymbol {V}_{n} \in Q_{N,p}(\boldsymbol {S})$ by using Euclidean optimization algorithms for Problem 1.2 with $\boldsymbol {S}$ in analogy with the NCP strategy (see also the later part [“Related works and primitive goal of this paper”] in this section). After the detection of the singular-point issue at the tentative estimate $\widetilde {\boldsymbol {V}}_{n+1} \in Q_{N,p}(\boldsymbol {S})$ , we update $\boldsymbol {S} \in {\mathrm{ O}}(N)$ to $\boldsymbol {S}' \in {\mathrm{ O}}(N)$ so that $\boldsymbol {U}_{n+1}:= \Phi _{\boldsymbol {S}}^{-1}(\widetilde {\boldsymbol {V}}_{n+1}) \in \mathrm {St} (p,N)$ is distant from the new singular-point set $E_{N,p}(\boldsymbol {S}')$ . The parametric expression of $\boldsymbol {U}_{n+1}$ is updated from $\widetilde {\boldsymbol {V}}_{n+1} \in Q_{N,p}(\boldsymbol {S})$ to $\boldsymbol {V}_{n+1}:= \Phi _{\boldsymbol {S}'}(\boldsymbol {U}_{n+1}) \in Q_{N,p}(\boldsymbol {S}')$ , and the minimization $f\circ \Phi _{\boldsymbol {S}'}^{-1}$ over $Q_{N,p}(\boldsymbol {S}')$ is restarted with the initial guess $\boldsymbol {V}_{n+1}\in Q_{N,p}(\boldsymbol {S}')$ . This restarting technique in the ALCP strategy computationally realizes an adaptively parameterized space, and improves numerical performance of the NCP strategy without suffering from impacts of the singular-point issue.

Convergence rates, in terms of gradient sequence, of Euclidean optimization algorithms have been of great interest in order to estimate the decay speed of the gradient to zero. Indeed, many Euclidean optimization algorithms, e.g., the Nesterov-type accelerated gradient method [39], [40], [41] and the adaptive momentum optimization algorithms [43], [44], [45], have been designed carefully with convergence rates. We call a class of such Euclidean optimization algorithms Nonconvex Optimization over the Euclidean space with convergence Rates (NOE-r) (see Definition 2.4 for the formal definition of NOE-r).

So far, any convergence rate analysis for the ALCP strategy has not yet been established.3 Here, we encounter the following fundamental question:

Can we derive any convergence rate of the ALCP strategy, in terms of gradient sequence, by incorporating NOE-r?

To establish a convergence rate for the ALCP strategy, we introduce a modification of the restarting condition [33], of the ALCP strategy, with a polynomial lower bound of the usage count of the same center point. Thanks to the NOE-r and the proposed modification, we successfully derive a convergence rate of the ALCP strategy in Theorem 3.3. Numerical experiments in Section IV demonstrate the effectiveness of the ALCP strategy compared with the standard strategy (see below), for Problem 1.1, known as the retraction-based strategy [25], [46], [47].

Related Works and Primitive Goal Of This Paper: The so-called retraction-based strategy [25], [46], [47] is known as the standard generic strategy for optimization over Riemannian manifolds, which include $\mathrm {St}(p,N)$ as a special instance. In order to extend ideas established for optimization over the Euclidean space to that over $\mathrm {St}(p,N)$ , the retraction-based strategy approximates $\mathrm {St}(p,N)$ by the tangent space to $\mathrm {St}(p,N)$ locally around the current iterate $\boldsymbol {U}_{n} \in \mathrm {St} (p,N)$ with a retraction mapping, where the tangent space varies according to the tangent point $\boldsymbol {U}_{n}$ . However, the retraction-based strategy encounters certain difficulties in utilizing past estimates of a solution for updating the current estimate of a solution. This is mainly because the past estimates are expressed on different tangent spaces from the current tangent space. Such difficulties prevent designers of algorithms from extending powerful optimization algorithms over the Euclidean space to new algorithms over $\mathrm {St}(p,N)$ . This is because such advanced algorithms, e.g., [39], [40], [41], [43], [44], and [45], effectively exploit the past estimates of a solution for updating the current estimate of a solution. The major goal of this paper is to resolve such difficulties in the retraction-based strategy by utilizing an especially beautiful nature of $\mathrm {St}(p,N)$ , i.e., $\Phi _{\boldsymbol {S}}: \mathrm {St}(p,N) \setminus E_{N,p}(\boldsymbol {S}) \to Q_{N,p}(\boldsymbol {S})$ and $\Phi _{\boldsymbol {S}}^{-1}:Q_{N,p}(\boldsymbol {S}) \to \mathrm {St} (p,N) \setminus E_{N,p}(\boldsymbol {S})$ .

A partial preliminary short version of this paper was presented in [31].

Notation: $\mathbb {N}$ , $\mathbb {N}_{0}$ , and $\mathbb {R}$ denote respectively the set of all positive integers, the set of all nonnegative integers, and the set of all real numbers. For general $n\in \mathbb {N}$ , $\boldsymbol {I}_{n}$ stands for the identity matrix in $\mathbb {R}^{n\times n}$ , but the identity matrix in $\mathbb {R}^{N\times N}$ is denoted simply by $\boldsymbol {I}$ . For $p \leq N$ , $\boldsymbol {I}_{N\times p} \in \mathbb {R}^{N\times p}$ denotes the matrix of the first $p$ columns of $\boldsymbol {I}$ . For $p < N$ , the matrices $\boldsymbol {U}_{\mathrm {up}} \in \mathbb {R}^{p \times p}$ and $\boldsymbol {U}_{\mathrm {lo}} \in \mathbb {R}^{(N-p) \times p}$ denote respectively the upper and the lower block matrices of $\boldsymbol {U} \in \mathbb {R}^{N\times p}$ , i.e., $\boldsymbol {U}=[\boldsymbol {U}_{\mathrm {up}}^{ \mathsf {T}}~\boldsymbol {U}_{\mathrm {lo}}^{ \mathsf {T}}]^{ \mathsf {T}}$ . The matrices $\boldsymbol {S}_{\mathrm {le}} \in \mathbb {R}^{N\times p}$ and $\boldsymbol {S}_{\mathrm {ri}} \in \mathbb {R}^{N\times (N-p)}$ denote respectively the left and right block matrices of $\boldsymbol {S} \in \mathbb {R}^{N\times N}$ , i.e., $\boldsymbol {S}=[\boldsymbol {S}_{\mathrm {le}}~\boldsymbol {S}_{\mathrm {ri}}]$ . For a square matrix $\begin{aligned} \boldsymbol {X}:= \begin{bmatrix} \boldsymbol {X}_{11} \in \mathbb {R}^{p\times p} & \boldsymbol {X}_{12} \in \mathbb {R}^{p\times (N-p)} \\ \boldsymbol {X}_{21} \in \mathbb {R}^{(N-p)\times p} & \boldsymbol {X}_{22} \in \mathbb {R}^{(N-p)\times (N-p)} \end{bmatrix}\in \mathbb {R}^{N\times N} \end{aligned}$ , we use the notation $[\![\boldsymbol {X} ]\!]_{ij}:= \boldsymbol {X}_{ij}$ for $i,j \in \{1,2\}$ . For $\boldsymbol {X} \in \mathbb {R}^{m\times n}$ , $[\boldsymbol {X}]_{ij}$ denotes the $(i,j)$ entry of $\boldsymbol {X}$ and $\boldsymbol {X}^{ \mathsf {T}}$ denotes the transpose of $\boldsymbol {X}$ . For $\boldsymbol {X}\in \mathbb {R}^{n\times n}$ , $\mathop {\mathrm {S_{kew}}}(\boldsymbol {X}) = (\boldsymbol {X}-\boldsymbol {X}^{ \mathsf {T}})/2$ is the skew-symmetric component of $\boldsymbol {X}$ . For square matrices $\boldsymbol {X}_{i}\in \mathbb {R}^{n_{i}\times n_{i}}~(1\leq i \leq k)$ , $\mathrm {diag}(\boldsymbol {X}_{1},\boldsymbol {X}_{2},\ldots,\boldsymbol {X}_{k})\in \mathbb {R}^{\left({\sum _{i=1}^{k}n_{i}}\right)\times \left({\sum _{i=1}^{k}n_{i}}\right)}$ denotes the block diagonal matrix with diagonal blocks $\boldsymbol {X}_{1},\boldsymbol {X}_{2},\ldots,\boldsymbol {X}_{k}$ . For a given matrix $\boldsymbol {X} \in \mathbb {R}^{m\times n}$ , $\left \lVert{ \boldsymbol {X} }\right \rVert _{2}$ and $\left \lVert{ \boldsymbol {X} }\right \rVert _{F}$ denote respectively the spectral norm and the Frobenius norm. For a subset $\mathcal {N}$ of $\mathbb {N}_{0}$ , $\left |{\mathcal {N}}\right |$ denotes the cardinality of $\mathcal {N}$ . To distinguish from the symbol for the orthogonal group ${\mathrm{ O}}(N)$ , the symbol $\mathfrak {o}(\cdot)$ is used in place of the standard big O notation. $\lfloor \cdot \rfloor : \mathbb {R}\to \mathbb {Z}: t \mapsto \max \{n\in \mathbb {Z} \mid n\leq t\}$ stands for the floor function, where $\mathbb {Z}$ is the set of all integers. For a differentiable function $J:\mathcal {X} \to \mathbb {R}$ defined over the Euclidean space $\mathcal {X}$ with an inner product ${\langle \cdot,\cdot \rangle }$ , $\nabla J(\boldsymbol {x})$ is the gradient of $J$ at $\boldsymbol {x} \in \mathcal {X}$ satisfying $\lim _{t\to 0,t\neq 0, t\in \mathbb {R}} \frac {J(\boldsymbol {x}+t\boldsymbol {v}) - J(\boldsymbol {x})}{t} = {\langle \nabla J(\boldsymbol {x}),\boldsymbol {v} \rangle }$ for all $\boldsymbol {v} \in \mathcal {X}$ .

SECTION II.

Preliminaries

A. Adaptive Localized Cayley Parametrization for Optimization Over $\mathrm{St}(p,N)$

The generalized left-localized Cayley transform $\Phi _{\boldsymbol {S}}$ [33] is an extension of the classical Cayley transform [48] for parametrization of $\mathrm {St}(p,N)$ . For $p,N\in \mathbb {N}$ satisfying $p\leq N$ , $\Phi _{\boldsymbol {S}}$ with $\boldsymbol {S}\in {\mathrm{ O}}(N)$ is defined as $\begin{align*} \Phi _{\boldsymbol {S}}&: \mathrm {St}(p,N) \setminus E_{N,p}(\boldsymbol {S})\to Q_{N,p}(\boldsymbol {S}) \\ &: \boldsymbol {U}\mapsto \begin{bmatrix} \boldsymbol {A}_{\boldsymbol {S}}(\boldsymbol {U}) & -\boldsymbol {B}^{ \mathsf {T}}_{\boldsymbol {S}}(\boldsymbol {U}) \\ \boldsymbol {B}_{\boldsymbol {S}}(\boldsymbol {U}) & \boldsymbol {0} \end{bmatrix} \tag{2}\end{align*}$ View Source with $\begin{align*} \boldsymbol {A}_{\boldsymbol {S}}(\boldsymbol {U}) &:= 2(\boldsymbol {I}_{p}+\boldsymbol {S}_{\mathrm {le}}^{ \mathsf {T}}\boldsymbol {U})^{-\mathrm {T}} \mathop {\mathrm {S_{kew}}}(\boldsymbol {U}^{ \mathsf {T}}\boldsymbol {S}_{\mathrm {le}})(\boldsymbol {I}_{p}+\boldsymbol {S}_{\mathrm {le}}^{ \mathsf {T}}\boldsymbol {U})^{-1} \in Q_{p,p}, \\ \boldsymbol {B}_{\boldsymbol {S}}(\boldsymbol {U}) &:= - \boldsymbol {S}_{\mathrm {ri}}^{ \mathsf {T}}\boldsymbol {U}(\boldsymbol {I}_{p}+\boldsymbol {S}_{\mathrm {le}}^{ \mathsf {T}}\boldsymbol {U})^{-1} \in \mathbb {R}^{(N-p) \times p},\end{align*}$ View Source where (i) $\begin{equation*} E_{N,p}(\boldsymbol {S}):= \{\boldsymbol {U} \in \mathrm {St} (p,N) \mid \det (\boldsymbol {I}_{p} + \boldsymbol {S}_{\mathrm {le}}^{ \mathsf {T}}\boldsymbol {U}) = 0\} \tag{3}\end{equation*}$ View Source is called the singular-point set of $\Phi _{\boldsymbol {S}}$ , and (ii) $Q_{N,p}(\boldsymbol {S})$ is clearly a vector space [33] (see (1)). For $\boldsymbol {S} \in {\mathrm{ O}}(N)$ , the inversion mapping of $\Phi _{\boldsymbol {S}}$ is given by $\begin{align*} \Phi _{\boldsymbol {S}}^{-1}&: Q_{N,p}(\boldsymbol {S})\to \mathrm {St} (p,N)\setminus E_{N,p}(\boldsymbol {S}) \\ &: \boldsymbol {V} \mapsto \boldsymbol {S}(\boldsymbol {I}-\boldsymbol {V})(\boldsymbol {I}+\boldsymbol {V})^{-1}\boldsymbol {I}_{N\times p}. \tag{4}\end{align*}$ View Source We call a tunable parameter $\boldsymbol {S}$ a center point of $\Phi _{\boldsymbol {S}}$ .

Fact 2.1 (Optimality Condition[33]):

Let $f:\mathbb {R}^{N\times p} \to \mathbb {R}$ be differentiable. Let $\boldsymbol {U} \in \mathrm {St} (p,N)$ and $\boldsymbol {S} \in {\mathrm{ O}}(N)$ satisfy $\boldsymbol {U} \in \mathrm {St} (p,N) \setminus E_{N,p}(\boldsymbol {S})$ . Then, the following holds:

$\boldsymbol {U}$ is a local minimizer of Problem 1.1 if and only if $\Phi _{\boldsymbol {S}}(\boldsymbol {U})$ is a local minimizer of $f\circ \Phi _{\boldsymbol {S}}^{-1}$ over $Q_{N,p}(\boldsymbol {S})$ .
$\boldsymbol {U}$ is a stationary point of Problem 1.1 4 if and only if $\Phi _{\boldsymbol {S}}(\boldsymbol {U})$ is a stationary point of $f_{\boldsymbol {S}}:=f\circ \Phi _{\boldsymbol {S}}^{-1}$ over $Q_{N,p}(\boldsymbol {S})$ , i.e., $\nabla f_{\boldsymbol {S}}(\Phi _{\boldsymbol {S}}(\boldsymbol {U})) = \boldsymbol {0}$ (see Appendix A for $\nabla f_{\boldsymbol {S}}$ ).

By using $\Phi _{\boldsymbol {S}}$ and $\Phi _{\boldsymbol {S}}^{-1}$ , we illustrate the adaptive localized Cayley parametrization (ALCP) strategy [42] in Algorithm 1 for Problem 1.1 via Problem 1.3, where center points $\boldsymbol {S}_{[l]}$ are adaptively updated to suppress the singular-point issue (see Remark 2.2 (b) and (c)). Basically, while keeping the $l$ th center point $\boldsymbol {S}_{[l]}$ , Algorithm 1 updates from $\boldsymbol {V}_{n} \in Q_{N,p}(\boldsymbol {S}_{[l]})$ to a tentative estimate $\widetilde {\boldsymbol {V}}_{n+1} \in Q_{N,p}(\boldsymbol {S}_{[l]})$ by using a single update of a Euclidean optimization algorithm $\mathcal {A}^{\langle l \rangle }$ for the minimization of $f\circ \Phi _{\boldsymbol {S}_{[l]}}^{-1}$ (see Section II-B for $\mathcal {A}^{\langle l \rangle }$ ). In a case where the singular-point issue is detected by checking a certain restarting condition (see Remark 2.2 (d)), we update the center point from $\boldsymbol {S}_{[l]}$ to $\boldsymbol {S}_{[l+1]} \in {\mathrm{ O}}(N)$ (see line 1) so that $\boldsymbol {U}_{n+1}$ is distant from the singular-point set $E_{N,p}(\boldsymbol {S}_{[l+1]})$ (see Algorithm 2 and Remark 2.3). Then, the parametric expression of $\boldsymbol {U}_{n+1}$ is also changed from $\widetilde {\boldsymbol {V}}_{n+1} \in Q_{N,p}(\boldsymbol {S}_{[l]})$ to $\Phi _{\boldsymbol {S}_{[l+1]}}(\boldsymbol {U}_{n+1}) \in Q_{N,p}(\boldsymbol {S}_{[l+1]})$ . At last, the next iterate is given as $\begin{align*} \boldsymbol {V}_{n+1}:= \begin{cases} \displaystyle \Phi _{\boldsymbol {S}_{[l+1]}}(\boldsymbol {U}_{n+1}) & \mathrm {(if~restarting~condition~holds)} \\ \displaystyle \widetilde {\boldsymbol {V}}_{n+1} & \mathrm {(otherwise)}. \end{cases}\end{align*}$ View Source Just after updating a center point from $\boldsymbol {S}_{[l]}$ to $\boldsymbol {S}_{[l+1]} \in {\mathrm{ O}}(N)$ , we restart the minimization of $f\circ \Phi _{\boldsymbol {S}_{[l+1]}}^{-1}$ by applying a Euclidean optimization algorithm $\mathcal {A}^{\langle l+1 \rangle }$ with the initial guess $\boldsymbol {V}_{n+1} \in Q_{N,p}(\boldsymbol {S}_{[l+1]})$ .

Algorithm 1 Adaptive Localized Cayley Parametrization Strategy

Input:

$\boldsymbol {U}_{0} \in \mathrm {St} (p,N)$

1:

${\mathcal {N}}_{l} \leftarrow \emptyset ~(l\in \mathbb {N}_{0})~\triangleright$ Initialization of index sets

2:

$l \leftarrow 0$ , $n \leftarrow 0$ , $\mathcal {N}_{0} \leftarrow \{0\}$

3:

$\boldsymbol {S}_{[l]} \leftarrow \mathrm { Algorithm~ 2}(\boldsymbol {U}_{n})$

4:

$\boldsymbol {V}_{n} \leftarrow \Phi _{\boldsymbol {S}_{[l]}}(\boldsymbol {U}_{n})$

5:

while $\left \lVert{ \nabla (f\circ \Phi _{\boldsymbol {S}_{[l]}}^{-1}) (\boldsymbol {V}_{n}) }\right \rVert _{F} > 0$ 5 do

6:

$\widetilde{\boldsymbol {V}}_{n+1} \leftarrow \mathcal {A}^{\langle l \rangle }(\boldsymbol {V}_{n})$

7:

$\boldsymbol {U}_{n+1} \leftarrow \Phi _{\boldsymbol {S}_{[l]}}^{-1}(\widetilde{\boldsymbol {V}}_{n+1})$

8:

if restarting condition (e.g., (6) and (17)) holds then

9:

$\mathcal {N}_{l+1} \leftarrow \{n+1\}$

10:

$\boldsymbol {S}_{[l+1]} \leftarrow \mathrm { Algorithm~ 2}(\boldsymbol {U}_{n+1})$

11:

$\triangleright$ Update the center point

12:

$\boldsymbol {V}_{n+1} \leftarrow \Phi _{\boldsymbol {S}_{[l+1]}}(\boldsymbol {U}_{n+1})$

13:

$\triangleright$ New parametric expression of $\boldsymbol {U}_{n+1}$ using $\boldsymbol {S}_{[l+1]}$

14:

$l\leftarrow l+1$

15:

else

16:

${\mathcal {N}}_{l} \leftarrow {\mathcal {N}}_{l}\cup \{n+1\}$

17:

$\boldsymbol {V}_{n+1} \leftarrow \widetilde{\boldsymbol {V}}_{n+1}$

18:

end if

19:

$n \leftarrow n+1$

20:

end while

Algorithm 2 Choice of Center Point

Input:

$\boldsymbol {U} = \begin{bmatrix} \boldsymbol {U}_{\mathrm {up}}^{ \mathsf {T}} & \boldsymbol {U}_{\mathrm {lo}}^{ \mathsf {T}} \end{bmatrix}^{ \mathsf {T}} \in \mathrm {St} (p,N)$

Compute the singular value decomposition of $\boldsymbol {U}_{\mathrm {up}} = \boldsymbol {Q}_{1}\boldsymbol {\Sigma }\boldsymbol {Q}_{2}^{ \mathsf {T}}$ .

$\boldsymbol {S} \leftarrow \mathrm {diag} (\boldsymbol {Q}_{1}\boldsymbol {Q}_{2}^{ \mathsf {T}},\boldsymbol {I}_{N-p}) \in {\mathrm{ O}}(N)$ .

Output:

$\boldsymbol {S}$

Remark 2.2: [On ALCP strategy (Algorithm 1)]:

(Properties of ${\mathcal {N}}_{l}$ ) In Algorithm 1, the index set ${\mathcal {N}}_{l} \subset \mathbb {N}_{0}$ satisfies $\begin{align*} \begin{cases} \displaystyle n\in {\mathcal {N}}_{l} \Rightarrow \boldsymbol {V}_{n} \in Q_{N,p}(\boldsymbol {S}_{[l]}), \\ \displaystyle (l_{1}\neq l_{2})\quad \mathcal {N}_{l_{1}} \cap \mathcal {N}_{l_{2}} = \emptyset, \\ \displaystyle \mathcal {N}_{l+1} \neq \emptyset \Rightarrow \left |{{\mathcal {N}}_{l}}\right | = \max ({\mathcal {N}}_{l}) - \min ({\mathcal {N}}_{l}) + 1, \\ \displaystyle \mathcal {N}_{l+1} \neq \emptyset \Rightarrow \max ({\mathcal {N}}_{l})+1 = \min (\mathcal {N}_{l+1}), \end{cases} \tag{5}\end{align*}$ View Source where $\mathcal {N}_{l+1} \neq \emptyset$ implies that ${\mathcal {N}}_{l}$ is determined as a finite interval ${\mathcal {N}}_{l} = [\min ({\mathcal {N}}_{l}),\max (\mathcal {N}_{l+1})] \cap \mathbb {N}_{0}$ , i.e., $\begin{align*} {\mathcal {N}}_{l} = \begin{cases} \displaystyle [\min ({\mathcal {N}}_{l}), \max ({\mathcal {N}}_{l})] \cap \mathbb {N}_{0} & (\mathrm {if}~\max ({\mathcal {N}}_{l}) < +\infty) \\ \displaystyle [\min ({\mathcal {N}}_{l}), \infty) \cap \mathbb {N}_{0} & (\mathrm {otherwise}). \end{cases}\end{align*}$ View Source
(Singular-point issue possibly impacts NCP strategy6) Any singular point corresponds to a point at infinity in $Q_{N,p}(\boldsymbol {S})$ through $\Phi _{\boldsymbol {S}}^{-1}$ [33, Theorem 2.3 (d)]. Clearly, if a minimizer $\boldsymbol {U}^{\star } \in \mathrm {St} (p,N)$ in Problem 1.1 lies near the singular-point set $E_{N,p}(\boldsymbol {S})$ [28], [33], then $\left \lVert{ \Phi _{\boldsymbol {S}}(\boldsymbol {U}^{\star }) }\right \rVert _{2}$ becomes extremely large, which implies (i) many iterations are required for $\boldsymbol {V}_{n}$ to approach $\Phi _{\boldsymbol {S}}(\boldsymbol {U}^{\star })$ , and (ii) a direct application of any Euclidean optimization algorithm to $f\circ \Phi _{\boldsymbol {S}}^{-1}$ may induce slow convergence.
(Suppression of singular-point issue in NCP strategy by ALCP strategy) To suppress the undesired influence caused by $E_{N,p}(\boldsymbol {S}_{[l]})$ , the ALCP strategy (Algorithm 1) updates $\boldsymbol {S}_{[l]}$ adaptively so that the parametric expression $\Phi _{\boldsymbol {S}_{[l]}}(\boldsymbol {U}^{\star }) \in Q_{N,p}(\boldsymbol {S}_{[l]})$ of $\boldsymbol {U}^{\star }$ is not too distant from zero by using Algorithm 2 (see Remark 2.2 (d) and Remark 2.3).
(Detection of singular-point issue) We can detect the singular-point issue, in line 8 of Algorithm 1, by checking whether $\left \lVert{ \widetilde {\boldsymbol {V}}_{n+1} }\right \rVert _{2}$ exceeds a predetermined threshold $T > 0$ [42], where $\widetilde {\boldsymbol {V}}_{n+1}=\mathcal {A}^{\langle l \rangle }(\boldsymbol {V}_{n})$ is a tentative update to judge whether the parametric expression $\widetilde {\boldsymbol {V}}_{n+1}$ of $\boldsymbol {U}_{n+1}$ is too influenced, by $E_{N,p}(\boldsymbol {S}_{[l]})$ , to keep $\boldsymbol {S}_{[l]}$ or not. We can also use $\begin{equation*} \left \lVert{ \widetilde {\boldsymbol {V}}_{n+1} }\right \rVert _{\mathrm {block}}:= \left \lVert{ [\![\widetilde {\boldsymbol {V}}_{n+1} ]\!]_{11} }\right \rVert _{2} + \left \lVert{ [\![\widetilde {\boldsymbol {V}}_{n+1} ]\!]_{21} }\right \rVert _{2} > T \tag{6}\end{equation*}$ View Source in place of $\left \lVert{ \widetilde {\boldsymbol {V}}_{n+1} }\right \rVert _{2}$ because $\left \lVert{ \cdot }\right \rVert _{\mathrm {block}}$ is a more computationally efficient7 (equivalent) norm for $Q_{N,p}(\boldsymbol {S})$ [42].

Remark 2.3:

[Updating Center Points by Algorithm 2] To suppress the singular-point issue, a choice of $\boldsymbol {S}_{[l+1]} \in {\mathrm{ O}}(N)$ plays a crucial role in Algorithm 1. Algorithm 2 serves as a proper design of $\boldsymbol {S}_{[l+1]}$ for the latest estimate $\boldsymbol {U}_{n+1} \in \mathrm {St} (p,N)$ because $\left \lVert{ \Phi _{\boldsymbol {S}_{[l+1]}}(\boldsymbol {U}_{n+1}) }\right \rVert _{2} \leq 1$ and $\left \lVert{ \Phi _{\boldsymbol {S}_{[l+1]}}(\boldsymbol {U}_{n+1}) }\right \rVert _{\mathrm {block}} \leq 1$ are guaranteed [33, Theorem 2.7], implying $\Phi _{\boldsymbol {S}_{[l+1]}}(\boldsymbol {U}_{n+1})$ is not too distant from zero (see Remark 2.2 (c) for its effect on the singular-point issue). Moreover, the computational complexities for $\Phi _{\boldsymbol {S}_{[l+1]}}$ and $\Phi _{\boldsymbol {S}_{[l+1]}}^{-1}$ with $\boldsymbol {S}_{[l+1]}$ designed by Algorithm 2 are more efficient ( $\mathfrak {o}(Np^{2} + p^{3})$ flops) than those complexities ( $\mathfrak {o}(N^{2}p + p^{3})$ flops) with general center points [33, Sec. 2].

B. Nonconvex Optimization Over the Euclidean Space With Convergence Rates

Many Euclidean optimization algorithms have been designed to achieve better convergence. Especially in the machine learning area, convergence rates in terms of the gradient of the cost function are of great interest, e.g., [38], [39], [40], [43], [44], [45], and [41]. Here, we focus on the class of Nonconvex Optimization over the Euclidean space with convergence Rates (NOE-r) defined below (see also Example 2.5).

Definition 2.4 (NOE-r):

Assume that the cost function $J:\mathcal {X}\to \mathbb {R}$ defined over the Euclidean space $\mathcal {X}$ satisfies the following condition: $\begin{align*} \hspace {-1em} \begin{cases} \mathrm {(i)}& J~\mathrm {is~bounded~below.} \\ \mathrm {(ii)}& J~\mathrm {is~differentiable.} \\ \mathrm {(iii)} & \nabla J~\mathrm {is~Lipschitz~continuous~with~its~Lipschitz} \\ & \mathrm {constant}~L_{\nabla J}> 0,~\mathrm {i.e.,~for~all~} \boldsymbol {x}_{1},\boldsymbol {x}_{2} \in \mathcal {X}, \\ & \left \lVert{ \nabla J(\boldsymbol {x}_{1})-\nabla J(\boldsymbol {x}_{2}) }\right \rVert \leq L_{\nabla J} \left \lVert{ \boldsymbol {x}_{1} - \boldsymbol {x}_{2} }\right \rVert. \end{cases} \tag{7}\end{align*}$ View Source We say that an algorithm $\mathcal {A}$ belongs to NOE-r of $\mathfrak {o}(cn^{-\alpha })$ with $\alpha >0$ for the minimization of $J$ if the following hold:

(Update) Let $\mathcal {N}:= \mathbb {N}_{0}$ . Choose an initial guess $\boldsymbol {x}_{\min (\mathcal {N})} (=\boldsymbol {x}_{0}) \in \mathcal {X}$ , of a stationary point of $J$ , arbitrarily. Estimates $(\boldsymbol {x}_{n})_{n\in \mathcal {N}} \subset \mathcal {X}$ of a stationary point of $J$ are generated by $\begin{equation*} (n\in \mathcal {N}) \quad \mathcal {A}: \left ({\boldsymbol {x}_{n},\mathfrak {R}_{[\min (\mathcal {N}),n]}}\right) \mapsto \boldsymbol {x}_{n+1} \in \mathcal {X}\end{equation*}$ View Sourcewith a certain available record8 $\mathfrak {R}_{[\min (\mathcal {N}),n]}$ (see Example 2.5). For simplicity, we also use the notation $\mathcal {A}(\boldsymbol {x}_{n})$ for a single update from $\boldsymbol {x}_{n}$ .
(Convergence rate in terms of gradient sequence) The algorithm $\mathcal {A}$ has a convergence rate $\mathfrak {o}(cn^{-\alpha })$ in terms of gradient sequence, which means in this paper (see also [38, p.28]), for any cost function $J$ satisfying (7), the algorithm $\mathcal {A}$ satisfies $\begin{align*} & (\exists \alpha > 0, \exists c > 0, \forall n \in \mathcal {N}) \\ & \quad \min _{k \in \mathcal {N}, k \leq n} \left \lVert{ \nabla J(\boldsymbol {x}_{k}) }\right \rVert \leq \frac {c}{(n-\min (\mathcal {N})+1)^{\alpha }}. \tag{8}\end{align*}$ View Source

Example 2.5 (Selected Algorithms in NOE-r):

Many Euclidean optimization algorithms belong to NOE-r of $\mathfrak {o}(cn^{-1/2})$ for some $c > 0$ , e.g., the gradient descent method [38], Nesterov-type accelerated gradient methods [39], [40], [41], and adaptive momentum optimization algorithms [43], [44], [45]. Here, we introduce some methods below.

(Gradient descent method) Under the condition (7) for $J$ , a gradient descent method updates $\begin{equation*} \mathcal {A}_{\mathrm {GDM}}:\boldsymbol {x}_{n} \mapsto \boldsymbol {x}_{n+1}:=\boldsymbol {x}_{n} - \gamma _{n} \nabla f(\boldsymbol {x}_{n})\end{equation*}$ View Source with $\mathfrak {R}_{[\min (\mathcal {N}), n]}:= (\gamma _{n}, \nabla f(\boldsymbol {x}_{n}))$ , where a stepsize $\gamma _{n}>0$ satisfies, e.g., the Armijo condition (see, e.g., [34]). Then, $\mathcal {A}_{\mathrm {GDM}}$ has a convergence rate $\mathfrak {o}(L_{\nabla J}^{1/2}\Delta _{J}^{1/2}n^{-1/2})$ (see, e.g., [38], [41]), where $\Delta _{J}:= J(\boldsymbol {x}_{0}) - \inf _{\boldsymbol {x}\in \mathcal {X}}J(\boldsymbol {x})$ .
(Nesterov-type accelerated gradient method (NAG)) The NAG [37], [38] was originally proposed for a convex function $J$ satisfying (7), where the NAG achieves the optimal convergence rate in terms of function sequence. The NAG has been extended to nonconvex $J$ satisfying (7) with convergence rate $\mathfrak {o}(cn^{-1/2})$ but in terms of gradient sequence (see (8)) [39], [40], [41]. The following briefly summarizes a special instance of extended NAGs, say rNAG, with a restarting scheme [40, Theorem 4.1]. The rNAG updates $\begin{align*}\mathcal {A}_{\mathrm {rNAG}}:\boldsymbol {x}_{n} \mapsto \boldsymbol {x}_{n+1}:= \begin{cases} \displaystyle \boldsymbol {y}_{n} - \gamma _{n}\nabla J(\boldsymbol {y}_{n}) & (\mathrm {if}~(10)\mathrm {~holds}) \\ \displaystyle \boldsymbol {x}_{n} & (\mathrm {otherwise}) \end{cases} \tag{9}\end{align*}$ View Source with $\mathfrak {R}_{[\min (\mathcal {N}),n]}:= (\boldsymbol {y}_{n},\gamma _{n},\nabla J(\boldsymbol {y}_{n}))$ based on the test whether the value $J(\boldsymbol {y}_{n} - \gamma _{n}\nabla J(\boldsymbol {y}_{n}))$ sufficiently decreases, i.e., $\begin{equation*} J(\boldsymbol {y}_{n} - \gamma _{n}\nabla J(\boldsymbol {y}_{n})) \leq J(\boldsymbol {x}_{n}) - c_{R} \gamma _{n} \left \lVert{ \nabla J(\boldsymbol {y}_{n}) }\right \rVert ^{2}, \tag{10}\end{equation*}$ View Source or not, where $\begin{align*} \boldsymbol {y}_{n}:= \begin{cases} \displaystyle \boldsymbol {x}_{n} & (\mathrm {if}~n = \min (\mathcal {N})) \\ \displaystyle \boldsymbol {x}_{n} + \beta _{n}(\boldsymbol {x}_{n} - \boldsymbol {x}_{n-1}) & (\mathrm {if}~n > \min (\mathcal {N})), \end{cases} \tag{11}\end{align*}$ View Source is designed with $\beta _{n}:= \frac {n-\min (\mathcal {N})}{n+3-\min (\mathcal {N})}$ , $c_{R} > 0$ is a predetermined small constant, and $\gamma _{n} > 0$ is a stepsize designed to ensure a convergence rate9 $\mathfrak {o}(cn^{-1/2})$ . More precisely, each stepsize $\gamma _{n}$ is chosen to satisfy, with a predetermined $\rho \in (0,1]$ , $\begin{align*} {\begin{array}{cccc}& \frac {\rho }{L_{\nabla J}} \leq \gamma _{n} \leq \gamma _{n-1}~(\mathrm {for}~n > \min (\mathcal {N}))~\mathrm {and} \\ & J(\boldsymbol {y}_{n} - \gamma _{n}\nabla J(\boldsymbol {y}_{n})) \leq J(\boldsymbol {y}_{n}) - \frac {\gamma _{n}}{2} \left \lVert{ \nabla J(\boldsymbol {y}_{n}) }\right \rVert ^{2}. \end{array}}\biggl \} \tag{12}\end{align*}$ View Source By noting the latter condition in (12) is achieved automatically by $\gamma _{n} \leq L_{\nabla J}^{-1}$ , such a $\gamma _{n}$ satisfying (12) can be given as the output of Algorithm 3 (backtracking algorithm, see, e.g., [34] for its key idea), by setting in Algorithm 3 $\begin{align*} \gamma _{\mathrm {initial}}:= \begin{cases} \displaystyle \gamma ' & (\mathrm {if}~n=\min (\mathcal {N})) \\ \displaystyle \gamma _{n-1} & (\mathrm {if}~n > \min (\mathcal {N})), \end{cases}\end{align*}$ View Source $\boldsymbol {x}:= \boldsymbol {x}_{n}$ , and $\boldsymbol {y}:=\boldsymbol {y}_{n}$ , where $\rho \in (0,1)$ and $\gamma ' > L_{\nabla J}^{-1}$ are predetermined arbitrarily. Remark that if the output $\gamma _{n}$ of Algorithm 3 and $(\boldsymbol {x}_{n},\boldsymbol {y}_{n})$ do not satisfy (10), then (i) $\boldsymbol {x}_{n+1} = \boldsymbol {x}_{n}$ by (9); (ii) $\boldsymbol {y}_{n+1} = \boldsymbol {x}_{n+1} + \beta _{n+1}(\boldsymbol {x}_{n+1}-\boldsymbol {x}_{n}) = \boldsymbol {x}_{n+1}$ by (11), therefore the rNAG restarts from $\boldsymbol {y}_{n+1} = \boldsymbol {x}_{n+1} = \boldsymbol {x}_{n}$ .

Algorithm 3 Backtracking Algorithm

Input:

$\gamma _{\mathrm {initial}} > 0, ~\boldsymbol {x},\boldsymbol {y} \in \mathcal {X}$

$\gamma \leftarrow \gamma _{\mathrm {initial}}$

while $\gamma > L_{\nabla J}^{-1}$ and

$J(\boldsymbol {y} - \gamma \nabla J(\boldsymbol {y})) > J(\boldsymbol {y}) - \frac {\gamma }{2} \left \lVert{ \nabla J(\boldsymbol {y}) }\right \rVert ^{2}$ and

$J(\boldsymbol {y} - \gamma \nabla J(\boldsymbol {y})) > J(\boldsymbol {x}) - c_{R}\gamma \left \lVert{ \nabla J(\boldsymbol {y}) }\right \rVert ^{2}$ do

$\gamma \leftarrow \rho \gamma$

end while

Output:

$\gamma$

SECTION III.

Convergence Rate Analysis

Recall that (i) $\boldsymbol {S}_{[l]} \in {\mathrm{ O}}(N)$ is employed as the $l$ th center point of $\Phi _{\boldsymbol {S}_{[l]}}^{-1}$ in Algorithm 1, (ii) ${\mathcal {N}}_{l} \subset \mathbb {N}_{0}$ is the index interval where $\boldsymbol {V}_{n} \in Q_{N,p}(\boldsymbol {S}_{[l]})~(n\in {\mathcal {N}}_{l})$ are produced (see also (5)), (iii) $\mathcal {A}^{\langle l \rangle }$ is a Euclidean optimization algorithm, which updates $\boldsymbol {V}_{n} \in Q_{N,p}(\boldsymbol {S}_{[l]})~(n\in {\mathcal {N}}_{l})$ for the minimization of $f_{\boldsymbol {S}_{[l]}}:=f\circ \Phi _{\boldsymbol {S}_{[l]}}^{-1}$ .

In this section, we present a convergence rate analysis of Algorithm 1 under the following condition (Condition 3.1). More precisely, for Algorithm 1, we first present a center point-wise convergence analysis in Lemma 3.2, then an overall convergence analysis in Theorem 3.3.

Condition 3.1 (For Convergence Analysis ofAlgorithm 1):

In Algorithm 1, $\left \lVert{ \nabla f_{\boldsymbol {S}_{[l]}}(\boldsymbol {V}_{n}) }\right \rVert _{F}$ does not achieve zero for any $n \in \mathbb {N}_{0}$ , implying that either (i) or (ii) below holds true.

For some $l_{\max } \in \mathbb {N}_{0}$ , $\mathcal {N}_{l_{\max }} = \bigl [\min (\mathcal {N}_{l_{\max }}), \infty \bigr) \cap \mathbb {N}_{0}$ and ${\mathcal {N}}_{l} = \emptyset ~(\forall l > l_{\max })$ . In this case, we use the index set $\Lambda _{\mathrm{ c}}:= \{0,1,\ldots,l_{\max }\}$ of chosen center points during the process of Algorithm 1.
For all $l \in \mathbb {N}_{0}$ , ${\mathcal {N}}_{l} = [\min ({\mathcal {N}}_{l}), \max ({\mathcal {N}}_{l})] \cap \mathbb {N}_{0}$ and $\max ({\mathcal {N}}_{l}) + 1 = \min (\mathcal {N}_{l+1})$ . In this case, we use the index set $\Lambda _{\mathrm{ c}}:= \mathbb {N}_{0}$ of chosen center points during the process of Algorithm 1.

Note that, in both cases, a wide-sense monotone increasing function

$\begin{equation*} \ell :\mathbb {N}_{0} \to \mathbb {N}_{0}:n\mapsto l,~\mathrm {s.t.}~n\in {\mathcal {N}}_{l}\end{equation*}$

View Source

is well-defined, and thus

$\boldsymbol {V}_{n} \in Q_{N,p}(\boldsymbol {S}_{[\ell (n)] })$

and

$\boldsymbol {U}_{n}:= \Phi _{\boldsymbol {S}_{[\ell (n)] }}^{-1}(\boldsymbol {V}_{n}) \in \mathrm {St} (p,N)$

are well-defined for all

$n\in \mathbb {N}_{0}$

.

Lemma 3.2:

Assume that (i) $f:\mathbb {R}^{N\times p} \to \mathbb {R}$ is differentiable, and (ii) $\nabla f$ is Lipschitz continuous with a Lipschitz constant $L_{\nabla f} > 0$ (see (7)). Consider the situation where Condition 3.1 holds true. Then, the following hold:

For all $l\in \Lambda _{\mathrm{ c}}$ , the condition (7) is achieved when $J$ and $\mathcal {X}$ are given respectively by $f_{\boldsymbol {S}_{[l]}}:=f\circ \Phi _{\boldsymbol {S}_{[l]}}^{-1}$ and $Q_{N,p}(\boldsymbol {S}_{[l]})$ .
(Center point-wise convergence analysis) Suppose that, for some $c > 0$ and all $l \in \Lambda _{\mathrm{ c}}$ , $\mathcal {A}^{\langle l \rangle }$ belongs to NOE-r of $\mathfrak {o}(cn^{-1/2})$ in the sense of Definition 2.4 (b). Let $(\boldsymbol {V}_{n})_{n \in \bigcup _{l\in \Lambda _{\mathrm{ c}}} {\mathcal {N}}_{l}}$ be generated by Algorithm 1 incorporating $\mathcal {A}^{\langle l \rangle }~(l\in \Lambda _{\mathrm{ c}})$ . Then, we have $\begin{align*} (l\in \Lambda _{\mathrm{ c}}, n \in {\mathcal {N}}_{l}) & \min _{k \in {\mathcal {N}}_{l}, k \leq n} \left \lVert{ \nabla f_{\boldsymbol {S}_{[l]}}(\boldsymbol {V}_{k}) }\right \rVert _{F} \\ & \quad \leq \frac {c}{(n-\min ({\mathcal {N}}_{l})+1)^{1/2}} \tag{13}\end{align*}$ View Source and $\begin{equation*} (n\in \mathbb {N}_{0}) ~\min _{k=0,1\ldots,n} \left \lVert{ \nabla f_{\boldsymbol {S}_{[\ell (k)] }}(\boldsymbol {V}_{k}) }\right \rVert _{F} \leq \sqrt {\frac {\ell (n)+1}{n+1}}c. \tag{14}\end{equation*}$ View Source

Proof:

The image of $f$ of $\mathrm {St}(p,N)$ , say $f(\mathrm {St}(p,N)) \subset \mathbb {R}$ , is bounded below by the compactness of $\mathrm {St}(p,N)$ and the continuity of $f$ . From $\Phi _{\boldsymbol {S}}^{-1}(Q_{N,p}(\boldsymbol {S})) \subset \mathrm {St} (p,N)$ , $f_{\boldsymbol {S}_{[l]}}(Q_{N,p}(\boldsymbol {S}_{[l]})) \subset f(\mathrm {St}(p,N))$ is also bounded below, where $f_{\boldsymbol {S}_{[l]}}(Q_{N,p}(\boldsymbol {S}_{[l]}))$ is the image of $f_{\boldsymbol {S}_{[l]}}$ of $Q_{N,p}(\boldsymbol {S}_{[l]})$ . Regarding the condition (7), see Fact 1.1 for the differentiability of $f_{\boldsymbol {S}}$ and the Lipschitz continuity of $\nabla f_{\boldsymbol {S}_{[l]}}$ .
(13) is clear from (a) and the assumption on $\mathcal {A}^{\langle l \rangle }$ (see also (8)).
(Proof of (14)) Clearly, for any $n\in \mathbb {N}_{0}$ , we have $\begin{align*} &\hspace {-.5pc} \sum _{l=0}^{\ell (n)-1} \left |{{\mathcal {N}}_{l}}\right |\min _{k\in {\mathcal {N}}_{l}} \left \lVert{ \nabla f_{\boldsymbol {S}_{[l]}}(\boldsymbol {V}_{k}) }\right \rVert _{F}^{2} \\ & \quad + (n-\min (\mathcal {N}_{\ell (n)}) + 1)\min _{\min (\mathcal {N}_{\ell (n)})\leq k \leq n} \left \lVert{ \nabla f_{\boldsymbol {S}_{[\ell (n)] }}(\boldsymbol {V}_{k}) }\right \rVert _{F}^{2} \\ & \geq \left ({n-\min (\mathcal {N}_{\ell (n)}) + 1 + \sum _{l=0}^{\ell (n)-1} \left |{{\mathcal {N}}_{l}}\right |}\right) \\ & \hspace {9em} \times \min _{k=0,1,\ldots,n} \left \lVert{ \nabla f_{\boldsymbol {S}_{[\ell (k)] }}(\boldsymbol {V}_{k}) }\right \rVert _{F}^{2} \\ & = (n+1)\min _{k=0,1,\ldots,n} \left \lVert{ \nabla f_{\boldsymbol {S}_{[\ell (k)] }}(\boldsymbol {V}_{k}) }\right \rVert _{F}^{2}, \tag{15}\end{align*}$ View Sourcewhere the last equality is verified by (5) and $\begin{align*} &\hspace {-1pc} \sum _{l=0}^{\ell (n)-1} \left |{{\mathcal {N}}_{l}}\right | = \sum _{l=0}^{\ell (n)-1}(\max ({\mathcal {N}}_{l})- \min ({\mathcal {N}}_{l}) + 1) \\ & = \sum _{l=0}^{\ell (n)-1}(\min (\mathcal {N}_{l+1})- \min ({\mathcal {N}}_{l})) \\ & = \min (\mathcal {N}_{\ell (n)})- \min (\mathcal {N}_{0}) = \min (\mathcal {N}_{\ell (n)}) \quad (\because \min (\mathcal {N}_{0}) = 0).\end{align*}$ View Source

Moreover, by (13), we have $\begin{equation*} (0 \leq l \leq \ell (n)-1) \quad \left |{{\mathcal {N}}_{l}}\right |\min _{k\in {\mathcal {N}}_{l}} \left \lVert{ \nabla f_{\boldsymbol {S}_{[l]}}(\boldsymbol {V}_{k}) }\right \rVert _{F}^{2} \leq c^{2},\end{equation*}$ View Source and $\begin{equation*} (n+1-\min (\mathcal {N}_{\ell (n)}))\min _{\min (\mathcal {N}_{\ell (n)})\leq k \leq n} \left \lVert{ \nabla f_{\boldsymbol {S}_{[\ell (k)] }}(\boldsymbol {V}_{n}) }\right \rVert _{F}^{2} \leq c^{2}.\end{equation*}$ View Source Finally, by substituting these relations to the LHS of (15), we obtain $\begin{equation*} \min _{k=0,1\ldots,n} \left \lVert{ \nabla f_{\boldsymbol {S}_{[\ell (k)] }}(\boldsymbol {V}_{k}) }\right \rVert _{F}^{2} \leq \frac {\ell (n)+1}{n+1}c^{2}.\end{equation*}$ View Source

From (14) in Lemma 3.2 (b), to evaluate the speed of convergence of $\min _{k=0,1\ldots,n} \left \lVert{ \nabla f_{\boldsymbol {S}_{[\ell (k)] }}(\boldsymbol {V}_{k}) }\right \rVert _{F}$ to zero, it is sufficient to show that the RHS of (14) has the following upper bound with some $\widehat {\alpha } > 0$ and some $\widehat {c} > 0$ : $\begin{equation*} (n\in \mathbb {N}_{0}) \quad \frac {\ell (n) + 1}{n+1} \leq \frac {\widehat {c}}{(n+1)^{\widehat {\alpha }}}. \tag{16}\end{equation*}$ View Source To guarantee the existence of such $\widehat {\alpha } > 0$ and $\widehat {c} > 0$ , we modify10 the restarting condition (6) by newly introducing a polynomial lower bound $\eta (l)$ of the usage count of $\boldsymbol {S}_{[l]}$ as $\begin{align*} \left.{\begin{array}{ccccccccc} & \left \lVert{ \widetilde {\boldsymbol {V}}_{n+1} }\right \rVert _{\mathrm {block}} > T ~\mathrm {and} \\ & \quad n - \min ({\mathcal {N}}_{l}) + 1 \geq \eta (l):= \left \lfloor{ \frac {l}{\tau } }\right \rfloor ^{\theta }, \end{array}}\right \} \tag{17}\end{align*}$ View Source where (i) $T > 0$ is a predetermined threshold for the detection of the singular-point issue (see Remark 2.2 (d)), (ii) $\theta \in \mathbb {N}$ is a tunable polynomial order of $\eta (\cdot)$ , and (iii) $\tau \in \mathbb {N}$ is a tunable invariant period of $\eta (\cdot)$ . Thanks to the latter condition in (17), we ensure $\begin{equation*} \hspace {-1.5em} \mathcal {N}_{l+1} \neq \emptyset \Rightarrow \left |{{\mathcal {N}}_{l}}\right | \overset {(5)}{=} \max ({\mathcal {N}}_{l}) - \min ({\mathcal {N}}_{l}) + 1\geq \eta (l), \tag{18}\end{equation*}$ View Source

Finally, we present the following overall convergence analysis of Algorithm 1.

Theorem 3.3 (Overall Convergence Analysis):

Assume that (i) $f:\mathbb {R}^{N\times p} \to \mathbb {R}$ is differentiable, and (ii) $\nabla f$ is Lipschitz continuous with a Lipschitz constant $L_{\nabla f} > 0$ . Consider the situation where Condition 3.1 holds true. Suppose that, for some $c > 0$ and all $l \in \Lambda _{\mathrm{ c}}$ , $\mathcal {A}^{\langle l \rangle }$ belongs to NOE-r of $\mathfrak {o}(cn^{-1/2})$ in the sense of Definition 2.4 (b). Let $(\boldsymbol {V}_{n})_{n \in \bigcup _{l\in \Lambda _{\mathrm{ c}}} {\mathcal {N}}_{l}}$ be generated by Algorithm 1 incorporating $\mathcal {A}^{\langle l \rangle }~(l\in \Lambda _{\mathrm{ c}})$ and (17) with some $T > 0$ , and $(\theta,\tau) \in \mathbb {N}^{2}$ as the restarting condition in line 8 of Algorithm 1. Then, we have $\begin{equation*} (n\in \mathbb {N}_{0}) ~\min _{k=0,1,\ldots,n} \left \lVert{ \nabla f_{\boldsymbol {S}_{[\ell (k)] }}(\boldsymbol {V}_{k}) }\right \rVert _{F} \leq \frac {\sqrt {(\sqrt {3}+3)\tau }c}{(n+1)^{\frac {\theta }{2(\theta +1)}}} \tag{19}\end{equation*}$ View Source and11 $\begin{align*} &\hspace {-.5pc} \liminf _{n\to \infty } \left \lVert{ \nabla f_{\boldsymbol {S}_{[\ell (n)] }}(\boldsymbol {V}_{n}) }\right \rVert _{F} \\ &:= \lim _{n\to \infty }\inf \underbrace {\left \{{ \left \lVert{ \nabla f_{\boldsymbol {S}_{[\ell (k)] }}(\boldsymbol {V}_{k}) }\right \rVert _{F}\mid k \geq n}\right \}}_{=:\mathfrak {G}_{n}} = 0. \tag{20}\end{align*}$ View Source

Proof:

See Appendix B.

Remark 3.4 (Interpretation of Theorem 3.3):

As stated just after Problem 1.1, a realistic goal of Problem 1.1 is to find its stationary point $\boldsymbol {U}^{\star } \in \mathrm {St} (p,N)$ . More precisely, according to Fact 2.1 (b), it is desired to approximate $(\boldsymbol {V}^{\star }, \boldsymbol {S}^{\star }) \in Q_{N,p}\times {\mathrm{ O}}(N)$ satisfying $\nabla f_{\boldsymbol {S}^{\star }}(\boldsymbol {V}^{\star }) = \boldsymbol {0}$ with a certain sequence $(\boldsymbol {V}_{n},\boldsymbol {S}_{[\ell (n)] })_{n=0}^{\infty } \subset Q_{N,p}\times {\mathrm{ O}}(N)$ , e.g., by Algorithm 1. Clearly, the RHS in (19) converges to zero as $n\to \infty$ . This upper bound of $\min _{k=0,1,\ldots,n} \left \lVert{ \nabla f_{\boldsymbol {S}_{[\ell (k)] }}(\boldsymbol {V}_{k}) }\right \rVert _{F}$ tells us how fast the gradient approaches zero.
The evaluations shown in (19) and (20) can also apply to the NCP strategy (incorporating an algorithm of NOE-r as $\mathcal {A}^{\langle 0 \rangle }$ ) as a special case of the ALCP strategy (Algorithm 1) by setting a sufficiently large $T > 0$ in (17).

SECTION IV.

Numerical Experiments

In the experiments, we demonstrate the numerical performance of Algorithm 1 incorporating the restarting condition (17) with $T:= 1.5$ and $\tau,\theta \in \{1, 10\}$ . We employed $\mathcal {A}_{\mathrm {rNAG}}$ in (9) as $\mathcal {A}^{\langle l \rangle }~(l\in \Lambda _{\mathrm{ c}})$ in Algorithm 1, say rNAG+ACLP, where we used $\rho := 0.5$ , $c_{R}:= 2^{-13}$ , and $\gamma ':=1$ in rNAG and Algorithm 3. We compared rNAG+ALCP with (i) the NCP strategy (say rNAG+NCP) incorporating $\mathcal {A}_{\mathrm {rNAG}}$ (see also Remark 3.4 (b)), and (ii) the corresponding retraction-based strategy (say rNAG+Retr) incorporating $\mathcal {A}_{\mathrm {rNAG}}$ [40, Algorithm 4.1] (for retraction-based strategies, see [25] and the last part of Section I). Both algorithms were terminated at the $n$ th iterate, achieving $\begin{align*} \begin{cases} \displaystyle n=1000, \\ \displaystyle \quad \mathrm {or} \\ \displaystyle \frac { \left \lVert{ \boldsymbol {G}_{n} }\right \rVert _{F}}{ \left \lVert{ \boldsymbol {G}_{0} }\right \rVert _{F}} < 10^{-6}, \\ \displaystyle \quad \mathrm {or} \\ \displaystyle \left |{f(\boldsymbol {U}_{n})-f(\boldsymbol {U}_{n-1})}\right | + \left |{f(\boldsymbol {U}_{n-1})-f(\boldsymbol {U}_{n-2})}\right | < 10^{-6}, \end{cases} \tag{21}\end{align*}$ View Source where $\begin{align*} \boldsymbol {G}_{n}:= \begin{cases} \displaystyle \nabla (f\circ \Phi _{\boldsymbol {S}_{[\ell (n)] }}^{-1})(\boldsymbol {V}_{n}) & \textrm {(rNAG+ALCP, rNAG+NCP)} \\ \displaystyle \mathop {\mathrm {grad}}f(\boldsymbol {U}_{n}) & \textrm {(rNAG+Retr)}. \end{cases}\end{align*}$ View Source $\mathrm {grad} f(\boldsymbol {U})$ is the Riemannian gradient of $f$ at $\boldsymbol {U} \in \mathrm {St} (p,N)$ .12

All the experiments were performed in MATLAB on MacBook Pro (13)-inch, M1, 2020) with 16 GB of RAM. To evaluate the numerical performance of rNAG+ALCP, we considered the following standard test problems (i) linear eigenvalue problem, e.g., [25]; (ii) nonlinear eigenvalue problem, e.g., [21]. For each problem, we generated an initial guess $\boldsymbol {U}_{0}\in \mathrm {St} (p,N)$ randomly by MATLAB code ‘orth(rand(N,p))’ for 10 trials.

Linear eigenvalue problem: For a given symmetric matrix $\boldsymbol {A} \in \mathbb {R}^{N\times N}$ , we considered $\begin{equation*} \mathop {\mathrm {minimize}}_{\boldsymbol {U}\in \mathrm {St} (p,N)} f_{1}(\boldsymbol {U}):=- \mathrm {Tr}(\boldsymbol {U}^{ \mathsf {T}}\boldsymbol {A}\boldsymbol {U}). \tag{22}\end{equation*}$ View Source Any global minimizer $\boldsymbol {U}^{\star }$ of the problem (22) is an orthonormal eigenbasis associated with the $p$ largest eigenvalues of $\boldsymbol {A}$ [25]. In our experiment, we used $\boldsymbol {A}:=\widetilde {\boldsymbol {A}}^{ \mathsf {T}}\widetilde {\boldsymbol {A}} \in \mathbb {R}^{N\times N}$ with randomly chosen $\widetilde {\boldsymbol {A}} \in \mathbb {R}^{N\times N}$ of which each entry is sampled by the standard normal distribution.
Nonlinear eigenvalue problem: By following [19], [20] and [21], we considered $\begin{equation*} \mathrm {find}~\boldsymbol {U}^{\star } \in \mathrm {St} (p,N)~\mathrm {s.t.}~H(\boldsymbol {U}^{\star })\boldsymbol {U}^{\star } = \boldsymbol {U}^{\star }\boldsymbol {D} \tag{23}\end{equation*}$ View Source with some diagonal matrix $\boldsymbol {D} \in \mathbb {R}^{N\times N}$ , where $H(\boldsymbol {U}):= \boldsymbol {L}+ \zeta \mathrm {diag}(\boldsymbol {L}^{-1}\psi (\boldsymbol {U})) \in \mathbb {R}^{N\times N}$ is given with a symmetric Laplacian matrix $\boldsymbol {L} \in \mathbb {R}^{N\times N}$ , $\psi (\boldsymbol {U}):=\mathrm {Diag}(\boldsymbol {U}\boldsymbol {U}^{ \mathsf {T}}) \in \mathbb {R}^{N}$ , $\zeta > 0$ , and $\mathrm {Diag}:\mathbb {R}^{N\times N}\to \mathbb {R}^{N}:\boldsymbol {X}\mapsto [[\boldsymbol {X}]_{11}~[\boldsymbol {X}]_{22}~\cdots ~[\boldsymbol {X}]_{NN}]$ . The solution of (23) can be characterized as a global minimizer of (see, e.g., [20]) $\begin{equation*} \mathop {\mathrm {minimize}}_{\boldsymbol {U}\in \mathrm {St} (p,N)} f_{2}(\boldsymbol {U}):=\frac {1}{2} \mathrm {Tr}(\boldsymbol {U}^{ \mathsf {T}}\boldsymbol {L}\boldsymbol {U}) + \frac {\zeta }{4}\psi (\boldsymbol {U})^{ \mathsf {T}}\boldsymbol {L}^{-1}\psi (\boldsymbol {U}). \tag{24}\end{equation*}$ View Source In our experiment, we used $\zeta = 3$ and $\boldsymbol {L}$ as one-dimensional discrete Laplacian with 2 on the diagonal and −1 on the sub- and super- diagonals by following [21].

Tables 1 and 2 summarize respectively the average results for problems (22) and (24) over 10 trials with $N\in \{1000,2000\}$ and $p\in \{1, 10\}$ , where for each output $\boldsymbol {U}^{\diamond } \in \mathrm {St} (p,N)$ of these algorithms, ‘fval-optimal’ means the value $f(\boldsymbol {U}^{\diamond }) - f(\boldsymbol {U}^{\star })$ , ‘fval’ means the value $f(\boldsymbol {U}^{\diamond })$ , ‘feasi’ means the feasibility $\left \lVert{ \boldsymbol {I}_{p}-\boldsymbol {U}^{\diamond \mathsf {T} }\boldsymbol {U}^{\diamond } }\right \rVert _{F}$ , ‘nrmg’ means the norm $\left \lVert{ \mathrm {grad} f(\boldsymbol {U}^{\diamond }) }\right \rVert _{F}$ , ‘itr’ means the number of iterations, ‘time’ means the CPU time (s), and ‘restart’ means the number of updating center points for rNAG+ALCP. Figures 1 and 2 demonstrate the convergence histories for problems (22) and (24) respectively, where the plots show CPU time on the horizontal axis versus the value of the cost function on the vertical axis. The following remark summarizes the main points found in Tables 1–2 and Figures 1–2.

TABLE 1 Performance of Each Algorithm Applied to the Problem (22)

TABLE II Performance of Each Algorithm Applied to the Problem (24)

$FIGURE 1. - Convergence histories of each algorithm applied to the problem (22) regarding the value $f(\boldsymbol{U})-f(\boldsymbol{U}^{\star})$ at CPU time for each problem size, where $\boldsymbol{U}^{\star}\in\mathrm{St} (p,N)$ was obtained by the eigenvalue decomposition of $\boldsymbol{A}$ . Markers are put at every 15 iterations.$

FIGURE 1.

Convergence histories of each algorithm applied to the problem (22) regarding the value $f(\boldsymbol{U})-f(\boldsymbol{U}^{\star})$ at CPU time for each problem size, where $\boldsymbol{U}^{\star}\in\mathrm{St} (p,N)$ was obtained by the eigenvalue decomposition of $\boldsymbol{A}$ . Markers are put at every 15 iterations.

Show All

$FIGURE 2. - Convergence histories of each algorithm applied to the problem (24) regarding the value $f(\boldsymbol{U})$ at CPU time for each problem size. Markers are put at every 15 iterations.$

FIGURE 2.

Convergence histories of each algorithm applied to the problem (24) regarding the value $f(\boldsymbol{U})$ at CPU time for each problem size. Markers are put at every 15 iterations.

Show All

Remark 4.1 (Main Findings From Numerical Experiments):

(Comparison of rNAG+ALCP and rNAG+NCP) As expected in the design concept of the ALCP strategy (see the paragraphs just before and after Problem 1.3, and Remark 2.2 (c)), Figures 1 and 2 show that rNAG+ALCP improves dramatically the convergence speed of rNAG+NCP.
(Comparison of rNAG+Retr and rNAG+ALCP) From a view of the number of iterations in Tables 1 and 2, the performance of rNAG+ALCP is comparable to that of rNAG+Retr. From Figures 1 and 2, we found that the proposed rNAG+ALCP for all $\tau,\theta \in \{1, 10\}$ outperforms rNAG+Retr in a view of CPU time.
(Robustness against choices of $\tau,\theta$ used in (17)) From Figures 1 and 2, the numerical performance of rNAG+ALCP is robust against chosen $\tau, \theta$ . Moreover, from Tables 1 and 2, for rNAG+ALCP with $(\tau,\theta) \neq (1,10)$ , we found that all results are the same except for “time”. This is mainly because, for rNAG+ALCP with $(\tau,\theta) \neq (1,10)$ , (i) $\eta (l) \leq 1$ holds during the process of algorithm due to the small number of restarting (see (17) and “restart”); (ii) the latter condition $n - \min ({\mathcal {N}}_{l}) + 1 \geq \eta (l)$ in (17) always holds true; (iii) every center point was updated at the same iteration.

SECTION V.

Conclusion

We presented a convergence rate analysis of the adaptive localized Cayley parametrization (ALCP) strategy for optimization over the Stiefel manifold with a new restarting condition. In the ALCP strategy, an optimization problem over the Stiefel manifold is translated into optimization problems over the Euclidean space with the tunable generalized left-localized Cayley transform. The proposed analysis provides a convergence rate, in terms of gradient sequence, for the ALCP strategy. The numerical experiments demonstrate the effectiveness of the ALCP strategy compared with the retraction-based strategy and the naive Cayley parametrization strategy.

Appendix A
Gradient of Function After the Cayley Parametrization

We briefly review the gradient of $f\circ \Phi _{\boldsymbol {S}}^{-1}$ in a case where the standard inner product ${\langle \boldsymbol {V}_{1},\boldsymbol {V}_{2} \rangle } = \mathrm {Tr}(\boldsymbol {V}_{1}^{ \mathsf {T}}\boldsymbol {V}_{2})~(\boldsymbol {V}_{1},\boldsymbol {V}_{2}\in Q_{N,p}(\boldsymbol {S}))$ is equipped in $Q_{N,p}(\boldsymbol {S})$ .

Fact 1.1 (Gradient after the Cayley Parametrization):

For a differentiable function $f:\mathbb {R}^{N\times p}\to \mathbb {R}$ and $\boldsymbol {S} \in {\mathrm{ O}}(N)$ , the following hold with $f_{\boldsymbol {S}}:=f\circ \Phi _{\boldsymbol {S}}^{-1}:Q_{N,p}(\boldsymbol {S}) \to \mathbb {R}$ :

([33, Proposition 2.9, Remark 2.11]) $f_{\boldsymbol {S}}$ is differentiable, and the computational complexity of $f_{\boldsymbol {S}}$ with $\boldsymbol {S}$ designed by Algorithm 2 is $\mathfrak {o}(Np^{2}+p^{3})$ flops.
([33, Proposition A.11]) Assume that the Lipschitz continuity of $\nabla f$ with its Lipschitz constant $L_{\nabla f} > 0$ as $\begin{align*} & (\forall \boldsymbol {U}_{1},\boldsymbol {U}_{2} \in \mathrm {St} (p,N)) \\ & \quad \left \lVert{ \nabla f(\boldsymbol {U}_{1})-\nabla f(\boldsymbol {U}_{2}) }\right \rVert _{F} \leq L_{\nabla f} \left \lVert{ \boldsymbol {U}_{1} - \boldsymbol {U}_{2} }\right \rVert _{F}.\end{align*}$ View Source Then, the gradient $\nabla f_{\boldsymbol {S}}$ is Lipschitz continuous with its Lipschitz constant $4(\mu + L_{\nabla f})$ over $Q_{N,p}(\boldsymbol {S})$ as $\begin{align*} & (\forall \boldsymbol {V}_{1},\boldsymbol {V}_{2} \in Q_{N,p}(\boldsymbol {S})) \\ & \quad \left \lVert{ \nabla f_{\boldsymbol {S}}(\boldsymbol {V}_{1}) - \nabla f_{\boldsymbol {S}}(\boldsymbol {V}_{2}) }\right \rVert _{F} \leq 4(\mu + L_{\nabla f}) \left \lVert{ \boldsymbol {V}_{1}-\boldsymbol {V}_{2} }\right \rVert _{F},\end{align*}$ View Source where $\mu \geq \max _{\boldsymbol {U}\in \mathrm {St} (p,N)} \left \lVert{ \nabla f(\boldsymbol {U}) }\right \rVert _{2}$ .

Appendix B
Proof of Theorem 3.3

The following claim is required for our analysis.

Claim 2.1:

Let $\theta,\tau \in \mathbb {N}$ , and $n, \Upsilon \in \mathbb {N}_{0}$ . Then,

$\begin{equation*} (\Upsilon \geq 1) \quad \sum _{l=0}^{\Upsilon -1}\left \lfloor{ \frac {l}{\tau } }\right \rfloor ^{\theta } \geq \tau \frac {\left({ \left \lfloor{ \frac {\Upsilon }{\tau } }\right \rfloor -1}\right)^{\theta +1}}{\theta +1}. \tag{25}\end{equation*}$ View Source
$\begin{align*} &\hspace {-1pc} \mathrm {If}~n + 1\geq \sum _{l=0}^{\max \{\Upsilon - 1,0\}}\left \lfloor{ \frac {l}{\tau }}\right \rfloor ^{\theta }, ~\mathrm {then} \\ & \left.{\begin{array}{ccccccc}& \Upsilon + 1 \leq \tau (\sqrt {3}+3) \sqrt [\theta +1]{n+1} \\ & \left ({ \Leftrightarrow \frac {\Upsilon + 1}{\tau \times \sqrt [\theta +1]{n+1}} \leq \sqrt {3}+3 }\right) \end{array}}\right \} \tag{26}\end{align*}$ View Source

Proof:

(Step 1) We first show $\begin{equation*} \sum _{l=0}^{\Upsilon -1}\left \lfloor{ \frac {l}{\tau } }\right \rfloor ^{\theta } \geq \tau \sum _{m=0}^{ \left \lfloor{ \frac {\Upsilon }{\tau } }\right \rfloor -1}m^{\theta }. \tag{27}\end{equation*}$ View Source Since (i) the RHS of (27) is invariant for $\Upsilon \in [\kappa \tau, (\kappa +1)\tau -1]$ with every $\kappa \in \mathbb {N}_{0}$ and (ii) the LHS of (27) is monotonically increasing for $\Upsilon$ , it is sufficient to verify the inequality (27) specially for the case $\Upsilon = \kappa \tau$ . In this case, the LHS of (27) is evaluated as $\begin{align*} &\hspace {-1pc} \sum _{l=0}^{\kappa \tau -1}\left \lfloor{ \frac {l}{\tau } }\right \rfloor ^{\theta } = \sum _{m=0}^{\kappa -1} \sum _{i=0}^{\tau -1} \left \lfloor{ \frac {m\tau + i}{\tau } }\right \rfloor ^{\theta } =\sum _{m=0}^{\kappa -1} \sum _{i=0}^{\tau -1} m^{\theta } \\ & = \tau \sum _{m=0}^{\kappa -1} m^{\theta } = \tau \sum _{m=0}^{\frac {\Upsilon }{\tau }-1} m^{\theta } = \tau \sum _{m=0}^{ \left \lfloor{ \frac {\Upsilon }{\tau } }\right \rfloor -1} m^{\theta },\end{align*}$ View Source which equals the RHS of (27).
(Step 2) By combining (27) and the clear relation $\begin{equation*} \tau \sum _{m=0}^{ \left \lfloor{ \frac {\Upsilon }{\tau } }\right \rfloor -1}m^{\theta } \geq \tau \int ^{ \left \lfloor{ \frac {\Upsilon }{\tau } }\right \rfloor -1}_{0} x^{\theta } dx = \tau \frac {\left({ \left \lfloor{ \frac {\Upsilon }{\tau } }\right \rfloor -1}\right)^{\theta +1}}{\theta +1},\end{equation*}$ View Source we deduce (25).
(Step 1) For the special case of $\Upsilon \in [0, \tau -1]$ , (26) is clear by $\Upsilon + 1 \leq \tau \leq \tau (3+\sqrt {3})\leq \tau (3+\sqrt {3})\sqrt [\theta +1]{n+1}$ .
(Step 2) For the other cases $\Upsilon \in [\tau, \infty)$ , combining (25) and the assumption of (26), we have $\begin{equation*} n+1 \geq \sum _{l=0}^{\Upsilon -1}\left \lfloor{ \frac {l}{\tau } }\right \rfloor ^{\theta } \geq \tau \frac {\left({ \left \lfloor{ \frac {\Upsilon }{\tau } }\right \rfloor -1}\right)^{\theta +1}}{\theta +1},\end{equation*}$ View Source which implies $\begin{equation*} \sqrt [\theta +1]{\tau ^{-1}(\theta +1)(n+1)} \geq \left \lfloor{ \frac {\Upsilon }{\tau } }\right \rfloor - 1 \geq \frac {\Upsilon }{\tau } - 2,\end{equation*}$ View Source equivalently $\begin{equation*} \Upsilon + 1 \leq \sqrt [\theta +1]{\tau ^{\theta }(\theta +1)(n+1)} + 2\tau + 1.\end{equation*}$ View Source
(Step 3) By evaluating the RHS of (26), we derive the desired upper bound: $\begin{align*} &\hspace {-1pc} \frac {\sqrt [\theta +1]{\tau ^{\theta } (\theta +1)(n+1)}+2\tau + 1}{\tau \times \sqrt [\theta +1]{n+1}} \\ & = \tau ^{-1/(\theta +1)}\sqrt [\theta +1]{\theta +1} + \frac {2\tau + 1}{\tau \times \sqrt [\theta +1]{n+1}} \\ & \leq \sqrt [\theta +1]{\theta +1} + \frac {2\tau + 1}{\tau }~(\because \tau \geq 1, n \geq 0) \\ & \leq \left ({(\theta +1)+1}\right)^{1/(\theta +1)} + 3 \leq \sqrt {3} + 3\end{align*}$ View Source where in the last inequality, the monotone decreasing property of $\left ({(\theta +1)+1}\right)^{1/(\theta +1)}~(\theta \in \mathbb {N})$ is used.

Now, we are ready to prove Theorem 3.3.

Proof ofTheorem 3.3:

For any $n\in \mathbb {N}_{0}$ , we have $\begin{equation*} n + 1 \geq \sum _{l=0}^{\max \{\ell (n)-1, 0\}} \left \lfloor{ \frac {l}{\tau } }\right \rfloor ^{\theta }. \tag{28}\end{equation*}$ View Source This is because (i) for the special case $\ell (n) = 0$ , we have $n+1 \geq 0 = \sum _{l=0}^{0} \left \lfloor{ \frac {l}{\tau } }\right \rfloor ^{\theta }$ , and (ii) for the other cases $\ell (n) \geq 1$ , we have $\begin{equation*} n+1 \geq \sum _{l=0}^{\ell (n)-1} \left |{{\mathcal {N}}_{l}}\right | \overset {(18)}{\geq } \sum _{l=0}^{\ell (n)-1}\eta (l) \overset {(17)}{=} \sum _{l=0}^{\ell (n)-1} \left \lfloor{ \frac {l}{\tau } }\right \rfloor ^{\theta }.\end{equation*}$ View Source By combining (28) and Claim 2.1 (b) with $\Upsilon := \ell (n)$ , we have $\ell (n) + 1 \leq \tau (\sqrt {3}+3) \sqrt [\theta +1]{n+1}$ . Finally, by substituting the last inequality into (14), we have (19), which also implies $\begin{equation*} 0\leq \inf \mathfrak {G}_{0} \leq \lim _{n\to \infty } \frac {\sqrt {(\sqrt {3}+3)\tau }c}{(n+1)^{\frac {\theta }{2(\theta +1)}}} = 0.\end{equation*}$ View Source Moreover, by noting that $\left \lVert{ \nabla f_{\boldsymbol {S}_{[\ell (n)] }}(\boldsymbol {V}_{n}) }\right \rVert _{F} > 0~(\forall n \in \mathbb {N}_{0})$ holds under Condition 3.1, we have $\begin{equation*} (\forall n \in \mathbb {N}_{0})~\mathfrak {g}_{n}:=\min _{0\leq k \leq n} \left \lVert{ \nabla f_{\boldsymbol {S}_{[\ell (k)] }}(\boldsymbol {V}_{k}) }\right \rVert _{F} > 0.\end{equation*}$ View Source Therefore, we get $\inf \mathfrak {G}_{n} = 0~(\forall n\in \mathbb {N}_{0})$ by $\begin{equation*} \inf \mathfrak {G}_{0} = \min \{\mathfrak {g}_{n}, \inf \mathfrak {G}_{n}\}= 0,\end{equation*}$ View Source and thus $0 = \lim _{n\to \infty }\inf \mathfrak {G}_{n} =:\liminf _{n\to \infty } \left \lVert{ \nabla f_{\boldsymbol {S}_{[\ell (n)] }}(\boldsymbol {V}_{n}) }\right \rVert _{F}$ .

MIT Libraries

MIT Libraries

Adaptive Localized Cayley Parametrization for Optimization Over Stiefel Manifold and Its Convergence Rate Analysis

Alerts

Abstract:

Metadata

Abstract:

Funding Agency:

Introduction

Problem 1.1:

Problem 1.2:

Problem 1.3:

Preliminaries

A. Adaptive Localized Cayley Parametrization for Optimization Over \mathrm{St}(p,N) \mathrm{St}(p,N)

Fact 2.1 (Optimality Condition[33]):

Algorithm 1 Adaptive Localized Cayley Parametrization Strategy

Algorithm 2 Choice of Center Point

Remark 2.2: [On ALCP strategy (Algorithm 1)]:

Remark 2.3:

B. Nonconvex Optimization Over the Euclidean Space With Convergence Rates

Definition 2.4 (NOE-r):

Example 2.5 (Selected Algorithms in NOE-r):

Algorithm 3 Backtracking Algorithm

Convergence Rate Analysis

Condition 3.1 (For Convergence Analysis ofAlgorithm 1):

Lemma 3.2:

Proof:

Theorem 3.3 (Overall Convergence Analysis):

Proof:

Remark 3.4 (Interpretation of Theorem 3.3):

Numerical Experiments

Remark 4.1 (Main Findings From Numerical Experiments):

Conclusion

Appendix AGradient of Function After the Cayley Parametrization

Gradient of Function After the Cayley Parametrization

Fact 1.1 (Gradient after the Cayley Parametrization):

Appendix BProof of Theorem 3.3

Proof of Theorem 3.3

Claim 2.1:

Proof:

Proof ofTheorem 3.3:

References

IEEE Account

Purchase Details

Profile Information

Need Help?

A. Adaptive Localized Cayley Parametrization for Optimization Over $\mathrm{St}(p,N)$

Appendix A
Gradient of Function After the Cayley Parametrization

Appendix B
Proof of Theorem 3.3