Introduction
Hessian estimation is a fundamental primitive with important implications in several machine leaning and engineering areas. For example, the Hessian matrix is a key ingredient to design optimization algorithms that achieve superlinear or quadratic convergence. These algorithms use the Hessian to either reshape the gradient or build a local function approximation, and this allows to reach the optimal solution of the optimization problem in few iterations. However, computing the exact Hessian is onerous, and especially for high-dimensional problems it is often necessary to fall back on cheaper Hessian estimates. Moreover, in many relevant cases such as black-box optimization and simulation-based optimization, it is not even possible to explicitly compute the Hessian matrix, as the objective function is accessible only through point evaluations. In the literature, this scenario is referred to as zeroth-order optimization, and it is assumed that function queries are expensive and possibly constrained in number.
The ideal Hessian estimator should possess all the following desirable qualities: (i) To be compatible with black-box functions, the estimator should not require exact gradients for its computation, but only a limited number of function values. (ii) To be computationally efficient, the estimator should incorporate past estimates and not forget previously collected information. (iii) To be employed by distributed optimization algorithms, the estimator should be parallelizable.
The Hessian estimator analyzed in this work satisfies all the above conditions, and in particular is well suited for federated learning, a kind of distributed machine learning where clients are connected to a central server in a star topology. Federated learning has gained significant attention from both academia and industry, as it allows to collaboratively train a shared model over the data islands owned by the various participants, while preserving the individual privacy. To satisfy the latter property, transmitting raw data samples is forbidden, and the communications typically consist of model updates. Data exchange between clients is costly and possibly limited or unreliable, for example due to bandwidth restrictions or connection instabilities. This calls for communication-efficient algorithms that converge in few iterations, and the Hessian estimator analyzed in this work is a useful building block to design them.
A. Preliminaries and Related Work
Below we provide an overview of the state of the art in some research topics related to our work.
Zeroth-order optimization: Zeroth-order optimization allows tackling learning problems where the derivatives of the objective function are not available or difficult to compute. Problems that fall inside this category include policy learning for reinforcement learning, adversarial training and generating explanations for black-box models [1]. Zeroth-order methods can also replace backpropagation for neural networks that are not entirely differentiable, and constitute an alternative to the recent Forward-Forward algorithm [2]. Zeroth-order optimization only requires to evaluate the objective at certain input points, and estimates function derivatives by means of finite-differences. However, even function queries are assumed to be expensive and possibly budgeted in number, which calls for efficient algorithms that can get the most out of them. Alternatives to zeroth-order optimization include derivative-free optimization, Bayesian optimization and genetic algorithms.
Zeroth-order federated learning: Since the in second half of this work we present a novel zeroth-order federated learning, we briefly introduce the main federated zeroth-order algorithms available in the literature: (i) FedZO [3] replaces the exact gradient in FedAvg with a stochastic gradient estimator based on forward finite-differences. (ii) ZONE-S [4] activates only one client per iteration, and minimizes an augmented Lagrangian function at the server. (iii) ZooPFL [5] considers the case where all clients own the same black-box model that cannot be changed, and uses trainable local encoders and linear projections to respectively remap the input and output of the frozen model. (iv) FZooS [6] tackles the problem of query inefficiency, but relies on the strong assumption that every local function is sampled from a Gaussian process. (v) BAFFLE [7] uses shared random seeds to sample common query points at all clients, and estimates the global gradient using Stein's identity. (vi) FedZeN [8] estimates also the Hessian of the global objective in an incremental fashion, and applies a quasi-Newton method at the server to address convex optimization problems. It is worth noting that FedZeN is the only algorithm that takes advantage of second-order derivative information, while the others rely only on gradient estimates.
Zeroth-order Hessian estimation: We quickly review the main techniques to approximate a
Hessian-based federated algorithms: The most effective way to reach a function minimum in few iterations is to leverage the curvature of the objective function, which is described by the Hessian matrix. Below we list the main second-order federated algorithms that take advantage of the Hessian. (i) GIANT [17] addresses convex optimization problems by averaging the local Newton directions of the clients. This is equivalent to approximating the Hessian of the global objective with the harmonic mean of the local Hessian matrices. (ii) FedNL and its variants [18] are designed for convex optimization and make clients transmit to the server compressed innovations of their local Hessian matrices and the corresponding compression errors. (iii) In SHED [18] clients compute the eigendecomposition of their local Hessian only when required, and share some of the most relevant eigenvalue-eigenvector pairs with the server to incrementally update the Hessian approximation. (iv) FLECS [20] extends FedNL to non-convex problems by changing the Hessian update rule and the computation of the approximate Newton direction.
We remark that all the above algorithm make use of exact derivatives and require clients to compute their local Hessian matrix. In particular, none of them allows approximate Hessian matrices, which makes impossible to directly implement them as zeroth-order algorithms. The only Hessian-aware federated algorithm which is also zeroth-order is FedZeN [8], but it is limited to convex problems.
B. Contribution and Organization
The contribution of this paper is twofold:
We provide a number of novel results regarding the properties of the incremental Hessian estimator proposed in [16]. Our analysis is profoundly different and more in-depth with respect to the original paper. First, we focus on a practical zeroth-order implementation of the estimator suitable for black-box optimization. We take into account the approximation error on the directional curvature due to finite-differences, which is neglected in [16]. Second, to fully exploit the advantage of incremental estimation, we consider the case of a time-varying Hessian to be tracked. Finally, we derive non-asymptotic bounds on the convergence of the estimation error and on the minimum number of function queries needed to achieve a given accuracy with the desired probability. The above analysis is non-trivial and involves mathematical tools that are not present in [16]. Moreover, we empirically show that
function queries are sufficient to obtain an useful estimate of aO(d) Hessian, whereas standard entry-wise estimators necessarily required \times d queries. Our bounds make the incremental Hessian estimator actually reliable, allowing it to be used whenever a cheap but sufficiently accurate approximation of a Hessian is needed.O(d^{2}) We present a novel optimization algorithm as a practical application of the incremental Hessian estimator discussed at item 1). The algorithm, named FedZCR, is the first zeroth-order algorithm for non-convex federated optimization to estimate and exploit the Hessian of the objective. The Hessian estimator is built in the federated setting following the distributed approach of [8]. In particular, our algorithm can be seen as an extension of [8] to the non-convex setting, and further improves over [8] by replacing a simplifying assumption with rigorous conditions. The Hessian estimate is combined with cubic regularization at the server to escape saddle points and converge with arbitrarily high probability and accuracy. More specifically, the algorithm is guaranteed to converge to a second-order
-optimal point in(\tau, \sqrt{\tau }) iterations with high probability, which is currently the best iteration complexity for non-convex problems. We also propose the adaptive version FedZACR, and the use of subsampling to decrease the total computational cost and accommodate client heterogeneity. Our numerical results show that FedZACR outperforms the existing zeroth-order federated methods, and is comparable to second-order methods for convex optimization that use the exact derivatives.O(\tau ^{-3/2})
The organization of this work reflects the duality of our contributions: in the first half of the paper (Section II) we derive theoretical bounds for the Hessian estimator, while in the second half (Section III) we present a novel algorithm for non-convex zeroth-order federated learning. Finally, Section IV is dedicated to numerical simulations.
Incremental Hessian Estimation
The first part of this work is dedicated to the analysis an Hessian estimator, that can be used to relieve the computational burden of computing exact Hessian matrices. We consider an estimator that can be employed also when the target function is only accessible by means of point evaluations, and that can be easily adapted to perform distributed estimation. Our goal is to derive two kind of bounds: (i) upper bounds on the approximation error with respect to the computational cost, measured by the number of function queries, and (ii) lower bounds on the number of function evaluations needed to achieve a given accuracy. These bounds are especially relevant in zeroth-order optimization, where the common assumption is that function evaluations are expensive, possibly budgeted in number, and come with a significant computational cost. Choosing an estimator that is parsimonious with function queries and does not represent a bottleneck is therefore crucial to the efficiency of the application that uses it. With the above premise, one cannot resort to standard entry-wise estimators based on finite-differences along the canonical basis, that involve
Notation: We denote with
A. Bounds for the Hessian Estimator
We start by introducing the Hessian estimator proposed in [16], based on the update
\begin{equation*}
H^{j} = H^{j-1} + \left(u_{j}^{T} (\nabla ^{2} f(x) - H^{j-1}) u_{j} \right) (u_{j} u_{j}^{T}), \tag{1}
\end{equation*}
\begin{equation*}
\mathbb {E} \left[ e \left(H^{j} \right)^{2} \right] \leq \left(1 - \frac{2}{d(d+2)} \right) e \left(H^{j-1} \right)^{2} \tag{2}
\end{equation*}
\begin{align*}
\hat{H}^{j} = &\hat{H}^{j-1} + \biggl (\frac{f(x + \mu u_{j})-2f(x)+f(x - \mu u_{j})}{\mu ^{2}} \\
& - u_{j}^{T} \hat{H}^{j-1} u_{j} \biggr)(u_{j} u_{j}^{T}). \tag{3}
\end{align*}
To analyse the estimator we first need to characterize the objective function with the following assumption.
Assumption 1 (Smoothness):
Let the global cost
\begin{align*}
\left\Vert f(x) - f(y) \right\Vert & \leq L_{0} \left\Vert x-y \right\Vert, \\
\left\Vert \nabla f(x) - \nabla f(y) \right\Vert & \leq L_{1} \left\Vert x-y \right\Vert, \\
\left\Vert \nabla ^{2} f(x) - \nabla ^{2} f(y) \right\Vert & \leq L_{2} \left\Vert x-y \right\Vert, \\
\left\Vert \left(D^{3} f \right)(x) - \left(D^{3} f \right)(y) \right\Vert & \leq L_{3} \left\Vert x-y \right\Vert.
\end{align*}
We recall that if
We begin by bounding the expected improvement in accuracy provided by the exact update (1). Since the Hessian estimator is incremental, to follow its evolution we need a convergence rate that can be applied recursively. The convergence rate provided in [16] considers the squared norm of the approximation error, while here we look for a bound on the non-squared norm, that is needed for subsequent analysis. This bound can be obtained by choosing
Lemma 1 (Convergence rate of the update (1)):
Let
\begin{equation*}
\mathbb {E} \left[ \left\Vert S - (u^{T} S u)u u^{T} \right\Vert _{F} \right] \leq \eta \left\Vert S\right\Vert _{F}, \; \eta = \sqrt{ 1 - \frac{3}{d(d+2)} }.
\end{equation*}
Remark 1 (Alternative convergence rate):
The proof of Lemma 1 is quite involved, but it is possible to obtain a convergence rate slightly worse than the one above in a much simpler way. Indeed, applying moment monotonicity to (2) we get
\begin{align*}
\mathbb {E} \left[ \left\Vert S - (u^{T} S u)u u^{T} \right\Vert _{F} \right] & \leq \left(\mathbb {E} \left[ \left\Vert S - (u^{T} S u)u u^{T} \right\Vert ^{2}_{F} \right] \right)^{1/2} \\
& \leq \left(\left(1 - \frac{2}{d(d+2)} \right) \left\Vert S\right\Vert ^{2}_{F} \right)^{1/2} \\
& = \left(1 - \frac{2}{d(d+2)} \right)^{1/2} \left\Vert S\right\Vert _{F}.
\end{align*}
We now address the error due to the zeroth-order approximation, which is neglected in [16] and not quantified in [8], that affects the convergence rate as shown below.
Lemma 2 (Zeroth-order error of the update (3)):
Consider the updates (1) and (3) with
\begin{equation*}
\mathbb {E} \left[ \left\Vert H^{j} - \hat{H}^{j}\right\Vert \right] \leq \frac{\mu ^{2} L_{3}}{\text{12}\,d}.
\end{equation*}
\begin{equation*}
\mathbb {E} \left[ \left\Vert \nabla ^{2} f(x) - \hat{H}^{j}\right\Vert _{F} \right] \leq \eta \left\Vert \nabla ^{2} f(x) - \hat{H}^{j-1}\right\Vert _{F} + \frac{\mu ^{2} L_{3}}{\text{12}\,d}.
\end{equation*}
We denote with
\begin{equation*}
\mathbb {P}(z \geq \epsilon) \leq \frac{\mathbb {E}[z]}{\epsilon } \quad \forall \epsilon >0. \tag{4}
\end{equation*}
In our case
Lemma 3 (Convergence of the zeroth-order Hessian estimator (3)):
The expected convergence rate of the estimation error of
\begin{equation*}
\mathbb {E} \left[ \left. e\left(\hat{H}^{r} \right) \right| \hat{H}^{0} \right] \leq \eta ^{r} e\left(\hat{H}^{0} \right) + \frac{\mu ^{2} L_{3}}{\text{12}\,d} \sum _{i=0}^{r-1} \eta ^{i}.
\end{equation*}
Remark 2 (Design parameters \mu and r ):
The bound of Lemma 3 tells us that the approximation error of the Hessian estimator can be made arbitrarily small by tuning the design parameters
We are now in a position to apply Markov's inequality, that provides sufficient conditions for the norm of the estimation error to be arbitrarily small with high probability. This requires to choose the maximum tolerable error
Our analysis addresses the general case in which the decision vector
Theorem 1 (Sufficient number of random directions):
Consider a sequence of Hessian matrices of a function
• (Memory-less case) Let
\begin{gather*}
\mu \leq \sqrt{ \frac{\text{12}\,d (1- \eta)}{L_{3}} \min \left\lbrace \epsilon \delta, \; \sqrt{d} L_{1} \right\rbrace },\\
r(k) \geq \log _{\eta } \left(\frac{ \epsilon \delta (1-\eta) - \frac{\mu ^{2} L_{3}}{\text{12}\,d}}{\sqrt{d} L_{1} (1-\eta) - \frac{\mu ^{2} L_{3}}{\text{12}\,d}} \right),
\end{gather*}
\begin{equation*}
\mathbb {P} \left(\left. \left\Vert \nabla ^{2} f(x_{k}) - \hat{H}_{k}^{r(k)} \right\Vert _{F} \geq \epsilon \right| \hat{H}_{k}^{0} = 0 \right) \leq \delta.
\end{equation*}
• (Warm-start case) Arbitrarily choose an initial point
\begin{align*}
\mu \leq& \sqrt{ \frac{\text{12}\,d \epsilon \delta (1- \eta)}{L_{3}} }, \; r(k) \geq \log _{\eta } \left(\frac{ \epsilon \delta - \frac{1}{1- \eta } \frac{\mu ^{2} L_{3}}{\text{12}\,d} }{c_{k}} \right),\\
c_{k} =& \frac{\mu ^{2} L_{3}}{\text{12}\,d} \left[ \sum _{j=1}^{k-1} \left(\eta ^{ \sum _{z=j+1}^{k-1} r(z) } \frac{1- \eta ^{r(j)}}{1-\eta } \right) - \frac{1}{1- \eta } \right] \\
& + \sqrt{d} L_{2} \sum _{j=2}^{k} \left(\eta ^{\sum _{z=j}^{k-1} r(z)} \left\Vert x_{j} - x_{j-1}\right\Vert \right) \\
& + \eta ^{\sum _{j=1}^{k-1} r(j)} \left\Vert \nabla ^{2} f(x_{1}) - \hat{H}_{1}^{0} \right\Vert _{F},
\end{align*}
\begin{equation*}
\mathbb {P} \left(\left. \left\Vert \nabla ^{2} f(x_{k}) - \hat{H}_{k}^{r(k)} \right\Vert _{F} \geq \epsilon \right| \hat{H}_{1}^{0} \right) \leq \delta.
\end{equation*}
Remark 3:
The update
The asymptotic scaling
Visual representation of the bound provided by Lemma 3, setting
Plot of the upper bound on
Training loss of zeroth-order logistic regression. The competitors of FedZACR are other federated zeroth-order algorithms.
Training loss of standard logistic regression: unfair comparison between the zeroth-order FedZACR versus other federated algorithms that make use of the exact derivatives. While FedZACR is clearly disadvantaged, its performance is comparable to second-order state-of-the-art algorithms.
Function value suboptimality of zeroth-order linear regression using a non-convex loss. We compare FedZACR to other federated zeroth-order algorithms.
Training accuracy achieved while fine-tuning a black-box neural network on a classification task. FedZACR is compared with other federated zeroth-order algorithms for non-convex optimization.
Remark 4:
The incremental randomized procedure allows an arbitrary small number of function queries, provides an improved estimate at each update and can be stopped at any moment. Moreover, it allows to track a time-varying Hessian using the previous estimate as starting point. In comparison, deterministic entry-wise estimators always start from scratch and necessarily require
B. Gradient Estimation and Improved Hessian Estimation via Orthonormal Search Directions
The incremental Hessian estimator requires to uniformly sample a set of vectors from the unit sphere. This is commonly done by normalizing random vectors
Picking the search directions from the union of multiple orthonormal bases can improve also gradient estimation. Indeed, it opens up the possibility to use the gradient estimator
\begin{equation*}
\hat{g} = \sum _{j=1}^{d} \frac{f(x + \mu u_{j}) - f(x - \mu u_{j})}{2 \mu } u_{j}, \tag{5}
\end{equation*}
\begin{equation*}
\left\Vert \nabla f(x) - \hat{g}\right\Vert \leq \frac{d L_{2} \mu ^{2}}{6}, \tag{6}
\end{equation*}
Application to Federated Learning
In this second part of the paper we propose a concrete application of the bounds developed so far. We first show how the incremental Hessian estimator can be efficiently built in a distributed way, and then we describe a novel zeroth-order algorithm for non-convex federated optimization of black-box functions. While in what follows we focus on zeroth-order federated learning, we remark that the bounds in Section II can be used whenever an Hessian estimate with bounded approximation error is needed, i.e. also in centralized or fully-distributed optimization settings.
A. Problem Formulation
We address the horizontal federated learning setting, where a set of clients are connected to a central server in a star topology. The objective is to collaboratively train a shared model to achieve higher accuracy and generalization capabilities, while preserving the individual privacy. Clients own disjoint sets of private data samples belonging to the same feature space, and the distribution of the local data can vary across participants. We consider the empirical risk minimization problem
\begin{equation*}
f(x^\star) = \min _{x \in \mathbb {R}^{d}} \left\lbrace f(x) := \frac{1}{n} \sum _{i=1}^{n} f_{i}(x) \right\rbrace. \tag{7}
\end{equation*}
\begin{equation*}
f_{i}(x) = \frac{1}{m_{i}} \sum _{j=1}^{m_{i}} f_{ij}(x). \tag{8}
\end{equation*}
To devise an algorithm that can work with limited communication bandwidth, we require that all the devices that take part to the training are equipped with a pseudo-random number generator (PRNG). This requirement is reasonable and barely restrictive, and can be found also in [7], [8] and [20].
Assumption 2 (PRNGs):
All the clients and the central server have access to a pseudo-random number generator, whose output sequence is repeatable and is uniquely determined by the internal seed.
B. Federated Gradient and Hessian Estimation
The zeroth-order gradient and Hessian estimators presented in Section II can be built in a distributed way by following the procedure described in [8]. Consider the federated scenario introduced in the previous section, namely a set of clients connected to a central server according to a star topology. The key idea is to leverage Assumption 2 to generate a common set of search direction at all nodes, removing the communication cost associated with broadcasting the directions. Below we summarize the distributed estimation procedure, as it will be employed also by the federated algorithm proposed in the next subsection.
During the initialization phase, the central server sends to all the clients the same seed, possibly using a private transmission channel, and the clients use it to synchronize their pseudo-random number generator. Suppose that we want to estimate the global gradient and Hessian at the current decision vector
\begin{equation*}
c_{ij} = \frac{ f_{i}(x_{k} + \mu u_{j}) - f_{i}(x_{k} - \mu u_{j})}{2 \mu } \approx \nabla f_{i}(x_{k})^{T} u_{j}, \tag{9}
\end{equation*}
\begin{equation*}
b_{ij} = \frac{f_{i}(x_{k} + \mu u_{j})-2f_{i}(x_{k})+f_{i}(x_{k} - \mu u_{j})}{\mu ^{2}}. \tag{10}
\end{equation*}
\begin{gather*}
g_{k} = \sum _{j=1}^{d} \left(\frac{1}{n} \sum _{i=1}^{n} c_{ij} \right) u_{j}, \tag{11}\\
H_{k}^{j} = H_{k}^{j-1} + \left(\frac{1}{n} \sum _{i=1}^{n} b_{ij} - u_{j}^{T} H_{k}^{j-1} u_{j} \right) u_{j} u_{j}^{T}, \quad j \in [r].
\end{gather*}
Remark 5:
This distributed estimation of the Hessian is advantageous also in case of non-zeroth-order federated optimization to build an Hessian estimate at the server in a communication-efficient way. Indeed, if clients are able compute the exact Hessian of their local function, they can simply send to the server the exact directional curvatures, i.e.
As noted in [8], the above procedure has two properties that are relevant for federated learning. First, since it estimates the derivative of global objective, it allows for clients with heterogeneous data distributions. Second, if in the initialization phase the seed is shared through a private channel, then the procedure conceals the estimated derivatives from external eavesdroppers, who can only obtain the coefficients
In case of Big Data applications it may not be feasible to compute the derivative estimators using all the data samples. To address these scenarios, in Appendix C we generalize the bounds on the estimation error to include the cases where subsampling is used.
Remark 6:
The aforementioned procedure was first introduced by [8], that proposes the federated zeroth-order algorithm for convex optimization FedZeN. In the theoretical analysis of the latter, the approximation error the Hessian estimator is bounded using a simplifying assumption that involves unknown quantities. Using the bounds developed in Section II, it is possible to remove that assumption and get a lower bound on the number of search directions needed by FedZeN to achieve local quadratic convergence.
C. Cubic-Regularized Federated Learning
We now show how our analysis regarding the incremental Hessian estimator can be applied to design provably convergent algorithms for non-convex federated optimization. Building over the previous sections, we design a novel federated algorithm that can be applied to non-convex functions whose exact derivatives are not available. The algorithm arises from the union of the following three elements. First, the procedure described in Section III-B is used to estimate the derivatives of the global objective at the central server. Second, the bounds derived in Section II are used to tune the number of search directions and make the Hessian estimator sufficiently accurate. Finally, the estimated derivatives are used to solve a cubic-regularized subproblem, allowing to optimize non-convex objectives with solid convergence guarantees.
1) Centralized Cubic-Regularized Newton
Non-convex optimization problems are typically characterized by multiple local minima and saddle points, which require additional care. For this reason, in our algorithm we employ the cubic-regularized Newton method introduced by [21], that for general non-convex optimization problems ensures fast convergence to approximate second-order stationary points, i.e. approximate local minima, defined below.
Definition 1 ((\epsilon _{g}, \epsilon _{H}) -optimality):
Given
\begin{equation*}
\left\Vert \nabla f(x) \right\Vert \leq \epsilon _{g}, \quad \lambda _{\min } (\nabla ^{2} f(x)) \geq - \epsilon _{H},
\end{equation*}
More specifically, under the assumption of
The key feature of this method is the capability of escaping saddle points that are non-degenerate, i.e. for which the Hessian is full-rank. As argued in [22], in general most critical points are saddles, and statistically the number of saddles grows exponentially with the size of the problem. Therefore, cubic-regularized Newton provides possibly better solutions compared to other optimization algorithms such as stochastic gradient descent, that may get stuck in saddle points and make the loss function plateau [22].
Remark 7:
In alternative to cubic-regularized Newton method, one can use Approximate saddle-free Newton method [22], that takes the absolute value of the eigenvalues of the Hessian to avoid getting trapped in saddle points.
At each iteration the method updates the decision vector using
\begin{equation*}
s_{k} \!=\! \text{argmin}_{s \in \mathbb {R}^{d}} f(x_{k}) \!+\! \nabla f(x_{k})^{T} s \!+\! \frac{1}{2}s^{T} \nabla ^{2} f(x_{k}) s \!+\! \frac{M}{6} \left\Vert s\right\Vert ^{3}\!.
\end{equation*}
To use our Hessian estimator, we resort to a version of the cubic-regularized Newton method that allows for approximate function derivatives, minimizing the cubic model
\begin{equation*}
m_{k} (s) = f(x_{k}) + g_{k}^{T} s + \frac{1}{2}s^{T} H_{k}^{r(k)} s + \frac{M_{k}}{6} \left\Vert s\right\Vert ^{3}. \tag{12}
\end{equation*}
2) the Novel Fedzcr and Fedzacr
Combining the federated derivative estimation described in Section III-B with cubic regularization, we obtain a novel algorithm for non-convex federated optimization, that we call FedZCR (Federated Zeroth-order Cubic Regularization). The first part of Algorithm 1 implements the federated estimation procedure, which builds the gradient and Hessian estimators at the central server. In the second part of the pseudocode the server updates the model parameters by solving the subproblem (12).
Remark 8:
Cubic-regularized Newton methods are advantageous also for convex optimization problems, since (i) compared to standard Newton methods they provide global convergence guarantees, and (ii) they do not require the Hessian to be positive-definite. This allows us to directly use the incremental Hessian estimator (1), that is not guaranteed to be positive-definite. In comparison, in FedZeN [8] it is needed to either add a regularization term or to clip the eigenvalues of the estimated Hessian.
We also introduce the variant FedZACR, where A stands for adaptive, that updates the coefficient
3) Convergence Analysis of Fedzcr
Below we exploit the bounds regarding the Hessian estimator to derive the values of
In order to preserve the appealing features of the original cubic-regularized Newton method while using inexact gradient and Hessian, the estimators in (12) must satisfy certain accuracy requirements. Many choices are available in the literature for these requirements, which correspond to different convergence results. In particular, it is shown in [25] that the convergence properties of the exact cubic-regularized Newton method can be retained provided that at all iterations
\begin{gather*}
\left\Vert H_{k}^{r(k)} - \nabla ^{2} f (x_{k})\right\Vert \leq \alpha \left\Vert s_{k-1}\right\Vert, \tag{13}\\
\left\Vert g_{k} - \nabla f (x_{k})\right\Vert \leq \beta \left\Vert s_{k-1}\right\Vert ^{2}, \quad M_{k} \geq \frac{2}{3} L_{2} + 8 \alpha + 8 \beta,
\end{gather*}
The following Theorem 2 provides sufficient conditions for which FedZCR encounters a
Theorem 2:
Let the objective function satisfy the smoothness properties in Assumption 1. Fix two positive constants
\begin{gather*}
\epsilon _{k} \leq \alpha \left\Vert s_{k-1}\right\Vert, \qquad M_{k} \geq \frac{2}{3} L_{2} + 8 \alpha + 8 \beta, \\
\mu _{k} \leq \min \left\lbrace \sqrt{ \frac{6 \beta }{d L_{2}} } \left\Vert s_{k-1}\right\Vert, \, \sqrt{ \frac{\text{12}\,d \epsilon _{k} \delta _{k} (1- \eta)}{L_{3}} } \right\rbrace,
\end{gather*}
\begin{equation*}
\left\Vert \nabla f(\tilde{x}) \right\Vert \leq \frac{C_{1}}{(\tau -1)^{2/3} }, \quad \lambda _{\min } (\nabla ^{2} f(\tilde{x})) \geq - \frac{C_{2}}{(\tau -1)^{1/3} }
\end{equation*}
Proof:
The proof is a straightforward application of (6) and Theorem 1 to the requirements in (13). Imposing the condition
\begin{equation*}
\left\Vert \nabla f(x) - \hat{g}\right\Vert \leq \frac{d L_{2} \mu ^{2}}{6} \leq \beta \left\Vert s_{k-1}\right\Vert ^{2}
\end{equation*}
Numerical Results
We now present a set of numerical simulations that showcases the performance of the algorithm proposed in Section III-C2. We focus on the adaptive version FedZACR, that can be directly applied to any problem as it does not require tuning the parameter
A. Convex Cost Functions
Our first experiments concern logistic regression, that is the most common benchmark function for convex optimization. We consider the same convex optimization problem of [8], namely binary classification via logistic regression using two classes of the dataset Covertype [27]. The feature size is
1) Comparison With Zeroth-Order Methods
We compare FedZACR with the following zeroth-order federated methods:
FedZO [3], that is the zeroth-order counterpart of FedAvg, setting learning rate
,\eta =0.1 local epochs andH=5 perturbation directions.b_{2}=d ZONE-S [4], where at each iteration only one client is updated and the server minimizes an augmented Lagrangian function. In our implementation we use Nesterov accelerated gradient to solve the Lagrangian subproblem.
A federated version of ZO-JADE [10], which is a fully-distributed algorithm for convex optimization that allows for general network topologies. ZO-JADE estimates only the diagonal of the Hessian matrix to use it as a preconditioner.
FedZeN [8], from which our algorithm inherits the distributed estimation procedure. FedZeN is specifically designed for strongly-convex optimization and enforces the estimated Hessian to be positive-definite applying either regularization or eigenvalue clipping. Since we replicate the test performed in [8], we tune the parameters of FedZeN as described there.
Fig. 3 shows that FedZACR outperforms all the other algorithms including FedZeN [8], that is based on the same distributed derivative estimation procedure. Moreover, FedZeN has an unfair advantage in knowing that the problem is convex, which allows it to “cheat” and tweak the Hessian estimate to be positive-definite. In comparison, FedZACR is not aware of convexity and makes up for Hessian uncertainty by using the cubic penalty to limit the stepsize. The training loss is plotted against the number of function queries, which are assumed to be the most expensive computations and represent the main evaluation metric in the zeroth-order optimization field.
2) Comparison With Non-Zeroth-Order Methods
We now consider the scenario where clients are able to directly compute the exact derivatives of their local functions. We compare the zeroth-order FedZACR (with the same parameters as in the previous test) with the following competitors that make use of the exact derivatives:
GIANT [17], that computes an approximate Newton direction by replacing the arithmetic mean of the local Hessian matrices with the harmonic mean. To prevent algorithmic divergence we use a fixed stepsize equal to 0.2, which is the largest stepsize the algorithm can handle in our experiment.
FedNL and FedNL-CR [18], where clients send to the server compressed Hessian innovations and the corresponding compression errors. The vanilla FedNL applies a Newton-type step, while FedNL-CR assumes the knowledge of the Lipschitz constant of the Hessian to apply cubic-regularized Newton method. In our implementation we use stepsize
and the rank-3 compression operator based on singular value decomposition.\alpha =1 FLECS [20], that starts from FedNL and proposes alternative ways to update the Hessian estimate at the server and compute the quasi-Newton step. We consider FLECS with “direct update” and “truncated inverse Hessian approximation”, which are respectively Algorithms 2 and 3 in [20].
FedAvg [28], that is the equivalent of stochastic gradient descent for federated learning. While FedAvg is a first-order method and cannot compete with the above second-order algorithms, we still include it as it represents the baseline for federated learning. We set learning rate
,\eta =0.1 local epochs and batchsizeE=10 .B=50
Fig. 4 shows that while being a zeroth-order algorithm, FedZACR can compete with state-of-the-art second-order methods that have access to the exact derivatives. The two variants of FedNL achieve the best performance, but together with GIANT they can be applied only when the objective function is strongly-convex. We remark that all the second-order competitors require the computation of the local Hessian matrices, and do not support black-box objectives or approximate derivatives.
B. Non-Convex Cost Functions
The following experiments highlight the distinguishing features of FedZACR: our algorithm allows tackling non-convex federated learning problems, does not require the objective function to be differentiable, and exploits the second-order derivative to achieve faster convergence.
1) Non-Convex Robust Linear Regression
In this test the Bike Sharing dataset [29] is evenly partitioned over
\begin{equation*}
f_{i}(r_{i}) = \frac{|\alpha -2|}{\alpha } \left(\left(\frac{(r_{i}/c)^{2}}{|\alpha -2|} + 1 \right)^{\alpha /2} -1 \right), \tag{14}
\end{equation*}
2) Black-Box Neural Network Finetuning
The test consists in fine-tuning a black-box convolutional neural network on a classification task. The network is an off-the-shelf model composed of a convolutional layer, two fully-connected layers and ReLU activation functions. The first two layers are frozen and treated as a black-box opaque to the optimizer. The last layer is trained from scratch, and the total amount of learnable parameters is
Since the objective function is non-convex and non-differentiable, we can compare FedZACR only with the zeroth-order methods FedZO and ZONE-S. The hyperparameters are the same as in the previous test (Section IV-A), with the only difference that for FedZACR we set a constant
Fig. 6 shows that FedZACR trains the network faster and more efficiently compared to its contenders, requiring fewer iterations and function queries to achieve the same training accuracy.
Conclusion
We have addressed two different but related topics. The common thread is a randomised estimator of the Hessian matrix suitable for zeroth-order optimization, where functions are black boxes accessible only through input-output pairs and function derivatives are not available. Moreover the estimator is incremental, meaning that it is built iteratively updating the last estimate as new data comes.
In the first part of the paper we have analyzed the convergence properties of the estimator, bounding the estimation error after a finite number of iterations and deriving the minimum number of search directions needed to achieve a given accuracy with high probability.
In the second part we have proposed a novel algorithm for federated optimization of non-convex black-box functions. The algorithm estimates the Hessian matrix of the global objective function and runs an approximate cubic-regularized Newton method. Applying our results regarding the Hessian estimator we have proved convergence to a second-order optimal point with high probability and optimal iteration complexity. We have validated the theory with numerical results, showing that our algorithm outperforms several state-of-the-art methods in multiple types of federated optimization problems.
NOTE
Open Access funding provided by ‘Universit? degli Studi di Padova’ within the CRUI CARE Agreement
Appendix AProofs of Section II
Proofs of Section II
Proof of Lemma 1
Proof:
Since
\begin{align*}
& \left\Vert S - (u^{T} S u)u u^{T} \right\Vert _{F} \\
& \quad = \left\Vert V \Sigma V^{T} - (\alpha ^{T} V^{T} V \Sigma V^{T} V \alpha) V \alpha \alpha ^{T} V^{T} \right\Vert _{F} \\
& \quad = \left\Vert V\right\Vert _{F} \left\Vert \Sigma - (\alpha ^{T} \Sigma \alpha) \alpha \alpha ^{T} \right\Vert _{F} \left\Vert V^{T}\right\Vert _{F},
\end{align*}
\begin{align*}
& \left\Vert \Sigma - (\alpha ^{T} \Sigma \alpha) \alpha \alpha ^{T} \right\Vert _{F}^{2} \\
& \quad = \text{Tr}[ (\Sigma - (\alpha ^{T} \Sigma \alpha)\alpha \alpha ^{T})^{T} (\Sigma - (\alpha ^{T} \Sigma \alpha)\alpha \alpha ^{T}) ] \\
& \quad = \text{Tr} (\Sigma ^{T} \Sigma) + (\alpha ^{T} \Sigma \alpha)^{2} \text{Tr}(\alpha \alpha ^{T} \alpha \alpha ^{T}) \\
& \qquad - 2(\alpha ^{T} \Sigma \alpha) \text{Tr} (\Sigma \alpha \alpha ^{T}) \\
& \quad = \left\Vert \Sigma \right\Vert _{F}^{2} + (\alpha ^{T} \Sigma \alpha)^{2} - 2(\alpha ^{T} \Sigma \alpha)^{2} \\
& \quad = \left\Vert \Sigma \right\Vert _{F}^{2} - \left(\sum _{i=1}^{d} \alpha _{i}^{2} \sigma _{i} \right)^{2} \\
& \quad \leq \left\Vert \Sigma \right\Vert _{F}^{2} - \sum _{i=1}^{d} \left(\alpha _{i}^{2} \sigma _{i} \right)^{2},
\end{align*}
\begin{equation*}
\left\Vert \Sigma - (\alpha ^{T} \Sigma \alpha) \alpha \alpha ^{T} \right\Vert _{F} \leq \sqrt{ \left\Vert \Sigma \right\Vert _{F}^{2} - \sum _{i=1}^{d} \left(\alpha _{i}^{2} \sigma _{i} \right)^{2} }.
\end{equation*}
\begin{align*}
\mathbb {E} \left[ \left\Vert \Sigma - (\alpha ^{T} \Sigma \alpha) \alpha \alpha ^{T} \right\Vert _{F} \right] & \leq \sqrt{ \left\Vert \Sigma \right\Vert _{F}^{2} - \sum _{i=1}^{d} \sigma _{i}^{2} \mathbb {E} \left[ \alpha _{i}^{4} \right] }
\end{align*}
Being
\begin{equation*}
\alpha _{i}^{2} \sim Beta \left(\frac{1}{2}, \frac{d-1}{2} \right) \quad i \in [d].
\end{equation*}
\begin{equation*}
\mathbb {E} [ \alpha _{i}^{2} ] = \frac{1}{d}, \quad \text{var}[ \alpha _{i}^{2} ] = \frac{ 2(d-1) }{ d^{2} (d+2) }.
\end{equation*}
\begin{equation*}
\mathbb {E} [ \alpha _{i}^{4} ] = \mathbb {E} [ \alpha _{i}^{2} ]^{2} + \text{var}[ \alpha _{i}^{2} ] = \frac{3}{ d(d+2) }.
\end{equation*}
\begin{align*}
\mathbb {E} \left[ \left\Vert \Sigma - (\alpha ^{T} \Sigma \alpha) \alpha \alpha ^{T} \right\Vert _{F} \right] & = \sqrt{ \left\Vert \Sigma \right\Vert _{F}^{2} - \left(\sum _{i=1}^{d} \sigma _{i}^{2} \right) \mathbb {E} \left[ \alpha _{1}^{4} \right] } \\
&= \sqrt{ \left\Vert \Sigma \right\Vert _{F}^{2} - \left\Vert \Sigma \right\Vert _{F}^{2} \frac{3}{ d(d+2) } }
\end{align*}
Proof of Lemma 2
Proof:
For simplicity of notation we drop the subscript of the search direction
\begin{align*}
\mathbb {E} \Bigl [ \Big \Vert H^{j} & - \hat{H}^{j} \Big \Vert \Bigr ] = \mathbb {E} \biggl [ \bigg \Vert \biggl (u^{T} \nabla ^{2} f(x) u \\
& \quad - \frac{f(x + \mu u)-2f(x)+f(x - \mu u)}{\mu ^{2}} \biggl) u u^{T} \bigg \Vert \biggr ] \\
&= \mathbb {E} \biggl [ \bigg \Vert - \frac{\mu }{2} u u^{T} \int _{0}^{1} (1-t)^{2} \langle (D^{3} f) (x + t \mu u) \\
& \quad - (D^{3} f) (x - t \mu u), u, u, u \rangle \, dt \bigg \Vert \biggr ] \\
& \leq \mathbb {E} \left[ \left\Vert - \frac{\mu }{2} u u^{T} \int _{0}^{1} (1-t)^{2} L_{3} \left\Vert u\right\Vert ^{3} \left\Vert 2 t \mu u\right\Vert dt \right\Vert \right] \\
& = \mathbb {E} \left[ \left\Vert - \frac{\mu ^{2}}{12} L_{3}\,u u^{T} \right\Vert \right] \\
& = \frac{\mu ^{2} L_{3}}{\text{12}\,d} \mathbb {E} \left[ \left\Vert uu^{T} \right\Vert \right].
\end{align*}
To obtain the convergence rate of (3) towards the exact Hessian we use the triangular inequality and Lemma 1 choosing
\begin{align*}
\mathbb {E} & \left[ \left\Vert \nabla ^{2} f(x) - \hat{H}^{j}\right\Vert _{F} \right] \leq \\
& \leq \mathbb {E} \left[ \left\Vert \nabla ^{2} f(x) - H^{j}\right\Vert _{F} + \left\Vert H^{j} - \hat{H}^{j}\right\Vert _{F} \right] \\
& \leq \eta \left\Vert \nabla ^{2} f(x) - H^{j-1}\right\Vert _{F} + \frac{\mu ^{2} L_{3}}{\text{12}\,d}.
\end{align*}
Proof of Lemma 3
Proof:
We consider the whole trajectory of the estimator, applying multiple times Lemma 2 together with the tower property of conditional expectation.
\begin{align*}
& \mathbb {E} \left[ \left. \left\Vert \nabla ^{2} f(x) - \hat{H}^{r} \right\Vert _{F} \right| \hat{H}^{0} \right] \\
&= \mathbb {E}_{\hat{H}^{r-1}} \left[ \mathbb {E}_{\hat{H}^{r}} \left[ \left. \left\Vert \nabla ^{2} f(x) - \hat{H}^{r} \right\Vert _{F} \right| \hat{H}^{0}, \hat{H}^{r-1} \right] \right] \\
& \leq \mathbb {E}_{\hat{H}^{r-1}} \left[ \left. \eta \left\Vert \nabla ^{2} f(x) - \hat{H}^{r-1} \right\Vert _{F} + \frac{\mu ^{2} L_{3}}{\text{12}\,d} \right| \hat{H}^{0} \right] \\
&= \frac{\mu ^{2} L_{3}}{\text{12}\,d} \!+\! \mathbb {E}_{\hat{H}^{r-2}} \left[ \mathbb {E}_{\hat{H}^{r-1}} \left[ \left. \eta \left\Vert \nabla ^{2} f(x) \!-\! \hat{H}^{r-1} \right\Vert _{F} \right| \hat{H}^{0}, \hat{H}^{r-2} \right] \right] \\
& \leq \frac{\mu ^{2} L_{3}}{\text{12}\,d} \sum _{i=0}^{1} \eta ^{i} + \mathbb {E}_{\hat{H}^{r-2}} \left[ \left. \eta ^{2} \left\Vert \nabla ^{2} f(x) - \hat{H}^{r-2} \right\Vert _{F} \right| \hat{H}^{0} \right] \\
& \leq \frac{\mu ^{2} L_{3}}{\text{12}\,d} \sum _{i=0}^{r-2} \eta ^{i} + \mathbb {E}_{\hat{H}^{1}} \left[ \left. \eta ^{r-1} \left\Vert \nabla ^{2} f(x) - \hat{H}^{1} \right\Vert _{F} \right| \hat{H}^{0} \right] \\
& \leq \frac{\mu ^{2} L_{3}}{\text{12}\,d} \sum _{i=0}^{r-1} \eta ^{i} + \eta ^{r} \left\Vert \nabla ^{2} f(x) - \hat{H}^{0} \right\Vert _{F}
\end{align*}
Proof of Theorem 1
Proof.
(Memory-less case): If
\begin{equation*}
\left\Vert \nabla ^{2} f(x_{k}) - \hat{H}_{k}^{0} \right\Vert _{F} \leq \sqrt{d} \left\Vert \nabla ^{2} f(x_{k}) \right\Vert \leq \sqrt{d} L_{1}.
\end{equation*}
\begin{align*}
\mathbb {P} & \left(\left. \left\Vert \nabla ^{2} f(x_{k}) - \hat{H}_{k}^{r(k)} \right\Vert _{F} \geq \epsilon \right| \hat{H}_{k}^{0} = 0 \right) \\
& \leq \frac{1}{\epsilon } \mathbb {E} \left[ \left. \left\Vert \nabla ^{2} f(x_{k}) - \hat{H}_{k}^{r(k)} \right\Vert _{F} \right| \hat{H}_{k}^{0} = 0 \right] \\
& \leq \frac{1}{\epsilon } \left(\eta ^{r(k)} \sqrt{d} L_{1} + \frac{\mu ^{2} L_{3}}{\text{12}\,d} \sum _{i=0}^{{r(k)}-1} \eta ^{i} \right) \leq \delta
\end{align*}
\begin{gather*}
\eta ^{r(k)} \sqrt{d} L_{1} + \frac{\mu ^{2} L_{3}}{\text{12}\,d} \frac{1 - \eta ^{r(k)}}{1 - \eta } \leq \epsilon \delta, \\
\eta ^{r(k)} \sqrt{d} L_{1} (1-\eta) + \frac{\mu ^{2} L_{3}}{\text{12}\,d} (1 - \eta ^{r(k)}) \leq \epsilon \delta (1-\eta),\\
\eta ^{r(k)} \left(\sqrt{d} L_{1} (1-\eta) - \frac{\mu ^{2} L_{3}}{\text{12}\,d} \right) \leq \epsilon \delta (1-\eta) - \frac{\mu ^{2} L_{3}}{\text{12}\,d}.
\end{gather*}
\begin{gather*}
\mu \leq \sqrt{ \frac{\text{12}\,d (1- \eta)}{L_{3}} \min \left\lbrace \epsilon \delta, \; \sqrt{d} L_{1} \right\rbrace },\\
r(k) \geq \log _{\eta } \left(\frac{ \epsilon \delta (1-\eta) - \frac{\mu ^{2} L_{3}}{\text{12}\,d}}{\sqrt{d} L_{1} (1-\eta) - \frac{\mu ^{2} L_{3}}{\text{12}\,d}} \right).
\end{gather*}
\begin{align*}
\eta &= \sqrt{1 - \frac{3}{d(d+2)}}, & \\
\sqrt{1 - x} &= 1 - \frac{x}{2} + o(x^{2}) & \text{for } |x| < 1, \\
\log (1-x) &= -x - \frac{x^{2}}{2} + o(x^{3}) & \text{for } |x| < 1.
\end{align*}
\begin{align*}
\mu &\leq \sqrt{ \frac{\text{12}\,d (1- \eta)}{L_{3}} \min \left\lbrace \epsilon \delta, \; \sqrt{d} L_{1} \right\rbrace } \\
& \propto \sqrt{ d \left(1 - \sqrt{1 - \frac{3}{d(d+2)}} \right) \epsilon \delta } \\
&= \sqrt{ d \left(1 - \left(1 - \frac{3}{2d(d+2)} + o \left(\frac{1}{d^{4}} \right) \right) \right) \epsilon \delta } \\
& \propto \sqrt{d \frac{3}{\text{2}\,d^{2}} \epsilon \delta } = O \left(\frac{\epsilon \delta }{ \sqrt{d} } \right).
\end{align*}
\begin{align*}
& r(k) \geq \log _{\eta } \left(\frac{ \epsilon \delta (1-\eta) - \frac{\mu ^{2} L_{3}}{\text{12}\,d}}{\sqrt{d} L_{1} (1-\eta) - \frac{\mu ^{2} L_{3}}{\text{12}\,d}} \right) \propto \log _{\eta } \left(\frac{ \epsilon \delta }{\sqrt{d} L_{1}} \right) \\
& \propto \log \left(\frac{\epsilon \delta }{\sqrt{d}} \right) / \log \left(1 - \frac{3}{2d(d+2)} \right) \\
& = \log \left(\frac{\epsilon \delta }{\sqrt{d}} \right) {/} \left(- \frac{1}{d^{2}} + o \left(\frac{1}{d^{4}} \right) \right) = O \left(d^{2} \log \left(\frac{\sqrt{d}}{ \epsilon \delta } \right) \right).
\end{align*}
(Warm-start case): In this case we use the assumption
\begin{align*}
\left\Vert \nabla ^{2} f(x) - \nabla ^{2} f(y) \right\Vert _{F} & \leq \sqrt{d} \left\Vert \nabla ^{2} f(x) - \nabla ^{2} f(y) \right\Vert \\
& \leq \sqrt{d} L_{2} \left\Vert x-y \right\Vert \quad \forall x,y \in \mathbb {R}^{d}.
\end{align*}
\begin{align*}
& \mathbb {E} \left[ \left. \left\Vert \nabla ^{2} f(x_{k}) - \hat{H}_{k}^{r(k)} \right\Vert _{F} \right| \hat{H}_{1}^{0} \right] = \mathbb {E}_{\hat{H}_{k}^{0}} \left[ \mathbb {E}_{\hat{H}_{k}^{r(k)}} \left[ \left. \left\Vert \nabla ^{2} f(x_{k}) - \hat{H}_{k}^{r(k)} \right\Vert _{F} \right| \hat{H}_{1}^{0}, \hat{H}_{k}^{0} \right] \right] \\
& \leq \mathbb {E}_{\hat{H}_{k}^{0}} \left[ \left. \eta ^{r(k)} \left\Vert \nabla ^{2} f(x_{k}) - \hat{H}_{k}^{0} \right\Vert _{F} + \frac{\mu ^{2} L_{3}}{\text{12}\,d} \sum _{i=0}^{{r(k)}-1} \eta ^{i} \right| \hat{H}_{1}^{0} \right] \\
&= \frac{\mu ^{2} L_{3}}{\text{12}\,d} \sum _{i=0}^{{r(k)}-1} \eta ^{i} + \eta ^{r(k)} \mathbb {E}_{\hat{H}_{k-1}^{r(k-1)}} \left[ \left. \left\Vert \nabla ^{2} f(x_{k}) - \hat{H}_{k-1}^{r(k-1)} \right\Vert _{F} \right| \hat{H}_{1}^{0} \right] \\
& \leq \frac{\mu ^{2} L_{3}}{\text{12}\,d} \sum _{i=0}^{{r(k)}-1} \eta ^{i} + \eta ^{r(k)} \mathbb {E}_{\hat{H}_{k-1}^{r(k-1)}} \left[ \left. \left\Vert \nabla ^{2} f(x_{k}) - \nabla ^{2} f(x_{k-1}) \right\Vert _{F} + \left\Vert \nabla ^{2} f(x_{k-1}) - \hat{H}_{k-1}^{r(k-1)} \right\Vert _{F} \right| \hat{H}_{1}^{0} \right] \\
& \leq \frac{\mu ^{2} L_{3}}{\text{12}\,d} \sum _{i=0}^{{r(k)}-1} \eta ^{i} + \eta ^{r(k)} \sqrt{d} L_{2} \left\Vert x_{k} - x_{k-1}\right\Vert + \eta ^{r(k)} \mathbb {E}_{\hat{H}_{k-1}^{0}} \left[ \mathbb {E}_{\hat{H}_{k-1}^{r(k-1)}} \left[ \left. \left\Vert \nabla ^{2} f(x_{k-1}) - \hat{H}_{k-1}^{r(k-1)} \right\Vert _{F} \right| \hat{H}_{1}^{0}, \hat{H}_{k-1}^{0} \right] \right] \\
& \leq \frac{\mu ^{2} L_{3}}{\text{12}\,d} \left(\sum _{i=0}^{{r(k)}-1} \eta ^{i} + \eta ^{r(k)} \sum _{i=0}^{{r(k-1)}-1} \eta ^{i} \right) + \eta ^{r(k)} \sqrt{d} L_{2} \left\Vert x_{k} - x_{k-1}\right\Vert + \eta ^{r(k) + r(k-1)} \mathbb {E}_{\hat{H}_{k-1}^{0}} \left[ \left. \left\Vert \nabla ^{2} f(x_{k-1}) - \hat{H}_{k-1}^{0} \right\Vert _{F} \right| \hat{H}_{1}^{0} \right] \\
& \leq \frac{\mu ^{2} L_{3}}{\text{12}\,d} \sum _{j=k-1}^{k} \left(\eta ^{ \sum _{z=j+1}^{k} r(z) } \sum _{i=0}^{{r(j)}-1} \eta ^{i} \right) + \sqrt{d} L_{2} \sum _{j=k-1}^{k} \left(\eta ^{\sum _{z=j}^{k} r(z)} \left\Vert x_{j} - x_{j-1}\right\Vert \right) \\
& \quad + \eta ^{\sum _{j=k-1}^{k} r(j)} \mathbb {E}_{\hat{H}_{k-2}^{r(k-2)}} \left[ \left. \left\Vert \nabla ^{2} f(x_{k-2}) - \hat{H}_{k-2}^{r(k-2)} \right\Vert _{F} \right| \hat{H}_{1}^{0} \right] \\
& \leq \frac{\mu ^{2} L_{3}}{\text{12}\,d} \sum _{j=1}^{k} \left(\eta ^{ \sum _{z=j+1}^{k} r(z) } \frac{1- \eta ^{r(j)}}{1-\eta } \right) + \sqrt{d} L_{2} \sum _{j=2}^{k} \left(\eta ^{\sum _{z=j}^{k} r(z)} \left\Vert x_{j} - x_{j-1}\right\Vert \right) + \eta ^{\sum _{j=1}^{k} r(j)} \left\Vert \nabla ^{2} f(x_{1}) - \hat{H}_{1}^{0} \right\Vert _{F} \\
& \leq \eta ^{r(k)} \Bigg \lbrace \frac{\mu ^{2} L_{3}}{\text{12}\,d} \left[ \sum _{j=1}^{k-1} \left(\eta ^{ \sum _{z=j+1}^{k-1} r(z) } \frac{1- \eta ^{r(j)}}{1-\eta } \right) - \frac{1}{1- \eta } \right] + \sqrt{d} L_{2} \sum _{j=2}^{k} \left(\eta ^{\sum _{z=j}^{k-1} r(z)} \left\Vert x_{j} - x_{j-1}\right\Vert \right) \\
& \quad + \eta ^{\sum _{j=1}^{k-1} r(j)} \left\Vert \nabla ^{2} f(x_{1}) - \hat{H}_{1}^{0} \right\Vert _{F} \Bigg \rbrace + \frac{1}{1- \eta } \frac{\mu ^{2} L_{3}}{\text{12}\,d}. \tag{15}
\end{align*}
\begin{align*}
\mathbb {P} & \left(\left. \left\Vert \nabla ^{2} f(x_{k}) - \hat{H}_{k}^{r(k)} \right\Vert _{F} \geq \epsilon \right| \hat{H}_{1}^{0} \right) \\
& \leq \frac{1}{\epsilon } \mathbb {E} \left[ \left. \left\Vert \nabla ^{2} f(x_{k}) - \hat{H}_{k}^{r(k)} \right\Vert _{F} \right| \hat{H}_{1}^{0} \right] \\
& \leq \frac{1}{\epsilon } \left(\eta ^{r(k)} c_{k} + \frac{1}{1- \eta } \frac{\mu ^{2} L_{3}}{\text{12}\,d} \right) \leq \delta.
\end{align*}
Appendix BMethods to Sample Orthonormal Vectors From \mathcal {U} (\mathbb {S})
Methods to Sample Orthonormal Vectors From \mathcal {U} (\mathbb {S})
A first procedure to generate orthonormal vectors
A second way to obtain orthogonal vectors
\begin{equation*}
\Lambda = \begin{bmatrix}\frac{R_{11}}{|R_{11}|} & & \\
& \ddots & \\
& & \frac{R_{dd}}{|R_{dd}|} \end{bmatrix},
\end{equation*}
Appendix CImproved Efficiency Using Subsampling
Improved Efficiency Using Subsampling
To keep a reasonable computational complexity in case of huge datasets, it is possible to incorporate subsampling in the procedure described in Section III-B. At each iteration the function value is evaluated using only a random subset of the data samples. We denote the function computed on the data subset
Assumption 3 (Smoothness):
For each data subset
We now provide lower bounds on the minimum number of data samples required by the zeroth-order derivative estimators to attain a target accuracy with high probability. In the following the subscripts do not refer to the iteration number, but rather to the data samples. For example,
Proposition 1:
Suppose that at each iteration the Hessian estimator of the function
\begin{equation*}
\hat{H}_{S}^{r} = \frac{1}{|S|} \sum _{(i,j) \in S} \frac{m}{n m_{i}} \hat{H}_{ij}^{r}
\end{equation*}
\begin{align*}
\left\Vert \nabla ^{2} f(x) - \hat{H}_{S}^{r} \right\Vert & \leq \left\Vert \nabla ^{2} f(x) -\nabla ^{2} f_{S}(x) \right\Vert \\
& \quad + \left\Vert \nabla ^{2} f_{S}(x) - \hat{H}_{S}^{r} \right\Vert \\
& \leq 4 L_{1} \sqrt{ \frac{ \log \left(\text{2}\,d / \delta _{S} \right) }{|S|} } + \epsilon _{\mu, r},
\end{align*}
Proof:
The scaling coefficient in the definition of
Proposition 2:
The approximation error of the global gradient estimator at the server
\begin{equation*}
\hat{g}_{S} = \frac{1}{|S|} \sum _{(i,j) \in S} \frac{m}{n m_{i}} \hat{g}_{ij}
\end{equation*}
\begin{align*}
\left\Vert \nabla f(x) - \hat{g}_{S} \right\Vert & \leq \left\Vert \nabla f(x) -\nabla f_{S}(x) \right\Vert + \left\Vert \nabla f_{S}(x) - \hat{g}_{S} \right\Vert \\
& \leq 4 \sqrt{2} L_{0} \sqrt{ \frac{ \log \left(\text{2}\,d / \delta _{S} \right) + 1/4}{|S|} } + \frac{d L_{2} \mu ^{2}}{6}.
\end{align*}
Proof:
The proof is similar to the previous one, but in this case we use Lemma 6 in [26] and the bound (6).
Remark 9:
The approximation error due to subsampling is governed by the cardinality of