Received October 28, 2020; accepted January 27, 2021; date of publication February 9, 2021; date of current version March 29, 2021.

Digital Object Identifier 10.1109/TQE.2021.3057908

# **Distributed Quantum Computing and Network Control for Accelerated VQE**

# STEPHEN DIADAMO<sup>1,2</sup>, MARCO GHIBAUDI<sup>1</sup>, AND JAMES CRUISE<sup>1</sup>

<sup>1</sup> Riverlane, St Andrews House, Cambridge CB2 3BZ, U.K.

<sup>2</sup> Department of Electrical Engineering, Technische Universität München, 80333 Münich, Germany

Corresponding author: Stephen DiAdamo (e-mail: stephen.diadamo@gmail.com).

This work was supported by the German Research Foundation (DFG) under Grant 1129/2-1.

**ABSTRACT** Interconnecting small quantum computers will be essential in the future for creating largescale, robust quantum computers. Methods for distributing monolithic quantum algorithms efficiently are, thus, needed. In this article, we consider an approach for distributing the accelerated variational quantum eigensolver algorithm over arbitrary sized—in terms of number of qubits—distributed quantum computers. We consider approaches for distributing qubit assignments of the Ansatz states required to estimate the expectation value of Hamiltonian operators in quantum chemistry in a parallelized computation and provide a systematic approach to generate distributed quantum circuits for distributed quantum computing. Moreover, we propose an architecture for a distributed quantum control system in the context of centralized and decentralized network control.

**INDEX TERMS** Distributed quantum computing, network control, quantum multicomputing, variational quantum eigensolver.

#### I. INTRODUCTION

To execute large-scale quantum algorithms on a quantum computer will require many physical qubits. One path to creating quantum computers with many qubits is to construct a network of smaller-scale quantum computers and perform distributed computing amongst them. This scheme is known as distributed quantum computing or quantum multicomputing. Envisioned in IBM's roadmap for scaling quantum devices, they discuss quantum interconnects, networking dilution refrigerators each holding a million qubits creating a massively parallel quantum computer capable of changing the world [1]. In any system that is converted from monolithic to distributed, a layer of communication complexity is added in order to perform distributed operations across devices. Because, here, the goal is to run quantum algorithms on this distributed system, the aim is to allow the control system to handle the communications needed. This depends on the architecture of the distributed system.

The number of qubits in quantum computers is steadily increasing, expecting to grow with a Moore's-like exponential growth every 18 months [2], [3], but until there are enough qubits, variational quantum algorithms have been proposed [4]. Variational quantum algorithms are hybrid classical-quantum algorithms that use shallow-depth circuits and classical optimization to perform a specific task. One such example is the variational quantum eigensolver (VQE). VQE can be used to estimate ground state energies of molecular chemical Hamiltonians and, therefore, has a strong use case for developing small-scale and possibly noisy quantum devices. Promising results have already appeared such as the experiment recently conducted using the Google quantum computer [5].

A modified version of VQE called accelerated VQE or  $\alpha$ -VQE is yet another adaptation that allows for using nearterm quantum hardware to its maximum ability. The  $\alpha$  in  $\alpha$ -VQE represents the tradeoff parameter for run time, which could be long for some variational algorithms, for circuit depth [6], reducing run time at the cost of having more stable qubits.  $\alpha$ -VQE uses a more efficient method for estimating the expectation value with respect to an Ansatz of a Hamiltonian than standard VQE, which is a quantum algorithm called accelerated quantum phase estimation or  $\alpha$ -QPE.  $\alpha$ -VQE replaces the expectation value estimation stage of standard VQE with  $\alpha$ -QPE.

In this article, we take these two concepts and combine them to construct a method for running  $\alpha$ -VQE on a distributed system of quantum computers. We begin with a technical overview of the state of the art in this direction. To estimate expectation values, an Ansatz state has to initialized. In Section III-A1, we consider various approaches for distributing the Ansatz states over an arbitrary distributed quantum computer. This includes methods for distributing the circuits needed to perform the initialization. Moreover, algorithms for distributing the circuits to perform  $\alpha$ -VQE are provided. In Section IV, we describe two different architectures for performing distributed quantum computing and propose network control systems based on Deltaflow.OS, which we explain more deeply what Deltaflow.OS is in Section IV.

# A. SUMMARY OF CONTRIBUTIONS

We design a method for performing accelerated VQE on a distributed system of quantum computers. We consider a system that, given the input of the number of qubits in a distributed collection of QPUs and circuitry needed to run VQE, we can map the system to a distributed system such that the Hamiltonian expectation estimation phase is parallelized and distributed. We provide various approaches of mapping a monolithic system to an arbitrary distributed quantum system. Moreover, we analyze a control system architecture that can be used to execute the quantum gate instructions, and combine the results such that  $\alpha$ -VQE can be executed. This process is not strictly contained to  $\alpha$ -VQE and many of the ideas can be adapted for VQE in its standard form (or when  $\alpha = 0$ ). The strategy for distributing qubits and scheduling can also be adapted for other quantum algorithms.

# **B. RELATED WORK**

The basis of this article is in distributing quantum algorithms specifically the accelerated variational quantum eigensolver [6] and controlling networked quantum hardware. Using the method of decomposing quantum algorithms that we use here has been explored in [7] and [8], where a full example of decomposing Shor's algorithm was proposed in [9]. Control systems for quantum computing have been proposed in [10] and [11], but these do not discuss the control between networked quantum computers.

#### **II. TECHNICAL PREREQUISITES**

In this article, we make use of theory from distributed quantum computing, software-based control systems, distributed operating systems, and dataflow programming schemes. In this section, we give a brief overview of each of these topics as a primer.

# A. DISTRIBUTED QUANTUM COMPUTING

Distributed quantum computing is the act of processing quantum information on two or more distinct quantum computers to solve a single problem and combining the results to produce one output [12], [13]. According to the Dowling– Neven law [2], [3], the number of usable qubits in a single, monolithic quantum computer, is growing steadily in a Moore's like fashion. To generate larger quantum systems, an orthogonal approach relies on connecting multiple quantum computers and use classical communication and entanglement to perform distributed quantum computing. Classical communication and entanglement allow application of multiqubit gates across physically separated quantum computers referred to as the LOCC-ENTANGLE model in [14]. There are various ways to perform multiqubit gates across quantum computers. In [15], teleportation is at the base of the overall process. In particular, two forms of teleportation qubit teleportation and gate teleportation—between quantum devices are analyzed. It is shown that teleporting qubits performs better than teleporting gates. Teleportation requires one entangled pair and two bits of classical communication. If the qubits are to be teleported back to their original location, this operation would need to be performed twice. The approach we use in this article, instead, uses the results from [7]. As it will be described in more depth in Section III-A, Yimsiriwattana *et al.* do not use teleportation at all but instead rely on one entangled pair and two bits of classical communication to perform a distributed control gate.

An analysis of how to perform quantum algorithms over a networked distributed quantum computer has been presented in [16]. A network model is proposed such that a distributed quantum system can simulate circuits for monolithic quantum computers with a communication overhead of  $O(\log^2 N)$ , where N is the number of qubits in the full system. In [15], it is discussed how a linear network topology will perform adequately for the foreseeable future, but I/O bandwidth will be a more challenging problem to overcome.

Another form of distributed quantum computing is cloudbased quantum computing, with companies such as Amazon, Microsoft, IBM, and others each releasing their own versions of a cloud quantum computing service [17]. In this type of distributed quantum computing algorithm input from a client is sent to a server, the server executes the algorithm instructions and then sends the results back to the client. In this case, protocols such as universal blind quantum computation [18] can be performed. In Section IV, we consider among our models a cloud based model with cooperating vendors.

Overall, the development of distributing quantum computers will be a promising path to increasing the size of quantum computers. Many network technologies for high speed, low-latency communication have been developed in other contexts and as we will see, can potentially be applied to distributed quantum computing.

# B. SOFTWARE CONTROL SYSTEMS OF QUANTUM HARDWARE

To perform the gate operations on the qubits in a quantum computer requires a system that can translate gate instructions to physical interactions with the qubits. Currently, quantum algorithms are generally written in terms of circuits of quantum gates. The circuits are designed with the assumption of noiseless qubits. The software takes these circuits as input, optimizes them, and converts them to a data format such that the control system controlling the qubits can execute the instructions.

A first step into defining how such a system functions is to draft a model of the full stack of the quantum computer. Such an architecture has been proposed in [11], defining the software and hardware stack for a quantum computer. Here, a protocol is defined to convert the classical and quantum instructions to binary strings such that they can be executed at the machine level. This stack incorporates error correction into the model and injects the additional instructions between gate operations when needed. To execute the instructions, hardware is in place that quickly reads the binary strings and then runs the optical control on the qubits. Such hardware is proposed in [19]. Here, cryogenic field-programmable gate arrays (FPGAs) are incorporated into the control system architecture to manipulate semiconductor-based qubits. In [20], it is explained how moving the parts of the classical control of quantum system to lower latency hardware like FPGAs can greatly benefit near-term quantum computing.

An important part of this system stack is to be able to control the amount of messages being passed to the hardware. The number of messages can grow very quickly when error correction is considered, and the bandwidth of the system can be used up completely with just instructions for error correction. In [10], quantum error-correction substrate, an architecture that delegates the task of quantum error correction to the hardware, for overcoming this is proposed.

Software and hardware will have to work closely together in an highly optimized way in order to reduce the amount of instructions while performing them with as low a latency as possible. As quantum hardware technology improves and as more research towards quantum software deepens, this area of quantum computing will become central to executing quantum algorithms on large-scale quantum hardware.

#### C. DISTRIBUTED OPERATING SYSTEMS

In general, an operating system (OS) is a system that manages the resources on a computer, such as the random access memory or the CPU, such that multiple programs can run simultaneously without undesirably interfering with each other. An OS also provides an interface between the user and the hardware. We will refer to the class of OSs that run on a single computer as a centralized OS. A distributed OS is an OS that runs on a cluster or group of computers, which are physically separated and connected via a network [21]. To the user of a distributed OS, it should appear as if their programs are running on a centralized OS. More specifically, a distributed OS should behave as an ordinary centralized OS with the caveat that the programs could be running at any physical location, which is not known to the user.

A distributed OS can be deployed in multiple ways [22, Ch. 8]. One way is to deploy the OS such that there is a distinction between the types of nodes in the distributed system, "nodes" meaning the computers in the cluster. The distinction is generally that there is one computer, which controls the rest of the system and the controlled nodes follow all the commands of this "controller" unit. An alternative configuration is that the network connecting the computers in the cluster are connected via an Internet and a set of Internet protocols are used to request resources from the nodes in the cluster and perform interprocess communication. One can think of this as a client-server relationship [23]. In this case, we call the operating system a network OS. The main



FIGURE 1. Example of a dataflow program.

difference between a distributed OS and a network OS is in a network OS, the user is aware that multiple systems are being use, albeit programs appear to be running on a single system.

One can deploy their systems as a distributed OS or a network OS or as a hybrid of the two. When deploying a distributed operating system one needs to find a good balance of some key properties to ensure that the operating system is robust, efficient, and can be scaled up. For example, one can potentially make the system very robust by adding abundant internode communication, but this could make the system inefficient or can overload the processors with messages to process.

In this article, we are focused on a specific system with a specific use case. These are systems that have classical control but have hardware that establish quantum entanglement and classical communication. We explore how one can design a distributed OS, where the distributed part we focus on is a cluster of quantum computers. We take into account the two models, the client-server model over a entanglement network and the single controller model. We use a specific control system, namely Deltaflow.OS, which we explain in Section IV, to explore these two models in depth.

### D. DATAFLOW PROGRAMMING

Dataflow programming is a method of programming that uses a network flow, or a directed graph, approach for developing algorithms [24]. The nodes in the network hold the logic of the program—or are constant valued—and the flow in the network represents the inputs to the next nodes in the flow, which is then processed and output to the next node in the network until the program is complete. Generally, in dataflow programming the nodes run in parallel and asynchronously. In real implementations, the nodes run idly, waking when they receive input to process. When the program starts, some nodes are selected to initialize without input, triggering the start of the program. Commonly used hardware programming languages like Verilog and VHDL use dataflow programming as a paradigm.

Fig. 1 is an example of a simple dataflow program. The constants three and five are inputs to the + node, which takes two inputs and outputs their sum. The output of the addition is passed to the  $\times$  node, which takes two inputs and output the product. In this case, 2 and 3 + 5 are inputs to  $\times$  and the complete output is 16.

In this article, we use Deltaflow to add control to the network hardware. Deltaflow is built on the dataflow programming paradigm. We will discuss in Section IV, how Deltaflow can be used to define the logic in the control blocks for an overall network control.

#### **III. DISTRIBUTING ACCELERATED VQE**

A problem to overcome when dealing with near-term quantum computing devices is that the ability to run deep circuits is greatly reduced due to low coherence time of qubit systems without error correction. A classical-quantum hybrid class of algorithms called "variational quantum algorithms" allow to run reduced depth circuits performing some of the algorithm on near-term quantum hardware and some on classical hardware. In particular, the VQE algorithm is a variational hybrid-quantum algorithm that can be used to find the minimum eigenvalue of a chemical Hamiltonian. It uses a quantum portion of the hardware to estimate the eigenvalues for a particular Ansatz of Pauli operations combining to form the Hamiltonian. VQE uses the quantum system to determine an expectation value and these expectation values are then combined to find an expectation value of the full Hamiltonian [4].

classical optimization techniques, Using various Ansätze-plural of Ansatz-are prepared with the goal of finding an estimate to the eigenstate with the lowest eigenvalue. The drawback of VQE is that the number of times the Ansatz state and expectation value needs to be prepared is proportional to  $1/\epsilon^2$ , where  $\epsilon$  is the desired precision, which could lead to long run-times [4]. Another way to estimate eigenvalues of unitary operations is using the quantum phase estimation (QPE) algorithm explained more in depth in Section III-B1. The advantage to using QPE is that the number of times the experiment is conducted to find the estimate is proportional to a constant. The downside is of course that the circuit depth grows proportionally to  $1/\epsilon$ .

As quantum hardware technologies improve, it will allow for longer coherence times of qubits and in turn allows for deeper quantum circuits. To make use of this ability, and to "squeeze" as much power out of the quantum hardware that is available, Wang *et al.* proposed the accelerated VQE ( $\alpha$ -VQE) algorithm [6]. We again attempt to squeeze more power out of our quantum hardware by considering how one could implement  $\alpha$ -VQE for a distributed quantum computer.

When using VQE for quantum chemistry applications, it is common to prepare parameterized circuits that generate entangled Ansatz states. A commonly used Ansatz is the unitary coupled cluster Ansatz [25], which grows in number of qubits required to prepare the Ansatz as Hamiltonian complexity increases. A critical part of using a distributed quantum computer for quantum chemistry is, therefore, preparing Ansatz states over an array of quantum computers. When distributing any quantum circuit across devices, the main complication that arises is when a controlled two-qubit gate needs to be applied across two QPUs. There are two approaches we consider here. We assume that we would like



**FIGURE 2.** Circuit diagram for a nonlocal CNOT gate between  $|\psi_1\rangle$  and  $|\psi_2\rangle$ , where (a) is the cat-entangler sequence and (b) the cat-disentangler sequence.

to use only entanglement and classical communication to achieve this—alternative to this, we could consider physically moving qubits between QPUs but this is a much noisier task and we ignore this option. We consider two approaches: Teleporting one of the two qubits to the other QPU so that they are on the same QPU and then perform the two-qubit gate on one QPU locally, the second approach is to use the mechanism introduced in [7], where Yimsiriwattana *et al.* introduce "cat-entangle" and "cat-disentangle" protocols seen in Fig. 2.

# A. SCHEDULING HAMILTONIANS

An electronic molecular Hamiltonian *H* can be written as a sum of a polynomial number (with respect to the system size) of Pauli matrices in the form of (1), where each  $P_i \in$  $\{I, \sigma_x, \sigma_y, \sigma_z\}^{\otimes n}$  is a tensor product of qubit *n* Pauli operators (or the identity), called a Pauli string, and each  $a_i \in \mathbb{R}$ 

$$H = \sum_{i} a_i P_i. \tag{1}$$

In order to more effectively use a networked quantum computer, we wish to use a parallelized and distributed approach to expectation value estimation. We motivate the approach as follows. Given the linear nature of estimating  $\langle \psi | H | \psi \rangle$ , we can break up the summation into its pieces. We need to prepare an *n* qubit Ansatz for each piece of the sum in order to estimate each  $\langle \psi | P_i | \psi \rangle$  independently to later rejoin the expectation values to estimate  $\langle \psi | H | \psi \rangle$ . Given the distributed QPU architecture, we need to allocate the qubits in such a way that Ansatz states can be prepared for each  $P_i$  in the sum. Later, the coefficients  $a_i$  can be merged to produce a single value for  $\langle \psi | H | \psi \rangle$ .

For Hamiltonians that require a large number of qubits, in this section, we consider methods that distribute the expectation calculation of the Pauli strings between a given distributed quantum computer. Here, we model a collection of quantum processors { $QPU_1,...,QPU_m$ } as a collection of  $q_i \in \mathbb{N}$  qubits (respectively), all of which are located in the same device. Given a set of QPUs and a Hamiltonian in the form of a summation of Pauli strings, a distributed layout of the qubits with the required allocation of communication qubits is produced.

We enforce the following restrictions. Because the goal is to run  $\alpha$ -VQE, we know ahead of time that one additional qubit (additional to the qubits in the Ansatz) is reserved for each Ansatz to perform  $\alpha$ -QPE. Ontop of this, we need to reserve qubits for entanglement between QPUs, which is necessary when an Ansatz is split between QPUs. The worst case for this occurs when there is a three-qubit control gate (equivalent to a Toffoli gate) where the chain qubits are allocated on different QPU while performing  $\alpha$ -QPE. In this case, since we are using the method of cat-entangling and disentangling, we need to reserve two qubits from each QPU for entanglement. We formalize this as a problem as follows.

Problem 1 (Ansatz Distribution Problem): Given a Hamiltonian  $H = \sum_{i=1}^{n} a_i P_i$  where each Pauli string  $P_i \in$  $\{I, \sigma_x, \sigma_y, \sigma_z\}^{\otimes n_i}$  and a collection of *m* QPUs described by the number of qubits on the system  $[q_1, q_2, ..., q_m]$ , output a series of rounds that can be used to estimate, for a given Ansatz  $|\psi\rangle$ , the expectation  $\langle \psi | H | \psi \rangle$ . In order to prepare an Ansatz, when  $P_i$  is split between two QPUs, two qubits from each QPU have to be allocated in order to perform nonlocal operations for preparing the Ansatz  $|\psi\rangle$  across two or more QPUs. Moreover, one qubit needs to be reserved for  $\alpha$ -QPE. The solution to this problem outputs a schedule of distributions in which one can run over the distributed system to obtain an estimate to  $\langle \psi | H | \psi \rangle$ .

For the task of distributing the qubits, we take various approaches to this problem. In its essence, this problem is a resource allocation problem. We can, therefore, gain insight from common solutions to such problems. Common approaches for resource allocation problems are greedy algorithms and constraint programming. We propose an algorithm of each approach in this section.

#### 1) GREEDY ANSATZ DISTRIBUTION

In the greedy algorithm approach, we greedily fill the QPUs with as many Ansatz states that can possibly fit and for the remaining needed qubits, we split them across the QPUs reserving the needed qubits as needed. When the QPUs cannot fit any more Ansätze, the execution of those estimations are moved to the next round. In detail, we propose Algorithm 1. We refer to an algorithm called doesNotFit, which simply runs a similar logic as the main algorithm but just ensures a distribution exists for one particular Ansatz. We refer the reader to Appendix A, Algorithm 8 for the detailed algorithm. This algorithm executes with run time on the order of  $O(p \cdot m^2)$ , where p is the number of Pauli strings to distribute and *m* is the number of QPUs.

#### 2) CONSTRAINT PROGRAMMING APPROACH

As another approach to solving Problem 1, we use constraint programming. The tradeoff with constraint programming is that setting up a collection of constraints is generally straight forward but solving constraint problems on a finite domain is generally NP-complete, trading simplicity for time. Adding more constraints is straightforward relative to implementing

#### Algorithm 1: Greedy Ansatz Distribution. Input:

- List of QPU sizes  $Q = [q_1, q_2, ..., q_m]$ . • *n* the qubits for Ansatz
- p the number of Pauli strings to distribute

Parameters for recursion defaulted to  $schedule = \{\}$  and round = 1**Output:** An Ansatz distribution schedule used to compute  $\langle \psi | H | \psi \rangle$  for an Ansatz  $|\psi\rangle$  of size *n* qubits.

**GreedyDistribution**(Q, n, schedule, round):

1: **if** p = 0 or n = 0:

```
2: | return schedule
 3: Q' \leftarrow copy(Q) = \{q'_1, ..., q'_m\}
                                                     \triangleright Copy Q for modification
 4: schedule[r] \leftarrow []
                                         ▷ Initialize the schedule for this round
 5:
    couldNotFit \leftarrow 0
    for i\in 1,...,p do
 6:
 7:
       sort(Q')
       if doesNotFit(n, Q'):
 8:
 9:
         if round = 1 \land i = 1: exit
10:
          couldNotFit \leftarrow coundNotFit + 1
11:
          continue
       distribution \leftarrow [0 \text{ for } \_ \in \{1, ..., m\}]
12.
                                                           \triangleright A vector of m zeros
13:
       for j \in \{1, ..., |Q'|\} do
14:
          curAllocation \leftarrow [0 \text{ for } \_ \in \{1, .., m\}]
          possibleQPUs \leftarrow Q'|_{\{1,...,j\}}
15:
                                                        \triangleright Restrict to first j QPUs
16:
          if j = 1:
                                                                ▷ No split needed
17:
            k \leftarrow QPUNumber(possibleQPUs[1])
                                                                ▷ The QPU index
18:
            curAllocation[k] \leftarrow possibleQPUs[1] - 1
19:
          else
20:
            k \leftarrow QPUNumber(possibleQPUs[1])
                                                               ▷ The QPU index
            curAllocation[k] \gets possibleQPUs[1] - 3
21:
            for q_s' \in possibleQPUs|_{\{2,...,j\}} do
22:
23:
             | curAllocation[s] \leftarrow q'_s - 2 \triangleright \text{Reserve 2 qubits from QPUs}
24:
            end for
25:
          if sum(curAllocation) \ge n:
                                                      ▷ An allocation is possible
26:
            remaining \leftarrow n
27:
            iteration \leftarrow 1
            for q'_s \in possibleQPUs do
28:
29:
               t \leftarrow \min\{remaining, curAllocation[s]\}
30:
               distribution[s] \leftarrow t
31:
               remaining \leftarrow remaining - t
32:
               if iteration = 1 : \triangleright Remove respective qubits from first QPU
33:
                 if j = 1:
34:
                 | q'_s \leftarrow q'_s - t - 1
35:
                 else
                 | q'_s \leftarrow q'_s - t - 3
36:
37:
               else
38:
                q'_s \leftarrow q'_s - t - 2
39
               if remaining = 0: break
40:
              iteration \leftarrow iteration + 1
41:
            end for
42:
            break
43:
       end for
       for q'_s \in Q' do
44:
45:
       | if q'_s = 0: delete q'_s
46:
       end for
47:
       schedule[r].add((i, distribution))
48: end for
49: return
      GreedyDistribution(Q, n, couldNotFit, schedule, round + 1)
```

the same constraints in greedily. We construct the multiobjective constraint program in detail in constraint program 2. Using this constraint program repeatedly, we can produce a schedule by running the constraint program on the maximum number of Ansätze that fit in the system and using a solution from the output, once per round, until all Ansätze are covered.

Algorithm 2 Constraint Program 2

#### Input:

- $Q = [q_1, ..., q_n], \forall i, q_i \in \mathbb{N}$  a list of the number of qubits for each QPU in the system
- $A \in \mathbb{N}$  the number of qubits in the Ansatz
- $m \in \mathbb{N}$  the number of Ansätze to fit

#### Variables:

- $x_{ij} \in \{0, ..., A\}$ : The number of qubits from Ansatz  $0 \le i \le m$  placed on QPU j
- $y_{ij} \in \{0, 1\}$ : The QPE qubit for Ansatz *i* on QPU *j*
- $z_{ijk} \in \{0, 2\}$ : The number of qubits used to split Ansatz *i* between QPUs *j* and *k*

#### **Objective Functions**:

maximize  $\sum_{ij} x_{ij}$ , minimize  $\sum_{ijk} z_{ijk}$ 

# **Constraints**:

1) There's only one QPE qubit per Ansatz:

$$\sum_{j=1}^{n} y_{ij} = 1, \quad \forall i \in \{1, ..., m\}$$

2) If the Ansatz is split, then both QPUs use qubits:

 $z_{ijk} = z_{ikj}, \quad \forall i \in \{1, \dots, m\}, \, j, k \in \{1, \dots, n\}, \, j \neq k, \, j < k$ 

3) Ansatz is completely covered with one QPE qubit:

$$\sum_{j=1}^{n} x_{ij} + y_{ij} = A + 1, \quad \forall i \in \{1, \dots, m\}$$

4) Qubits allocated do not exceed the number of qubits on the QPU. Note we can recycle the splitting qubits for multiple splits of the same Ansatz.

$$\sum_{i=1}^{m} x_{ij} + y_{ij} + \max_{\substack{k \in \{1, \dots, n\} \\ k \neq j}} z_{ijk} \le q_j, \quad \forall j \in \{1, \dots, n\}$$

5) The Ansatz fits on one QPU or it is split:

$$\max_{i \in \{1,...,A\}} x_{ij} = A \land \sum_{j=1}^{n} z_{ijk} = 0 \lor$$
$$|\{x_{ij} : j \in \{1,...,n\}, x_{ij} \neq 0\}| - 1$$
$$= |\{z_{ijk} : j, k \in \{1,...,n\}, j \neq k, z_{ijk} = 2\}|/2,$$
$$\forall i \in \{1,...,m\}$$

6) The QPE qubit exists on a QPU with Ansatz qubits:

$$\exists j \in \{1, ..., n\} x_{ij} \neq 0 \land y_{ij} \neq 0, \forall i \in \{1, ..., m\}$$

#### **B.** DISTRIBUTING α-VQE

As discussed in earlier sections, The VQE is a variational algorithm that uses a combination of quantum and classical components and can be used to estimate ground state energies in electric molecular Hamiltonians. To perform chemical







**FIGURE 4.** Circuit diagram for RFPE.  $Z(M\theta) := \text{diag}(1, e^{-iM\theta})$ .

calculations, VQE is used with a statistical sampling subroutine to estimate expectation values with a given Ansatz with a classical optimizer to pick the parameters to minimize the expectation value. In [6], a generalization of VQE is proposed, called  $\alpha$ -VQE. The generalization replaces the statistical sampling step with a subroutine called  $\alpha$ -QPE, which for the selection of  $\alpha \in [0, 1]$  can behave as VQE does, but also can become more efficient by choosing  $\alpha > 0$ , which requires the ability to run deeper circuits on quantum hardware.

In this section, we take the proposed  $\alpha$ -VQE in [6] and map it to a distributed system. The main theme in this section is applying nonlocal control gates over separated QPUs. We follow the approach of [7] and [8] using entanglement and classical communication to perform control gates across distributed systems, relying on the preallocated qubits from the previous section to hold the entanglement across devices.

#### 1) DISTRIBUTING $\alpha$ -QPE

The QPE algorithm is an essential ingredient to many popular quantum algorithms—one such being Shor's algorithm. First discussed by Kitaev in [26], QPE is used to estimate the phase of a quantum state  $|\psi\rangle$  that appears after applying a specific unitary operation U to it, where  $|\psi\rangle$  is an eigenstate of U. Specifically, QPE aims to estimate the phase  $\phi$  in  $U|\psi\rangle = e^{2i\pi\phi}|\psi\rangle$  with high probability. In Fig. 3, we depict a circuit representation of QPE applied to a qubit  $|\psi\rangle$ , where n qubits are used to estimate  $\phi$ .

Here, we adapt a modified version of QPE developed in [6] called  $\alpha$ -QPE for a distributed system.  $\alpha$ -QPE is a modified version of rejection filtering phase estimation (RFPE) whose circuit diagram is given in Fig. 4.  $\alpha$ -QPE uses a free parameter  $\alpha$  that is chosen depending on the available circuit depth on the specific hardware running the algorithm. With this  $\alpha$ , M, and  $\theta$  are selected as  $M = 1/\sigma^{\alpha}$  and  $\theta = \mu - \sigma$ . Here,  $\sigma$  and  $\mu$  are parameters for a normal  $\mathcal{N}(\mu, \sigma^2)$  prior distribution in the first round of  $\alpha$ -QPE for sampling values of  $\phi$ , the "eigenphase" in  $U|\phi\rangle = e^{\pm i\phi}|\phi\rangle$ . Here, U is modified to be a rotation operator that rotates an Ansatz  $|\psi\rangle$  by an angle  $\phi$  in the plane spanned by  $\{|\psi\rangle, P|\psi\rangle\}$ , where P is a Pauli string.



FIGURE 5. Example of running the DistributedRemapper algorithm. (a) Three-qubit variational form. (b) Three-qubit variational form distributed across two devices.

More precisely, with the goal of estimating  $|\langle \psi | P | \psi \rangle|$ , given an Ansatz preparation circuit  $R := R(\lambda)$  for some parameter vector  $\lambda \in \mathbb{R}^n$  and a reflection operator  $\Pi := \mathbb{I} - 2|0\rangle\langle 0|$ ,  $U := R\Pi R^{\dagger} P R\Pi R^{\dagger} P^{\dagger}$  and the circuit depicted in Fig. 4 is executed to obtain a value *E*. When *E* is obtained, rejection sampling is performed to produce a posterior distribution, which can be shown to again be normal, in which to again sample values of  $\psi$ . This process is repeated until sufficient accuracy is reached. Once an estimate for  $\phi$  is obtained, one can recover  $|\langle \psi | P | \psi \rangle|$  using the relation  $|\langle \psi | P | \psi \rangle| = \cos(\phi/2)$ . In [6], mechanisms to recover the sign of  $\langle \psi | P | \psi \rangle$ are provided.

In this section, we tackle three key steps in to adapt  $\alpha$ -QPE for a distributed system. The first is mapping the state preparation circuit  $R(\lambda)$  across multiple QPUs, the second is then to map *U* to a distributed system, and the third, performing the controlled operation in Fig. 4. We solve these in order. The solution to the first task takes Ansatz preparation circuit  $R(\lambda)$  and develops a mechanism such that it can be applied when some qubits are physically separated. Here, we consider  $R(\lambda)$  a variational form, a parameterized circuit used to prepare an Ansatz. We give an algorithm to achieve this in Algorithm 3.

The high level idea of Algorithm 3 is, given the circuit representation of  $R(\lambda)$  as a series of layers, where each layer is a collection of gates in a layer of a circuit, and a mapping of qubits, to search for any control gates where the control and target are physically separated between two QPUs. When found, insert, between the current layer and next layer in the circuit, the necessary steps to perform the control gate in a nonlocal way using the cat-entangling method. We also ensure that entanglement is established between the two QPUs ahead of time by prepending an entanglement generation step. As an optimization, the cat-disentangler step can be shifted to a later layer if the nonlocal control gate has the

VOLUME 2, 2021

same control qubit and no operations on that control qubit in between controlled gates. Note that we can generate a distributed  $R(\lambda)^{\dagger}$  in the same way. From the previous section, the proposed solutions to Problem 1 ensure that there are two qubits reserved on each QPU for the entanglement qubits needed for nonlocal operations. Producing the layering of a circuit can be done in a straight forward way and we assume that this structure is the input to the algorithm. We depict an example of running the algorithm in Fig. 5.

The next step is to map  $U := R\Pi R^{\dagger} P R\Pi P^{\dagger} R^{\dagger}$  to a distributed system. One observation that can be made immediately is, since P is a Pauli string,  $P^{\dagger} = P$ , so there are no additional steps needed to map  $P^{\dagger}$ . *P* is a separable operation (i.e., there are no two-qubit gates) and, therefore, we can apply each piece of P in a single layer with no added inter-QPU communication. For mapping  $R(\lambda)^{\dagger}$  to a distributed system, as discussed, given an  $R(\lambda)$  as a circuit that is not distributed, we can obtain  $R(\lambda)^{\dagger}$ . To obtain the mapping, we can run Algorithm 3 with  $R(\lambda)^{\dagger}$  as the input with the same Ansatz distribution. Next, we consider the *n* qubit reflection operator  $\Pi$ , which can be decomposed (locally) as a series of single-qubit gates and CNOT operations. We can, therefore, again use Algorithm 3 to map a provided reflection  $\Pi$  to a distributed architecture given the Ansatz distribution as input.

For the control part of  $\alpha$ -QPE, we consider the controlled version of U, c - U, because of the structure of U, one can see that the only operation that is in need of control is in the reflection  $\Pi$  since if  $\Pi$  is not applied, c - U is reduced to the identity. Here, it will be the case that we need to execute control–control gates (CC-gates). If the Ansatz is split between QPUs, then two qubits need to be reserved on each QPU to accommodate for CC-gates. This is guaranteed by the scheduling algorithm in the previous section and there will always be two free qubits reserved such that we can

# Algorithm 3: Local to Distributed Circuit.

#### Input:

- A circuit representation of unitary U where U is a list of list of gates. Each list represents a layer in the circuit. Gates have the form Gate(ID) or CTRL(G, ID<sub>1</sub>, ID<sub>2</sub>) where G is the gate to applied using control qubit with ID<sub>1</sub> and target qubit ID<sub>2</sub>. The ID in the form (i, j) where i is the QPU and j the qubit on that QPU.
- A qubit layout map *qubitMap* on a collection ID tuples of the form (*ID*<sub>1</sub>, *ID*<sub>2</sub>).

**Output:** An equivalent circuit that accommodates for non-local controlled gates.

**DistributedRemapper**(U, qubit Map):

1:  $remappedCircuit \leftarrow [[ ]]$ 2: for  $layer_l \in U$  do 3:  $modLayers \leftarrow [[] \times 8]$ 4: for  $gate \in layer_l$  do if gate is CTRL: 5: 6:  $((i, j), (s, t)) \leftarrow gate.qubits$ 7: if i = s: modLayers[0].add(gate)8: if Entanglement not established between QPUs i and s: 9:  $| remapping_{l-1}.add(Ent((i, e_1), (s, e_2)))$ 10: 11: modLayers[0].add $(CNOT((i, j), (i, e_1)))$ modLayers[1].add $(c_i \leftarrow measure((i, e_1)))$ 12: modLayers[2].add $(c_s \leftarrow claComm(i, s, c_i))$ 13: modLayers[2].add $(claCtrlX(c_i, (i, e_1)))$ 14: modLayers[3].add $(claCtrlX(c_s, (s, e_2)))$ 15: modLayers[4].add $(ctrl - G((s, e_2), (s, t)))$ 16: 17: modLayers[4].add $(ctrl - G((s, e_2), (s, t)))$  $n \leftarrow 0$ 18: 19: for ctrl- $G' \in GetSeriesCGates(U, layer_l, s, (i, j))$  do 20:  $n \leftarrow n+1$  $((\_,\_),(\_,t')) \leftarrow ctrl-G'$ .qubits 21: modLayers.add([], 4+n)22: modLayers[4+n].add $(ctrl-G'((s,e_2),(s,t')))$ 23: 24: remove ctrl-G'25: end for modLayers[5+n].add $(H(s, e_2))$ 26: 27. modLayers[6+n].add $(c_s \leftarrow measure(s, e_2))$ modLayers[6+n].add $(claCtrlX(c_s, (s, e_2)))$ 28: 29: modLayers[6+n].add $(c_s \leftarrow claComm(s, i, c_s))$ 30: modLayers[7+n].add $(claCtrlZ(c_i, (i, e_1)))$ 31: else modLayers[0].add(gate)32: 33: end for 34: remappedCircuit.addAll(modLayers) 35: end for 36: return remappedCircuit

apply Algorithm 3 again after adding a control connection to each gate of the circuit representing  $\Pi$  the distributed form, excluding the previously added nonlocal steps, to produce a circuit that achieves the controlled version of  $\Pi$ .

The remaining steps of  $\alpha$ -QPE are the two complications that arise, which are discussed in [6, Sec. II-B]. At each iteration of  $\alpha$ -QPE the Ansatz  $|\psi\rangle = 1/\sqrt{2}(|\phi\rangle + |-\phi\rangle)$ needs to be collapsed into either  $|\phi\rangle$  or  $|-\phi\rangle$ . Wang *et al.* [6] proposed a statistical sampling method, which one can apply a constant number of iterations in order to, with high confidence, both estimate the sign of  $\langle \psi | P | \psi \rangle$  and ensure that  $|\langle \psi | P | \psi \rangle| > \delta$ . When this bound holds, then with high confidence,  $|\psi\rangle$  can be efficiently collapsed to either one of  $|\phi\rangle$  or  $|-\phi\rangle$ . Once this is performed, we apply the  $\alpha$ -QPE

#### 3100921

| Algorithm 4: | GetSeriesControlGates. |
|--------------|------------------------|
| Input:       |                        |

- U the circuit as described in Algorithm 3
- $layerl_l$  the current layer to decompose in U
- s the QPU for the target qubit
- (i, j) the control qubit

**Output:** The series of gates directly following from layer l that are control gates with with control qubit (i, j).

 $GetSeriesCGates(U, layer_l, s, (i, j)):$ 1:  $layers \leftarrow \{layer_{l+1}, ..., layer_n\} \subseteq U$ 2:  $gates \leftarrow [$ ] 3: for  $layer_s \in layers$  do for  $gate \in layer_s$  do 4: 5. if gate is CTRL and  $gate = (\_, (i, j), (s, \_))$ : 6: gates.add(gate)7: else 8: return gates 9٠ end for 10. end for 11: return gates

procedure as normal. If high confidence cannot be achieved, then instead of using the  $\alpha$ -QPE circuitry, statistical sampling continues. Statistical sampling in this setting implies repeatedly preparing  $|\psi\rangle$ , applying the single layer Pauli string *P*, in order to estimate  $\langle \psi | P | \psi \rangle$ . When the bound does not hold, statistical sampling is performed until  $\langle \psi | P | \psi \rangle$  is estimated with sufficient precision in the normal VQE sense. We follow the method of Wang *et al.*, but use the modified  $R(\lambda)$  circuit needed to prepare the Ansatz over a distributed quantum computer. We write this whole procedure in Algorithm 5.

Definition 2 (Schedule): A schedule S is a collection of r lists where each element of a list contains the distribution of qubits on the m QPUs. Each distribution is a list of qubit allocations on each QPU  $q_i \in \{0, ..., Q_j\}$ , where  $Q_j$  is the number of qubits on QPU j. If the Ansatz is not allocated in a round  $r' \in \{1, ..., r\}$ , it does not appear in the distribution list. The structure of a schedule is as follows:

$$S = \{ \\ 1 : [[q_1, ..., q_m]_1, ..., [q_1, ..., q_m]_{n_1}] \\ 2 : [[q_1, ..., q_m]_{n_1+1}, ..., [q_1, ..., q_m]_{n_2}], \\ ..., \\ r : [[q_1, ..., q_m]_{n_{r-1}+1}, ..., [q_1, ..., q_m]_{n_r}] \\ \}.$$

The subscripts on the qubit count lists represent the index of the Pauli being estimated.

#### 2) DISTRIBUTED $\alpha$ -VQE

To conclude the mapping of a localized, monolithic version of  $\alpha$ -VQE to the distributed version, we need to replace the  $\alpha$ -QPE subroutine with the distributed  $\alpha$ -QPE version from the previous section. For completeness, we write distributed  $\alpha$ -VQE as an algorithm in Algorithm 6. Algorithm 5: Distributed  $\alpha$ -QPE.

#### Input:

- S: A schedule (defined in Definition 2) providing the qubit mapping of an Ansatz of size n for p Pauli strings on m QPUs.
- $A = [a_1, ..., a_n]$ : The vector constants for each Pauli string.
- $U = [U_1, ..., U_n]$ : The circuits for  $\alpha$ -QPE

•  $P = [P_1, ..., P_n]$ : The Pauli operators associated with each  $U_i$ **Output:** The value of the expectation value estimations of the pPaulis

**Distributed**  $\alpha$ **-QPE**(S, A, U, P):

- 1: estimates  $\leftarrow$  []
- 2: for  $r \in S$  do
- $roundOfEstimates \leftarrow RunAOPERound(S(r), U)$ 3:
- estimates.addAll(roundOfEstimates) 4:
- 5: end for

6: return  $A \cdot estimates$ ▷ Return the scalar product **RunAQPERound**(S(r), U):

1:  $estimates \leftarrow Array(|S(r)|) \Rightarrow$  Initalize |S(r)| length array 2: in parallel for  $p \in S(r)$  do

3: success  $\leftarrow$  Bound  $|\langle \psi(\lambda)|P_p|\psi(\lambda)\rangle|$  away from 0 and 1 using [6, Appendix C, Stage I]

4: if success:

- 5: Perform [6, Appendix C, Stage II] to collapse  $|\psi\rangle$  to  $|\pm\phi\rangle$
- 6:  $estimates[p] \leftarrow Perform \alpha$ -QPE using distributed circuits with  $c - U_p$  or  $c - U_p^{\dagger}$  depending on collapsed  $|\psi\rangle$

7: else

8:  $estimates[p] \leftarrow \text{Estimate } \langle \psi(\lambda) | P_p | \psi(\lambda) \rangle$  with statistical sampling using constant distribution circuits using the Ansatz distribution p 9. await all

10: return estimates

#### C. ANALYSIS

In this section, we analyze the properties of the distributed quantum circuits in relation to the Ansatz size. First, we compare the duration of computation using three methods of performing the estimates of the expectation values: estimating in parallel, on one single QPU the size of the Ansatz, and using parallel and distributed computing. When running in parallel, one Pauli string is estimated per OPU. The limitation is that the Ansatz can be only as big as the smallest QPU, minus the qubit for  $\alpha$ -QPE. In the single QPU case, we assume the full Ansatz can fit on the QPU, and therefore no gates are distributed. Finally, in the distributed and parallel case, Pauli strings are estimated similarly to the parallel case, but multiple Ansätze can be placed on a single QPU as well as split between multiple QPUs with distributed control gates.

To get an estimate for the number of gates used, we analyze the pieces of the U operator defined in the previous section. The reflection operator  $\Pi$  has the equivalent cost, up to 2nsingle-qubit gates to an (n + 1)-qubit Toffoli gate [6, Sec. II-B]. Without ancilla qubits, currently the circuit depth to implement such a gate grows linearly O(n) [27] with improved linear scaling with one ancilla qubit [28]. When  $\left\lceil \frac{n-2}{2} \right\rceil$  ancilla qubits are available, the depth can scale as  $O(\log n)$  [29] to implement with 6n - 6 CNOT gates. The additional ancilla qubits to decrease the circuit depth could be considered in the Ansatz distribution phase from Section III-A, and we leave it to future work to analyze this change. Here, we assume no additional ancilla qubits. For the Ansatz preparation  $R(\lambda)$ , in

# Algorithm 6: Distributed $\alpha$ -VQE.

#### Input:

- A list of QPU sizes  $Q = [q_1, q_2, ..., q_m]$  H the Hamiltonian  $H = A \cdot P = \sum_{i=1}^n a_i P_i$
- $R(\lambda)$  The Ansatz preparation circuit

**Output**: An estimate for  $\langle \psi(\lambda) | H | \psi(\lambda) \rangle$ ,  $| \psi(\lambda) \rangle$  the state prepared by circuit  $R(\lambda)$ .

**Distributed**  $\alpha$ **-VQE** $(Q, H, R(\lambda))$ 

- 1:  $q \leftarrow$  Number of qubits needed for  $R(\lambda)$
- 2:  $p \leftarrow$  Number of Paulis for H
- 3:  $S, map \leftarrow$  Ansatz schedule from an algorithm proposed in Section III-A
- 4:  $dR(\lambda) \leftarrow$ **DistributedRemapper** $(R(\lambda), map)$
- 5:  $dR(\lambda)^{\dagger} \leftarrow \text{DistributedRemapper}(R(\lambda)^{\dagger}, map)$
- 6:  $d\Pi \leftarrow \mathbf{DistributedRemapper}(\Pi, map)$
- 7:  $c \Pi \leftarrow \text{Add control connections to } d\Pi$  from pre-allocated  $\alpha$ -QPE qubit
- 8:  $c d\Pi \leftarrow \textbf{DistributedRemapper}(c \Pi, map)$
- 9: for  $P_i \in P$  do
- $dP_i \leftarrow \mathbf{DistributedRemapper}(P_i, map)$  $10 \cdot$
- $c dU_i \leftarrow$  Combine distributed circuits 11:  $dR(c-d\Pi)dR^{\dagger}dP_{i}dR(c-d\Pi)dP_{i}dR^{\dagger}$
- 12: end for
- 13:  $c dU \leftarrow [c dU_1, ..., c dU_n]$
- 14:  $dP \leftarrow [dP_1, ..., dP_n]$

15:  $\langle \psi(\lambda) | H | \psi(\lambda) \rangle \leftarrow \text{Distributed } \alpha \text{-QPE}(S, A, c - dU, dP)$ 16: return  $\langle \psi(\lambda) | H | \psi(\lambda) \rangle$ 



FIGURE 6. This plot is of a weighted time using the greedy distribution of the Ansatz for growing Ansatz sizes with five QPUs each with ten qubits. The green line shows the timing for running one Ansatz per QPU. It cuts off at nine qubits. The orange line is if all 50 qubits were on 1 QPU. The blue line is if we use a distributed Ansatz over the five QPUs.

most of the applications to date, the circuit depth is  $\Omega(n)$  [30], meaning it has a tight upper and lower bound proportional to the number of qubits, which could be the most significant overhead in this process.

We demonstrate the time tradeoff. In Fig. 6, we assume we have a QPU cluster with five QPUs each with ten qubits. We determine a rough upper bound on the number of gates needed to perform distributed computing and summarize the time weight and gate quantity scaling in Table 1. In Fig. 7, we show the maximum number of qubits that an Ansatz can be composed of using four different sized QPUs and with respect to the adding additional QPUs of the that size to the distributed system.

**TABLE 1.** Time scaling of gates. *n* represents the number of qubits in the Ansatz and *m* the number of QPUs. The execution time weights are derived from [31] for superconducting qubits. The quantity scalings are based on a Bravyi–Kitaev mapping [32]

| Operation               | Time weight | Quantity scaling      |
|-------------------------|-------------|-----------------------|
| CNOT                    | 5           | $O(n^4 \cdot \log n)$ |
| Single qubit gate       | 1           | $O(n^4 \cdot \log n)$ |
| Measurements            | 2           | $O(n^4 \cdot \log n)$ |
| Entanglement generation | 8           | $O(n^4 \cdot \log n)$ |
| Classical communication | 2           | $O(n^4 \cdot \log n)$ |
| Output merging          | 3           | O(m)                  |



**FIGURE 7.** Maximum Ansatz size that would fit on a distributed system of QPUs. The maximum Ansatz size is given by  $\sum_{i=1}^{n} q_i - 2n - 1$  with *n* QPUs with  $q_i > 2$  qubits on QPU *i*.

#### D. APPLICATIONS FOR QUANTUM CHEMISTRY

In this section, we take an example of an electronic molecular Hamiltonian for the chemical  $H_2$ . To estimate the Hamiltonian for this molecule with two electrons and two active orbitals, we require four qubits when using a Bravyi–Kitaev transformation. We can quickly obtain the Hamiltonian using the Pennylane Python library [33]. The Hamiltonian in this case, under the Bravyi–Kitaev transformation is of the form

$$H = \sum_{i=1}^{15} a_i P_i$$
 (2)

where we are concerned in the number of elements in the sum and less so about the constant factors and therefore to perform  $\alpha$ -VQE, we will need to estimate 15 Pauli strings. In this example, we will consider a distributed quantum system of three QPUs each containing nine qubits. If we use these parameters as input to the algorithms in Section III-A1, the output configuration would be the one depicted in Fig. 8. In one round, four Ansäzte can fit across this distributed system, and so at least four rounds need to be executed. We can use the same allocation for the first three rounds and in the last round eliminate the distributed Ansatz in order to reduce the need for cross communication between QPUs.

For the Ansatz preparation, we use the circuit  $R(\lambda)$  depicted in Fig. 9(a). From the four Ansätze, three of them will be able to run the  $\alpha$ -QPE step without distribution of the Ansatz. The fourth Ansatz is, on the other hand, distributed and will need to use the circuit in Fig. 9(b) for preparation. For simplicity, we include arbitrary qubit rotations, which are



**FIGURE 8.** Distributed Ansatz of size four on three QPUs with nine qubits. The green outlined qubits are reserved for running  $\alpha$ -QPE. The red outline qubits are for the Ansätze. The orange outlined are qubits reserved for entanglement between QPUs for nonlocal gates. One qubit is left idle.

represented by the  $R(\lambda_1, \lambda_2, \lambda_3)$  gates, where  $\lambda_i \in [-\frac{\pi}{2}, \frac{\pi}{2}]$ for  $i \in \{1, 2, 3\}$ . Next we need to perform the reflection operation  $\Pi$  described in Section III-B1, whose circuit is shown in Fig. 10(a). An equivalent circuit is also shown, which decomposes the four-qubit Toffoli gate into a series-controlled and single-qubit gate. We again need a distributed version of the reflection operation to support the Ansatz, which is distributed. We show this circuit in Fig. 10(b). Here, we introduce gates for the cat-entangler and cat-disentagler sequences. Here, four qubits are allocated for performing the nonlocal gates. Now, for running  $\alpha$ -QPE, we need a circuit for  $c - \Pi$ , which is the control part of c - U. Here is where it is critical to have two entanglement qubits for each splitting of the Ansatz on each QPU since, as seen in Fig. 11, there are CC-gates that occur across QPUs. With this collection of gates, we can run  $\alpha$ -QPE and, therefore, using the algorithm in Section III-B run  $\alpha$ -VQE.

#### IV. NETWORKED CONTROL SYSTEMS FOR DISTRIBUTED QC

Because it will be difficult in the near future to construct large, monolithic quantum computers, it will be a viable option to instead connect smaller quantum computers together using a network in a distributed manner. One can therefore consider networked control systems (NCSs) to manage the distribution of resources for running quantum algorithms. Such a system could allow for more flexibility regarding hardware configurations and the ability to dynamically add more devices while minimizing integration overheads. A NCS is a network of devices connected together using the network in order to perform a specific mutual task orchestrated by a control system [34], [35]. Among the other thing, NCSs are used to perform distributed or parallel computing, controlling a fleet of robots or drones, or smart grid systems deployed in modern cities [36].

Networked control systems can have various architectures for the control system part. These systems can either have a centralized controller where communications amongst the nodes are restricted to local area network (LAN) or a decentralized controller system that is connected via an Internet



**FIGURE 9.** Distributed circuit mapping for  $R(\lambda)$ . (a) Circuit for  $R(\lambda)$ ,  $\lambda_i \in [-\pi, \pi]^3$ ,  $i \in \{1, ..., 4\}$ . (b) Circuit for a distributed  $R(\lambda)$ . The red dashed lines represent the individual QPUs.



**FIGURE 10.** Distributed circuit mapping for reflection II. (a) Circuit representation of reflection II. (b) Circuit representation of distributed II. The square gates in the four-qubit gates represent the cat-entangler/disentager sequence.



**FIGURE 11.** Distributed circuit mapping for  $c - \Pi$ . (a) To run  $\alpha$ -QPE, one needs to perform a controlled *U* operation *M* times, where  $U = R \Pi R^{\dagger} P R \Pi R^{\dagger} P$ . The control portion to consider is  $c - \Pi$ . We depict the  $c - \Pi$  part, where the other parts of *U* are applied before and after what is depicted, which do not need to be controlled. (b) Distributed  $c - \Pi$ . The square gates in the four-qubit gates represent the cat-entangler/disentager sequence.

or wide area network (WAN). These two scenarios resemble how distributed quantum computers could potentially be networked. In the first case, one can consider a single owner of multiple quantum devices where all of the quantum devices are located in the same room or building, specifically, the network owner would know the network topology and information about hardware in the network. In the second setting, multiple quantum computers located possibly far apart potentially connected by multihop connections, where the owner of the hardware between the hops is possibly different. Here, more advanced protocols that consider security and robustness will be needed potentially leading to a fully fledged quantum Internet.

In order to use a network of distributed quantum computers efficiently, it is important that one develops robust communication protocols such that communication and control between the quantum devices in the network is efficient and reliable. In this section, we consider quantum systems with classical control and a separated quantum processing units. We consider a QPU to be a combination of a three-layered system depicted in Fig. 12. The QPU in this case is a layered system with inputs and outputs to a communication network through a classical computer, or a CPU. The CPU interfaces with the network as well as controls the FPGA based on the control instructions from the network, which in turn controls



FIGURE 12. Internal layering of a QPU. We assume there is a layered architecture. The CPU instructs the FPGA which in turn controls and measures the qubits. The CPU also interfaces with the network.

the qubits to perform quantum operations on the qubit layer. Qubit measurements and other classical messages are transmitted back to the network via a reversed path.

We consider the two different network configurations described and get into more detail about how these systems could be implemented in practice. We list the communication requirements needed to perform distributed quantum computations. We explore some available protocols to achieve these requirements under two scenarios. In the first one, there is a centralized controller of the system and communication to devices is classical information and quantum entanglement



FIGURE 13. Networked control system with a centralized controller. The triangular arrowheads represent classical connections and the diamond shaped arrow heads represent quantum connections used for establishing EPR pairs. Here, we assume the network between the QPUs is completely connected in terms of quantum and classical connections, that is, each QPU has the same connections with any other QPU. In the completely connected network, this network is single use for transmitting with low latency. Moreover, each QPU is connected to a common bus line that handles all the latency-tolerant message exchanges among the nodes.

can be sent directly to other quantum processors without routing. The second case is when control over the network is not centralized, but has a single user. We, then, propose a control system using Deltaflow.OS to control system to orchestrate distributed quantum computing.

# A. CONTROL SYSTEM ARCHITECTURES FOR DISTRIBUTED QUANTUM COMPUTING

In this section, we discuss two possible network architectures for distributed quantum computing control systems. The major difference between the two systems is centralization of the control. In the first system, we consider a distributed architecture with a centralized control. In the second, the control is split such that each QPU in the system has its own control. In this section, we describe these two systems in depth. In later sections, we go into detail regarding the communication requirements needed to run the systems and potential protocols to achieve them.

# 1) CENTRALIZED-CONTROLLED DISTRIBUTED QUANTUM SYSTEMS

The first distributed quantum computing scenario we consider is depicted in Fig. 13. This scenario is one where there is a single controller and the quantum hardware behaves only according to the instructions that are fed from this controller. The QPU systems are connected to the controller via classical network and further they are connected to each other both classically and quantumly-so that they can generate entanglement amongst themselves. The main idea here is that the CPUs in the network have a static IP and can be accessed by the centralized control. The finer synchronization between the QPU nodes is delegated to the CPU controlling the FPGA layer of the QPU from the centralized control ahead of execution time. The CPUs control the FPGAs and the FPGAs communicate over fixed low-latency links. This latency can be accounted for control instruction scheduling. At a small and medium size, this network scheme will be best suited, but when many nodes are added to the network, a system with a distributed control is better suited, which we discuss in the following section.



FIGURE 14. Internetworked distributed quantum computer with decentralized controllers. Here, there are independent controllers that control their respective quantum processing systems. Entanglement is generated with an entanglement network requiring possibly multihop entanglement routing. The controller is placed between the QPU and the quantum network since in this scenario, a quantum processing layer will be needed.

# 2) DECENTRALIZED-CONTROLLED DISTRIBUTED QUANTUM SYSTEMS

With a decentralized control system, the nodes in the network are no longer in a "master-slave" relationship because the hardware is no longer controlled by a single entity. A depiction of this scenario is seen in Fig. 14. Resources in this setting need to be requested from various parties and there is no guarantee that the requested resource will be available at the time of request. Access to the controllers is hidden behind a firewall and their IP, MAC, and inner network configuration is potentially not exposed. We assume that the QPUs are offered by various vendors that have agreed to offer a base set of services. They provide access to quantum hardware for a maximum amount of time per instruction set execution, they offer classical communication input and output to a prespecified IP address, where the communication stream is established prior to execution, and they allow for remote entanglement to be established between quantum devices on specified quantum hardware. In this case, the control information between QPUs is needed and we will need a lowlatency protocol that works in the network layer so that the control messages can be routed.

#### B. DISTRIBUTED QUANTUM ALGORITHM SCHEDULING

In order for networked quantum hardware to execute instructions in a synchronous fashion, a method of dictating to the devices when the instructions should be executed is needed. In this section, we propose a temporal operation schedule, i.e., a schedule of the operations with precise timestamps for execution. These schedules can then be sent to each QPU in the network with a time to begin execution. Because quantum gates generally have an upper bound for how long they take to execute, we can use this information when generating the schedule. Here, we assume that all gates have a known execution time as well and that latency times for classical communication and entanglement generation are known. We formalize the problem as follows.

*Problem 3 (Distributed Quantum Algorithm Scheduling):* Given a distributed circuit as a series of gate layers, where each layer contains a collection of gates to be applied on the qubits in the system, and the gate times (i.e., the amount of time it takes to perform the gate) of each gate for each QPU in the system, produce a temporal gate execution schedule such that the following constraints are obeyed.

- 1) Sending and receiving classical communication or entanglement between two parties occurs at the same time for the sending and the receiving parties.
- 2) One qubit operation occurs per time instance per qubit for the duration of the gate time (i.e., no overlapping gates).
- 3) At the start of a controlled operation, both qubits need to be available to perform the control gate (i.e., one qubit cannot have a gate operation ongoing).

At the start of the problem, it is assumed that all nodes in the distributed system have synchronized clocks. We assume routes for any multihop communication or entanglement generation is already established and is already calculated into the communication time bounds. Moreover, it is assumed that swap gates are not considered in the scheduling and are assumed included in the worst case gate times provided.

Comparing and creating a hybrid temporal planning approach with constraint programming for quantum circuit scheduling has been investigated in [37] for the max-cut problem. The difference here is the level of compilation is deeper as they include swap operations since they limit to nearest-neighbor interactions between the qubits. A temporal planning and constraint programming approach is, therefore, sensible. Here, we do not enforce nearest-neighbor interaction, and assume this process is included in the worse case timings for two-qubit gates for swapping qubits to their nearest neighbor if needed and remapping the index of the qubit so it does not have to be swapped back. We assume at the end of each layer of gates, each qubit will be free to be operated on and no swapping is needed and, therefore, do not use any constraint programming.

The output of the schedule for each QPU will have the form of Table 2. Table 3 is an intermediate schedule, which is used before splitting the schedules for each QPU. For a list of all possible commands and their descriptions, see Appendix B, Table 4. We approach this problem as follows.

**TABLE 2.**  $T_i$  is the time to execute the particular gate. The output ofAlgorithm 7 will generate a collection of schedules in this form for eachQPU

| Command               | Time  |
|-----------------------|-------|
| CTRL[G, qID, qID]     | $T_1$ |
| $SEND\_ENT[QPU, qID]$ | $T_2$ |
| $REC\_CLA[QPU, cID]$  | $T_3$ |
| :                     | •     |
| SINGLE[G, qID]        | $T_m$ |

**TABLE 3.**  $T_i$  is the time to execute the particular gate. The first step of Algorithm 7 generates a table of gates with QPU information before filtering the gates for each QPU for execution

| Command           | QPUs       | Time  |
|-------------------|------------|-------|
| CTRL[G, qID, qID] | QPU1       | $T_1$ |
| GEN_ENT[qID, qID] | QPU1, QPU2 | $T_2$ |
| CLASSICAL[cID]    | QPU3, QPU2 | $T_3$ |
| :                 | •          | ÷     |
| SINGLE[G, qID]    | QPU5       | $T_n$ |

We start with high-level instructions, which are entanglement generation, single qubit gates, classical communication, and control gates. We generate an instruction list using this gate set. We then take the high-level circuits and break them down into finer control instructions. Once the full schedule is created, we can split the instructions such that the instruction schedule is for a single QPU. The instruction sets can, then, be sent to their respective QPUs and the algorithm can start. In order to ensure gates are performed in the correct order, we layer the circuits as done in the previous section and schedule the circuits layer by layer, iteratively constructing a full schedule. In complete form, we propose Algorithm 7.

#### C. PROTOCOLS

In order to run a distributed quantum algorithm with a distribute quantum computer using the architectures proposed in the previous section, certain communication requirements are needed to ensure execution is possible. Protocols for controlling networked systems exist in practice in the centralized and decentralized case, and we explore some examples of them in this section.

The first requirement considered is the classical communication between the controllers and the QPUs. Here, what is needed is a method for sending the computation instructions to the QPUs, which can be done at slower latency, as well as a method for sending low-latency control bits between the QPUs. We explore methods for achieving this in the two cases. Clock synchronization is a commonly used scheme in distributed control systems. We consider an example of architectures using clock synchronization on a large scale. Finally, in the multivendor case, we discuss certification steps needed to ensure multiple vendors are able to execute distributed quantum algorithms in a cooperative way.

# Algorithm 7: Distributed Scheduler.

Input:

- QPUs the collection of QPUs in the system
- $C = \{l_1, ..., l_n\}$  the circuit to schedule as a series of layers where each  $l_i = \{g_1, ..., g_m\}.$
- gateTime a mapping of gate names to time the gate takes to execute for each QPU.

**Output:** A schedule of gate operations for each QPU to run.

1:  $layerEnd \leftarrow 0$ 

- 2:  $gateSchedule \leftarrow []$
- 3: for  $l_i \in C$  do  $\triangleright$  Make a first pass schedule based on the layers of the circuit
- 4: for  $g_i \in l_i$  do
- $gateSchedule.add((g_i, QPUs(g_i), layerEnd)))$ 5:

```
end for
6:
```

```
7:
     layerEnd \leftarrow \max_{i}(gateTime(g_{i})) + layerEnd
8: end for
```

9:  $QPUSches \leftarrow \{\}$ 

```
10: for QPU \in QPUs do \triangleright Split schedules so each QPU has own
      QPUSche \leftarrow []
11:
      for step \in gateSchedule do
12:
         if QPU \in \mathbf{QPUs}(step) \land |\mathbf{QPUs}(step)| = 1:
13:
           QPUSche. {\bf add} (step)
14:
         else if gate(step) = GEN\_ENT :
15:
```

16: if QPU = QPUs(step)[0]: ▷ Sender QPU

 $QPUSche.add(SEND\_ENT[QPUs(step)[1], qID],$ 17: **time**(*step*))

#### ▷ Receiver QPU else $QPUSche.add(REC\_ENT[QPUs(step)[0], qID]),$

```
18:
19:
                                         time(step))
20:
        else
              ▷ The other non-local gate is classical transmission
         if QPU = QPUs(step)[0]:
21:
                                                ▷ Sender QPU
22:
           QPUSche.add(SEND\_CLA[QPUs(step)[1], cID],
                                          time(step))
                                              ▷ Receiver QPU
23:
          else
         | QPUSche.add(REC\_CLA[QPUs(step)[0], cID],
24:
                                        time(step))
25:
     end for
     QPUSches[QPU] \leftarrow QPUSche
26:
27: end for
```

28: return QPUSches

Selecting the specific hardware that can execute these protocols is left to future work as tasks such as entanglement generation is still it a primitive state and may not exist to the extent we need for years to come. Also as qubit technologies improve, the need for as-low-as-possible latency could be loosened, and other protocols could be used in replacement. Here, we explore examples that could potentially achieve what is needed to perform distributed quantum computing.

#### CLASSICAL COMMUNICATION

In order to run distributed quantum algorithms, there are specific nonlocal tasks that need to be carried out by the distributed system such as receiving control commands, sending measurement results to the controller, and sending qubit measurements between the QPUs at low latency to perform nonlocal control gates. In this section, we explore communication protocols, which can be used by the control system to accomplish running distributed algorithms. Here, we explore

some examples of existing protocols that exist at an industry level.

For the centralized control case, we neglect routing of information and assume each node is connected both classically and quantumly to another. As discussed, We propose that there is a classical network connecting the QPUs to a centralized controller forming a "master-slave" relationship with the additional network of dedicated connections between the QPUs for the sole purpose of low-latency communication. This communication does not go through the CPU of the QPU, but directly between the FPGAs to perform the nonlocal gates.

When using a centralized control system, to perform slower communications between the QPUs and the controller one could consider two options. The first is to simply connect the CPU portions of the network to the controller using a LAN, and communicate using TCP/IP to from the controller to the QPUs. In this case, one would need to closely monitor that communication traffic does not overwhelm the network. Another approach that has this feature built in is to use a protocol often used in industrial control systems. The Modbus communication protocol [38]. Modbus is an open protocol used in a centralized controller master-slave setting as is this centralized controller setting. It is a messaging structure that allows for heterogeneous devices to communicate with a centralized controller and to receive control messages from the controller. The Modbus protocol can be used over a local network using TCP/IP making it easier to install into existing commonly used Ethernet networks. With Modbus, the controller can send the control instructions to the CPU portion of each QPU, which can be sent to the FPGA to perform the portion of the quantum algorithm. With Modbus, the controller can also receive the qubit measurement results from the QPUs once the algorithm is complete.

For low latency communication of short messages (<1 B)between FPGAs there are existing methods that can be used to communicate at the ultralow latency range (<1 ms). In the high performance computing domain, FPGA networks for ultralow latency, high bandwidth communication are realized. Here, we need to consider that the FPGAs may be meters apart. Connecting the FPGAs with, for example, 10 Gigabit Ethernet for sending short messages and using custom communication protocol and small form-factor pluggable (i.e., SFP+) transceivers, latency of 300 ns is possible for each link [39]. With fiber, the latency can be even further reduced.

Another approach that can be integrated again comes from the industrial control domain. Industrial control, especially in the power sector faces issues where some devices need constant monitoring and reacting to the changes needs to be done at very fast speeds. A method used is the Mirrored Bits [40] protocol. Mirrored Bits is a communication protocol for ultralow latency communication adding additional a latency of approximately 200  $\mu$ s for message processing in addition to the latency from transmitting signals over the communication link. Mirrored Bits could be used in this setting to transmit qubit measurement data. The devices that perform the Mirrored Bits protocol, which are manufactured by Schweitzer Engineering Laboratories are programmable and can trigger different routines on the FPGA depending on the input bit. These devices are commonly used to frequently monitor sensor data to trigger emergency shut offs as fast as possible, for example.

In the decentralized case, a dedicated WAN could be established between the vendors such that a link-layer (of the OSI model) protocol is used for the classical control information between the QPUs. Low latency communication can be achieved using the link-layer protocol called the Generic Object Oriented Substation Event (GOOSE). In particular, IEC 61850 is a GOOSE Ethernet protocol meeting time sensitive communications and high-speed performance requirements of automation applications. At the link-layer over an Ethernet network experimental results show GOOSE can be used to transmit in the 0.5-ms scale [41]. When routing is involved, naturally, the latency will grow. If TCP/IP protocols are used over the Internet are considered, then it is unlikely one could create any latency guarantees. If instead there are dedicated WANs with routing, one could consider the network-layer version of GOOSE called Routable GOOSE [42]. In [41], R-GOOSE is analyzed over a WAN using a particular data distribution service and was shown to transmit with average latency of around 8 ms.

Overall, there can be much to learn from looking into the power automation industry, as many low-latency and fastreaction systems have been developed, which have carryover into distributed quantum computing. These protocols have been tested for robustness and security and could potentially fit well for doing distributed quantum computing over a WAN. Moreover, networking FPGAs are as follows.

#### 2) CLOCK SYNCHRONIZATION

For centralized control, the assumptions of clock synchronization and full connectivity at a small scale are not overly restrictive. High-precision clock synchronization can be achieved even in very large configurations (i.e., that comprise a large number of devices) using methods such as in the White Rabbit project [43]. White Rabbit is used at CERN to synchronize over 1000 nodes with subnanosecond accuracy. This is achieved using Ethernet with lengths of up to 10 km, with experiments demonstrating an average of 160 ps skew between similar clocks—regarding clock environmental variables such as temperature—after several hours [44]. This protocol can be integrated in the centralized controller case so that all of the hardware used has synchronized clocks. Once the number of nodes in the network becomes large, routing and efficient network topologies becomes critical.

In the decentralized case, the controllers will need to perform a coarse-grain time synchronization via classical network synchronization protocols and a fine-grain synchronization, a precise notation of time can be shared. GPSs directly connected to FPGAs can be a solution where shielding does not stop the incoming signals. This process is

#### 3100921

### Protocol 8: Entanglement Validation.

#### Vendor 1

- 1: Generate N entangled pairs with Vendor 2
- 2: Measure all of the owned halves
- 3: Send *t* < *N* randomly selected measurements to Vendor 2 without stating which qubits were measured
- 4: Receive *t* bits from Vendor 2
- 5: If *t* bits not received, abort
- 6: Send positions of measurements to Vendor 2
- 7: Receive positions of measurements from Vendor 2 and compare measurements
- 8: Send acknowledgement if comparison passes, else send negative acknowledgement

#### Vendor 2

- 1: Generate N entangled pairs with Vendor 1
- 2: Measure all of the owned halves
- 3: Send t < N randomly selected measurements to Vendor 1 without stating which qubits were measured
- 4: Receive *t* bits from Vendor 1
- 5: If *t* bits not received, abort
- 6: Send positions of measurements to Vendor 1
- 7: Receive positions of measurements from Vendor 1 and compare measurements
- 8: Send acknowledgement if comparison passes, else send negative acknowledgement

common in distributed physical experiments such as in the Super-Kamiokande detector [45] and the CERN-OPERA experiment [46]. Each node will need to implement extra steps to guarantee that the timing information is constantly accurate.

#### 3) ENTANGLEMENT GENERATION

To perform the nonlocal control gates needed in the distributed circuits, the ability to share high quality entanglement between quantum processors is critical. Entangle generation has been achieved in various qubit technologies such as in optical photons, NV-centers, and superconducting qubits [47], but entanglement generation in quantum networks is an ongoing research topic. We consider deterministic entanglement generation schemes such that there is a guaranteed entangled pair available shared between the quantum processors when it is needed. Experimental results demonstrating deterministic delivery of entanglement using NV-centers in diamond as qubits have been shown in [48], generating heralded entanglement at a rate of 39 Hz, three orders of magnitude better than previous known results and guaranteeing an entangled pair every 100 ms with fidelity greater than 0.5 without pre- or postselection. Methods for improving the results further are also proposed. This gives evidence that using entanglement to perform distributed quantum computing can become more feasible using various qubit technologies. As technology regarding entanglement

#### Protocol 9: Contract Creation Protocol.

User

- 1: Assume it is known how many qubits exist on each available QPU for each QPU provider
- 2: Generate non-local circuits Section III-B
- 3: Request gate and classical communication latency times of gates from each QPU provider
- 4: Generate a gate execution schedule using Algorithm 7
- 5: Send executions schedules for the respective QPUs to the respective vendor along with current system time
- 6: Await confirmation messages from all vendors
- 7: If any vendor responds negatively, broadcast abort
- 8: Gather all latest start times and broadcast start time as the minimum over all latest start times
- 9: Await acknowledgements from all vendors, broadcast abort if any do not arrive, otherwise broadcast start signal

#### Vendor

- 1: Await request from user for gate times and respond accordingly
- 2: Await gate execution schedule
- 3: Validate that instructions can execute within allotted time frame for the user, respond to user if negative
- 4: If there are instructions with classical communication to an IP address, perform a handshake with other IP, respond to user if negative
- 5: If there are instructions with entanglement, perform entanglement validation procedure in Protocol 8, respond to user if negative
- 6: With other IPs, perform clock synchronization, respond to user if negative
- 7: When all checks pass, respond positively to user with latest possible start time of execution adjusted for user system time difference
- 8: Await start time confirmation, send acknowledgement to user, and await for final acknowledgement from user

generation and qubit stability improves, the deterministic entanglement generation rate can be improved.

# 4) CUSTOMER-VENDOR CERTIFICATION

In the case of a decentralized controller, additional protocols are required to ensure that the all parties are able to execute the distributed quantum algorithms and an execution schedule can be made such that nonlocal operations are performed synchronously. In this setting, the user has no control over the quantum hardware and, therefore, a protocol for ensuring the user's instructions can be executed is needed. We write in Protocol 9 a protocol for creating contracts between vendors and the user to ensure the desired instructions are carried out as specified.



FIGURE 15. DeltaGraph laid out on top of distributed quantum hardware. The DeltaGraph controls the distributed computations performed by QPUs 1 and 2. The user enters the parameters to the centralized controller which initializes execution.

# D. DELTAFLOW AS A NETWORKED CONTROL SYSTEM

To orchestrate a centralized control system, we propose a scheme that uses Deltaflow.OS as the control. Deltaflow is based on the dataflow programming paradigm described in Section II-D. The Deltaflow language is a Python-based language that allows users to specify graphs and edges representing the dataflow between them. It is a hosted domainspecific language: the nodes are filled with code corresponding to the hardware that is represented. A Deltaflow.OS is a tool for running Deltaflow programs on a specific piece of hardware. When given a Deltaflow program it performs compilation steps to transform it into chunks that run on the native hardware, runs those chunks, and exists in parallel to provide system services, networking and communication abstractions, and time sharing facilities. A Deltaflow.OS provides the same functionality as an OS and compiler combination like Linux+GCC [49].

# 1) CENTRALIZED CONTROL

As discussed, Deltaflow uses nodes in a graph to control the flow of information within a hardware network. The nodes contain the logic of what instructions to perform when an input message is received. In this section, we describe the highest level nodes that are used and their instruction logic to conduct a distributed  $\alpha$ -VQE when there is a single centralized control. Within the high-level nodes can be more nodes but these nodes are hardware specific, and we leave this open for future work. In Fig. 15, we depict a Deltaflow graph laid on top of the hardware. There are four unique nodes in the DeltaGraph: Controller, Classical Communication, Quantum Gates, and Time Reference. We describe the controller responsibilities and the controlled node tasks below. *Controller Node:* The controller node is the main interface between the user and the distributed quantum hardware. This interaction closely resembles that of a distributed operating system as discussed in Section II-C. The controller receives the algorithm parameters from the user, namely, the user sends the Ansatz preparation circuit designed for a single QPU and a Hamiltonian. Once received, the controller node handles the following.

- Takes user specified Ansatz preparation circuit and decomposed Hamiltonian, and confirms execution parameters.
- 2) When confirmed, the Hamiltonian is distributed across the QPUs and a gate execution schedule is made.
- 3) The schedule is split according to the locality of the execution, i.e., the entire instruction set is not sent to each node, but rather just the parts that are executed by that particular node.
- 4) Once all instructions are sent, the controller listens for incoming messages from any node in the DeltaGraph signaling an error. If such an error is received, then the execution process is aborted and the user is informed.
- 5) Once scheduled run time has elapsed, the controller listens for measurement results from the QGNs and sends them to the user.

Classical Communication Node (CCN): The CCN handles listening and sending classical data, that is information encoded into bits, to other QPUs in the network and collecting and forwarding classical control information from and to the quantum gates node (QGN) to perform any nonlocal control operations. The CCN receives the precise times for when to listen and when to transmit in the compiled instruction set from the controller. When time, the CCN communicates with the QGN to collect measurement results of the entangled qubits needed to perform the nonlocal gates and transmits the information to the paired OPU. The paired OPU will have its own CCN, which will be listening for classical input and will know what to do with the information based on the predetermined instructions. Each CCN communicates with the controller to receive instructions at the beginning of execution. CCNs can also inform the controller of any failures in communication so that the controller can abort the run-time process.

*Quantum Gates Node:* The QGN controls the quantum hardware and controls the gate operations performed on and measurements of the qubits. It communicates to the CCN and the controller. The communication link with the CCN is used for sending the measurement results of the entangled qubits in the cat-entangler and cat-disentagler steps to the respectively paired QPU. With the controller, communication to receive instructions is needed as well as to transmit the qubit measurements at the end of the algorithm. Moreover, the QGN conducts entanglement half of an entanglement generation scheme, controlling the particular hardware as per the instructions and entanglement generation protocol selected.



FIGURE 16. DeltaGraph for a decentralized control distributed quantum computer.

*Time Reference Node (TRN):* The TRN is a node used to maintain synchronous time within the DeltaGraph. The duty of the TNR is simply to periodically broadcast the system time so that all nodes in the network can update their own clocks to fix any clock drift that occurs during run time. Latencies from the TRN to the nodes can be accounted for, improving the overall resolution and precision of the timing reference.

#### 2) DECENTRALIZED CONTROL

In the decentralized controller setting, the DeltaGraph is split between the QPUs in the network. Each QPU has their own DeltaGraph and the intercommunication is handled by the nodes in the DeltaGraph. In this case, there needs to be an agreed upon contract between each participating party, which we have provided in Protocol 9. Here, for simplicity, we assume that there are no malicious parties and no eavesdroppers. Unlike the centralized case in the previous section, each remote QPU has its own controller and time reference node. The DeltaGraph in this case is depicted in Fig. 16. The main differences are that the independent time reference nodes allows different clock hardware for each vendor. Different mechanisms could be in place to adjust the time at each vendor, which is controlled by the respective time reference node. Time reference nodes are connected such that at each vendor they can perform different preselected clock synchronization protocols. The controller nodes' main tasks are interfacing with the user and orchestrating the instructions provided by the user for the specific hardware of the vendor. The controller for each vendor interfaces with the user similarly how it is described in the alternative method of deploying a distributed operating system in Section II-C, where the user is aware of the network topology and hardware capabilities of each node.

*Controller Node:* Each controller node has the responsibility of interfacing with the user and distributing the execution instructions to the other nodes. Much of the responsibilities of the controller in this case match that of the centralized controller case, but here the controller is only aware of the instructions that occur on the hardware in one QPU stack rather than the entire instruction set. In the centralized case, it was the duty of the controller to map the Ansatz circuit and schedule the Ansäzte to the distributed quantum hardware. In this case, that duty is moved to the user. The user queries QPU vendors, gathering the necessary hardware parameters in order to create a schedule that is executed on each vendor's hardware. The point of this is that the user will have more control over how many qubits they wish to run on each vendor's hardware. Variables such as cost of execution, time duration of execution, and vendor availability can be integrated into the user's decision when distributing their circuits across multiple vendors. The controller here can choose to reject or accept instruction sets depending on the vendor's hardware and availability. We summarize the responsibilities of the controller as follows.

- 1) Responds to user queries regarding hardware capabilities of the QPU as described in Protocol 9, orchestrating the handshake steps for classical communication and entanglement distribution.
- 2) Once start time and instructions are gathered, the instructions are distributed to the respective nodes in the DeltaGraph.
- 3) Time reference is sent to the nodes in the DeltaGraph via the controller in this case to simplify physical connections to any external clocks.
- 4) Transmitting measurement results to back the user.

*Classical Communications Node:* At a high level, the CCN performs much the same as in the centralized controller case. There may be added complexity in the low-level details depending on the selected protocols used for communication.

*Quantum Gates Node:* The QGN performs the much of same tasks as in the centralized case. A difference to consider is that each vendor could run their own centralized control distributed quantum system. Here, the quantum gate controller may need to be modified to perform much of the tasks of the centralized control as in the previous section.

*Time Reference Node:* The TRN is distinct for each QPU, different than in the centralized case where there is only one TRN. Here, the TRNs communicate with each other between DeltaGraphs in order to maintain clock synchronization in their respective graphs.

# **V. CONCLUSION AND OUTLOOK**

In summary, we have explored how the generalized VQE algorithm  $\alpha$ -VQE can be distributed across arbitrary sized quantum computers connected with entanglement and classical communication networks. We proposed various approaches for splitting the Ansatz across the distributed system and provided algorithms for splitting the circuitry needed to split perform  $\alpha$ -QPE, the central algorithm around  $\alpha$ -VQE. We show in our analysis that with this approach, larger Ansatz stats used on distributed systems at the cost of run

time. Next, we explore how one could network together a distributed quantum computer using two different architectures and we collect the necessary protocols needed to achieve this. We finish with a network control proposal using the Deltaflow.OS, the software-based control system for controlling quantum systems at the various classical and quantum hardware levels.

What remains open is to implement this system at a small scale, using both simulation and physical systems. Initial effort has been made in this direction, but more work is needed to have a complete proof of concept. Already, an implementation using quantum network simulator QuNetSim [50] and Deltaflow together has shown distributed quantum computing can be simulated. More, Deltaflow has been shown to work with distributed circuit boards, which leads to the first steps of simulating distributed quantum algorithms. The methods for distributing the Ansatz states could consider more parameters for further optimization, such as the coherence times of the qubits and location of the control gates in the circuits. We aim to explore this in more detail in future work.

As discussed, distributed quantum computing is a promising path to developing large-scale quantum computers. Much effort has gone into this in the classical computing domain, and the overlap between the fields is high. We can use this knowledge to design robust and secure distributed quantum computers, and as quantum technologies improve, this will surely become a reality.

#### APPENDIX A ALGORITHMS

| Algorithm 8: Does Not Fit.                                                                 |
|--------------------------------------------------------------------------------------------|
| Input:                                                                                     |
| • $Q = [q_1,, q_n]$ the collection of QPUs in the distributed                              |
| system, non-increasingly sorted by number of available qubits                              |
| • A the size of the Ansatz                                                                 |
| <b>Output:</b> If the Ansatz can fit in the distributed QPU $Q$                            |
| <b>DoesNotFit</b> $(Q, A)$                                                                 |
| <b>if</b> $Q$ is empty :                                                                   |
| return true                                                                                |
| for $q_i \in Q$ do                                                                         |
| $curAllocation \leftarrow [0,, 0] $ $\triangleright$ Allocate <i>n</i> element array of 0s |
| $possibleQPUs \leftarrow \{q_1,, q_i\} \subseteq Q$                                        |
| <b>if</b> $i = 1$ :                                                                        |
| $k \leftarrow \mathbf{QPUNumber}(possibleQPUs[1])  \triangleright \text{ The QPU index}$   |
| $  curAllocation[k] \leftarrow q_1 - 1$                                                    |
| else                                                                                       |
| $ \begin{bmatrix} \\ \\ \\ \\ \\ \\ \\ \\ \\ \\ \\ \\ \\ \\ \\ \\ \\ \\ $                  |
| index                                                                                      |
| $curAllocation[k] \leftarrow q_1 - 3$                                                      |
| for $q_j \in \{q_2, \dots, q_i\}$ do                                                       |
| $k \leftarrow QPUNumber(possibleQPUs[j]) \triangleright The QPU index$                     |
| $    curAllocation[k] \leftarrow q_j - 2$                                                  |
| end for                                                                                    |
| <b>If sum</b> (cur Allocation) $\geq A$ :                                                  |
| <b>return</b> <i>false</i> ▷ The Ansatz does nt                                            |
| end for                                                                                    |
| return true                                                                                |

#### APPENDIX B CONTROL SYSTEM COMMANDS

#### TABLE 4. Commands for QPU schedule

| Command                                      | Description                               |
|----------------------------------------------|-------------------------------------------|
|                                              | Two qubit gate $G$ on qubits with         |
| TWO OUBIT[C all all ]                        | memory IDs $qID_1, qID_2$ , where         |
| $[I W O \_ Q U B I I [G, q I D_1, q I D_2]]$ | $qID_1$ is the control if needed. $qID_1$ |
|                                              | is potentially a classical register.      |
| SINGLE[G, qID]                               | A single qubit gate $G$ applied to        |
|                                              | qubit with memory ID qID                  |
| $GEN\_ENT[qID_1,qID_2]$                      | Generate entanglement and                 |
|                                              | store it in the qIDs $qID_1, qID_2$       |
| REC_ENT[QPU, qID]                            | Receive entanglement half and             |
|                                              | store it in at memory qID                 |
| SEND_ENT[QPU, qID]                           | Send entanglement half and keep           |
|                                              | other half in qubit memory qID            |
| CLASSICAL[cID]                               | Transmission of classical bit in          |
|                                              | classical memory with ID cID              |
| SEND_CLA[QPU, cID]                           | Send classical bit in classical           |
|                                              | memory with ID cID                        |
| $REC\_CLA[QPU, cID]$                         | Receive classical bit and store in        |
|                                              | classical memory with ID cID              |

#### REFERENCES

- J. Gambetta, "IBM's roadmap for scaling quantum technology," IBM Research Blog, vol. 2, Nov. 2020. [Online]. Available: www.ibm.com/blogs/ research/2020/09/ibm-quantum-roadmap/
- [2] J. P. Dowling, Schrödinger's Killer App: Race to Build the World's First Quantum Computer. Boca Raton, FL, USA: CRC Press, 2013.
- [3] K. Hartnett, "A new law to describe quantum computing's rise?," Jun. 18, 2019. [Online]. Available: *Quanta Mag.* https://go.nature.com/32Yf0qh
- [4] A. Peruzzo et al., "A variational eigenvalue solver on a photonic quantum processor," Nat. Commun., vol. 5, no. 1, pp. 1–7, 2014, doi: 10.1038/ncomms5213.
- [5] F. Arute *et al.*, "Hartree-fock on a superconducting qubit quantum computer," *Science*, vol. 369, no. 6507, pp. 1084–1089, 2020, doi: 10.1126/science.abb9811.
- [6] D. Wang, O. Higgott, and S. Brierley, "Accelerated variational quantum eigensolver," *Phys. Rev. Lett.*, vol. 122, no. 14, 2019, Art. no. 140504, doi: 10.1103/PhysRevLett.122.140504.
- [7] A. Yimsiriwattana and J. Lomonaco, "Generalized Ghz states and distributed quantum computing," *Amer. Math. Soc.*, vol. 381, pp. 131–147, Mar. 2004, doi: 10.1090/comm/381/07096.
- [8] J. Eisert, K. Jacobs, P. Papadopoulos, and M. B. Plenio, "Optimal local implementation of nonlocal quantum gates," *Phys. Rev.*, vol. 62, no. 5, 2000, Art. no. 052317, doi: 10.1103/PhysRevA.62.052317.
- [9] A. Yimsiriwattana and S. J. Lomonaco Jr., "Distributed quantum computing: A distributed shor algorithm," in *Quantum Inf. Comput. II, Int. Soc. Opt. Photon.*, 2004, vol. 5436, pp. 360–372, doi: 10.1117/12.546504.
- [10] S. S. Tannu, Z. A. Myers, P. J. Nair, D. M. Carmean, and M. K. Qureshi, "Taming the instruction bandwidth of quantum computers via hardwaremanaged error correction," in *Proc. 50th Annu. IEEE/ACM Int. Symp. Microarchitecture*, 2017, pp. 679–691, doi: 10.1145/3123939.3123940.
- [11] X. Fu et al., "A heterogeneous quantum computer architecture," in Proc. ACM Int. Conf. Comput. Frontiers, 2016, pp. 323–330, doi: 10.1145/2903150.2906827.
- [12] J. Cirac, A. Ekert, S. Huelga, and C. Macchiavello, "Distributed quantum computation over noisy channels," *Phys. Rev.*, vol. 59, no. 6, pp. 4249– 4254, 1999, doi: 10.1103/PhysRevA.59.4249.
- [13] R. Van Meter, K. Nemoto, and W. Munro, "Communication links for distributed quantum computation," *IEEE Trans. Comput.*, vol. 56, no. 12, pp. 1643–1653, Dec. 2007, doi: 10.1109/TC.2007.70775.
- [14] V. S. Denchev and G. Pandurangan, "Distributed quantum computing: A new frontier in distributed systems or science fiction?," *SIGACT News*, vol. 39, pp. 77–95, Sep. 2008, doi: 10.1145/1412700.1412718.

- [15] R. V. Meter, W. Munro, K. Nemoto, and K. M. Itoh, "Arithmetic on a distributed-memory quantum multicomputer," ACM J. Emerg. Technol. Comput. Syst., vol. 3, no. 4, pp. 1–23, 2008, doi: 10.1145/1324177.1324179.
- [16] R. Beals *et al.*, "Efficient distributed quantum computing," *Proc. Roy. Soc. A, Math., Phys. Eng. Sci.*, vol. 469, no. 2153, 2013, Art. no. 20120686, doi: 10.1098/rspa.2012.0686.
- [17] D. Cuomo, M. Caleffi, and A. S. Cacciapuoti, "Towards a distributed quantum computing ecosystem," *IET Quantum Commun.*, vol. 1, no. 1, pp. 3–8, doi: 10.1049/iet-qtc.2020.0002.
- [18] A. Broadbent, J. Fitzsimons, and E. Kashefi, "Universal blind quantum computation," in *Proc. 50th Annu. IEEE Symp. Found. Comput. Sci.*, 2009, pp. 517–526, doi: 10.1109/FOCS.2009.36.
- [19] Y. Shi et al., "Optimized compilation of aggregated instructions for realistic quantum computers," in Proc. 24th Int. Conf. Architectural Support Program. Lang. Operating Syst., 2019, pp. 1031–1044, doi: 10.1145/3297858.3304018.
- [20] J. R. Cruise, N. I. Gillespie, and B. Reid, "Practical quantum computing: The value of local computation," 2020, arXiv:2009.08513.
- [21] A. S. Tanenbaum, *Distributed Operating Systems*. Noida, India: Pearson Education India, 1995.
- [22] A. S. Tanenbaum and H. Bos, *Modern Operating Systems*, 2nd ed. London, U.K.: Pearson, 2001.
- [23] A. S. Tanenbaum and M. Van Steen, *Distributed Systems: Principles and Paradigms*, 2nd ed. Englewood Cliffs, NJ, USA: Prentice-Hall, 2007.
- [24] W. M. Johnston, J. P. Hanna, and R. J. Millar, "Advances in dataflow programming languages," ACM Comput. Surv., vol. 36, no. 1, pp. 1–34, 2004, doi: 10.1145/1013208.1013209.
- [25] J. Romero, R. Babbush, J. R. McClean, C. Hempel, P. J. Love, and A. Aspuru-Guzik, "Strategies for quantum computing molecular energies using the unitary coupled cluster ansatz," *Quantum Sci. Technol.*, vol. 4, no. 1, 2018, Art. no. 014008, doi: 10.1088/2058-9565/aad3e4.
- [26] A. Y. Kitaev, "Quantum computations: Algorithms and error correction," *Russian Math. Surv.*, vol. 52, no. 6, pp.1191–1249, 1997.
- [27] M. Saeedi and M. Pedram, "Linear-depth quantum circuits for n-qubit Toffoli gates with no ancilla," *Phys. Rev.*, vol. 87, no. 6, pp. 1191–1249, 2013, doi: 10.1070/RM1997v052n06ABEH002155.
- [28] Y. He, M.-X. Luo, E. Zhang, H.-K. Wang, and X.-F. Wang, "Decompositions of n-qubit Toffoli gates with linear circuit complexity," *Int. J. Theor. Phys.*, vol. 56, no. 7, pp. 2350–2361, 2017, doi: 10.1007/s10773-017-3389-4.
- [29] D. Maslov, "Advantages of using relative-phase Toffoli gates with an application to multiple control Toffoli optimization," *Phys. Rev.*, vol. 93, no. 2, 2016, Art. no. 022311, doi: 10.1103/PhysRevA.93.022311.
- [30] R. Babbush et al., "Low-depth quantum simulation of materials," Phys. Rev. X, vol. 8, no. 1, 2018, Art. no. 011044, doi: 10.1103/PhysRevX.8.011044.
- [31] K. Michielsen, M. Nocon, D. Willsch, F. Jin, T. Lippert, and H. De Raedt, "Benchmarking gate-based quantum computers," *Comput. Phys. Commun.*, vol. 220, pp. 44–55, 2017doi: 10.1016/j.cpc.2017.06.011.
- [32] J. T. Seeley, M. J. Richard, and P. J. Love, "The Bravyi–Kitaev transformation for quantum computation of electronic structure," *J. Chem. Phys.*, vol. 137, no. 22, 2012, Art. no. 224109, doi: 10.1063/1.4768229.
- [33] V. Bergholm *et al.*, "Pennylane: Automatic differentiation of hybrid quantum-classical computations," 2018, arXiv:1811.04968.
- [34] J. P. Hespanha, P. Naghshtabrizi, and Y. Xu, "A survey of recent results in networked control systems," *Proc. IEEE Proc. IRE*, vol. 95, no. 1, pp. 138–162, 2007, doi: 10.1109/JPROC.2006.887288.
- [35] T. C. Yang, "Networked control system: A brief survey," *IEE Proc.-Control Theory Appl.*, vol. 153, no. 4, pp. 403–412, 2006, doi: 10.1109/ICMA.2015.7237484.
- [36] M. S. Mahmoud and Y. Xia, Networked Control Systems: Cloud Control and Secure Control. Oxford, U.K.: Butterworth-Heinemann, 2019.
- [37] K. E. C. Booth, M. Do, J. C. Beck, E. G. Rieffel, D. Venturelli, and J. Frank, "Comparing and integrating constraint programming and temporal planning for quantum circuit compilation," in *Proc. 28th Int. Conf. Automated Planning Scheduling*, 2018, pp. 366–374.
- [38] A. Swales et al., "Open MODBUS/TCP specification," Schneider Electric, vol. 29, pp. 15–101, 1999.

- [39] WhichNas, "10Gbase-T Vs. SFP+ Which is the best 10G network solution for small businesses," WhichNAS, Jun. 2020. [Online]. Available: whichnas.com/10gbase-t-vs-sfp-which-is-the-best-10gnetwork-solution-for-small-businesses/
- [40] K. Behrendt and K. Fodero, "Implementing mirrored bits technology over various communications media," SEL Appl. Guide, vol. 12, 2001.
- [41] T. A. Youssef, M. M. Esfahani, and O. Mohammed, "Data-centric communication framework for multicast IEC 61850 routable goose messages over the wan in modern power systems," *Appl. Sci.*, vol. 10, no. 3, 2020, Art. no. 848, doi: 10.3390/app10030848.
- [42] A. Apostolov, "R-goose: What it is and its application in distribution automation," *CIRED-Open Access Proc. J.*, vol. 2017, no. 1, pp. 1438– 1441, 2017, doi: 10.1049/oap-cired.2017.0494.
- [43] M. Lipiński, T. Włostowski, J. Serrano, and P. Alvarez, "White Rabbit: A PTP application for robust sub-nanosecond synchronization," in *Proc. IEEE Int. Symp. Precision Clock Synchronization Meas.*, Control, Commun., 2011, pp. 25–30, doi: 10.1109/ISPCS.2011.6070148.
- [44] T. Włostowski, Precise time and frequency transfer in a white rabbit network. Ph.D. thesis, Instytut Radioelektroniki, New York, NY, USA, 2011.
- [45] S. Fukuda et al., "The super-kamiokande detector," Nuclear Instrum. Methods Phys. Res. Sect. A, Accel., Spectrometers, Detectors Assoc. Equip., vol. 501, no. 2/3, pp. 418–462, 2003, doi: 10.1016/S0168-9002(03)00425-X.
- [46] C. R. Contaldi, "The opera neutrino velocity result and the synchronisation of clocks," 2011, arXiv:1109.6160.
- [47] B. C. Sanders, "Review of entangled coherent states," J. Phys. A, Math. Theor., vol. 45, no. 24, 2012, Art. no. 244002, doi: 10.1088/1751-8113/45/24/244002.
- [48] P. C. Humphreys *et al.*, "Deterministic delivery of remote entanglement on a quantum network," *Nature*, vol. 558, no. 7709, pp. 268–273, 2018, doi: 10.1038/s41586-018-0200-5.
- [49] Riverlane, "Deltaflow," Sep. 2020.
- [50] S. DiAdamo, J. Nötzel, B. Zanger, and M. M. Beşe, "Qunetsim: A software framework for quantum networks," 2020, arXiv:2003.06397v4.



**Stephen DiAdamo** received the B.Sc. (hons.) degree in computer science and software engineering from the University of Toronto, Toronto, ON, USA, in 2014, and the M.Sc. degree in mathematics in 2018 from the Technical University of Munich, Munich, Germany, where he is currently working toward the Ph.D. degree in electrical engineering.

His research interests include quantum networks, quantum information theory, and distributed quantum computing.

Mr. DiAdamo was the recipient of the awarded two microgrants from the Unitary Fund program in order to assist in the development of software libraries related to this work.



**Marco Ghibaudi** received the Ph.D. degree in innovative technologies from CERN/Scuola Superiore SantAnna, Pisa, Italy, in 2014.

He is currently an FPGA expert with CERN/Scuola Superiore Sant'Anna Pisa. As part of his Ph.D., he contributed to the design of the CERN ATLAS experiment back-end electronics and to the development of prototypes used in Automative, IoT, and Fibre-optic Communications. In the past five years, he has worked in multinational companies specializing

in systems on chip and consumer electronics.



James Cruise received the Ph.D. degree in applied probability from the University of Cambridge, Cambridge, U.K., in 2008.

He is currently a mathematician who works on the application of statistical and probabilistic techniques to real-world systems. He has held research positions with Bristol University and Heriot Watt University. His research interests include Internet traffic modeling, social networks, and most recently power systems.