Received March 9, 2020, accepted March 24, 2020, date of publication April 7, 2020, date of current version May 20, 2020. Digital Object Identifier 10.1109/ACCESS.2020.2986335 # **Early-Stage Planning of Switched-Capacitor Converters in a Heterogeneous Chip** LEILEI WANG<sup>®1,2,3</sup>, LU WANG<sup>1,2,3</sup>, CHENG ZHUO<sup>®4</sup>, (Senior Member, IEEE), AND PINGQIANG ZHOU<sup>®1</sup>, (Member, IEEE) <sup>1</sup>School of Information Science and Technology, Shanghai Tech University, Shanghai 201210, China Corresponding author: Pingqiang Zhou (zhoupq@shanghaitech.edu.cn) This work was supported in part by the National Natural Science Foundation of China under Grant 61401276. **ABSTRACT** The switched-capacitor converter (SCC) has been widely used for voltage regulation in multicore chips, where energy efficiency is the major concern. However as the overhead to integrate SCCs in a chip is non-negligible, the SCCs could not be overused. Hence in this paper we propose an early stage SCCs planning framework to obtain the SCC supply scheme together with the optimized Metal-Insulator-Metal (MIM) capacitance allocation and converter ratio selection for each SCC when the given number of SCCs is less than the number of cores. Besides, our method could also explore to find the best number of used SCCs for a given chip. The experiments show the results of our SCC planning methods. **INDEX TERMS** Energy efficiency, switched-capacitor converter, multi-core, MIM capacitance. # I. INTRODUCTION By fully exploiting the unique advantages of different types of cores (CPU, GPU, accelerators etc.), a state-of-art heterogeneous multi-core chip can achieve both powerful performance and high energy efficiency [1]–[6]. For example, the Apple A12 processor [5] has two high-performance CPU cores, four energy-efficient CPU cores, four GPU cores and eight neural engine cores; The Kirin 980 processor [6] has two high-performance CPU cores, two medium-performance CPU cores, four efficiency CPU cores, one GPU core and two neural processor cores. The heterogeneous chip is typically divided into several power domains [7], and each domain can be powered individually by the integrated on-chip voltage regulators such as switching-capacitor converters (SCCs) [8], [9], inductive switching regulators [7], [10] and LDOs [11], [12] (see Fig. 1 for an example). Table 1 shows the comparison of different types of voltage regulators. The main problem with inductive switching regulator is that inductor could not easily be integrated on the chip, and is usually manufactured on the package. The SCC has advantages such as wide output voltage, high energy conversion efficiency and high power density. Therefore, it has been widely studied and applied in recent processors [13]–[15]. The associate editor coordinating the review of this manuscript and approving it for publication was Poki Chen. FIGURE 1. A heterogeneous chip powered by distributed SCCs with different ratios. Note that the SCCs share the same MIM capacitance resource over the chip area. Fig. 2 shows a 3:1 step-down SCC where a high input voltage could be converted to a lower output voltage by a series of flying capacitors $C_{sw}$ and switches $\phi_1/\phi_2$ . During phase $\phi_1$ the flying capacitors $C_{sw}$ gets charged by the input power supply and delivers power to the output during phase $\phi_2$ , where $\phi_1$ and $\phi_2$ are non-overlapping clock signals with <sup>&</sup>lt;sup>2</sup>Shanghai Institute of Microsystem and Information Technology, Chinese Academy of Sciences, Shanghai 200050, China <sup>&</sup>lt;sup>3</sup>School of Electronic, Electrical and Communication Engineering, University of Chinese Academy of Sciences, Beijing 100049, China <sup>&</sup>lt;sup>4</sup>College of Information Science and Electronic Engineering, Zhejiang University, Hangzhou 310058, China TABLE 1. The comparison of different voltage regulators. "Step up/down" means output voltage is higher or lower than the input source voltage (i.e., Vdd). | | LDO | Inductive Switching Regulator | SCC | |---------------------|------------------------------|-------------------------------|-----------| | Electronic Element | Error Amp.<br>Differentiator | Inductor | Capacitor | | Step Up/Down? | No/Yes | No/Yes | Yes/Yes | | On-chip Integration | Easy | Hard | Easy | | Design Complexity | Low | Low | High | | Energy Efficiency | Low | High | High | FIGURE 2. A 3:1 SCC and its output voltage ripple $\Delta V$ caused by the two-phase operation. a frequency of $f_{sw}$ . The charging and discharging behavior of the flying capacitors result in supply voltage ripple $\Delta V$ at the output. It should be mentioned that although only a 3:1 SCC is shown in this figure, other conversion ratios could be achieved with different topologies [16]. Energy conversion efficiency is critical for the SCC design [17]–[20]. A lot of prior works have investigated the optimization of the SCC design to achieve better energy efficiency, and have proposed techniques such as tuning the size of flying capacitors, operating frequency and switch width etc. [17], [21]-[26]. However, there are limited works on optimizing the energy efficiency of SCCs in a holistic way in a multi-core chip. In [9], the authors improve the overall energy efficiency of many-core system by dynamically adapting the switching frequency $f_{sw}$ of each SCC to the specific output load. The authors in [26] aim to not only achieve the highest energy efficiency but also suppress supply noise by optimizing the allocation of limited die area between flying capacitance and decoupling capacitance. In [19], the authors propose a system level efficiency model which characterizes the number, size and distribution of the SCCs, and they solve the optimization problem by mathematical optimization methods. Although SCC is a promising technique for implementing fine-granularity power management in a mulitcore chip, we should not overuse the SCCs in a given chip since integrating the SCCs in a multicore chip needs the control circuit, the routing resource, the power consumption of clock signals, the chip area and so on [19]. On the other hand, in a heteroge- (a) Case 1: $\eta=80.4\%$ ; $SCC_1$ supplies the power to core 1, and $SCC_2$ supplies that to core 2, core 3 and core $\frac{1}{4}$ (b) Case 2: $\eta=82.3\%$ ; $SCC_1$ supplies the power to core 1 and core 2, and $SCC_2$ supplies that to core 3 and core 4. FIGURE 3. Two cases of SCC supply schemes would lead to different efficiency, which motivates us to propose a smart supply scheme. neous multi-core chip, the cores with close voltage demands (i.e., their voltage demands slightly differ) are usually placed together in the physical layout [7]. This implies that theoretically the SCCs in a domain with higher voltage supply can also be shared by the cores in a adjacent domain with sightly lower voltage demand. In this way, we can potentially reduce the number of used SCCs for a given chip. Therefore, in our work, we will explore the best planning strategy of the SCCs in a heterogeneous chip. More specifically, our work tries to optimize the energy efficiency of a SCC-powered heterogeneous chip from the following two important aspects: itemsep=0.2em - The SCC supply scheme. - Let us examine the two cases shown in Fig. 3. In both cases, two SCCs are used to power a four-core chip (the detailed load information could been see in Table 4 in Section VI), while Case2 can achieve higher efficiency with a different supply scheme (the mapping between SCCs, the supply side, and cores, the demand side). When several cores are supplied by one SCC, the cores with lower demand voltage would be over-supplied, which leads to extra power consumption. And different supply schemes would have different extra power consumption. So it motivates us to develop a smart SCC supply scheme for better energy efficiency. Besides, we also explore the number of used SCCs in a chip. - The capacitance allocation and ratio selection of SCCs. The energy efficiency of the SCC is related to its switching capacitance and conversion ratio (see Section III). Recently Metal-Insulator-Metal (MIM) capacitors have been utilized as flying capacitors for SCCs [8], [18], [27] (see MIM capacitance in Fig. 1), and more than one conversion ratios are available for a SCC to supply for the loads where the ideal output voltage of the conver <sup>&</sup>lt;sup>1</sup>In Case1 the SCC2 supplies Core2, Core3 and Core4 simultaneously, the output voltage of SCC2 should be no less than the maximal voltage demand among the three cores (i.e., leading to output 0.90V for all the three cores). Hence in Case1 the Core2 and Core3 would have an over-supplied voltage which leads to extra power consumption, and the system would have an extra power loss (0.90-0.68)\*0.10+(0.90-0.82)\*0.30=0.046W (see SectionIV-B for details). On the contrary, in Case2 the extra power loss is (0.68-0.55)\*0.08+(0.90-0.82)\*0.30=0.0344W. TABLE 2. Different capacitance allocation and ratio selection would leads different efficiency (Vdd=1.2V). | #SC | C | Case1 | | #50 | Case2 | | | | | |-------------------|--------------|-------|--------|--------|--------------|-------|--------|--|--| | #50 | $C_{sw}(nF)$ | Ratio | $\eta$ | #50 | $C_{sw}(nF)$ | Ratio | $\eta$ | | | | $SC_1$ | 7 | 2:1 | 75.3% | $SC_1$ | 4 | 3:2 | 81.9% | | | | $\overline{SC_2}$ | 6 | 4:3 | 13.3 / | $SC_2$ | 9 | 1:1 | 01.7 / | | | sion ratio is greater than the demanded voltage. Table 2 shows two cases with different capacitance allocation and conversion ratios for SCCs in Fig. 3. We can see that the energy efficiency is different. Since different capacitance and ratio in each SCC would lead to different power loss with the given SCC loss mechanism (see Section III), capacitance allocation and ratio selection would have effect on the energy efficiency.<sup>2</sup> Hence, given a supply scheme we also need to optimize the capacitance allocation and conversion ratio of the SCCs to improve the efficiency. Motivated by the aforementioned observations, in this work, we propose an early stage SCCs planning framework to improve the energy efficiency of the multi-core chips when the number of SCCs is less than the number of cores. It is noticed that when the number of SCC equals to the number of cores, each SCC could be used to supply the power for one core by converting the global voltage to a specific demand voltage of the core. In this way, a finest power management is achieved. However, the overhead associated with integrated SCCs (such as control circuit, routing resource) is non-negligible [19]. As a result, if there is limited budget to integrate SCCs, we should integrate less number of SCCs (i.e., the number of SCC is less than the number of cores). The pros in this scenario are that we could reduce the overhead of integrating SCCs, while the cons are that since some cores with different demand voltages are supplied by the same SCC outputting the higher demand voltage of cores, there would be some extra power loss due to the over-supply of some cores. And we also provide a method to guide how many SCCs should be used to get the highest efficiency for the system. The rest of this paper is organized as follows. We formulate the problem in Section II. In Section III, we introduce the basic loss mechanism of SCCs. Then the proposed SCC planing framework - SCC supply scheme and SCC optimization are introduced in Section IV. We then show the experimental results in Section VI. Finally, we make conclusions in Section VII. $^2$ In Case1, the MIM capacitance allocated in each SCC ( $C_{sw}$ ) is proportional to the area of the cores this SCC supplied, and the selected ratio is the one whose no-load output voltage is minimal but larger than the demand of the supplied cores. According to the SCC loss equation (see Section III-A), the loss would be $P_{scc} = P_{scc1} + P_{scc2} = (e_{1,scc1} \cdot C_{sw,scc1} + \frac{e_{2,scc1}}{C_{sw,scc2}}) + (e_{1,scc2} \cdot C_{sw,scc2} + \frac{e_{2,scc2}}{C_{sw,scc2}})$ where $e_1$ and $e_2$ are the ratio-determined parameters. From this equation, we could see that the ratio-determined parameters would significantly affect the loss model, and after given these $e_1$ s and $e_2$ s the capacitance allocated in each SCC would also affect the efficiency. In this situation, the MIM capacitance allocated and the selected ratios in each SCC in Case1 may not lead to the minimal loss. As a contrast, the MIM capacitance allocation and ratio selection with our proposed method (introduced in Section IV-C) would have a higher efficiency. #### **II. PROBLEM FORMULATION** In the literature [7], Intel has proposed the one regulator per core scheme to achieve the finest power management for a given chip. However, considering the overhead associated with distributing a large number of regulators over the chip, in this work, we focus on the scenario that the available SCCs are typically limited (less than the number of cores). The problem can be formulated as follows: Given - 1) the layout of a M-core heterogeneous chip, 2) the minimal supply voltage and maximal demand current of each core, 3) the available number of SCCs $\hat{M}$ ( $1 \le \hat{M} < M$ ), 4) the total MIM capacitance for the flying capacitors of SCCs and 5) the available ratios for one SCC, our work attempts to maximize the overall energy efficiency of the power supply system by finding - 1) the best supply scheme for a specified $\hat{M}$ SCCs, together with 2) the amount of flying capacitance and the conversion ratio of each of the $\hat{M}$ SCCs. In our work, we also explore the number $\hat{M}$ $(1 \le \hat{M} < M)$ to find the best number of SCCs for the given M-core chip. It's noticed that as the MIM capacitors are designed and used as the flying capacitors in SCCs, works like [18], [28] have studied the design techniques to optimize the performance and quality of MIM capacitors, such as optimizing the ESR (i.e., Equivalent Series Resistance). Hence, we don't take the ESR issue into consideration in this paper. Instead, we use the optimal parameter like frequency shown in [18] to significantly reduce the ESR's effect on the MIM capacitors. # III. ANALYSIS OF THE INHERENT POWER LOSSES OF SCCS In this section, we will briefly introduce and analyse the loss mechanism of SCCs, and then further discuss the ratios and flying capacitance used in SCCs. ## A. THE LOSS MECHANISM OF THE SCCS The switched-capacitor converter would have several inherent losses when delivers energy from the input side to the output side. These non-negligible power losses include switching loss caused by charging flying capacitance, conduction loss caused by driving these switches and load power loss caused by the output voltage ripple. Each kind of the losses could be seen in Fig. 4, and all of these are explained in the work of [17], [19] and [22]. Our problem formulation in Section IV-C are based on these loss mechanisms. The detailed formulation of these loss mechanisms are introduced in the following. (1) Conduction loss. For a specific SCC topology, when the switches are turned on and the charge would be transferred to the flying capacitors through the switches, and part of the power will be dissipated in the switches [19] as $$P_{cond} = \frac{M_{sw} I_{out}^2 R_{on}}{\sigma \gamma C_{sw} f_{sw}} \tag{1}$$ Here, $M_{sw}$ is the parameter that related with the topology, $I_{out}$ is the load current of a SCC and $R_{on}$ is the equivalent resistance density of a switch when it is on. $\sigma$ is a fitting FIGURE 4. The power is supplied to the core through the SCC. Three loss components are shown in different colours. parameter and $\gamma$ is a topology-dependent parameter. $f_{sw}$ is the switching frequency of the SCC. (2) Gate-drive loss. When the switches are turned on or off, the gate capacitors of these transistors would be charged or discharged. The energy lost here in each cycle [17] is $N_{\rm phase} \cdot N_{\rm sw} \cdot \left(C_{\rm gate} W_{\rm sw}\right) \cdot V_{\rm dd}^2$ . So the gate-drive loss [19] is $$P_{gate} = N_{phase} \cdot N_{sw} \cdot f_{sw} \cdot (C_{gate} W_{sw}) \cdot V_{dd}^2$$ Since $W_{SW}$ is the cumulative width of switches that are turned ON/OFF in this period, it is also proportional to the frequency $f_{SW}$ and could be written as [19] $W_{SW} = \sigma \gamma f_{SW} C_{SW} / N_{phase}$ . As a result, the gate-drive loss is proportional to $f_{SW}^2$ . Hence we have the gate-drive loss $$P_{gate} = N_{sw} f_{sw}^2 V_{dd}^2 C_{gate} \sigma \gamma C_{sw}$$ (2) where $N_{sw}$ represents the number of switches in a SCC and $C_{gate}$ is the per-unit-width gate capacitance of the switches. (3) Load power loss. Due to the voltage ripple of the SCC output voltage, the load power loss [19] is $P_{load} = \frac{1}{2}I_{out}\Delta V$ . Since the load current of a SCC can be expressed as $I_{out} = M_{topo} \cdot f_{sw} \cdot C_{sw} \cdot N_{phase} \cdot \Delta V$ , we could have $$\Delta V = \frac{I_{out}}{M_{topa}f_{sw}N_{phase}C_{sw}} \tag{3}$$ where $M_{topo}$ is topology-dependent and $N_{phase}$ is the number of interleaving stage in a SCC. Finally, the load power loss could be written as $$P_{load} = \frac{I_{out}^2}{2M_{topo}f_{sw}N_{phase}C_{sw}} \tag{4}$$ As a result, the inherent power loss of one SCC is $$P_{scc} = P_{cond} + P_{gate} + P_{load} = e_1 \cdot C_{sw} + \frac{e_2}{C_{sw}}$$ (5) where $$e_1 = N_{sw} f_{sw}^2 C_{gate} \sigma \gamma V_{dd}^2 \tag{6}$$ and $$e_2 = I_{out}^2 \left(\frac{1}{2M_{topo}N_{phase}f_{sw}} + \frac{M_{sw}R_{on}}{\sigma \gamma f_{sw}}\right) \tag{7}$$ Here the parameters are divided into three types. 1) Non topology-dependent parameters: $f_{sw}$ , $R_{on}$ , $C_{gate}$ , $N_{phase}$ , $V_{dd}$ , TABLE 3. Topology-dependent parameters for different conversion ratios [17], [19]. $V_{nl}$ is the no-load output voltage of SCCs ( $V_{in} = 1.2V$ ). | Conversion Ratio | $V_{nl}$ | $N_{sw}$ | $M_{topo}$ | $M_{sw}$ | $\gamma$ | |------------------|----------|----------|------------|----------|----------| | 2:1 | 0.6V | 4 | 2 | 2 | 2 | | 3:2 | 0.8V | 7 | 9/8 | 2 | 1 | | 4:3 | 0.9V | 10 | 8/9 | 7/3 | 2/3 | | 1:1 | 1.2V | 2 | 1/2 | 1 | 1 | and $\sigma$ , 2) topology-dependent parameters: $N_{sw}$ , $M_{sw}$ , $M_{topo}$ and $\gamma$ , 3) other parameters: $I_{out}$ and $C_{sw}$ . These parameters show that the power loss is related with the topology of SCCs, such as the switches represented by the parameters of $N_{sw}$ and $W_{sw}$ and the flying capacitors represented by the parameter of $C_{sw}$ . As non topology-dependent parameters such as $f_{sw}$ (see Table 5 in Section VI) are fixed and the load information $I_{out}$ is also known, we can see the loss mechanisms are only related with the flying capacitance $C_{sw}$ and the topology-dependent parameters (see Table 3) of SCCs. As a results, two issues (flying capacitance and conversion ratio) could be optimized to improve the conversion efficiency of the SCCs. #### B. THE MIM CAPACITANCE USED IN SCCS In fully integrated switching-capacitor converters, the flying capacitors could be implemented by the baseline MOS capacitors [29], deep trench capacitors [30], Metal-Oxide-Metal (MOM) capacitors [20] and Metal-Isolator-Metal (MIM) capacitors [8]. For the MOS capacitor, it has large parastics which will significantly reduce the efficiency of SCC [29] and the leakage issue of MOS capacitor is also serious [31]. The deep trench capacitor is implemented by dry-etching macro pores arrays in silicon and filling the pores with the dielectric and eletrode [32]. This technique is not part of baseline CMOS, which leads to much more additional masks and costs [20]. The MOM capacitor (typical density 1.5fF $\sim$ $2.8fF/\mu m^2$ @65nm [33]) is fabricated by the lower metal layers on the chip, resulting in heavy capacitive coupling to the substrate [34]. On the contrary, the MIM capacitor (typical density 1.6fF $\sim 1.9fF/\mu m^2$ @65nm [33]) is fabricated between one upper metal layer and one additional metal layer above it, resulting in small capacitive coupling to the substrate [34]. Although the combined capacitors such as MIM and MOS capacitors [13], [35] are used together as flying capacitors in SCCs, in recent high-end processor chips [8], [18], [27] the MIM capacitors are widely used in the SCCs. As the global resource, the MIM capacitance existing in the top metal layers are shared by all the SCCs in the chip(see Fig. 1). Hence, in our SCC planning method, we should better allocate the MIM capacitance to maximize the total conversion efficiency of SCCs. **FIGURE 5.** Using ratio 2:1 to output the demand voltage in the overlap region would need much flying capacitance (Equation (3)), leading to high power loss (Equation (5)). Instead, using ratio 3:2 may lead to better SCC conversion efficiency. #### C. THE CONVERSION RATIOS OF THE SCC As shown in Table 3, different conversion ratios could output different no-load voltages. Since there exists voltage ripple in the output of SCCs (see Equation 3), the voltage acquired by the core loads, $V_{nl} - \Delta V$ , should be no less than the minimal supply voltage of the core, $V_{core}$ . Generally speaking, we choose a ratio for one SCC according to the minimal supply voltage of the load, which is shown in Fig. 5 (the Vdd is 1.2V here). However, when the demand voltage is slightly less than 0.6V (for example, 0.58V) and we choose the ratio 2:1, the output voltage ripple of the SCC $\Delta V$ is allowed up to $V_{nl} - V_{core} = 0.02V$ . This would lead to that the demand amount of flying capacitance $C_{sw}$ for this SCC is very large (see Equation (3)). And consequently the power loss is huge (see Equation (5)), resulting in low conversion efficiency. For instance, if one SCC with the ratio 2:1 is employed to supply to one core, using the parameters in Table 3 and Table 5 we can have e1 = 7.08 \* e + 5and $e^2 = 9.13 * e - 12$ in Equation 5. When the demand voltage is 0.588V, the demand capacitance is at least 2.1nF (see Equation (16)) and the loss increases to 5.9mW. However when the demand voltage is 0.598V, the demand capacitance is at least 12.5nF and the loss increases to 9.6mW. What's more, as the MIM capacitance is global resource, the MIM capacitance used in other SCCs would decrease. On the other hand, perhaps the ratio 3:2 is better to achieve high efficiency. Hence we could set an overlap region whose demand voltage is slightly less than 0.6V (and also other $V_{nl}$ s such as 0.8V, 0.9V...). In these regions, we should determine which ratio is better to select for the efficiency of all SCCs. #### IV. PLANNING METHODS OF SCCS In this section, we would show how to do the SCC planning to achieve better power efficiency at early design stage of the chip, when the M cores system are supplied by less number of SCCs (i.e., $\hat{M}$ ). Here we study two steps: 1) the SCC supply scheme (i.e., mapping relationship between SCCs and cores), when $\hat{M}$ SCCs supply energy to the M cores, 2) the MIM capacitance allocation and conversion ratio selection of the $\hat{M}$ SCCs. Besides, we also provide the guidance of how many SCCs should be used to achieve the minimum overall loss. # A. OVERVIEW OF THE SCC PLANNING In order to plan $\hat{M}$ SCCs $(1 \le \hat{M} < M)$ to supply energy to the M cores, we would merge the cores into $\hat{M}$ groups with each group supplied by an individual SCC. Hence two kinds of loss are shown and introduced in details in the followings. (1) Assume the Group $\hat{m}$ has L cores, these cores have the minimum supply voltage $V_{core}^1$ , $V_{core}^2$ , ..., $V_{core}^L$ respectively. As this group is supplied by one SCC, it would have an minimum supply voltage $V_{group}^{\hat{m}} = max\{V_{core}^1, V_{core}^2, \ldots, V_{core}^L\}$ . Hence many cores here would have an higher supply voltage which is not necessary and this leads to the extra power loss $$P_{extra}^{\hat{m}} = \sum_{l=1}^{L} I_{core}^{l} (V_{group}^{\hat{m}} - V_{core}^{l})$$ (8) where the $I_{core}^{l}$ is the maximum current demand of core l. Hence we need to wisely merge the M cores into $\hat{M}$ groups with each group supplied by an individual SCC, so that the total extra power loss of each group $P_{extra}^{total} = \sum_{\hat{m}=1}^{\hat{M}} P_{extra}^{\hat{m}}$ is minimized to get the highest energy efficiency. (2) We also need to optimize the flying capacitance and selected ratio for each SCC to minimize the total power loss of each individual SCC $P_{scc}^{total} = \sum_{1}^{M} P_{scc}^{m}$ , where $P_{scc}^{m}$ is the power loss of the m-th SCC (see Equation (5)). So our SCC planning tries to minimize these two kind of losses. #### B. STEP 1: GROUPING THE CORES It is reasonable that only the cores/groups having a neighbor relationship could be merged into one group, and then can be supplied by one SCC. Hence it is not easy to minimize $P_{extra}^{total} = \sum_{\hat{m}=1}^{\hat{M}} P_{extra}^{\hat{m}}$ (see Equation (8)) because the physically adjacent location of cores in each group would be the constraint. Therefore we would introduce our greedy approaches (two strategies) to merge the cores into $\hat{M}$ groups with inducing a decent $P_{extra}^{total}$ . Our grouping method is similar to the hierarchical/agglomerative clustering methods in unsupervised learning [36], [37]. We could treat each group as one core, and it has a minimum supply voltage $$V_{group}^{\hat{m}} = max\{V_{core}^1, V_{core}^2, \dots, V_{core}^L\}$$ (9) where $core1, core2, \ldots, coreL \in Group \hat{m}$ and the maximum demand current $$I_{group}^{\hat{m}} = \sum_{l \in Group \, \hat{m}} I_{core}^{l} \tag{10}$$ In order to represent the adjacent relationship among the cores/groups on a chip and the extra power loss induced by the merging of two adjacent cores/groups, we could use an adjacency graph where the connected nodes x and y represent the adjacency cores/groups, and the weight $W_{x,y}$ between nodes x and y represents the induced extra power loss if these two cores/groups merge. According to Equation (8), the $W_{x,y}$ could be written as $$W_{x,y} = \begin{cases} (V_{group}^{x} - V_{group}^{y}) * I_{group}^{y}, & V_{group}^{x} > V_{group}^{y} \\ (V_{group}^{y} - V_{group}^{x}) * I_{group}^{x}, & V_{group}^{x} \leq V_{group}^{y} \end{cases}$$ (11) **FIGURE 6.** The layout of a 4-core chip and its adjacency graph representing the neighbor relationship. FIGURE 7. Each time our approach select two adjacent groups connected with the smallest weight in the graph to merge (assuming in each of the above graphs the labeled weight is the smallest weight). Given the layout of all the cores in a multicore chip, we could obtain this adjacent graph by using the Voronoi diagrams. After we get the adjacent graph (M nodes), each node would represent one core with a minimum supply voltage and a maximum demand current. We could follow the loop below to iteratively merge the nodes until the number of nodes in the graph is $\hat{M}$ . - 1. Find the smallest weight $W_{x,y}$ in the adjacency graph. Merge the node x and node y into a new node $x_y$ , and calculate $(V_{group}^{\hat{x}\_y}, I_{group}^{\hat{x}\_y})$ (see Equation (9) and (10) respectively). - 2. Update the weights connected to the original node *x* and *y* with Equation (11). Then a new adjacency graph is generated. - 3. Goto 1, until the number of nodes in the adjacent graph is $\hat{M}$ . For example, to represent the neighbor relationship in a layout of a 4-core chip shown in Fig. 6(a), we can use a adjacency graph shown in Fig. 6(b). The weight in this adjacency graph is the extra power loss if this two cores are merged into one. Then our greedy approach would always choose two groups connected with the smallest weight to merge. Hence each time we greedily induce the smallest extra power loss and reduce the number of used SCC by one. In this way our approach iteratively reduces the number of SCCs and could finally get the targeted number of SCCs. This flow is shown in Fig. 7. After the grouping, we could calculate the total induced extra power loss with Equation (8), $$P_{extra}^{total} = \sum_{\hat{m}=1}^{\hat{M}} P_{extra}^{\hat{m}} = \sum_{\hat{m}=1}^{\hat{M}} \sum_{l=1}^{L} I_{core}^{l} (V_{group}^{\hat{m}} - V_{core}^{l})$$ (12) #### C. STEP 2: OPTIMIZING THE SCCS When the cores are merged in $\hat{M}$ groups, we treat each group as one core with a minimum supply voltage $V_{group}^{\hat{m}}$ and a maximum demand current $I_{group}^{\hat{m}}$ (see Equation (9) and (10)). Given the load information of these cores, the total available MIM capacitance for SCCs and optional ratios for one SCC, The problem is to optimize the capacitance allocation and ratio selection for the $\hat{M}$ SCCs to get better energy efficiency. In this section, we conduct our method to optimize the capacitance allocation and ratio selection for each SCC to achieve the minimal power loss of all SCCs, $P_{scc}^{total}$ . We could use $x(\hat{m}, n)$ , a binary-variable, to indicate whether the *n*-th ratio is used in the $\hat{m}$ -th SCC. According to Equation (5), the total power loss of all SCCs would be $$P_{scc}^{total} = \sum_{\hat{m}=1}^{\hat{M}} \sum_{n=1}^{N} x(\hat{m}, n) \cdot (e_1^{\hat{m}, n} \cdot C_{sw}^{\hat{m}} + \frac{e_2^{\hat{m}, n}}{C_{sw}^{\hat{m}}})$$ (13) where $\hat{M}$ and N are the number of SCCs and the number of optional ratios. And $e_1^{\hat{m},n}/e_2^{\hat{m},n}$ is the value of $e_1/e_2$ in the power loss model of the $\hat{m}$ -th SCC, when the n-th ratio of the $\hat{m}$ -th SCC is used. So we would minimize the objective function of Equation (13). And we have the constraints: 1) each of the $\hat{M}$ SCCs would only choose one conversion ratio, that's $$\sum_{n=1}^{N} x_{\hat{m},n} = 1, \quad \forall \hat{m} = 1, 2, \dots, \hat{M}$$ (14) 2) the MIM capacitance used in each SCC as flying capacitance would have a total amount, which should be no more than the total MIM capacitance $C_{total}$ . So we have $$\sum_{\hat{m}=1}^{M} C_{sw}^{\hat{m}} \le C_{total} \tag{15}$$ 3) besides, according to Equation (3) and Section III-C, the output voltage ripple of the $\hat{m}$ -th SCC, $\Delta V^{\hat{m},n}$ , should not exceed the maximal allowed ripple of this core group, $V_{nl}^{\hat{m},n} - V_{group}^{\hat{m}}$ [19]. This is because the SCC with different conversion ratios would output different no-load voltage $V_{nl}$ (for example, the SCC with ratio 2:1 would output no-load voltage 1/2\*VDD), and the received voltage of the core is $V_{nl} - \Delta V$ , where $\Delta V$ is the voltage ripple. If the received voltage of the core $V_{nl} - \Delta V$ is less than the demand voltage of the core $V_{core}$ , this would lead to malfunction of the cells in this core. As a result, the maximum allowed voltage ripple is $\Delta V_{max} = V_{nl} - V_{core} \geq \Delta V = \frac{I_{out}}{M_{topofsw}N_{phase}C_{sw}}$ (see Equation (3)). This could lead to a lower bound of each $C_{sw}^{\hat{m}}$ , since $$V_{nl}^{\hat{m},n} - V_{group}^{\hat{m}} \ge \frac{I_{group}^{\hat{m}}}{M_{topo}^{\hat{m},n} \cdot f_{sw} N_{phase} C_{sw}^{\hat{m}}}$$ (16) FIGURE 8. When the demand voltage is in overlap regions there would be two possible ratios, otherwise there would be one possible ratio. Here we set the length of the overlap region as 0.05. where $\hat{m} = 1, 2, ..., \hat{M}$ and hence $\hat{M}$ constraints here. Notice $V_{nl}^{\hat{m},n}$ here is the output voltage of the $\hat{m}$ -th SCC, when it uses the n-th ratio. Obviously this is a MINLP (i.e., Mixed-Integer Nonlinear Programming) problem since we have continuous variables $C_{sw}^{\hat{m}}s$ , binary variables $x_{\hat{m},n}$ and non-linear terms in the objective function. Although there are many solvers such as IBM Cplex [38] could directly be used to solve the problem, the problem is still hard to get its optimal result in a suitable period of time, especially when the size of $\hat{M}/N$ is large (eg., 32/4). Actually the number of possible ratios for each SCC is limited according to their load demands (see Fig 5), so we could enumerate the possible conversion ratios for each SCC [39] and then eliminate all the binary variables $x_{\hat{m},n}$ s here. This would make the solving process more smart. Hence we could solve it by establishing and solving the sub-problems: - Each SCC only has one or two possible ratios to achieve better power efficiency (Section III-C). Hence for all SCCs, we could enumerate all the possible ratio combinations, and result in a set of sub-problems. Let us see an example. - Fig. 8 shows how we choose the possible ratio(s) for a given load voltage. If we have $\hat{M}=3$ SCCs, and each of them supplies the demand voltages 0.57V, 1.05V and 0.78V, we would have possible ratio for each SCC $r_{pos}^1=\{2:1,\ 3:2\},\ r_{pos}^2=\{1:1\}$ and $r_{pos}^3=\{3:2,\ 4:3\}$ . We choose one ratio from each set $r_{pos}^{\hat{m}}$ , so there are total K=2\*1\*2=4 ratio combinations. In each ratio combination, all the $x_{\hat{m},n}$ s are fixed, and then we have a specific sub-problem $P_{scc}^{total}=\sum_{m=1}^{M}\cdot(e_1^m\cdot C_{sw}^m+\frac{e_2^m}{C_{sw}^m})$ , which is convex and easy to solve. As a result, we get K simple sub-problems. - 2. We could solve each of the sub-problems by CVX [40] and get a case of the optimal power loss of the SCCs, $P_{scc}^{total,k}$ , k = 1, 2, ..., K. The solution of the original problem is the minimal solved power loss among the sub-problems, i.e., $P_{scc}^{total} = \min_{k=1,2,...,K} \{P_{scc}^{total,k}\}$ . And the capacitance allocation and ratio selection for each SCC are the optimization results in that sub-problem. # D. THE WHOLE FLOW OF THE FRAMEWORK After the two steps, we could have our energy efficiency expressed as $$\eta = P_{load}/(P_{load} + P_{extra}^{total} + P_{scc}^{total})$$ (17) where $P_{load}$ is the total load power of all the cores in a chip. As a conclusion of this solution for the SCCs planning, Algorithm 1 shows the framework of the early-stage SCC planning method. # Algorithm 1 Early-stage planning of SCCs - 1: **Input**: the layout of M cores, the minimum supply voltage and maximum current of m th core $(V_{core}^m, I_{core}^m)$ , the number of SCCs $\hat{M}$ . - 2: **Output**: the supplied cores in each of the $\hat{M}$ SCCs, the capacitance $C_{opt}^m$ and the ratio $r^m$ for each SCC. - 3: // Step1: Grouping the cores - 4: Generate the adjacency graph G - 5: while the number of nodes in $G > \hat{M}$ do - 6: Find the smallest weight $W_{x,y}$ in G, and merge the nodes x and y into new node $x_y$ - 7: Calculate the minimum supply voltage and maximum demand current $(V_{group}^{\hat{x}_{-y}^{-y}}, I_{group}^{\hat{x}_{-y}^{-y}})$ with Equation (9) and (10). - 8: Update the weights connected to the original nodes *x* and *y* with Equation (11) - 9: end while - 10: // Step2: Optimizing the SCCs - 11: Formulate the power loss optimization as a MINLP problem (Equation (13) (14) (15) and (16)). - 12: According to the demand voltages of each core group, get the *K* ratio combinations (sub-problems). - 13: **for** k = 1 to K **do** - 14: get a case of optimal total power loss $P_{scc}^{total,k}$ , and the corresponding MIM capacitance $C_{sw}^{\hat{m}}$ and conversion ratio $r^{\hat{m}}$ , $\hat{m} = 1, 2, ..., \hat{M}$ . - 15: **end for** - 16: get the case of $C_{sw}^{\hat{m}}$ and $r^{\hat{m}}$ , $\hat{m} = 1, 2, ..., \hat{M}$ , which is corresponding to the minimal total power loss $P_{scc}^{total} = \min_{k=1,2,...,K} \{P_{scc}^{total,k}\}$ . #### V. THE OPTIMAL NUMBER OF THE SCCS As introduced in [19], the authors use a penalty term for the power loss of control circuit and the power consumption of clock signals. Here, we also use a penalty term for the power loss overhead of integrating the SCCs since integrating the SCCs in a multicore chip needs the control circuit, the routing resource (including the supply routing resource), the power consumption of clock signals, the chip area and so on. It's reasonable that the overhead to integrate one SCC could be evaluated as a constant loss $P_0$ (as the penalty). With the increase of the number of SCCs, the overhead to integrate more SCCs $\hat{M}*P_0$ would increase, while the power loss $P_{total}^{total} + P_{scc}^{total}$ would generally decrease because of finer power management. The overall loss with $\hat{M}$ SCCs is $$P_{extra}^{total}(\hat{M}) + P_{scc}^{total}(\hat{M}) + \hat{M} * P_0$$ (18) where $\hat{M}$ is the number of used SCCs in the chip. By varying the number of SCC $\hat{M}$ from M to 1, we could get the overall loss with the aforesaid techniques at every FIGURE 9. The layout of three heterogeneous multi-core chips. granularity of SCCs. And the optimal number of SCC $\hat{M}_{opt}$ is the one with the minimal overall loss. Therefore one could explore the number $\hat{M}$ to find the best number of used SCCs which achieves the minimal overall loss. It's noticed that the optimal number of SCCs is closely related with the layout information of cores in the chips. For the Equation (18), the layout information would affect the term $P_{extra}^{total}(\hat{M})$ and $P_{scc}^{total}(\hat{M})$ . Therefore, we could not directly figure out the optimal number of SCCs at one time. Instead, we could use the method that by varying the number of SCCs, we can get the loss information and finally achieve the optimal number of SCCs. #### VI. EXPERIMENTAL RESULTS In this section, we would present the results of our SCC planning work. Heterogeneous multicore benchmarks, including 4-core, 8-core and 16-core, are tested. The loads information in each core of our benchmarks are obtained from a reasonable scaling of the value in [19] and are shown in Table 4. We assume that firstly the cores in [19] are in many types (such as Cortex-A72@ 0.78V/@ 0.82V [41], DSP@ 0.55V [42]), and the current information could be achieved by the system-level simulators GEM5 [43] and McPAT [44] which simulate the hardware behavior and get the power information [45]. Then the aspect ratios of these cores could be customized [46]. The layouts of cores in the three chips are shown in Fig. 9 and simpler versions of such heterogeneous chips could be seen on today's market [47]. The layouts of these three multicore chips are fixed patterns. And the Algorithm 1 in Section IV-D can be applied to any benchmarks as long as the layouts of cores, the minimum supply voltage and maximum current and the available number of SCCs are given. As introduced in Section II, by given enough information, we can easily formulate the problem and apply the optimization algorithm. And the results vary with the layouts of cores in different benchmarks. The parameters of SCCs used in the experiments have been listed in Table 5. And the CVX [40] is used to solve the convex problems. ## A. RESULTS OF THE SCC PLANNING #### 1) THE CORE GROUPING With our grouping method, the cores in a chip could be in several groups. Fig. 10 shows two strategies used to group the cores, and the respective obtained extra power loss. Strategy 1: The proposed grouping strategy in Section IV-B. TABLE 4. The load information of cores in heterogeneous Benchmarks. | #Benchmark | | 4-core | e chip | | | | | | | | |---------------|--------------|--------|--------|------|--|--|--|--|--|--| | Core index | 1 | 2 | 3 | 4 | | | | | | | | $V_{core}(V)$ | 0.55 | 0.68 | 0.82 | 0.90 | | | | | | | | $I_{core}(A)$ | 0.08 | 0.10 | 0.30 | 0.35 | | | | | | | | #Benchmark | 8-core chip | | | | | | | | | | | Core index | 1 | 2 | 3 | 4 | | | | | | | | $V_{core}(V)$ | 0.55 | 0.65 | 0.65 | 0.72 | | | | | | | | $I_{core}(A)$ | 0.15 | 0.18 | 0.18 | 0.20 | | | | | | | | Core index | 5 | 6 | 7 | 8 | | | | | | | | $V_{core}(V)$ | 0.78 | 0.85 | 0.85 | 0.90 | | | | | | | | $I_{core}(A)$ | 0.25 | 0.35 | 0.35 | 0.40 | | | | | | | | #Benchmark | 16-core chip | | | | | | | | | | | Core index | 1 | 2 | 3 | 4 | | | | | | | | $V_{core}(V)$ | 0.55 | 0.68 | 0.82 | 0.90 | | | | | | | | $I_{core}(A)$ | 0.08 | 0.10 | 0.30 | 0.35 | | | | | | | | Core index | 5 | 6 | 7 | 8 | | | | | | | | $V_{core}(V)$ | 0.55 | 0.68 | 0.82 | 0.90 | | | | | | | | $I_{core}(A)$ | 0.08 | 0.10 | 0.30 | 0.35 | | | | | | | | Core index | 9 | 10 | 11 | 12 | | | | | | | | $V_{core}(V)$ | 0.55 | 0.65 | 0.65 | 0.72 | | | | | | | | $I_{core}(A)$ | 0.15 | 0.18 | 0.18 | 0.20 | | | | | | | | Core index | 13 | 14 | 15 | 16 | | | | | | | | $V_{core}(V)$ | 0.78 | 0.85 | 0.85 | 0.90 | | | | | | | | $I_{core}(A)$ | 0.25 | 0.35 | 0.35 | 0.40 | | | | | | | **TABLE 5.** Non topology-dependent parameters of SCCs. | Experimental parameters | | | | | | | | | | |-------------------------|----------------------------------------|--|--|--|--|--|--|--|--| | $V_{dd}$ | 1.2V | | | | | | | | | | $N_{phase}$ | 8 | | | | | | | | | | $f_{sw}$ | 200MHz | | | | | | | | | | $C_{gate}$ | 3fF/μm | | | | | | | | | | $R_{on}$ | $130\Omega \cdot \mu \text{m}$ | | | | | | | | | | $\sigma$ | $512\mu/(\mu \text{F}\cdot\text{MHz})$ | | | | | | | | | Strategy 2: Differing from strategy 1, here we simply merge two adjacent cores/groups which have the most similar supply voltage at each time. Hence we replace the weight in Equation (11) with $W_{x,y} = |V_{group}^x - V_{group}^y|$ . We can see that as the number of groups (i.e., $\hat{M}$ ) varies from M to 1, the extra power loss would increase. This is because with less SCCs the power management would be coarser, and more power could be wasted as there are more mismatches between the core demand voltage and the SCC output voltage. And we also see that our strategy 1 is better than strategy 2 since it comes out with less extra power loss in *most cases* (Notice that when the case $\hat{M}=1$ , both grouping strategies would result in only one group that includes all cores, leading to same extra power loss. What's more, since there are some homogeneous cores in this chip, both grouping strategies would put these cores together first, leading to no extra power loss at $\hat{M}=6$ or 7). FIGURE 10. The extra power loss of all cores $P_{extra}^{total}$ varies along with the given number of used SCCs. # 2) THE SCC OPTIMIZATION In this section, we would show the obtained power loss of all SCCs when optimizing the MIM capacitance allocation and ratio selection. To show the effectiveness of our optimization, we also show the obtained power loss when we do not optimize the capacitance or ratio. We show power loss results of 4 SCCs in the 4-core chip without and with optimization in Table 6. In the w/o method, we allocate the capacitance to each SCC according to - \_1: the percentage of core area supplied by this SCC, - \_2: the percentage of core current supplied by this SCC. And then we select the ratio according to the demand voltage without considering the overlap voltage regions (see Fig. 5). We mark this two methods as w/o optimizing SCCs \_1 and w/o optimizing SCCs \_2, respectively. It can be seen that our method could allocate the capacitance wisely and also select better ratios in some SCCs, which totally significantly reduce the total power loss of SCCs. # 3) RESULTS OF THE WHOLE FRAMEWORK In our work, we propose the SCC planning framework to obtain better power efficiency when the given number of SC converters is less than the number of cores in chip. In this part, we would show the planning results and the obtained power efficiency when the given number of SC converters varies from M to 1. To the best of our knowledge, there is no literature to explore the supply method where the used number of SCCs is less than the number of cores. Three methods stated in the following to implement the SCC planning framework, which include the general ideas if one would use less number of SCCs to supply the power for the cores, are shown as the comparisons in the results. - (0) Ours: we use the proposed grouping strategy and the SCC optimization technique both described in Section IV. - (1) Method 1: we do not use the proposed grouping strategy (but strategy 2 in Section VI-A.1 instead) and use the SCC optimization technique. - (2) Method 2: we do not use the proposed grouping strategy (but strategy 2 in Section VI-A.1 instead) and do not use the SCC optimization technique (w/o optimizing SCCs \_1 in Section VI-A.2 instead). - (3) Method 2\_2: we do not use the proposed grouping strategy (but strategy 2 in Section VI-A.1 instead) and do not FIGURE 11. The energy efficiency of the chip (as the number of SCCs is at each level of granularity). FIGURE 12. 8-core benchmark: efficiency vs convert number. use the SCC optimization technique (w/o optimizing SCCs \_2 in Section VI-A.2 instead). Let us see the 4-core benchmark as an example firstly. If the given number of SC converters $\hat{M}$ is 3, the results of our SCC planning framework are shown in Table 7. We can see with our methods the supply schemes and optimized capacitance and ratios could be obtained for better efficiency of the chip. The energy efficiency of the chip with three planning methods are shown in Fig. 11 respectively. We can see the energy efficiency with our method is slightly higher than that with Method 1, and more higher than that with Method 2. Besides, it could be seen that as the the given number of SCCs in the 4-core benchmark varies from 1 to 4, the energy efficiency of the chip with our method could be improved since more SCCs would lead to a finer power management. On the contrary, as Method 2 does not use the SCC optimization technique, the more SCCs are given, the more unreasonable capacitance allocation would occur and the worse efficiency would appear. The energy efficiency of the 8-core and 16-core benchmarks with our methods are shown in Fig. 12 and Fig. 13, respectively. It is noticed that the energy efficiency would stay the same when the number of SCCs nears M. This is because there are many identical cores (i.e., homogeneous) in the heterogeneous chip, and it would not induce any more $P_{extra}^{total}$ or $P_{scc}^{total}$ when two identical cores are merged into one group and supplied by one SCC. # B. THE BEST NUMBER OF SCCS Our method also provides the way to find the best number of SCCs, which could lead to the minimal overall power TABLE 6. Optimizing the four-SCC case of the four-core benchmark. Here \* stands for the selected ratio among the possible candidates with our optimization. | | | mizing | SCCs _1 | | mizing | SCCs _2 | | | | | | | |-------------------|-------------------------------------------------------------|---------------|--------------------------------------------------------------|---------------------------------------------------------------|---------------|--------------------------------------------------------------|--------------------|------------------------------------------------------------------|---------------------|--------------------------------------------------------------|--|--| | #SC | $\begin{pmatrix} C_{total}^{\hat{m}} \\ (nF) \end{pmatrix}$ | $r^{\hat{m}}$ | $\begin{array}{ c c } P_{scc}^{\hat{m}} \\ (mW) \end{array}$ | $\begin{pmatrix} C_{total}^{\tilde{m}} \\ (nF) \end{pmatrix}$ | $r^{\hat{m}}$ | $\begin{array}{c c} P_{scc}^{\tilde{m}} \\ (mW) \end{array}$ | $C_{total}$ $(nF)$ | $ \begin{array}{ c c } C_{opt}^{\tilde{m}} \\ (nF) \end{array} $ | $r_{pos}^{\hat{m}}$ | $ \begin{array}{c c} P_{scc}^{\hat{m}} \\ (mW) \end{array} $ | | | | $SC_1$ | 5 | 2:1 | 5.08 | 1.25 | 2:1 | 8.17 | | 1.04 | 2:1*<br>3:2 | 9.47 | | | | $SC_2$ | 2 | 3:2 | 15.32 | 1.57 | 3:2 | 18.95 | 13 | 1.85 | 3:2 | 16.41 | | | | $SC_3$ | 3 | 4:3 | 145.62 | 4.70 | 4:3 | 94.61 | | 4.67 | 1:1*<br>4:3 | 37.37 | | | | $SC_4$ | 3 | 1:1 | 77.90 | 5.48 | 1:1 | 43.31 | | 5.44 | 1:1 | 43.59 | | | | $P_{scc}^{total}$ | | | 243.92 | | | 165.04 | | | | 106.84 | | | TABLE 7. When the 4-core chip is supplied with three SCCs, our SCC planning methods could obtain smart supply schemes and optimized capacitance and ratios for each SCC. | | | Ours | | | Method 1 Method 2 Method 2 | | _2 | | | | | | | | | | |------|--------------|----------------------|-----------------------------------------|------|----------------------------|-------|-------|--------------|----------|------|-------|--------------|----------|------|-------|--------| | #SCC | Supplied | Cap | ratio | η | Supplied | Cap | ratio | $\eta$ | Supplied | Cap | ratio | $\eta$ | Supplied | Cap | ratio | $\eta$ | | | cores | $ (nF) ^{\eta atio}$ | $\binom{n}{m}$ (%) cores $\binom{n}{m}$ | (%) | cores | (nF) | (%) | | cores | (nF) | raito | (%) | | | | | | SCC1 | core1, core2 | 3.22 | 3:2 | | core1 | 1.04 | 2:1 | | core1 | 5 | 2:1 | | core1 | 1.25 | 2:1 | | | SCC2 | core3 | 4.51 | 1:1 | 84.4 | core2 | 1.85 | 3:2 | 83.7 | core2 | 2 | 3:2 | 79.0 | core2 | 1.57 | 3:2 | 83.6 | | SCC3 | core4 | 5.27 | 1:1 | | core3, core4 | 10.11 | 1:1 | core3, core4 | 6 | 1:1 | 1 | core3, core4 | 10.18 | 1:1 | | | FIGURE 13. 16-core benchmark: efficiency vs convert number. FIGURE 14. The tradeoff between power loss and the constant loss (i.e., the penalty of using SCCs) in eight-core benchmark, and result in finding a minimal overall loss. loss. We would show the results in the 8-core benchmark. If we set the overhead of using one SC converter as 30mW (we refer this from literature [19]), as the number of SCCs grows, the power loss $P_{extra}^{total}(\hat{M}) + P_{scc}^{total}(\hat{M})$ , the constant loss $\hat{M}*P_0$ and the overall loss (i.e., the sum of that two) could be respectively shown in Fig. 14. We can see the power loss $P_{extra}^{total}(\hat{M}) + P_{scc}^{total}(\hat{M})$ would decrease while the constant loss FIGURE 15. The tradeoff between power loss and the constant loss (i.e., the penalty of using SCCs) in four-core benchmark, and result in finding a minimal overall loss. FIGURE 16. The tradeoff between power loss and the constant loss (i.e., the penalty of using SCCs) in sixteen-core benchmark, and result in finding a minimal overall loss. $\hat{M}*P_0$ would increase, as the number of the SCC grows. As a result, when two SC converters are used we would obtain the minimal overall loss. The similar trend is also observed in 4-core and 8-core benchmarks, and the results are shown in Fig. 15 and Fig. 16 respectively. We can see that the optimal numbers of SC converter for 4-core and 16-core benchmarks are 1 and 3 respectively. # VII. CONCLUSION As the overhead of integrating SCCs in a chip is non-negligible and the SCCs could not be overused. In this paper, for better energy efficiency we propose an early stage planning framework of SCCs to obtain the SCC supply scheme together with the optimized MIM capacitance allocation and converter ratio selection for each SCC when the given number of SCCs is less than the number of cores. Besides, our method could also explore to find the best number of used SCCs for a given chip. The experiments show the results of our SCC planning methods. #### **REFERENCES** - A. Branover, D. Foley, and M. Steinman, "AMD fusion APU: Llano," IEEE Micro, vol. 32, no. 2, pp. 28–37, Mar. 2012. - [2] M. Yuffe, E. Knoll, M. Mehalel, J. Shor, and T. Kurts, "A fully integrated multi-CPU, GPU and memory controller 32 nm processor," in *IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech. Papers*, Feb. 2011, pp. 264–266. - [3] J. Burt, "Intel begins shipping Xeon chips with FPGA accelerators," eWeek, 2016. - [4] T. Liang, L. Feng, S. Sinha, and W. Zhang, "PAAS: A system level simulator for heterogeneous computing architectures," in *Proc. 27th Int. Conf. Field Program. Log. Appl. (FPL)*, Sep. 2017, pp. 1–8. - [5] (2018). Apple A12. [Online]. Available: https://en.wikipedia. org/wiki/Apple\_A12 - [6] (2018). Hisilicon Kirin 980. [Online]. Available: https://en.wikichip.org/ wiki/hisilicon/kirin/980 - [7] B. Bowhill, B. Stackhouse, N. Nassif, Z. Yang, A. Raghavan, O. Mendoza, C. Morganti, C. Houghton, D. Krueger, and O. Franza, "The Xeon processor E5-2600 v3: A 22 nm 18-core product family," in *IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech. Papers*, Feb. 2015, pp. 1–3. - [8] P. Meinerzhagen et al., "An energy-efficient graphics processor featuring fine-grain DVFS with integrated voltage regulators, execution-unit turbo, and retentive sleep in 14 nm tri-gate CMOS," in *IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech. Papers*, Feb. 2018, pp. 38–40. - [9] R. Jevtić, H.-P. Le, M. Blagojević, S. Bailey, K. Asanović, E. Alon, and B. Nikolić, "Per-core DVFS with switched-capacitor converters for energy efficiency in manycore processors," *IEEE Trans. Very Large Scale Integr.* (VLSI) Syst., vol. 23, no. 4, pp. 723–730, Apr. 2015. - [10] S. M. Tam, H. Muljono, M. Huang, S. Iyer, K. Royneogi, N. Satti, R. Qureshi, W. Chen, T. Wang, H. Hsieh, S. Vora, and E. Wang, "SkyLake-SP: A 14 nm 28-core Xeon processor," in *IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech. Papers*, Feb. 2018, pp. 34–36. - [11] E. J. Fluhr, J. Friedrich, D. Dreps, V. Zyuban, G. Still, C. Gonzalez, A. Hall, D. Hogenmiller, F. Malgioglio, and R. Nett, "POWER8: A 12-core server-class processor in 22 nm SOI with 7.6 Tb/s off-chip bandwidth," in IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech. Papers, Feb. 2014, pp. 96–97. - [12] S. Lai and P. Li, "A fully on-chip area-efficient CMOS low-dropout regulator with fast load regulation," *Anal. Integr. Circuits Signal Process.*, vol. 72, no. 2, pp. 433–450, Aug. 2012. - [13] S. Bang, J.-S. Seo, L. Chang, D. Blaauw, and D. Sylvester, "A low ripple switched-capacitor voltage regulator using flying capacitance dithering," *IEEE J. Solid-State Circuits*, vol. 51, no. 4, pp. 919–929, Apr. 2016. - [14] J. Jiang, Y. Lu, W.-H. Ki, U. Seng-Pan, and R. P. Martins, "A dual-symmetrical-output switched-capacitor converter with dynamic power cells and minimized cross regulation for application processors in 28 nm CMOS," in *IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech. Papers*, Feb. 2017, pp. 344–345. - [15] T. M. Andersen, F. Krismer, J. W. Kolar, T. Toifl, C. Menolfi, L. Kull, T. Morf, M. Kossel, M. Brändli, and P. A. Francese, "A 10 W on-chip switched capacitor voltage regulator with feedforward regulation capability for granular microprocessor power delivery," *IEEE Trans. Power Electron.*, vol. 32, no. 1, pp. 378–393, Jan. 2017. - [16] P. Zhou, D. Jiao, C. H. Kim, and S. S. Sapatnekar, "Exploration of onchip switched-capacitor DC–DC converter for multicore processors using a distributed power delivery network," in *Proc. IEEE Custom Integr. Circuits Conf. (CICC)*, Sep. 2011, pp. 1–4. - [17] Y. K. Ramadass, "Energy processing circuits for low-power applications," Ph.D. dissertation, Massachusetts Inst. Technol., Cambridge, MA, USA, 2009 - [18] R. Jain, B. M. Geuskens, S. T. Kim, M. M. Khellah, J. Kulkarni, J. W. Tschanz, and V. De, "A 0.45–1 V fully-integrated distributed switched capacitor DC–DC converter with high density MIM capacitor in 22 nm tri-gate CMOS," *IEEE J. Solid-State Circuits*, vol. 49, no. 4, pp. 917–927, Apr. 2014. - [19] P. Zhou, A. Paul, C. H. Kim, and S. S. Sapatnekar, "Distributed onchip switched-capacitor DC–DC converters supporting DVFS in multicore systems," *IEEE Trans. Very Large Scale Integr. (VLSI) Syst.*, vol. 22, no. 9, pp. 1954–1967, Sep. 2014. - [20] N. Butzen and M. Steyaert, "A 94.6%-efficiency fully integrated switched-capacitor DC-DC converter in baseline 40 nm CMOS using scalable parasitic charge redistribution," in *IEEE Int. Solid-State Circuits Conf.* (ISSCC) Dig. Tech. Papers, Jan./Feb. 2016, pp. 220–221. - [21] M. D. Seeman, "A design methodology for switched-capacitor DC–DC converters," Ph.D. dissertation, Dept. Elect. Eng. Comput. Sci., Univ. California, Berkeley, Berkeley, CA, USA, 2011. - [22] H.-P. Le, S. R. Sanders, and E. Alon, "Design techniques for fully integrated switched-capacitor DC–DC converters," *IEEE J. Solid-State Circuits*, vol. 46, no. 9, pp. 2120–2131, Sep. 2011. - [23] J. De Vos, D. Flandre, and D. Bol, "A sizing methodology for on-chip switched-capacitor DC/DC converters," *IEEE Trans. Circuits Syst. I, Reg. Papers*, vol. 61, no. 5, pp. 1597–1606, May 2014. - [24] V. S. Sathe and J.-S. Seo, "Analysis and optimization of CMOS switched-capacitor converters," in *Proc. IEEE/ACM Int. Symp. Low Power Electron. Design (ISLPED)*, Jul. 2015, pp. 327–334. - [25] T. M. Andersen, F. Krismer, J. W. Kolar, T. Toifl, C. Menolfi, L. Kull, T. Morf, M. Kossel, M. Brandli, and P. A. Francese, "Modeling and Pareto optimization of on-chip switched capacitor converters," *IEEE Trans. Power Electron.*, vol. 32, no. 1, pp. 363–377, Jan. 2017. - [26] X. Mi, H. F. Moghadam, and J.-S. Seo, "Flying and decoupling capacitance optimization for area-constrained on-chip switched-capacitor voltage regulators," in *Proc. Design, Automat. Test Eur.*, Mar. 2017, pp. 1269–1272. - [27] L. G. Salem and P. P. Mercier, "An 85%-efficiency fully integrated 15-ratio recursive switched-capacitor DC-DC converter with 0.1-to-2.2 V output voltage range," in *IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech. Papers*, Feb. 2014, pp. 88–89. - [28] K. Kesarwani, R. Sangwan, and J. T. Stauth, "Resonant switched-capacitor converters for chip-scale power delivery: Modeling and design," in *Proc. IEEE 14th Workshop Control Modeling Power Electron. (COMPEL)*, Jun. 2013, pp. 1–7. - [29] J. Jiang, Y. Lu, C. Huang, W.-H. Ki, and P. K. Mok, "A 2-/3-phase fully integrated switched-capacitor DC–DC converter in bulk CMOS for energyefficient digital circuits with 14% efficiency improvement," in *IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech. Papers*, Feb. 2015, pp. 1–3. - [30] L. Chang, R. K. Montoye, B. L. Ji, A. J. Weger, K. G. Stawiasz, and R. H. Dennard, "A fully-integrated switched-capacitor 2:1 voltage converter with regulation capability and 90% efficiency at 2.3 A/mm<sup>2</sup>," in *Proc. IEEE Symp. VLSI Circuits (VLSIC)*, Jun. 2010, pp. 55–56. - [31] L. Wang and P. Zhou, "Leakage power reduction in multicore chips via online decap modulation," in *Proc. China Semiconductor Technol. Int. Conf. (CSTIC)*, Shanghai, China, Mar. 2016, pp. 1–3. - [32] F. Roozeboom, H. J. Bergveld, K. Nowak, F. Le Cornec, L. Guiraud, C. Bunel, S. Iochem, J. Ferreira, S. Ledain, E. Pieraerts, and M. Pommier, "Ultrahigh-density trench capacitors in silicon and their application to integrated DC–DC conversion," *Procedia Chem.*, vol. 1, no. 1, pp. 1435–1438, Sep. 2009. - [33] P.-Y. Chiu and M.-D. Ker, "Metal-layer capacitors in the 65 nm CMOS process and the application for low-leakage power-rail ESD clamp circuit," *Microelectron. Rel.*, vol. 54, no. 1, pp. 64–70, Jan. 2014. - [34] M. Steyaert, T. Van Breussegem, H. Meyvaert, P. Callemeyn, and M. Wens, "DC–DC converters: From discrete towards fully integrated CMOS," in Proc. Eur. Solid-State Device Res. Conf. (ESSDERC), 2011, pp. 59–66. - [35] C. Huang and P. K. T. Mok, "A 100 MHz 82.4% efficiency package-bondwire based four-phase fully-integrated buck converter with flying capacitor for area reduction," *IEEE J. Solid-State Circuits*, vol. 48, no. 12, pp. 2977–2988, Dec. 2013. - [36] D. Beeferman and A. Berger, "Agglomerative clustering of a search engine query log," in *Proc. 6th ACM SIGKDD Int. Conf. Knowl. Discovery Data Mining (KDD)*, 2000, pp. 407–416. - [37] D. Müllner, "Modern hierarchical, agglomerative clustering algorithms," 2011, arXiv:1109.2378. [Online]. Available: http://arxiv.org/abs/ 1109.2378 - [38] IBM ILOG CPLEX Optimization Studio V.12. Accessed: Apr. 1, 2019. [Online]. Available: http://www-01.ibm.com/software/integration/optimization/cplex-optimization-studio/ - [39] L. Wang, L. Wang, D. Shang, C. Zhuo, and P. Zhou, "Optimization of switched-capacitor DC-DC converters in heterogeneous multicore systems," in *Proc. Design, Automat. Test Eur.*, 2019, pp. 1269–1272. - [40] M. Grant. (2008). CVX: MATLAB Software for Disciplined Convex Programming. [Online]. Available: http://cvxr.com/cvx/ - [41] WikiChip. TSMC Demonstrates A 7 nm Arm-Based Chiplet Design for HPC. Accessed: Apr. 1, 2019. [Online]. Available: https://fuse.wikichip. org/news/2446/tsmc-demonstrates-a-7nm-arm-based-chiplet-designfor-hpc/ - [42] D. Marković and R. W. Brodersen, DSP Architecture Design Essentials. Boston, MA, USA: Springer, 2012. - [43] N. Binkert, B. Beckmann, G. Black, S. K. Reinhardt, A. Saidi, A. Basu, J. Hestness, D. R. Hower, T. Krishna, S. Sardashti, R. Sen, K. Sewell, M. Shoaib, N. Vaish, M. D. Hill, and D. A. Wood, "The gem5 simulator," ACM SIGARCH Comput. Archit. News, vol. 39, no. 2, pp. 1–7, May 2011. - [44] S. Li, "McPAT: An integrated power, area, and timing modeling framework for multicore and manycore architectures," in *Proc. 42nd Annu. IEEE/ACM Int. Symp. Microarchitecture*, Dec. 2009, pp. 469–480. - [45] C. Zhuo, K. Unda, Y. Shi, and W.-K. Shih, "From layout to system: Early stage power delivery and architecture co-exploration," *IEEE Trans. Comput.-Aided Design Integr. Circuits Syst.*, vol. 38, no. 7, pp. 1291–1304, Jul. 2019. - [46] Mantra VLSI. Aspect Ratio of Core/Block/Design. Accessed: Apr. 1, 2019. [Online]. Available: http://mantravlsi.blogspot.com/2016/06/aspect-raio-of-coreblockdesign.html - [47] ARM Cortex Processors. Accessed: Apr. 1, 2019. [Online]. Available: http://arm.com/products/processors/index.php **LEILEI WANG** received the B.E. degree from Shandong University, Jinan, China, in 2013. He is currently pursuing the Ph.D. degree with the Shanghai Institute of Microsystem and Information Technology, Chinese Academy of Sciences, Shanghai, China, and the ShanghaiTech University, Shanghai. His current research interests include the modeling and optimization of the power delivery systems. **LU WANG** received the B.E. degree in microelectronics from Xidian University, Xi'an, China, in 2017. She is currently pursuing the master's degree with ShanghaiTech University. Her recent research interest includes improving the energy efficiency of switched-capacitor converters in heterogeneous multicore chips. **CHENG ZHUO** (Senior Member, IEEE) received the B.S. and M.S. degrees in electronic engineering from Zhejiang University, Hangzhou, China, in 2005 and 2007, respectively, and the Ph.D. degree in computer science engineering from the University of Michigan, Ann Arbor, MI, USA, in 2010. He is currently a Professor with the College of Information Science Electronic Engineering, Zhejiang University. His current research interests include 3-D integration, hardware acceleration, and power signal integrity. Dr. Zhuo received the 2012 ACM SIGDA Technical Leadership Award, the 2017 JSPS Invitation Fellowship, the Second Place in 2018 DAC System Design Contest (GPU), and the Best Paper Nominations in DAC16 and CSTIC18. He has served on the technical program committees of many international conferences. He is an Associate Editor of the IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, *Integration* (Elsevier), and the IEEE VLSI CAS Newsletter. **PINGQIANG ZHOU** (Member, IEEE) received the B.E. degree from the Nanjing University of Posts and Telecommunications, China, in 2005, the M.E. degree from Tsinghua University, Beijing, China, in 2007, and the Ph.D. degree from the University of Minnesota, in 2012. He worked as a Research Intern with the IBM T. J. Watson Research Center, in 2011. From 2012 to 2013, he was a Postdoctoral Researcher with the University of Minnesota. Since July 2013, he has been an Assistant Professor with the School of Information Science and Technology, ShanghaiTech University, Shanghai, China. He was a Visiting Scholar with the University of California at Berkeley, Berkeley, in 2015. His current research interests include the computer-aided design of VLSI circuits, computer architecture, multicore processors, and 3D integrated circuits. Dr. Zhou received the Best Paper Nominations in ASP-DAC 2010 and CSTIC 2016. He has been serving on the Technical Program Committees of many international conferences, such as DAC, ICCAD, and ASP-DAC. He is an Associate Editor of the ACM SIGDA Newsletter.