Abstract-On-chip switched-capacitor (SC) DC-DC converters have recently been demonstrated in silicon for high-performance applications such as multicore processors. The efficiency of the power delivery system using SC converters is a major concern, but this has not been addressed at the system level in prior research. This work develops models for the efficiency of such a system as a function of size and layout of the SC converters, and proposes an approach to optimize the size and layout of the SC converter to minimize power loss. The efficiency of these techniques is demonstrated on both homogenous and heterogenous multicore chips.
I. INTRODUCTION With on-chip processing moving towards a dominant multicore paradigm, the requirements of on-chip power grids are changing. Temporal and spatial variations in on-chip power demands are particularly acute in multicore processors, and trends show that these challenges will become even more difficult in the future.
Greater integration of on-chip power regulation, based on a single external supply, is imperative in order to ensure supply integrity and serve spatially diverse loads [1] , [2] . This is easier said than done, and numerous challenges are faced in integrating on-chip supplies. Inductive power supplies can be impractical since on-die inductors have low quality factors and require large area overheads [2] . As a result, in the recent past, there has been a move towards building on-chip capacitance-based DC-DC converters, since capacitors can achieve higher quality factors with lower areas than inductors. Initial efforts [3] , [4] have targeted ultra-low power (several mW) applications, but more recent work has resulted in the ability to drive higher power densities, similar to those encountered in multicore CPUs [5] , [6] . For example, through the use of trench capacitors, the work in [6] builds converters that can achieve current densities of 2.3A/mm 2 and 90% efficiency under the experimental conditions in the paper. 1 shows a simplified power delivery system including the global V dd supply, a switched-capacitor (SC) converter to convert the input V dd to required voltage supply level, a power grid to distribute the power to local core loads, and a core load. The output of the converters is Vcvt, but the exact voltage supply seen by the cores This was supported in part by NSF CCF-0903427 and SRC 2009- TJ-1990 . Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. is downgraded to Vcore due to losses such as voltage droop (e.g., due to IR drop) in the power delivery network. To overcome these losses and ensure correct core operation, the specification on Vcvt, V vdd,dom , must be set to V vdd,dom = V vdd,core + V droop + ∆V (1) where V vdd,core is the minimum voltage specified at the core load, V droop is the peak voltage droop between Vcvt and Vcore, and ∆V is the peak-to-peak output voltage ripple of the converter. For a core that draws current Icore, the power supplied by the converters is: Pcvt = IcoreV vdd,dom (2) However, the power drawn by the core loads is smaller: Pcore = IcoreV vdd,core ( 3) The remainder of the power, Icore(V droop +∆V ), is wasted in various parts of the power delivery network.
Prior work on optimizing on-chip capacitive DC-DC converters is very limited. The work in [2] has focused primarily on reducing wasted power within the internal design of the converter (i.e., entirely inside the "SC converter" box in Fig. 1 ) by controlling the the voltage ripple ∆V , optimizing efficiency by choosing the optimal switch width and switching frequency. Under this paradigm, the burden of optimizing the other term for the voltage droop, V droop , (corresponding to the "Power grid" box in Fig. 1 ) is placed on conventional means for power grid optimization, e.g., grid topology selection and wire widening. The authors in [1] address the problem by suggesting the use of distributed SC converters, which can significantly reduce the voltage droop seen by the local core loads by providing more localized power distribution; however, they have not looked into the efficiency optimization problem.
In this work, we take a novel approach to the problem and consider a more holistic optimization of the DC-DC converter at the system level. We differ from prior efforts in considering not only the internals of the converter but also its context within the system to which it delivers power. In particular, we show that by optimizing the number and layout of the converters for the power domain, it is possible to control the losses due to wasted power in the power grid and enhance the efficiency of the converter. To the best of our knowledge, this is the first work to address efficiency optimization at the system level.
The rest of this paper is organized as follows. In Section II, we present some basic principles of SC converters. This is followed, in Section III, by a description of our proposed models for various components of the power loss as a function of the size and layout of the SC converters in a power delivery system based on SC converters. Next, in Section IV, we present the problem formulation of the efficiency optimization problem, followed by a description of our approaches for solving the problem in Section VI. Finally, in Section VII, the efficiency of our approaches is demonstrated on both homogeneous and heterogeneous multicore chips.
II. SC DC-DC CONVERTERS
A block diagram of a general SC converter system is shown in Fig. 2(a) . The system consists of N phase interleaving stages (a typical value of N phase is 32), which reduce the ripple voltage by 1/N phase compared to an SC converter without any interleaving. At the core of the system is the switch matrix, one for each phase [7] . This matrix is a reconfigurable arrangement of switches and flying capacitors that is configured in different ways by the "Topology select" signal from the topology controller. Each such configuration provides the ability to produce a different voltage conversion ratio, allowing the converter to generate one of several output voltage levels from the converter [3] : for simplicity, these details are not shown here. The conversion ratio of the converter, ratiocvt, is defined as the ratio between the input voltage, which is the external supply voltage, V dd , and the desired output voltage, V vdd,dom , which is the specification for the ideal value of Vcvt. The control circuit takes these inputs:
• the clock signal clk from a phase-locked loop (PLL) • the reference voltage for a particular topology V ref • the feedback voltage Vcvt from the converter output It generates the nonoverlapping clock signals Φ1 and Φ2 for the switches in the switch matrix, and may also be used to gate some of the capacitors to control the amount of capacitance that takes part in the charge transfer process [4] .
A switch matrix topology is shown in Fig. 2 (b), with a 2:1 conversion ratio. Fig. 2 (c) (top) shows that during Φ1, the flying capacitor C f ly is connected to the input global V dd to get charged, and during Φ2, the charge stored in C f ly is transferred to the load and its voltage drops by ∆V as it is discharged. This is reflected as the output voltage at the output, Vcvt of the converter in Fig. 2(a) , as shown in Fig. 2 (c) (bottom) in Φ2. Note that another switch matrix is connected to the output during Φ1 (and is charged during Φ2), which results in the voltage ripple observed in the Vcvt waveform.
Note that the signals Φi are generated by a relatively lowfrequency clock (fsw ≈ 100MHz), which is distinct from the multiGHz clock used by the multicore processor.
III. POWER LOSS ANALYSIS
Efficiency is one of the key design metrics for the on-chip DC-DC converters [2] , [8] . We now analyze the inefficiency and power loss in a SC converter. Our analysis is based on [2] , [7] , [9] , as well as from conversations with designers. Some items in this section are taken from the literature, while others are freshly derived.
For each converter, let fsw be the switching frequency of the converter, Csw = C f ly × N phase be the total amount of flying capacitance, and ∆V be the output ripple of the converter.
(1) Conduction loss: This corresponds to the power loss in the switches as the flying capacitors are charged. For each converter, the conduction loss is modeled as:
where Msw is a constant determined by the converter topology (Table I) , Iout is the total current delivered by the converter, Ron is the switch resistance per unit width, and Wsw is the switch width. [7] . α IS THE RATIO OF THE PLATE CAPACITANCE TO ITS EFFECTIVE CAPACITANCE.
For a given topology, Wsw is proportional to fsw and Csw:
where σ is a fitting coefficient, and γ is topology-dependent (Table I) .
In an SC converter supporting DVFS, the switch size may be adjustable, where some of a set of parallel switches are turned on to achieve the desired switch size [9] .
(2) Gate-drive loss of the switches: The switches in a converter are implemented using transistors. These transistors must be very wide in order to minimize conduction losses, and therefore the power loss in driving their gate nodes can be modeled as:
where Nsw is the number of switches used in one particular topology and Cgate is the per-unit-width gate capacitance of the switches. (3) Parasitic loss: This is the loss from the bottom-plate parasitic capacitance of the flying capacitors. The loss can be estimated as:
where Mp is a parameter that depends on the internal structure of a topology (Table I ). This loss component depends on the particular type of the capacitance technology. Deep trench capacitors typically have superior efficiency compared to MIM and CMOS capacitors. (4) The load power loss: The load power loss Icore(V droop + ∆V ), described in Section I, can be separated into two parts: (4a) The part determined by the voltage ripple, ∆V , is PL1 = Icore∆V (8) In each cycle, the energy a topology can deliver is given by MtopoCswN phase ∆V , where Mtopo is determined by the topology (Table I) , because with the same amount of flying capacitance Csw, different topologies can deliver different amount of power to the output. When switching at frequency fsw, the current a converter can provide is (10) From Equation (10), we can see that with the same output current Iout, the voltage ripple ∆V is inversely proportional to the size of charge-transfer capacitance Csw. (4b) The power loss associated with the voltage droop, V droop , is PL2 = IcoreV droop (11) Note that the voltage droop changes as we alter the number and locations of the converters on the chip, since the distance between the converters and the utilization points (cores) changes. (5) Control circuit and clock network: The control unit generates the nonoverlapping clock signals for the switches used in the converter. This unit includes a voltage comparator, DLL and control logic. The power loss of the clock network arises from the wire capacitance, the clock buffers inserted for the wires, and the clock loads. The power losses from control unit P ctrl and clock network P clock are both dependent on the number of used converters Ncvt. We use a penalty term for these two items in the objective formulation, as stated in Section V. (6) Clock sources: The clock source is implemented as a simple PLL with relaxed frequency (≈ 100MHz) and jitter (less than tens of ps) requirements compared to the main PLL for the on-chip circuit.
Thus, the power consumption of the clock source is P clksrc = PP LL, where PP LL is the power consumption of one PLL [10] . (7) Topology controller: This generates the signals that provide DVFS directives to reconfigure the topology in each converter to set the conversion ratio that provides the desired voltage output level. The topology controller is a small combinational logic block and its power consumption is in the order of µW, which is ignored here.
IV. OPTIMIZATION FORMULATION
In the scenario studied here, it is safe to assume that the switching frequency fsw and interleaving stages N phase are fixed for the converters. Based on the analysis in Section III, the components of power loss can be divided into four categories.
The first component, which depends on the parameters of the converter, is the power consumption of the conduction loss/gatedrive loss of the switches/parasitic loss/part of load loss PL1, and is determined by the Csw and the global V dd , as:
For each converter, we can change the total flying capacitance, Csw, to tune the voltage ripple ∆V , according to Equation (10) . A larger Csw results in smaller ∆V , and can therefore reduce the load power PL1 (Equation (8)) and switch conduction loss P cond (Equations (4) and (5)). On the other hand, the gate switching loss Psw (Equations (5) and (6)) and parasitic loss Ppara (Equation (7)) increase with Csw. An optimal value of Csw balances these conflicts.
The second and third components are, respectively, the power consumption of part of load loss PL2, and the sum of the power loss in the control circuit and clock network. P2 = PL2 (13) P3 = P ctrl + P clock (14) Both P2 and P3 are determined by the number and layout of the converters. Changing the granularity of the capacitance through more fine-grained distributed converters placed over the chip (as opposed to a single centralized converter) can help reduce the voltage droop seen by the core loads, therefore reduce the loss PL2 [1] . However, using a larger number of converters implies higher cost for the hardware implementation due to higher losses in the control circuit and clock network. Therefore, it is necessary to explore the number and layout of the DC-DC converters to determine an optimum.
The last component, corresponding to the loss of the clock sources is fixed and given by P4 = P clksrc (15) At the system level, the efficiency of the power delivery system η is defined as the ratio between power delivered to the load and total power extracted from the input V dd supply, i.e.,
where Pcore is defined in Equation (3). To increase the efficiency, we minimize the sum of P1 through P4, which constitute the power wasted during power delivery. Further, since P4 is a fixed quantity, to improve the overall efficiency of the power delivery system using SC converters, we should optimize the objective function: minimize P1 + P2 + P3 (17) The variables in the optimization problem are
• the number of converters used, Ncvt, • the capacitance of each used converters Csw, and • the locations of the converters. The optimization is subject to the following constraints:
1) The supply voltage at each core load must meet a lower bound:
2) Since the voltage ripple constraint must limit ∆V ≤ ∆Vmax, Equation (10) provides a bound on Csw:
3) To control the capacitance resource used, we require that:
where Cunit is the capacitance density, and Areamax is the maximum available area for the converters.
V. MINLP FORMULATION Fig. 3(a) presents a schematic of the on-chip power delivery network for a multicore processor. The on-chip power delivery network consists of a global V dd supply, on-chip DC-DC converters, the power grid, and core loads. The voltage supplied to the power grid controlled by a set of on-chip SC converters, which can be placed at a list of predefined candidate locations on the chip. In the following sections, we show that the optimization problem in Section IV can be formulated as a mixed-integer nonlinear program problem (MINLP), by introducing 0-1 integer variables zis, with zi = 1 denoting that a converter is placed at candidate location i. We first macromodel the power grid in Section V-A, and then present the complete MINLP formulation in Section V-B.
A. Macromodeling of the power grid
The power grid may have millions of nodes, but we are only interested in OBS, the selected n observation nodes of the core loads, and Src, the m predefined candidate connection nodes for the SC converters. Therefore, we build a macromodel whose ports are these n + m nodes, and abstract away all of the other nodes in the network using the macromodeling approach [11] . Therefore, Fig. 3(a) is transformed to the model shown in Fig. 3(b) .
The DC analysis of a V dd power grid is formulated as:
where G is the conductance matrix for the interconnected resistors, v is the vector of node voltages, and i is the vector of current loads. The equations for the power grid are given as
where U and V are voltages of the internal nodes and ports, J1 and J2 are current sources connected at ports and internal nodes, and I is the vector of current flowing into the macromodel through the ports. The macromodel of the power grid including only the port nodes (cores' accessing nodes OBS and the candidate nodes for the converters Src) is given by
22 G21, and S = J1 − G12G −1 22 J2. By partitioning the ports into sets Src and OBS, this can be rewritten as
where (ISrc, Vsrc) and (IOBS, VOBS) are the (current,voltage) values at the Src and OBS ports. Since IOBS = 0, we have: 
B. MINLP Formulation
Using the macromodel shown in Fig. 3(b) , the optimization problem described in Section IV is equivalent to finding the optimal zi assignments, and for each used converter i (with zi = 1), determining its size Ci and voltage ripple ∆Vi.
We rewrite P1 (Equation (12)), the power loss associated with the converter and the global V dd supply, as:
where
Using Equation (25), P2, the power loss in the grid, and P3 are:
Power supplied to the macromdel
where c is penalty weight for control circuit and clock network,
i Src , Ci, ∆Vi are the continuous variables and zis are the 0-1 integer variables in the optimization problem.
Then we can transform the optimization problem defined in Section IV into a MINLP formulation as
subject to ∀j ∈ OBS:
∀i ∈ Src:
and
Here, V j th is the minimum required voltage at the observation nodes of each core, and M is a large positive number.
Constraints (31) are transformed from Equation (18), to specify the minimum voltage for each core load. Constraints (32) are from Equation (26), and Constraints (34) from Equation (10) 
We can observe that there are nonlinear (actually non-convex) terms in the objective function (30) and constraints (34) are also nonlinear. Therefore, the above optimization problem is a MINLP.
VI. HEURISTIC APPROACHES
As stated in [12] , "MINLP problems are difficult to solve precisely, because they combine all the difficulties of both of their subclasses: the combinatorial nature of mixed integer programs (MIP) and the difficulty in solving nonconvex (and even convex) nonlinear programs (NLP). Because subclasses MIP and NLP are among the class of theoretically difficult problems (NP-complete), so it is not surprising that solving MINLP a challenging and daring venture."
Therefore, in our work we explore heuristic approaches to solve the optimization problem. For the objective function in Equation (30),
• P2 + P3 is determined by the number/layout of the converters • P1 is determined by the converter design, i.e, the size of converters Ci, and V vdd,dom , the V dd supply. From Equation (1) we can see that V vdd,dom is determined by the voltage droop in the power grid and the ripple in the converters. Therefore, we may optimize the power loss in two steps. We first optimize P2 + P3, the power in the distribution network, by finding the optimal number and layout of the converters. We present two heuristic approaches in Section VI-B for this step. Next, we optimize P1 to determine the optimal size of each used converter Ci, which is presented in Section VI-C.
A. An approximation for the voltage ripple
We introduce the approximation that all converters have the same voltage ripple. In other words, ∆Vi = ∆V ∀ i such that zi = 1. The impact of this assumption is that by Equation (34), the current delivered by a converter i is proportional to its capacitance Ci, which is a reasonable assumption.
We justify this approximation as follows. In Equation (27), let P i 1 be the contribution of the i th converter to P1. If zi = 1,
vdd,dom Ci (39) According to Equation (34), P i 1 is equivalent to
If we minimize P i 1 locally by setting ∂P i 1 /∂Ci = 0, we get
Therefore, according to Equation (34) we can see that
Since e1, e2, and e3 are constants, and V vdd,dom is common to all the converters, ∆Vis can be assumed to be the same among the used converters if they are locally optimized. Therefore, in the following discussion, we assume ∆Vi = ∆V for each used converter. If all Cis were free variables, allowed to take any value, this would not be an approximation. However, according to Equation (38), the Cis are not unconstrained, therefore this is an approximation.
B. Optimizing Converter Number/Layout
As stated earlier, the number and layout of the converters also affects the efficiency of the power delivery system. Distributing the converters with finer granularity and optimized layout over the chip can help improve the efficiency loss by reducing the voltage droop seen by the local core loads, when placing the converters closer to the utilization points. However, there is an overhead associated with the power loss in the control units and clock network.
1) How significant is the converter area?: At this point, it is useful to consider some technology numbers to determine the area overheads of the SC converters. To compute this, we assume that the SC converters are fabricated using deep-trench capacitors. In [6] , the reported power density of deep-trench capacitors is 200nF/mm 2 . A typical core has the current of ∼ 1A. According to Equation (9), if we use a 2:1 converter (with Mtopo = 2) to deliver this amount of current with ripple ∆V = 5mV, N phase = 32 and fsw = 100Mhz, then the required amount of capacitance is 31.25nF, which transforms to 0.156mm 2 . Considering that the typical size of a core is of several mm 2 , we may ignore the area effect of the converters when optimizing the layout of the converters. Of course, we can extend our general methodology described in this section to deal with other kinds of capacitors such as the MIM capacitor, by considering the area effect in exploring the granularity of the converters, but this is a topic for future work.
2) MILP-based Approach:
In this section, we present an MILPbased approach by reducing the MINLP problem in Section V through a natural approximation and relaxation process.
We proceed under the assumption that for each used converter, ∆Vi = ∆V , and define 
Essentially, since I i src = 0 when zi = 0, the substitution in the first term means that V i Src = V vdd,local . In the above expression,
Src is the total current delivered to the cores, and therefore, a constant. We can see that by relaxation we can transform the nonlinear cost function P2 to be linear.
In fact, in our experiments using all approaches, we find that V i Src is nearly equal for every converter i, so that (44) is in practice an equality, confirming the validity of the minimizing the relaxed P2.
) is a constant, it is unchanged under any optimization. Then the relaxed power loss (P2 + P3) can be minimized by solving the following MILP problem:
subject to the linear constraints in Equations (31), (33) 3) Greedy Approach: Considering that MILP can be expensive for a large number of integer variable zis, we propose a greedy approach to reduce the run-time complexity of solving the optimization problem with a large set of candidate locations for the converters. The idea is to explore different granularity of converters: from one converter for each core, to a single lumped converter for all the cores.
For a chip with l cores, the inputs of the greedy approach include 1) A list of cores ℜ = {C0, . . . , C l }. Core Ci has peak current Ii and minimum required voltage supply V vdd,C i , 2) A adjacency graph G0 representing the neighbor relationships among the l cores; if a layout is provided instead, this information can be generated using Voronoi diagrams. 3) A list of all candidate locations Ψ = {ψ1, . . . , ψm} for the converters on the chip (Fig. 5 shows part of the candidate set that are used by the converters). The edge weight wij of an edge between vertices i and j in the adjacency graph is calculated as the increase in the power loss from combining two converters Vi and Vj into a single converter, Vij. This quantity is the total change in the power loss P2+P3, which includes: 1) the change in power loss from voltage droop [Equations (1), (2) and (11)
Ii 2) the change in power loss from the control circuit ∆P ctrl 3) the change in power loss from the clock network ∆P clock i.e, wij = ∆PL2 + ∆P ctrl + ∆P clock where ∆PL2 is non-negative because voltage droop tends to increase with fewer converters, ∆P ctrl = −P ctrlr because the number of converters is reduce by one after combining two converters into one, and ∆P clock is determined by the locations of the converters Vi, Vj and Vij. Note that wij can be negative in our approach.
Our approach to optimizing the converter design is iterative in nature, and the overall scheme is illustrated in the left half of Fig. 4 . We begin with a design with one individual converter for each core. The top right box in Fig. 4 shows an example of the given adjacency graph G0 for the l cores. In G0, each node Vi represents the converter for core Ci. The principle behind our method is to begin with the adjacency graph, allowing each core to have its own converter. Then we contract edges in the graph to reduce the number of converters by merging the adjacent converters. Starting from a given adjacency graph G0 with l converters, at each iteration we greedily merge the neighboring converters Vi and Vj with minimum edge weight wij, so as to minimize the possible increase of power loss at the next level of converter granularity. When merging two neighboring converters Vi and Vj, two nodes in the adjacent graph is merged into one new node, and the weights of the edges between this new node edge and its neighbours are updated as stated earlier.
We compute the optimal location, as described in the next paragraph, for the combined converter Vij, and then update the adjacency graph. With l cores, our approach will repeat the merging process l−1 times to evaluate all possible levels of converter granularity.
We select the location of a converter Vi from the set of candidate locations Ψ to minimize the nominal output voltage of the converters, minus the voltage ripple part [Equation (1)], i.e.,
where V droop,i is the voltage drop at core Ci. When evaluating each candidate location, the voltage droop of each core can be obtained from the simulation of the power grid. However, consider that the power grid is typically costly to simulate, to speed up the evaluation process, we assume that the conduction resistance between a core Ci and its converter Vj is linearly proportional to their distance
where Runit is the unit-distance resistance of the power grid. However, the voltage droop for our final results are validated using a accurate circuit simulator.
C. Optimization of Converter Size
After determining the number and layout of converters using the heuristic approaches in Section VI-B, the second step is to determine Ci for each converter i by optimizing P1.
Ci, then from Equation (42) we can see that
so to minimize the power loss P1 in Equation (27) is equivalent to minimizing
Using Equation (43), Equation (49) can be further transformed to
where I total is a constant, and V vdd,local can be found after solving the optimization problem in Section VI-B. The constraints for the above problem is given by Equation (38) and
which is derived from Equations (36) and (48). Note that P1 is a convex function of C total . It is easily determined that the optimal solution to the unconstrained problem defined in Equation (50) is given by:
However, this value of C0 may fall outside the bounding constraints (38). If so, from the convexity of the objective function, we can conclude that the optimum must be at the extreme point of the allowable C total interval that is closer to C0. Next, the optimal size of C total for the converters, Copt, is
Then we can calculate the voltage ripple ∆V according to Equation (48) using Copt, and the optimal size of each used converter Ci can be calculated by Equation (48) because I i Src is known after solving the optimization problem in Section VI-B.
VII. EXPERIMENTAL RESULTS
Our heuristic approaches described in Section VI are implemented in C++. The MILP problem is solved using CPLEX [13] .
A. Test Cases
Our approaches were exercised on two chips, one of which is a homogeneous multicore while the other is a heterogenous multicore processor. The configuration of each chip is described below: Homogeneous Chip: Our homogeneous test case consists of a chip with one power domain of 16 identical cores, as shown in Fig. 5 (left), which follows the tile-based design for multicore chip [14] . Each core consists of a CPU, L1 I/D cache and L2 cache with area ratio of 2:1:2. The core is 3×3mm
2 with a peak current of 1A@0.6V. In our simulations, we model the current ratio among CPU, L1 cache and L2 cache inside each core using guidelines consistent with [15] . Heterogeneous Chip: We also consider a heterogeneous test case consisting of a set of ARM Cortex cores [16] . Simpler versions of such heterogeneous cores are already on the market today [17] . This test case has one power domain of 32 cores as shown in Fig. 5 (right) . Core types A through E are, respectively, the A9, A8, A5, M4, and M0 cores. Table II shows our experimental parameters in the 32nm technology node based on the published literature and PTM [18] . We assume the available converter area to be up to 20% of the total core area.
Individual parameters
Homo16, Hete32 Common parameters 
B. Comparison of Heuristic Approaches
We have presented two heuristic approaches for the optimization of the number and layout of the converters in Section VI-B, followed by the optimization of converter size using a closed-form solution. The first heuristic approach (refer to Section VI-B2) Heuristic-MILP formulates the optimization as a MILP problem, and the second heuristic approach Greedy in Section VI-B3 uses greedy strategy to explore the number and layout of converters at different levels of granularity. We compare these two approaches with a manual design approach, which evenly distributes the converters over the chip at different levels of granularity with total number of converters set to Table III shows the results of these approaches. Columns 2-3 show m, the numbers of candidate locations for the converters, and n, the number of observation nodes for the cores. Columns 4-9 show the results of manual design, columns 10-16 give the results of the greedy scheme discussed in Section VI-B3, and columns 17-23 show the results of the heuristic approach presented in Section VI-B2. For each approach, we list the total number of converters used, the total power loss (refer to Equation (17)) and its breakdown, P1, P2, and P3, in mW. We also show η, the system-level efficiency of the power delivery system, and CPU, the runtime of these two heuristic approaches in seconds (on a 64-bit 2.5GHz Intel Quad-core platform).
On average, compared to the manual design, the greedy approach can reduce P2 (the power loss due to voltage droop) by 33%, and total power loss by 19% with higher system-level efficiency. The heuristic approach based on MILP can reduce P2 by about 50% and total power loss by 25%. The system-level efficiency is improved from 86.1% to 88.1% for the homogeneous chip, and from 86.1% to 90.1% for the heterogeneous chip. The runtime of the MILP problem is tractable, it takes only a few minutes for CPLEX to solve these two chips.
As stated before, the manual design has limited search space w.r.t the number of converters, as compared to the two heuristic approaches. For a comparison that is more favorable to the limited search space of manual design, and to explore the quality of our approach under stringent constraints, we perform another set of experiments by setting the same upperbound for the available number of converters for these three approaches.
The results are presented in Table IV . Column 3 shows the upper bound for number of converters. From the table we can see that compared to manual design, on average, Greedy and Heuristic-MILP can still improve the results respectively by 13% and 18% in terms of the total power loss. This is because with the same number of converters, the heuristic approaches can search different combinations of the converters. Even for the homogeneous chip, there is still room for improvement because of the unevenly distribution of current within each core and the asymmetry in the power pads shared by different power domains in a single chip. Fig. 6(a) shows how the power losses P2, P3 and the total power loss P1 + P2 + P3 change with various number of converters for the homogeneous chip by applying the heuristic approach Heuristic-MILP. We can see that as we increase the number of converters from 1 (all the cores connected to a converter) to 30, the power loss P2 due to voltage droop decreases quickly, with a reduction of more than 20X. This implies that the distributed design of the converters can effectively reduce the IR drop seen by the cores, and therefore, improve the efficiency of the power delivery system. The reduction in total power loss starts to slow down as we further increase the converter number, and the overhead from the control circuit and clock network begins to dominate the overall power loss. Similar results can be observed for the heterogeneous chip as shown in Fig. 7(a) . Fig. 6 (a) shows high power loss (more than 10W) when only a few converters are used. This is because we generated the results with the same wiring resources for different number of converters. The loss number can be reduced by using more interconnect resources through narrowing the pitch of the power grid, but that can cause very high congestion.
For the homogeneous chip, the lowest total power loss is achieved with 47 converters as shown in Fig. 6(b) , and the layout is shown in Fig. 5(left) . Note that although there is no large difference in the total power loss between the cases using 47 and 56 converters, more routing resource is needed for the clock network when more converters are used, which is not captured by power loss objective function. It is certainly possible to use an enhanced objective that captures this factor, or to determine a reasonable tradeoff by examining the curve. For the heterogeneous chip, the lowest total power loss of is achieved with 13 converters shown in Fig. 7(b) , and the layout is shown in Fig. 5(right) .
In Section VI, we had proposed heuristic approaches to break the MINLP problem (described in Section V) into two independent subproblems. In fact, we have another formulation (details not shown due to space limitations) that solves MINLP problem approximately in an iterative way: We start with the initial guess to the MINLP problem provided by the Heuristic-MILP and closed-form solution presented in Section VI-C. And we set the integer variables zis to be the values from the initial guess (i.e., fixing the number and location of the converters).
The iterative process, called Heuristic-iterative, consists of two steps: (1) For fixed zis, the MINLP problem in Section V-B becomes a NLP, that is solved by CPLEX through sequential linear programming. (2) We update the number and location of the converters by solving a MILP problem by fixing some variables based on the NLP solution. The key difference between Heuristic-MILP and Heuristic-iterative is that we allow the converters to have different voltage ripple ∆Vis in Heuristic-iterative. Table V presents the results of comparison between Heuristic-MILP and Heuristic-iterative. We observe that Heuristic-iterative can only improves the initial guess provided by Heuristic-MILP by a small amount. This implies that our assumption about identical voltage ripple made in Section VI is acceptable in terms of the solution quality.
VIII. CONCLUSION
In this paper, we study the efficiency of the power delivery system using SC converters at the system level. This work develops models for the efficiency of such a system as a function of size and layout of the SC converters, and the problem is formulated as a mixed integer non-linear program optimization. We then propose heuristic approaches to optimize the size and layout of the SC converter to minimize power loss. The efficiency of these techniques is demonstrated on both homogenous and heterogenous multicore chips. Our current work only considers the deep trench capacitor and in future we would extend our work to deal with other types of capacitors such as CMOS and MIM capacitors, by considering the area effect in exploring the granularity of the converters.
