Abstract-High power consumption not only leads to short battery life for hand-held devices but also causes on-chip thermal and reliability problems in general. As power consumption is proportional to the square of supply voltage, reducing supply voltage can significantly reduce power consumption. Multi-supply voltage (MSV) has previously been introduced to provide finer grain power and performance tradeoff. In this paper, we propose a methodology on top of a set of algorithms to exploit nontrivial voltage island boundaries for optimal power versus design-cost tradeoff under performance requirement. Our algorithms are efficient, robust, and error-bounded and can be flexibly tuned to optimize for various design objectives (e.g., minimal power within a given number of voltage islands, or minimal fragmentation in voltage islands within a given power bound) depending on the design requirement. Our experiment on real industry designs shows a tenfold improvement of our method over current logical-boundary-based industry approach.
Placement-Proximity-Based Voltage Island Grouping Under Performance Requirement
Huaizhi Wu, Martin D. F. Wong, Fellow, IEEE, I-Min Liu, and Yusu Wang Abstract-High power consumption not only leads to short battery life for hand-held devices but also causes on-chip thermal and reliability problems in general. As power consumption is proportional to the square of supply voltage, reducing supply voltage can significantly reduce power consumption. Multi-supply voltage (MSV) has previously been introduced to provide finer grain power and performance tradeoff. In this paper, we propose a methodology on top of a set of algorithms to exploit nontrivial voltage island boundaries for optimal power versus design-cost tradeoff under performance requirement. Our algorithms are efficient, robust, and error-bounded and can be flexibly tuned to optimize for various design objectives (e.g., minimal power within a given number of voltage islands, or minimal fragmentation in voltage islands within a given power bound) depending on the design requirement. Our experiment on real industry designs shows a tenfold improvement of our method over current logical-boundary-based industry approach.
Index Terms-Low power, optimization, timing, voltage island.

I. INTRODUCTION
W
ITH THE broadening market interests in sophisticated mobile applications, meeting aggressive power target on top of performance requirement in high-speed portable design is becoming a challenging task. As the design parameters for optimal power and optimal performance often contradict each other (for example, a lower supply voltage reduces power consumption but slows device speed), designers are constantly struggling to balance power and performance throughout the chip-design cycle.
High power consumption not only leads to short battery life for hand-held devices but also causes on-chip thermal and reliability problems in general. At a 90-nm process node, the vast amount of functionality integrated within SoC designs, compounded with much larger leakage current, is already leading to designs with power dissipation in the hundreds of Watts. As process technology is trending against power dissipation, this problem is expected to only get worse at the future process nodes. Power consumption generally breaks down into two sources: dynamic and static powers [1] . Static power related to CMOS devices includes reverse-biased junction leakage, gate induced drain leakage, direct-tunneling leakage, and subthreshold leakage [2] . With the current technology, subthreshold leakage is much larger than the other ones, due to reduced Vt. While static power comes from leakage current, dynamic power P d is the result of a device's switching activities. It can be represented by
where k is the switching rate, c is the load capacitance, f is the clock frequency, and V dd is the supply voltage. Static and dynamic powers are comparable in many of today's logic designs. Techniques to lower switching power are combinations of reducing switching activity, load capacitance, and supply voltage. For example, clock frequency can be set to zero by gating the clock to an inactive logic block. Load capacitance can be reduced by minimizing total wire length and by downsizing the gates.
As dynamic power is proportional to the square of the supply voltage V dd , reducing V dd can significantly reduce dynamic power. Meanwhile, when higher power-supply voltage is selectively available, the percentage of low Vt devices needed in a design become less, thus reducing leakage power also. However, the delay increase due to reduced V dd degrades the circuit performance. Multi-supply voltage (MSV) can be used to provide finer grain power and performance tradeoff. In an MSV design, high V dd is assigned to cells on critical paths while low V dd is assigned to cells on noncritical paths, so that power can be saved without degrading the overall circuit performance [3] .
Although, in theory, only timing-critical cells need high V dd , this naive thinking for maximum power reduction is not practical. When low V dd and high V dd interleave heavily, there is a significant overhead in voltage-shifting devices between low V dd and high V dd cells to eliminate the undesirable static current that will otherwise flow. Moreover, it is expensive to implement the resulting fragmented power networks, as implementing such complex-power network is not only tedious human work but also takes a lot of precious routing resources, which is not good especially when per-metal-layer manufacturing cost is soaring as process technology migrates. Previous efforts toward reducing the level-shifting overhead include: 1) clustered voltage scaling (CVS) [4] and 2) extended CVS (ECVS) [5] . In CVS, the cells driven by each power supply are grouped (clustered) together and level shifting is needed only at sequential element outputs. In ECVS, level shifting is 0278-0070/$25.00 © 2007 IEEE allowed anywhere within the combinational logic block using an asynchronous level shifter. The added flexibility in ECVS can provide greater power reduction than CVS. However, the delay penalty tends to be larger too.
Grouping cells of different supply voltages into a small number of "voltage islands" [6] - [8] , where each voltage island occupies a contiguous physical space and operates at a single supply voltage that meets the performance requirement, can also effectively reduce the amount of level shifting, as well as reduce the cost of the power network. The current stateof-the-art design of voltage islands is largely done manually and is primarily based on the design's logic hierarchy. That is, designers partition circuits into a few groups based on their performance requirement and the connectivity between modules. Each group is then specified with a supply voltage. Logic boundaries are largely used in this grouping process mainly because they are the boundaries that designers are most familiar with. However, these "natural" boundaries in a design are almost always nonoptimal boundaries for supply voltages. Fig. 1 illustrates why sticking to logical boundaries is limiting the solution space in producing optimal MSV. In the example, there are three modules, each of them contains only leaf cells, and both modules A and B contain some timing-critical cells that require high voltages [ Fig. 1(a) ]. Fig. 1(b) and (c) is a design based on logical boundary. While Fig. 1(b) guarantees the performance using high power, Fig. 1(c) reduces the power consumption without meeting the timing requirement. None of them are optimal MSV. By using placement proximity (instead of logical) information, the optimal MSV meets power and timing requirements at the same time while keeping the number of power domains small [ Fig. 1(d) ].
In this paper, we wish to reduce the cost of the power network in an MSV design by voltage-island grouping (we also call it voltage-island partitioning). 1 We propose a methodology on top of a set of algorithms to exploit the nonlogical boundary in a design for optimal voltage-island grouping that captures the power versus design-cost tradeoff under performance (timing) requirement. Depending on each designer's specific needs, the optimal tradeoff can be explored by either one of the two dual optimization problems: Maximally reduce power consumption within a given bound on a number of voltage islands or create minimally fragmented voltage islands within a given bound on power consumption. Our approach can handle both problems, with the latter having an extra log(k) factor in running time (k is the number of voltage islands), coming from a binary search for k. We will focus on the latter problem in the following discussion. However, all our results can be easily adapted to the former one.
Our contributions can be summarized as follows: To our best knowledge, we are the first to consider power versus design-cost tradeoff under timing requirement for the voltageisland-grouping problem. In particular, we exploit nontrivial 1 We assume that power consumption is monotonic in the number of voltage islands, meaning that we consider device power and delay only depending on its own supply voltage but not on others. In theory, a device's power and delay are functions of its own supply voltage, as well as the supply voltages of other devices due to coupling. In this paper, we ignore such coupling effects, as they are secondary and can be optimized by other techniques if arisen. voltage-island boundaries to balance power consumption and power-network fragmentation. We formulate this problem as a partitioning problem (Section II). This voltage-partitioning problem (VPP) is hard (some closely related variants with much simpler weight functions are NP-hard), and we, thus, study approximation algorithms (i.e., algorithms that guarantee solutions with good guarantees) and present one that runs in polynomial time. This algorithm, unfortunately, is not efficient enough for practical designs. We, therefore, design an efficient two-step heuristic algorithm, which combines dynamic programming with variable-sized p × q gridding (Section III). We show (Section IV) that our method is efficient and practical, as well as produces good-quality voltage islands for a wide selection of industry data. Compared to the current industry approach using logical boundaries within a design, our method generates about one tenth of voltage islands for the same amount of power reduction. The running time is small even for very large industry designs.
II. VOLTAGE-PARTITIONING PROBLEM
A. Problem Definition
Let A be an m × n array, and
[respectively (r, u)] as the bottom-left (respectively upper right) corner. We may also refer to an array (or a subarray) as a rectangle or a region. Let µ(R) = max (i,j)∈R A [i] [j] be the maximum value of all elements in a rectangle R. The weight of a rectangle R is defined as
See Fig. 2(a) for an illustration.
A partitioning of A is a set of disjoint rectangles (subarrays) Π = {R 1 , . . . , R k } that cover A; k = |Π| is called the size of this partitioning. See Fig. 2(b) for an illustration. The weight of a partitioning Π is defined as
In the voltage-island-partitioning problem, we wish to subdivide the placement region into a small number of voltage islands (where every cell in the same voltage island will eventually receive the same voltage), while keeping the total power consumption low and timing requirement met. The latter means that the voltage assigned to each cell should be no lower than its required value. Therefore, the voltage value of each voltage island should be the maximum required voltage of all cells in this voltage island. Raising voltages of cells with lower requirement to this maximum value will result in an increase in the power consumption of this voltage island. To keep the overall power consumption low, it is desirable to have the total power penalty on all voltage islands below some threshold.
We can consider each standard-cell-placement region as a 2-D array A, induced by the underlying placement grid. As dynamic power is proportional to the square of the supply voltage (1), we let A [i] [j] be the square of the required voltage of the corresponding standard cell covering this grid. Then, it is easy to see that a partitioning of A corresponds to a voltage-island partitioning of the placement region, µ(R) is the voltage value of each voltage island R, and the weight of the partitioning represents the total power penalty. We can, thus, formally define the VPP as follows.
Definition 1 (VPP):
Given an m × n array A and an error threshold δ, among all partitionings whose weight is at most δ, find one with the smallest size. Let κ(A, δ) be the size of this optimal partitioning.
The dual version of this problem (DVPP) is defined as the problem of minimizing the weight of the partitioning with a bound on the size. We will focus on VPP in this paper. However, our algorithm can easily handle DVPP as well.
B. Algorithm With Guarantees
While no previous work has been done for the VPP problem, some variants of it are well studied. In particular, if we define the weight of a rectangle R as the sum of all
and the weight of a partitioning Π as the maximum weight of all rectangles in Π, we have a variant of the VPP problem, previously referred to as the RTILE problem [9] . Intuitively, this max-sum weight function is much simpler than the one we consider in the VPP problem, and it has been shown that the RTILE problem is NP-hard [9] . Therefore, given the indication of the hardness of our problem, 2 we shift our focus to approximation algorithms, which provide a guarantee on the output size. Below, we first describe our two-approximation algorithm for the VPP problem, which finds a partitioning Π of a given array A so that ω(Π) ≤ δ and |Π| ≤ 2κ(A, δ). This approximation algorithm is based on a nicely structured class of partitionings, which we introduce next.
Slicing model: Given an input region (array) A, we can slice it either with a vertical or a horizontal cut. Each cut will divide its parent region into two, and we then slice the resulting two children regions recursively. An example is shown in Fig. 3(a) . A partitioning obtained this way is called a slicing partitioning of region A [10] , [11] . In fact, it is the same as a special type of binary-space partitioning (BSP) induced by orthogonal cuts, called orthogonal BSP (see A for more details). It has been shown in [12] that given a set of m disjoint axis-aligned rectangles that cover a rectangular region A, one can construct an orthogonal BSP so that in the induced slicing partitioning of A, each region lies completely inside one of the m original rectangles, and the size of this slicing partitioning is at most 2m. Let ρ(A, δ) denote the size of the optimal slicing partitioning of A with weight at most δ and κ(A, δ) denote that of the optimal arbitrary partitioning. We can infer the following lemma.
The left inequality is obvious. For the right one, the optimal solution Π * of the VPP gives κ(A, δ) disjoint rectangles that cover region A. By [12] , there exists a slicing partitioning Π of A with at most 2κ(A, δ) rectangles (regions). Since every rectangle in Π is completely inside a rectangle in Π * , the weight of the slicing partitioning Π cannot increase, i.e.,
Thus, proves the right inequality. The above lemma states that an optimal slicing partitioning for A is a two-approximation for the VPP. Therefore, we now only need to focus on solving the VPP under this slicing 2 The VPP under p × q grid partitioning is NP-complete, which can be shown by a straightforward extension of the NP-completeness proof of the RTILE problem under p × q grid partitioning. Unfortunately, its hardness under arbitrary partitioning does not seem to be trivial. We are not able to prove its NP-completeness nor disprove it; although, we tend to believe that the former is true. model. One standard method to handle the slicing partitioning is by using dynamic programming [12] . We first describe DP-Alg, a dynamic programming-based two-approximation algorithm for VPP.
Dynamic programming approach: First, we show that the dual DVPP under the slicing model can be solved optimally by dynamic programming. It follows a similar framework as the one in [13] , which was developed for the RTILE problem and several other variants of it. 
Otherwise, we have the recursion as shown at the bottom of the page. The second minimization term enumerates all horizontal and vertical cuts, and the first term (1 ≤ t < s) enumerates all possible ways to assign the number of rectangles (t and s − t) for the two separated parts obtained by some horizontal or vertical cut.
For the base cases, the weight ω(R) of any rectangle R can be computed in O(1) time after O(nm) time/space preprocessing of the input array A, by extending the prefix-sum algorithm from [9] to our weight function (see B for more details). It then follows from the above recursion that we can build the remaining dynamic programming 
, together with Lemmas 1 and 2, we conclude with the following result.
Theorem 1: Given an m × n array A and an error bound
III. FAST HEURISTIC ALGORITHM
In this section, we describe TS-Alg, an efficient two-step heuristic algorithm for the VPP.
A. Algorithm Overview
Ideally, we would like to have some guarantee on the size of the output partitioning, while keeping the complexity of the algorithm low. DP-Alg, as introduced in last section, produces solutions with good guarantees. Unfortunately, it is too slow and memory inefficient to be practical. In fact, the large space requirement limits the size of the input array to merely around 100 × 100, 4 while the size encountered in practice can easily go up to 50 000 × 50 000. We, therefore, want to first reduce the size of the input array, before we feed it to DP-Alg. In particular, we can impose a second grid of lower resolution on the original array A, and obtain a "compressed" array G, where each element G [u] [v] corresponds to a rectangle (subarray) R uv of A at grid (u, v). To guarantee meeting the timing requirement for the voltage islands, we let
. Note that G actually corresponds to a special type of partitioning of A: Suppose G is of size p × q, we call the underlying partitioning Π G a grid partitioning of size p × q for A [see Fig. 3(b) ], and each rectangle R uv in Π G is referred to as a grid rectangle. To avoid excessive power penalty in such partitioning, we would like to have a bound on ω(Π G ),
This motivates us to design the following TS-Alg for the VPP, referred to as TS-Alg.
Step 1) Size reduction: produce a p × q array G with
Step 2) Approximate voltage partitioning: apply DP-Alg on G to compute a partitioning Π with ω(Π) ≤ δ. Note that both the quality and quantity of the first step directly affect the performance of the second step. On the one hand, we hope that the quantity (i.e., the value of p and q) is small, so that DP-Alg is fast and practical. On the other hand, as DP-Alg will not further subdivide any grid rectangle in Π G (i.e., a rectangle in the final output Π will be a combination of some grid rectangles in Π G ), grid rectangles in Π G should be "good." We will see in what follows that although TS-Alg does not guarantee that the output size will approximate κ(A, δ) within some constant factor, there is a control on the quality in each of the two steps. The experimental results from next section further demonstrate the performance of TS-Alg both in efficiency and in output quality.
B. Size Reduction
One straightforward approach for Step 1) of TS-Alg is to simply subdivide A evenly into p × q grids (so that all pq grid rectangles in Π G are congruent). This method is completely 4 We remark that there are some running-time/space improvement over the above DP algorithm that can be extended to our case as well. However, it increases the approximation factor greatly [13] . Our goal is to have an efficient algorithm that performs well in practice.
oblivious to the data distribution in A, thus, leading to a suboptimal result. It is desirable to subdivide A more intelligently so as to preserve the data distribution in A, by using more flexible variable-sized grids. In particular, we ask the following question for Step 1).
Definition 2 (Grid-Partitioning Problem, or GPP): Given an m × n array A and an error threshold , among all grid partitionings whose weight is at most , find one with the smallest size (i.e., p × q).
Variants of this problem have already been studied in computational geometry [14] , [15] , and we modify the algorithm from [15] to obtain the following result. 5 Lemma 3: Let p * × q * be the size of the optimal grid partitioning with weight at most /2. One can compute in O(nm + (n + m +pq) ·p log(nm)) time and O(nm) space a grid partitioning of weight at most and of sizep ×q, wherē p ×q ≤ 17p * × q * . On the high level, the GPP can be reduced to a special type of set cover problem, which can be efficiently approximated using techniques from randomized algorithms (in particular, ε nets [14] and an elegant analysis given by Clarkson in [16] . See C for more details). Roughly speaking, it uses the iterative doubling technique originally used for linear-programming problem [16] . It assigns a load to every possible separator for grid partitioning (i.e., all vertical and horizontal grid lines). The load was referred to as weight in previous studies. We change it here to avoid confusion with our weight function ω introduced earlier. Starting with unit loads, at each iteration, it chooses a subset of these separators according to their current loads, such that the resulting grid partitioning Π G serves as an ε net. If Π G already satisfies the error requirement (i.e., ), the algorithm returns Π G . Otherwise, it chooses a grid rectangle at random with probability proportional to its weight and doubles the loads of all separators that intersect this grid rectangle. The intuition is that if a grid rectangle has high weight, then those grid lines cutting it are more likely to be chosen for the grid partitioning. Careful analysis can show that the expected number of iterations can be bounded. Further details of the algorithm are omitted here. Interested readers can refer to [15] .
C. Putting Everything Together
The size-reduction step produces a "compressed" p × q array G, with each element G [u] [v] representing a grid rectangle R uv of the original array A, for any
Let T be a subarray of G and
. We define the compressed weight of T aŝ Let Π = {T 1 , . . . , T l } be a partitioning of G, it can also be considered as a partitioning over the original array A. Recall that Π G is the grid partitioning of A induced by the compressed array G, it is easy to see that the compressed weight of Π,ω(Π) = 1≤h≤lω (T h ), is in fact the weight increase from ω(Π G ) to ω(Π), i.e., ω(Π) = ω(Π G ) +ω(Π). To satisfy overall power penalty bound δ, we now require that ω(Π G ) +ω(Π) ≤ δ, i.e.,ω(Π) ≤ δ − ω(Π G ), where ω(Π G ) is output by Step 1).
We can now carry over the algorithm from Section II-B, running DP-Alg on the compressed p × q array G. The only difference is in how the base cases are computed, as the weight function is slightly different here. 6 However, as the weight ω(T ) of any subarray T of G can also be computed in O (1) 
IV. EXPERIMENTAL RESULTS
A. Experiment Setup and Snapshots of Results
We perform our experiments with a set of industry designs on 64-bit Linux machines (CPU: 1.95 GHz, Memory: 11.7 GB). For each design, the experiment is carried out in the following steps.
We use the Cadence's commercial tool SoC Encounter [17] to do timing-driven placement, timing optimization, and timing analysis. Then, we assign a voltage to each standard cell according to its worst slack. 7 We give four different levels of voltages, each corresponding to a different range of worst slacks. The smaller the slack, the higher the voltage.
We transform the standard cell placement and associated voltage requirement into an input array, as described in Section II-A.
We calculate the maximum power penalty (denoted as δ max ), which is the total power penalty when all cells are raised to the highest required voltage on the entire chip. We give some reasonable bounds on the total amount of power penalty, each corresponding to a certain percentage of the maximum power penalty δ max , and apply our TS-Alg to generate minimum number of voltage islands within each power-penalty bound.
Snapshots: First, we give some visual results of our TS-Alg to demonstrate its effectiveness. We will present quantitative results in the subsequent section. Fig. 4(a) shows the voltage islands generated from two industry designs. For each design (one column), the top picture shows the placement with timingcritical cells in dark colors (the darker a cell, the higher voltage is needed). The middle picture shows the p × q grid partitioning generated by the size-reduction step. Note that the voltagedistribution information is well preserved by the variable-sized grids. The last picture shows the generated voltage islands.
B. Comparison With Other Approaches
To demonstrate the efficiency of our TS-Alg, we compared it with two alternative approaches (one of them is commonly used in industry; the other is developed by us as a straightforward improvement over the industry approach).
Outline of alternative approaches: The first one is the logical-boundary-based approach mentioned earlier in Fig. 1(b) and (c), where each grouped module or cell in the logical hierarchical tree forms an individual voltage island [ Fig. 5(a) ]. Currently, this approach is commonly used in practice.
This approach is often very inefficient due to the high fanout of modules in the logical hierarchy tree. For the example in A natural way to improve the logical-boundary-based approach is to substitute the high-fan-out hierarchical tree with nonlogical-boundary-based standard quad-tree. The leaves of the quad-tree are the grid cells in the input array, and each node in this tree corresponds to a possible region (a subarray). Given an upper bound δ for power penalty, the goal is to find a set of appropriate nodes from the tree that form a partitioning of the input array. The way to obtain such a partitioning is by a greedy bottom-up merging approach. In particular, we mark a node white if it has not been merged, black otherwise. A node is called a candidate for merging if it is white while all its four children are black. Furthermore, given a node R in quad-tree with its four children R i , i = 1, . . . , 4, define the cost of merging R i to be
). This corresponds to the power penalty resulted from combining the four subregions into R. Now, in order to compute a partitioning, we start with a tree where all nonleaf nodes are white. At any time, we choose the candidate with the smallest cost (by using a priority queue). The process will terminate when the total weight of the resulting partitioning exceeds the given upper bound δ. We refer to this algorithm as QT-Alg. The overall time complexity is O(nm log(nm)) and space complexity is O(nm). Fig. 6 illustrates QT-Alg.
Comparison with different power penalty bounds: In the left column of Fig. 7 , we compare the output size of our TSAlg with QT-Alg and the logical-tree-based algorithm on each industry design with different power-penalty bounds (the size of the designs are shown in Table I ). Clearly, our TS-Alg outperforms the other two significantly and consistently on all designs in terms of the number of voltage islands obtained. 8 Between the two alternatives, QT-Alg is generally significantly better (with one exception on design industryH, where the logical hierarchy happens to become more advantageous than the physical hierarchy for the voltage-island merging). In the right column, we also show the actual amount of power penalty caused by the voltage-island grouping by each algorithm. All the values are below the given power bound. The amount from the TS-Alg is slightly closer to the bound, because it optimizes more aggressively toward the objective of the smallest number of voltage islands.
Comparison with selected power-penalty bound: As the output size of QT-Alg is much closer to our TS-Alg than the logical-tree-based algorithm, we compare TS-Alg with QT-Alg more explicitly in this section by listing some data from Fig. 7 in Table I . Here, for each design, we only pick a particular power-penalty bound such that the number of voltage islands being generated is within a desired range. 9 We let the range be around 20, which is roughly an upper bound in current practical designs.
Additionally, we also show the comparison of running time in the table. The time complexity of TS-Alg has two terms:
The first term comes from preprocessing the input array for later computation of the weight of any subarray. Since this preprocessing step is needed by both TS-Alg and QT-Alg, we omit it from the running time presented. 10 Other than this preprocessing time, the running time of TS-Alg is only output-sensitive, depending on p, q, and k, while QT-Alg still has a O(nm log(nm)) running time.
11
This explains why TS-Alg has larger running time on the first three small designs in Table I : Because p × q after Step 1) in these cases is larger than that of later cases. For all practical data, we test with all practical numbers of voltage islands, and p and q are small regardless of the size of the input design (see, for example, p × q does not increase in Table I with the input size). This means that TS-Alg scales well with increasing size of the input design! Furthermore, as k decreases, the running time advantage of TS-Alg over QT-Alg becomes even more significant, because the running time of TS-Alg decreases with k, while that of QT-Alg remains roughly the same (as it is not output sensitive). This is demonstrated in Table II , where we choose k around ten, probably a more realistic number in current designs. Note again that TS-Alg also beats QT-Alg significantly and consistently in terms of the quality of result.
C. Comparison of DP-ALG and TS-ALG
The DP-Alg from Theorem 1 is impractical for large data size, due to its huge memory requirement. Therefore, we are 9 As stated earlier, our algorithm works for both of the dual-optimization problems. Our discussion and implementation is focused on minimizing number of voltage islands bounded by power penalty; however, it can be adapted to directly solve the other problem with log(k) faster running time. 10 Also, since the preprocessing time is O(nm), it will be smaller than the O(nm log(nm)) running time of QT-Alg, especially for large design, this logfactor can be quite large! Therefore, omitting it will not change our comparison, while helping to clarify things. 11 We remark that the logical-tree-based algorithm also has a O(nm log(nm)) running time. Its base of the logarithm is much larger than that of the QT-Alg, due to the high fan-out of modules in the logical hierarchy tree. As a result, its running time is much smaller than the QT-Alg. Quantitatively, the logical-tree-based algorithm runs in less than 1 s on all our tested designs. unable to obtain its result on any of our tested designs. This is one of the main reasons why we developed the TS-Alg. However, in order to study the gap between DP-Alg and TS-Alg, we make some special modification to the input data so that the result from DP-Alg can be obtained. We then apply TS-Alg on the modified data too and compare the results with those from DP-Alg.
In particular, we first test the memory of our machine and roughly find the upper limit of the input data size for DP-Alg to run 12 : m = n = 40, k = 15. We then partition the input array A by randomly generated 40 × 40 nonuniform grids (we will explain later why we use nonuniform grids) and obtain a corresponding 40 × 40 array S. Each element of S is the average or maximum 13 of all the elements of A in the corresponding grid rectangle. Then, we run DP-Alg on S to obtain the optimal slicing partitioning. The power bound is selected such that the number of voltage islands in the resulting partitioning is roughly between 10 and 15.
In general, a natural next step would be to apply TS-Alg on the same array S that we performed DP-Alg and compare their results. However, in this case, the size of S is too small for the comparison to be meaningful. As can be expected, not too much reduction of array size will happen in the first step, and the result from TS-Alg will be quite close to that of DP-Alg.
Nonetheless, we still wish to obtain some indication of how well TS-Alg performs with respect to DP-Alg. We, thus, 12 The actual limit also depends on the design size (# of cells and nets), because the design itself will take up certain amount of memory. 13 We decide the function on a case-by-case basis so as to maximize the similarity of data distribution between A and S.
conduct the following experiments. Instead of using the smaller array S to run TS-Alg, we "project" S back to the original array to obtain a modified array M of the original size. In particular, the value of each element M [i][j] will be the value of the element of S which corresponds to the grid rectangle that contains M [i] [j] . Due to the "identicalness" of data distribution between S and M , it is easy to see that the optimal slicing partitioning of the former is also the optimal for the latter. We now apply TS-Alg on this modified array M . Although M has a special data distribution that may bias the behavior of TS-Alg in the right direction, we believe that with the specific p × q gridding algorithm we currently use, the randomness and nonuniformity of the way we previously obtain S alleviate this problem. Hence, the results of TS-Alg on such modified array still reflect its performance on arbitrary arrays.
The comparison of the results of TS-Alg with DP-Alg is shown in Table III . For each design, we control the power bound of the size-reduction step so as to generate different p × q size (three sets). The largest p × q size is decided by the memory limit. From the table, clearly, the larger the p × q size, the better the result; the gap between the best TS-Alg result and the DP-Alg result is quite tolerable. In fact, with the availability of machines with larger memory, we may hopefully further increase the p × q size and reduce the gap. Thus, we can conclude that the p × q gridding does not greatly deteriorate the optimal solution and the TS-Alg is still able to find a close-tooptimal solution.
D. Study of the Effect of Level Shifters
A common issue in MSV designs is that level shifters need to be inserted at the boundaries between low V dd and high V dd , causing extra area/delay/power penalty. This is one of the main reasons that motivated us to try to group the cells into a small number of voltage islands (the other reason is to reduce the design cost of the power network). Because as such, level shifters will only need to be inserted at the boundaries of the voltage islands. The demand can, thus, be significantly reduced. To further study the effect of the level shifters in MSV designs, we count the average number of level shifters needed per timing path after the voltage-island partitioning and calculate their delay penalty accordingly.
More specifically, for each design, we select the first 1000 worst timing paths (with different end points); we also generate a set of different voltage-island partitionings under different power-penalty bounds. Then, for each of the voltage-island partitionings, we count the number of level shifters needed on each selected timing path, which is the number of nets on the path that cross voltage-island boundaries (from low V dd to high V dd ). For each selected timing path, we also compute the average delay of buffers/inverters on the path. Then, assuming that the delay of each level shifter is twice the average delay of buffers/inverters, we calculate the total delay of all level shifters on the path and its percentage in the pathrequired time. We then compute the average over all selected paths. The results are shown in Fig. 8 . For each design, each point along the x-axis corresponds to a different voltage-island partitioning generated under the given power-penalty bound. For each voltage-island partitioning, the lower figure shows the number of voltage islands, the middle figure shows the average number of level shifters per path, the upper figure shows the average percentage of the level-shifter delay in the pathrequired time.
From these figures, we can make the following observations. First, in general, the average number of level shifters (and the corresponding delay) increases with the number of voltage islands, for example, industryA and industryC. Clearly, this is because the chance that a timing path crosses the boundaries of voltage islands becomes more. However, since we only count the first 1000 worst paths, the average number of level shifters counted also depends on where the voltage-island boundaries are with respect to these 1000 paths. This explains why the average number of level shifters does not correspond to the number of voltage islands in some designs, for example, industryB and industryJ. On the other hand, this suggests a future improvement for the voltage-islands partitioning to reduce the number of level shifters: Adding the number of times the timing paths cross the voltage-islands boundaries as a secondorder minimization objective. This can be easily integrated into the DP-Alg.
Second, for most designs, the average number of level shifters is quite small and their total delay only takes a small portion of the path-required time.
14 This means that, as a future improvement, we can estimate such delay and reserve it in advance before doing voltage assignment and voltage-island partitioning. Besides, with the availability of more physical information of the level shifters to be used in each MSV design, their area and power penalty could also be estimated and reserved in advance for placement and for setting the power bound, respectively.
Third, some designs with a number of small voltage islands inside a small area, for example, industryE [see the upper right region of the bottom figure in Fig. 4(a) and (b) ], tends to have a higher average number of level shifters. Intuitively, this is because the timing paths will cross the voltage-island boundaries more frequently in that area. As a future improvement, we may put some additional constraints on the voltage-island partitioning to avoid such cases. The DP-Alg can take such constraints fairly easily.
V. CONCLUSION
Reducing power consumption is essential in current chip designs. MSV has previously been used to reduce the overall power consumption without degrading the entire circuit performance, by applying high voltage on timing-critical cells and low voltage on noncritical cells. However, the resulting fragmented power network causes extra design cost, and grouping cells into a practical number voltage islands is, therefore, desired. Logical boundaries are largely used today in such groupings but they are almost always far from optimal for the voltage islands. In this paper, we initiated the study of the problem of balancing power consumption and power-network fragmentation. In particular, we investigated the problem of partitioning a placement region into a small number of voltage islands, while keeping the overall voltage supply low. We formulated this problem as a VPP and presented an efficient TS-Alg for solving it. We have implemented and applied our TS-Alg on a wide selection of real industry designs, and it has been shown to be fast, scalable, and produce superior results compared with the logical-boundary-based approach commonly used today.
APPENDIX A ORTHOGONAL BSP
The definition of the BSP for a collection of disjoint geometric objects in the two-dimensional plane can be illustrated by Fig. 9(a) [19] . The BSP is obtained by recursively splitting the plane with a line: First, we split the entire plane with l 1 ; then, we split the half-plane above l 1 with l 2 and the half-plane below l 1 with l 3 , and so on. The splitting lines not only partition the plane, they may also cut objects into fragments. The splitting continues until there is only one fragment left in the interior of each region. This process is naturally modeled as a binary tree (BSP tree), as shown in Fig. 9(b) . Each leaf of this tree corresponds to a face (leaf region) of the final subdivision. Each internal node corresponds to a splitting line.
If the geometric objects are all rectangles whose boundaries are parallel to the x-axis or y-axis of the coordinate plane, we call them axis-aligned rectangles, and if the splitting lines are also all parallel to the x-axis or y-axis, then we call the corresponding BSP as orthogonal BSP. Fig. 9(c) shows an example. The corresponding BSP tree is shown in Fig. 9(d) .
The size of a BSP is defined as the number of leaves in the BSP tree (which is also the number of leaf regions in the BSP). Reference [12] has proven the following upper bounds on the size of the orthogonal BSP: 3n − 1 in general and 2n − 1 if the rectangles cover the underlying space completely [an example is shown in Fig. 9(e) ], where n is the number rectangles. For the orthogonal BSP in Fig. 9(c) , n = 4 and the size is five. For the orthogonal BSP in Fig. 9(e) , n = 5 and the size is six.
APPENDIX B CONSTANT TIME WEIGHT COMPUTATION
For a given array A, we first describe how to do a O(nm) time/space preprocessing so that sum(R)
Define two m × n arrays S and T
Then, using the prefix-sum algorithm, both S and T can be computed in O(nm) time/space as follows: 
S[i][j] =
APPENDIX C ε-NET AND ANALYSIS
A set system (X, R) is a set X along with a collection R of subsets of X. A hitting set of a set system (X, R) is a subset H ⊆ X such that H has a nonempty intersection with every set S in R. The hitting-set problem is to find a hitting set of the smallest size. It is equivalent to the set cover problem and they are both NP-complete.
A subset H ⊆ X is an ε-net of (X, R), if it has a nonempty intersection with every set S in R for which |S| ≥ ε · |X|. Note that an ε-net is a hitting set of the set system (X, R ε ), where R ε = {S ∈ R||S| ≥ ε · |X|}. For a given additive weight function 15 w : X → R + , a subset H ⊆ X is an ε-net of (X, R) with respect to weight w, if it has a nonempty intersection with every set S in R for which w(S) ≥ ε · w(X). A net finder for (X, R) is an algorithm that, given ε and a weight function w, returns an ε-net of (X, R) with respect to weight w.
For a set system (X, R) with finite VC-dimension (see [14] for the definition), [14] gives an algorithm to find a hitting set of almost optimal size in polynomial time. Suppose the size of the optimal hitting set of (X, R) is c * . The algorithm performs a doubling search for the value of c * , staring from c = 1. For each guess c of c * , it tries to find a hitting set of size s(2c) by using a strategy based on the notion of "survival of the fittest" from the evolutionary biology. 16 Intuitively, one wants to simulate the growth of a population where some elements are advantaged, because they hit more sets than others. This is done by putting weights on the elements of X, as follows. Initially, all weights are uniform (unit weights). Then, the algorithm iterates as follows: 1) Invoke the net finder 17 to select an ε-net H with respect to current weights (where ε = 1/2c); 2) verify whether H is a hitting set; 3) if H is not a hitting set, then find a set S in R that is not hit by H, and double the weights of the elements in S.
Following the arguments of Clarkson [16] , [14] proves that if there is a hitting set H of size c, then the number of iterations in the above algorithm will not exceed a certain bound, 4c log(n/c), where n = |X|. Briefly speaking, the proof is based on the fact that w(H) would grow faster than w(X) through the iterations, and since H ⊆ X, clearly, w(H) ≤ w(X). This implies that the algorithm has to terminate before w(H) exceeds w(X).
Hence, if the number of iterations exceeds this bound, then we know that there is no hitting set of size c; in other words, the current guess c is too small, and we will double its value. The doubling search stops until a hitting set is found for a certain value of c. Clearly, c ≤ 2c * , so the hitting set returned by the algorithm is of size at most s(4c * ). 15 The term additive means that w(Y ) = y∈Y w(y). 16 s is a nondecreasing function. 17 A polynomial time net finder can be implemented by the algorithm of [20] .
