The performance growth of conventional VLSI circuits is seriously hampered by various variation effects and the fundamental limit of chip power density. Adaptive circuit design is recognized as a power-efficient approach to tackling the variation challenge. However, it tends to entail large area overhead if not carefully designed. This work studies how to reduce the overhead by forming adaptivity blocks considering both timing and spatial proximity among logic cells. The proximity optimization consists of timing and location aware cell clustering and incremental placement enforcing the clusters. Experiments are performed on the ICCAD 2014 benchmark circuits, which include case of near one million cells. Compared to alternative methods, our approach achieves area overhead reduction with an average of 0.6% wirelength overhead, while retains about the same timing yield and power.
INTRODUCTION
Variability, such as process variations and device aging, and power are notorious barriers to the progress of VLSI technology. Their compound effect is even more difficult to deal with. Variations demand extra power for timing margins and therefore exacerbate power dissipation. On the other hand, increasingly tight power budget seriously hinders design techniques for variation tolerance. Adaptive circuit design is an approach to getting out of this difficult situation.
An adaptive circuit contains sensors that detect timing variations. Broadly speaking, there are two kinds of senPermission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org. sors: critical path replica [1, 2] and canary flip-flop [3, 4] . Sensor outputs control certain tuning knobs, such as body bias [1] and supply voltage change [5, 6] , such that timing variations are compensated. Unlike conventional methods, which allocate extra power and timing margins according to the worst case variations, an adaptive circuit spends additional power only when timing variation is actually observed. Adaptive design is conceptually more power-efficient than conventional designs, however, it entails area overhead on sensors, tuning circuits and control wires. If not carefully designed, the overhead can be quite significant. For example, a naïve implementation of the voltage interpolation technique [5] can double chip power grid. In [6] , over 20% area overhead is observed for adaptive designs. Indeed, the overhead issue is a key reason that prevents wide application of adaptive design techniques.
The efficiency of an adaptive design highly depends on how its adaptivity blocks are formed. The circuit cells within one block share the same sensors, control and tuning knobs. Ideally, one prefers to put cells that need similar tuning actions into the same block. As such, a small number of sensors and knobs can cover a large circuit and the amortized overhead is relatively low. At the same time, cells in a block need to be spatially close to each other or form a contiguous region. Separating cells of the same block far apart would at least cause unnecessarily large control wire overhead. Overall, adaptivity blocks should be formed according to both timing proximity and spatial proximity among cells.
The adaptivity block generation problem is studied in [7] for adaptive body bias. It estimates timing proximity among cells by Monte Carlo simulation of body bias assignment. The assignment assumes that body voltage of each cell can be individually controlled and is implemented by quadratic programming. After the simulation, the probability distribution of body bias tuning for each cell is obtained. Then, cells with similar distributions and high correlations are clustered to form a block. It is observed that spatial correlation among tuning actions of different cells is similar as physical proximity among them. To ensure that cells in the same block are located in a contiguous region, incremental placement change is performed using the Capo placer [8] . This is a pioneer work that demonstrates the importance of adaptivity clustering. However, it has a few drawbacks. First, Monte Carlo simulation of quadratic programming is immensely time consuming and very difficult, if not impossible, to scale to large cases. For example, a 32K-cell case, which is fairly small from the point of view of modern IC design, costs nearly four and half hours runtime in [7] . Second, it is not described how Capo [8] enforces spatial continuity among cells in a cluster. Third, although timing and spatial proximity are correlated, the difference between them cannot be neglected. In fact, the timing-only clustering [7] results in 3% wirelength overhead, which is not trivial.
In this work, we propose a balanced approach: cell clustering with consideration of both timing and spatial proximity. To make the clustering more scalable, the consideration of timing is based on timing analysis result instead of Monte Carlo simulation of quadratic programming. The clustering is followed by an incremental placement that enforces the spatial continuity of cells in a cluster. The placement is formulated and solved by min-cost network flow model that minimizes total cell movement. Experiments are performed on the ICCAD 2014 incremental timing-driven placement contest benchmark suites, which include circuit of near one million cells. Compared to timing-only and location-only clustering, our approach achieves area overhead reduction. Its average wirelength overhead is 0.6%, which is 95% less than that from the timing-only clustering. At the same time, it retains about the same timing yield and power consumption.
The rest of this paper is organized as follows. An overview of our method and design flow background will be given in Section 2. The cell clustering will be described in Section 3. Section 4 will be focused on the incremental placement. Experimental results will be shown in Section 5. Finally, conclusions will be provided in Section 6.
OVERVIEW
The input to our method is a circuit, for which cell placement (including detailed placement) and timing analysis have been performed. It is represented by a graph G = (V, E), where the node set V indicates cells and edges E imply fanin/fanout among cells. Our method is a two-phase approach: phase I cell clustering and phase II incremental placement. The clustering partitions cells into blocks, where cells in a block have similar timing characteristics and are close to each other. The incremental placement further forces cells of each cluster to form a contiguous region. The clustering method uses a few weighting coefficients to balance timing and spatial proximity. If a large wirelength overhead is observed after incremental placement, the weight for spatial proximity is increased and the two phases are performed again. The result of our method is fed to the adaptive circuit optimization [9] , which decides if to assign adaptivity to each block and simultaneously performs gate sizing. An overview of entire design flow is sketched in Figure 1, where our main contributions are highlighted with the orange dashed rectangle.
The proposed methodology allows flexibility of changes. For example, one can start with merely global placement at the beginning and perform detailed placement with legalization after the clustering. Also, the clustering and incremental placement can be repeated after gate sizing to form another layer of iterations.
TIMING AND LOCATION AWARE CELL CLUSTERING
We start with an example in Figure 2 to illustrate that considering only timing in clustering like [7] is insufficient. The argument in [7] is that timing correlations highly de- pend on spatial correlations. As such, spatial proximity is largely addressed by considering only timing proximity. This statement is often true, however, it is not difficult to find counter examples that easily happen in practice. In Figure 2 , there are two timing critical paths: one from gate A1 to A2 and the other from B1 to B2. Along path A (B), forward body bias (FBB) of either gate A1 (B1) or A2 (B2) can fix timing error. If FBB of NAND gates is slightly more efficient than FBB of NOR gates, the quadratic programming in [7] may mostly choose FBB for A1 and B2 at the same time. Then, the clustering of [7] would put A1 and B2 into the same cluster although they are spatially far apart. The subsequent incremental placement must move those cells for a long distance to bring them together. Consequently, wirelength is significantly increased. Moreover, the large cell moves may invalidate the original timing analysis result. It is not difficult to find that a better solution is to cluster A1 (A2) with B1 (B2). In our clustering, both timing and spatial proximity are considered. Straightforwardly, spatial proximity between two cells are estimated by the Manhattan distance between them.
Timing proximity is remarkably more complex. Ideally, it should indicate the probability that two cells take the same tuning actions. Monte Carlo simulation of quadratic programming like [7] serves this purpose well, but is too ex-pensive to use. Therefore, we resort to a simple surrogate metric that include two factors. The first factor is timing slacks at cells. If two cells have similar slack, then it is more likely that they would take the same tuning actions. Otherwise, they tend to take different tuning actions. However, when slacks are extremely large or extremely negative, the corresponding cells may still take the same actions. For example, two cells with hugely negative slacks would both take the maximum FBB level regardless their slack difference. Hence, a cell gi is characterized by capped slack defined aŝ si = max(min(slacki, θmax), θmin)
where slacki is the original slack at cell gi, and θmax and θmin are constant thresholds. To further account for timing variability, the slacki here is based on nominal delay plus scaled σ (standard deviation) of the delay. It is conceivable that a cell with large σ is more likely to be tuned when it has the same nominal delay as others. The second factor in the surrogate timing proximity is sensitivity, which is defined as ratio of slack increase by tuning a cell versus the tuning cost, i.e.,
where the tuning cost can be power increase or area overhead of adaptivity. When two cells have similar timing slack, the sensitivity may make a difference on if to take tuning action. Indeed, it makes sense for tuning policies to favor change on cells with relatively large sensitivities. Overall, the distance between gi at (xi, yi) and gj at (xj, yj) in the clustering is defined as
where α, β and γ are constant parameters. Usually, the value of β is much smaller than α. These parameters are not necessarily static. If a large wirelength overhead is observed after the subsequent incremental placement, the value of γ is increased and the clustering is performed again.
Note that we handle timing proximity in a much simpler way than [7] . One may argue that this simple approach has a drawback. That is, only one or a few cells along the same critical path need to be tuned while our approach sees the same slack/sensitivity among all cells along the path and may cluster them all together. Actually, this problem is addressed by simultaneously considering spatial proximity. It is rare that all cells along one path are crowded together. It is quite likely these cells are separated into different clusters due to spatial distance.
Based on the distance defined by Equation (3), we adopt Lloyd's K-means algorithm for the clustering. To make the description complete, we summarize the main steps of this algorithm. It starts with K arbitrary means or centers. Then, each cell is assigned to the cluster with nearest center. After assigning all cells, the centers are updated with the centroids of the newly formed clusters. This assignment and center update procedure is repeated till the within-cluster sum of distance (WCSD) converges to the minimum. WCSD is defined by
where Ci is a cluster, x is the vector characterizing a cell and µi is the mean or center for cluster Ci. Unlike the original Lloyd's algorithm, which is based on Euclidean distance, we use Manhattan distance to match the layout convention in VLSI circuits. The value of K is decided empirically [9] . Moreover, we allow K to be changed according to clustering results. If two clusters are very near to each other, they are merged and K is therefore decreased.
CLUSTER DRIVEN INCREMENTAL PLACEMENT
After the clustering, a small number of cells are often located away from the majority cells of their own clusters. For example, in Figure 3 , where clusters are indicated by colors, two blue cells and one orange cell are away from their clusters. We call them alien cells. Due to alien cells, control wires for a cluster must span a relatively large region. Moreover, tuning overhead, such as extra power lines in voltage interpolation [5] , is also increased by the spreading out of clusters. The purpose of incremental placement is to move alien cells back to their clusters such that each cluster forms a compact and contiguous region. An alien cell g k i belonging to cluster C k can be moved to an empty space among majority cells of C k . Alternatively, it can be moved to the position of another alien cell g l j that belongs to another cluster C l but sits within cluster C k , as g l j will be moved out of cluster C k sooner or later. For example, the blue cell in top row of Figure 3 can be moved to the location of the orange cell in the middle row. Of course, these moves are allowed only if the size of empty space or g l j is no less than that of g k i .
In order to retain the original design as much as possible, the total cell movement needs to be minimized at the same time.
In essence, the incremental placement is a min-cost assignment problem -assigning alien cells to empty or potentially empty space. In general, an assignment problem can be solved through min-cost network flow model. However, there is a pitfall. That is, if one attempts to move all alien cells simultaneously in a network flow model, it is difficult to ensure that an alien cell is moved to its own cluster, not other clusters. In fact, this is a multi-commodity network flow problem, which is NP-complete. On the other hand, this issue is not difficult to circumvent, simply by processing only one cluster at a time. Specifically, alien cells belonging to one cluster are collected back using min-cost network flow model.
Since the clusters are processed one at a time, we need to find the order for processing them. The order is based on cluster porosity, i.e., the percentage of space can be used by its alien cells. A cluster with low porosity is processed first. White space between two clusters can be claimed by either cluster. Processing low porosity (high density) clusters first would allow them to have high priority for taking white space between clusters. Evidently, if the overall placement density is not high, this order does not matter. Now we describe the network flow model for moving alien cells belonging to cluster C k back to C k . The network is a directed graph G = (V , E ). The node set V is composed by the following types of nodes:
• Source node: Each source node corresponds to an alien cell g k i that needs to be moved to cluster C k .
• Sink node: Each sink node indicates (1) a contiguous empty space inside or adjacent with cluster C k , or (2) an alien cell g l j that sits inside C k .
• Super source S: This is a virtual node and there is an edge from S to every source node.
• Super sink T : This is a virtual node and there is an edge from every sink node to T .
There are three types of edges in E .
• From S to source nodes: Each such edge has capacity equal to the size of corresponding alien cell, and cost of 0.
• From source to sink node: There is an edge between every pair of source and sink nodes. Its capacity is infinity and its cost is equal to the distance of moving the corresponding alien cell to C k .
• From sink node to T : Each such edge has capacity equal to the size of corresponding empty space or alien cell that does not belong to C k . The edge cost is 0. Figure 3 . Figure 4 shows the network flow model for moving the two blue alien cells in Figure 3 . The flow constraint is equal to the total size of alien cells to be moved. In practice, placement density is rarely near 100% and the percentage of alien cells is small. Hence, there is usually plenty of white space accommodating the moves. After the model is formulated, the Edmonds-Karp algorithm [10] is performed to obtain min-cost flow solution. The algorithm can guarantee to find the optimal solution in polynomial time. In the solution, the flow on each edge from source to sink node tells how to move an alien cell. The incremental placement algorithm flow is outlined in the pseudo code below. In implementation, we need to identify alien cells, the clusters they belong to and the empty space within (or adjacent with) clusters. Since circuit designs mostly use standard cells and cells are placed in rows, the identification is done by scanning individual rows. By checking if a consecutive set of cells belong to the same cluster, one can detect potential alien cell. If a cell does not have left or right neighbors from the same cluster, the rows right above and below it are also examined to see if it has neighbor above/below that belongs to the same cluster.
EXPERIMENTS

Experiment Design and Setup
The entire flow of Figure 1 is evaluated in the experiments. The initial placement (including detailed placement) is done using the Capo placer [8] . All the other steps in Figure 1 are implemented by C++ language. The timing analysis after the initial placement follows the method of PCA-based statistical static timing analysis [11] . The wirelength is evaluated according to half-perimeter of net bounding boxes. The last step gate sizing and adaptivity assignment uses the method of [9] . The timing yield of adaptive designs is estimated by the technique of [12] . All the implementations run on an AMD Opteron processor with 2.2GHz frequency, 4GB memory and Linux operating system.
The experiments are performed on ICCAD 2014 Incremental Timing-Driven Placement Contest benchmark suites [13] . Adaptive body bias is employed as platform of adaptive circuit design. Please note that our method can be applied to other types of adaptive design, such as voltage interpolation [5] . We assume canary flip-flop [3] based delay variation sensors. The control signals incur only several dozens of nets, whose wirelength is negligible in circuits with hundreds of thousands of nets. We did routing for a few cases and found that the control wirelength accounts for less than 0.1% of total wirelength. The tiny increase in wirelength makes sure despite cell placement is perturbed during incremental placement, the timing disturbance is very small. In the experiments, we only focus on the wirelength overhead arising from the clustering and incremental placement. The area overhead from adaptive circuits mostly includes sensor area and gate area increase due to triple-well process for body bias. The number of clusters is empirically chosen in a range from 10 to 25. Please note our clustering algorithm can autonomously adjust the number of clusters. Because the nature of K-means clustering algorithm, the cluster size differ from each other. The difference in size can be quite dramatic in some extreme cases. Hence, there is no obvious pattern between the final result changes with number of clusters and the clustering solution. The timing is estimated according to RC switch model and the Elmore model. The power model is the same as that in [7] . Gate length variations with standard deviation of 5% nominal value are considered.
The following approaches are compared in the experiments.
• Over-design: This is the conventional non-adaptive circuit design. It does not have sensors, control or tuning circuits, and therefore cannot adapt to variations. It applies identical amount of power among all chips according to the worst case variation.
• Location-aware clustering: This is an implementation of the flow in Figure 1 , but only spatial proximity is considered in the clustering step. As such, each cluster forms a contiguous region without the need of incremental placement.
• Timing-aware clustering: This is an implementation of the flow in Figure 1 , but only timing proximity is considered in the clustering step, i.e., γ = 0 for the clustering distance defined in Equation (3). This implementation tries to emulate the approach of [7] in a broad sense, but in a simpler manner.
• Ours: This is our complete flow in Figure 1 based on timing and location aware clustering.
Experimental Results
All methods are tested under several different timing constraints and the average results are shown in Table 1 and 2. The results in Table 1 are from experiments with only forward body bias (FBB). One can see that all methods achieve about the same timing yield and adaptive design can save about 26% power compared to over-design. Our method results in 26% less area overhead than the location-aware clustering method. Compared to the timing-aware clustering, our method not only reduces area overhead by 31% but also incurs much less wirelength overhead. The average wirelength overhead from our method is only 0.6%, which is about 95% less than that from the timing-aware clustering. Table 2 summarizes results from experiments with both FBB and RBB (Reverse Body Bias). The observation is similar to Table 1 except that area overhead reduction from our method is 78% on average compared to location-aware clustering. For circuit mgc matrix mult, the adaptive design leads to area decrease. This is because the optimization in the last step of Figure 1 may downsize cells.
Flow computing runtime data for different methods with FBB are outlined in Table 3 . Even for the largest case netcard, which has near one million cells, our complete flow takes about 3 hours. This is much faster than the approach of [7] , which spends near 4.5 hours to process a small circuit with 32K cells. The over-design flow has only the initial placement and timing analysis part of Figure 1 . By comparing with the runtime of our complete flow, one can tell that the initial placement and timing analysis account for about 1/3 of total runtime. The location-aware clustering method does not include incremental placement. A simple calculation tells that the incremental placement causes about 1/3 of entire runtime, and the clustering plus adaptivity optimization also costs about 1/3 of total runtime. Experiment is performed to investigated the impact of weight factors α, β and γ in Equation (3), which defines the distance for clustering. The experiment is conducted on circuit mgc matrix mult with FBB, and the result is shown in Table 4 . In this Table, the column of AP is for adaptive power, which is the average power increase due to the tuning from zero body bias to forward body bias. The minimum area overhead ∆Area is 1653, which is significantly lower than that in Table 1 . This is because the timing constraint for Table 4 is relatively loose while the area overhead in Table 1 is an average from multiple experiments including those with tight timing constraints. The wirelength overhead ∆Wire is mostly decided by the ratio between α and In another experiment, timing constraint is varied to observe the effect on power and area overhead for circuit mgc matrix mult with FBB. The timing constraint is the target critical path delay we set for the circuit to meet. The power here only includes the adaptive power, which is incurred due to body bias change. The tradeoff curves are depicted in Figure 5 . It is as expected that power/area increases as timing constraint is tightened. The optimization and adaptivity tuning are carried out in a way to obtain similar timing yield. Table 5 provides data related with design perturbation due to the clustering and incremental placement. The third column lists the percentage of cells being moved in the incremental placement, i.e., alien cells. One can see that this percentage is usually small. The rightmost column displays the average cell move distance in term of the minimum NAND2 cell width. The move distance is typically a few dozens of NAND2 cell width, which is fairly small in chip layout area. These data indicate that the perturbation from the clustering and incremental placement is limited. 
CONCLUSIONS
In this work, a new approach is proposed to reduce overhead of adaptive circuit design. It is composed by cell clustering and incremental placement. The clustering considers both timing and spatial proximity, and is much faster than its previous work. The incremental placement is realized
