In this paper 
Introduction
The increase of circuit complexity and the high demand for short time-to-market products force designers to reuse old designs (IP) while the increased chip densities allow them to put more and more functionality on the same chip (SoC). Therefore, it is envisioned that platform-based design will be the optimal design approach [17] . Within this new framework, where Field Programmable Logic Arrays (FPGAs) and embedded systems define the hierarchical design philosophy, we can no longer rely on flat design methodologies. Such methods are too timeconsuming for today's large designs. Hierarchical design, where systems are built from pre-characterized library blocks/functions (such as multipliers, filters, or even processors) appears to be the proper approach. That is why our proposed solution in this paper targets circuits of moderate-and large-sized circuits, basic components of libraries that we mentioned above. For these circuits we have to perform gate-level physical design. Therefore, partitioning, as an early step during physical design, remains very important.
On the other hand, since interconnections/wiring contribute more than 70% to the circuit/block delay, partitioning has a great impact on the interconnect distribution and thus on the circuit performance. Therefore, it is imperative to account for timing as early as possible during the design process, particularly during partitioning, leading to an early wire planning.
In this paper we present statistical timing driven hMetis-based partitioning. For our timing minimization purpose, we exploit the hyperedge coarsening scheme of hMetis [12] partitioner. This allows us to perform partitioning such that the most critical nets in the circuit are not cut and therefore timing minimization can be achieved. Our approach is a different way for doing timing-driven optimization: we drive the partitioning as for the best timing to be obtained without performing any netlist alteration (e.g., buffer insertion and gate duplication), though our method can be easily modified to incorporate these techniques as well. The main contribution of our work is the use of a better timing criticality and of a new delay model within a net-based partitioning approach (using a fast partitioner), which proves to provide circuits that are more tolerant to delay variations. By improving on timing by minimizing critical wire delays at partitioning level, we provide a way of doing wire planning very early in the physical design process.
The remainder of the paper is organized as follows. Section 2 presents previous work on timing driven partitioning. Section 3 presents the criticality concept that we use as edge weight for the hMetis partitioner. In Section 4 we describe our proposed statistical timing driven partitioning methodology. The delay model that we use is presented in Section 5. Simulation results are presented in Section 6. We demonstrate the robustness of our partitioning algorithm in Section 7. We conclude, suggesting further research directions, in Section 8.
Previous Work
Timing driven partitioning approaches can be classified into two categories: (1) top-down partitioning approaches and (2) bottom-up clustering-based approaches. Approaches in the first category are usually based on the Fiduccia-Mattheyses (FM) [7] recursive min-cut partitioning method or on quadratic programming formulations [16] , [20] . Timing optimization is obtained by minimizing the delay of the most critical path. The second category includes bottom-up clustering-based approaches. They are used mostly as a pre-processing step for min-cut algorithms [6] , [14] . All previous approaches achieve delay minimization by netlist alteration such as logic replication, retiming, and buffer insertion in order to meet delay constraints while the cutsize is minimized. The focus is on delay improvement, and the cutsize is ignored. Gate replication in these methods can be massive. We can identify a few problems for all previous timing driven partitioning approaches: (1) Unrealistic delay models are used. It is common to use the general-delay model, which considers delay 1 for all gates, delay 0 for interconnects inside a partition, and a constant delay for interconnects between partitions [6] , [14] , [16] . (2) Unrealistic simplifications are made. For instance, circuits are mapped to two-input gates only [6] . (3) Static timing analysis is used as a framework for timing analysis. However, it is known that there are uncertainties in both gate and wire delays, such as fabrication variations, changes in supply voltage and temperature, that are not captured by the delay modeling within the framework of the classical static timing analysis. (4) The run time for moderate-sized circuits is too long and makes these approaches impracticable for large-sized circuits. One reason for that may be that previous approaches usually separates the timing-driven partitioning into two steps: (i) clustering or partitioning and (ii) timing refinement based on netlist alteration [6] , [16] .
In this paper, we try to eliminate the above deficiencies. We approach timing driven partitioning from a different perspective: we use the statistical timing criticality concept to change the partitioning process itself such that delay minimization is achieved while delay uncertainties are captured. We use a more realistic delay model, which incorporates a statistical net-length estimation and we use the hMetis partitioning algorithm which is very fast.
Statistical Timing Analysis
In this section we present the concept of criticality within the framework of statistical timing analysis versus static timing analysis. The idea of static timing analysis is to compute the slack for every gate based on the latest arrival time and the required arrival time values. Each gate has a constant delay value. However, in reality there are several uncertainties in both gate and wire delays, such as fabrication variations, changes in supply voltage and temperature [8] , [15] , [19] . These uncertainties are modeled in statistical timing analysis by considering gate and wire delays as stochastic variables (i.e. as probability distribution functions). That means that the delay variation is captured by the standard deviation. Even though different methods of statistical timing analysis have been proposed [11] , [13] , we adopt the approach proposed by Berkelaar [3] and later improved by Hashimoto and Onodera [8] who introduced the concept of criticality which fits well into the partitioning framework.
Generally, for an n-input gate (see Fig.1 .a), under the assumption of stochastic independence of the inputs, the maximum latest arrival time at all inputs can be modeled with a normal distribution whose probability density function is [8] :
where f i and F j are the probability density function (pdf) and the cumulative density function (cdf) of input i respectively.
: 
Fig.1 a) Example of general gate b) Influence and criticality computation
Since the internal gate delay 1 is also considered normally distributed, the gate output delay is calculated as the sum of two normal distributions: the maximum of all inputs and the internal gate delay. Wire delays are also considered stochastic variables. Hence, we can compute the probability density function of the overall circuit delay by computing the pdf of each primary output (PO). The equivalent of slack in static timing analysis are the notions of influence and criticality [8] . These notions address the problem of characterizing parts of the circuit from the point of view of timing similarly to the critical path concept. In what follows we briefly present the concepts of influence and criticality. The term between brackets in equation (1) represents the following probability:
The probability P(T i +t i =x) expresses the magnitude of the influence that the i-th input gives to f maxPIs at x. The influence infl i is defined as the influence proportion of the i-th input in the range x>x 1 as follows:
where C 1 is a normalization coefficient to satisfy 1 = i i infl and C 2 is a constant to emphasize the region of large arrival time. Criticality is meant to represent the timing criticality at each gate, i.e. the contribution to the circuit delay of all the paths that pass through that gate. It is computed using the following relation (see Fig.1 .b): (G j ) as how much the i(G)-th input affects the timing at gate G j for x≥x 1 . In other words, infl i(G) (G j ) represents how easily the timing criticality back-propagates from gate G j to gate G. All influences are computed by propagation from primary inputs (PI's) towards PO's. Criticalities are computed by back-propagation from PO's towards PI's. The gate with the largest criticality in a circuit is the most critical in terms of timing since its contribution to the circuit output delays is the most significant among all gates in the circuit. Details can be found in [3] , [8] .
For example, the hypergraph shown in Fig.2 .a as a Directed Acyclic Graph (DAG) depicts criticality values for all hyperedges. The corresponding circuit netlist is shown in Fig.2. b. Gate G 2 in the circuit schematic (i.e. vertex 8 in the corresponding DAG) and its fanout net (i.e. hyperedge {8,9,10} in DAG) is the most critical one because its criticality, which equals 2, is the largest. In our partitioning methodology we want this hyperedge not to be cut because otherwise the circuit delay will increase.
In our partitioning methodology we use the criticality values as hyperedge weights. Thus, the hyperedge coarsening scheme of the hMetis partitioning algorithm clusters the most critical hyperedges early, which means that they would not be cut by the partitioning process. This has a great impact on the circuit timing, because the most critical nets in the circuit will not be cut during partitioning and subsequently, these critical nets will not become long/global interconnects.
The complexity of this statistical timing analysis and the calculation of all criticalities are linear with respect to the circuit size. [8] One can argue that the slack for each node is also an indication of the gate criticality and thus the static timing analysis can be used in the same way. However, from our experiments that included both static and statistical timing analyses we found no one-to-one mapping between the gate criticality found by the static timing analysis and the gate criticality found by the statistical timing analysis. That means that a gate that is declared the most critical by the statistical timing analysis is not necessarily declared the most critical gate by the static timing analysis. We will show in Section 7 that the statistical timing based partitioning is more robust than the static slack-based partitioning.
Statistical Timing Driven Partitioning
In this section we present our statistical timing driven partitioning methodology. The partitioning is done by recursive bipartitioning. At each level we associate timing criticality as weight to all corresponding hyperedges in the hypergraph. Then, the hMetis partitioning algorithm is run using the hyperedge coarsening scheme. This scheme gives preference, during hypergraph coarsening, to the hyperdges that have large weights. By using timing criticality as hyperedge weight we practically discourage the partitioning algorithm from cutting edges with high delay criticalities.
Criticalities (i.e. hyperedge weights) are updated at each partitioning level. Initially we compute all criticalities in the circuit assuming zero delay for all wires. These criticalities are then used as weights associated to hyperedges. We call this process forward annotation of criticalities. After the first bipartitioning, we know which nets are cut and thus we are able to compute the delay for these wires by using the Elmore delay model. The wire delay calculation uses a statistical model for wire length proposed in [22] . These wire delays are then used to re-compute all criticalities in the circuit. We call this process back annotation of the wire delays. During the recursive bipartitioning we back annotate more and more wire delays. Hence, criticalities will reflect better the timing criticalities all over the circuit. The recursive bipartitioning process stops when each block contains a number of vertices smaller than a threshold specified by the user.
The pseudo-code of our statistical timing driven hMetis-based partitioning algorithm is as follows: 
Delay Model
Our delay model has two components. The first component is the gate delay. For all gates we consider a typical intrinsic delay that is given for a typical input transition and a typical output net capacitance. This delay is actually the mean value of the pdf associated with the gate delay. For each pdf associated with all gates we consider a typical standard deviation of 15% [19] . The second component is the wire delay. We use the Elmore delay to model the wire delay. The Elmore delay for an edge e (an edge corresponds to the wire connecting the net source to one of its fanout sinks) is given by:
where R e is the wire lumped resistance, C e is the wire lumped capacitance, and C t is the total lumped capacitance of the source node of each net. To compute R e and C e we need the length of each edge. For this, we use the statistical net-length estimation proposed in [22] . The average length of a net, connecting m cells enclosed in a rectangular area whose width is a and whose height is b, is given by:
where α, β, and γ are fitting parameters computed in [22] as α ≈ 1.1, β ≈ 2.0, and γ ≈ 0.5. During recursive partitioning, when a net is cut, it is assigned a certain wire delay that will be used to re-compute all delays on the paths that include that net. The earlier a net is cut during recursive partitioning, the greater the back-annotated wire delay has to be. In our case, any net that is cut during the first bipartitioning step (see Fig.3 ) is assumed to be bounded by a rectangular area which is the same as the chip area and for simplicity we consider an aspect ratio equal to 1. At the second partitioning level a and b have different values that will ensure a smaller delay than that assigned during a previous partitioning level. The delay of each net is set only the first time when it is cut. In other words, if a net is cut again at a lower partitioning level, it does not have its delay increased or re-assigned (based on the net length estimation corresponding to the bounding box at this partitioning level) because otherwise its delay would be over increased. In our experiments we consider a 0.18µ copper process technology (unit length resistance r = 0.115, unit length capacitance c = 0.00015).
Simulation Results
In this section, we present simulation results. It is difficult for us to make a meaningful comparison of our statistical results with previous static timing analysis based works (except for the experiments presented in Section 7) because: (i) Our approach is based on statistical timing analysis, which is different from all previous approaches that are based on static timing analysis. Hence, we cannot compare statistical delay to static delay. (ii) We do not use netlist alteration in order to meet a timing requirement but we minimize timing by changing the partitioning process itself. Our goal is to show the potential timing improvement that can be obtained using our methodology, which can be further enhanced by using different netlist alteration techniques (iii) We use a different delay model vs. all previous approaches that are based either on the unit-delay model or on the global delay model. However, we compare our method against the case when the weights in graphs are constant corresponding to the case when simply hMetis would be used for circuit partitioning (we call it the pure partitioning method). In this way, we show the potential timing improvement that can be obtained using our algorithm. The experimental setup is shown in Fig.4 . Table 1 ). All circuits were first optimized using the script.rugged in SIS [18] . The results are presented in Table 1 . The second column in Table 1 indicates the number of PI's and PO's, followed by the number of gates in the third column. For each circuit, Cutsize represents the number of all edges cut after the recursive bipartitioning. The Delay indicates the maximum mean delay (using the statistical delay model) among all PO's. The run time is rounded to the closest integer and is given in seconds.
We run the partitioning algorithm 60 times and report the average in Table 1 . The maximum number of gates allowed for each partition was set to 10% of the total number of gates (i.e. we did 10-way partitioning). As it can be seen, the proposed partitioning methodology offers in average a 22% better delay. However, this is at the expense of an increase of 33% in the cutsize. On one hand, we obtain a better delay with our partitioning algorithm because we use a better statistical timing criticality as hyperedge weight. On the other hand, compared to the pure hMetis partitioning, the cutsize increases because we practically reduce the search space for the hMetis partitioner when criticality is used as hyperedge weight. The partitioner does not have the same freedom in exploring the search space as when all hyperedges have the same weight. Similar cutsize/delay tradeoff was observed for all circuits. This allows the user to "tune" the partitioning method to smoothly tradeoff between cutsize and delay.
The run time for our methodology is greater due to the criticality update operation. The run time for the pure hMetis algorithm includes the recording of the cut wires at each level of the partitioning as well as the delay computation.
Validation Scenarios
In this section, we describe two simple scenarios to further demonstrate the robustness of the statistical timing driven partitioning. It is known that due to the increase in chip clock frequency the amount of power consumption also increases, resulting in an increase in the chip temperature. However, the heat dissipation is usually unevenly distributed among the circuit gates, which leads to various temperatures across the whole area of the chip [21] . On one hand, higher temperatures slow down the transistors [5] . On the other hand, the interconnect resistance increases linearly with the temperature [1] . Thus, the delay of all gates and wires in areas with higher temperature will be larger than their estimated values during the design process. This motivated us to come up with two simple scenarios for testing the robustness of our proposed statistical timing driven partitioning. These scenarios are depicted in Fig.5 .
In both cases we first perform recursive bipartitioning using our statistical timing driven partitioning algorithm or pure hMetis partitioning algorithm or a slack-based partitioning algorithm 2 . Then, we perform a static timing analysis to compute the maximum delay among all the PO's; we denote this delay as delay1. Third, in the first scenario (see Fig.5 .a), we consider a 15% delay increase for all gates and their fanout wires that are placed in one of the partitions after the first level of bipartitioning. We choose a typical 15% for the delay increase [19] though for large temperature variations this increase can be larger [1] . In this way we try to mimic the case where half of the chip has a higher temperature, which in turn will determine a delay increase. Obviously, in reality, the temperature pattern will be more complex [4] , but we restrict ourselves to this simplified version, which is similar to the example presented in [21] . In the second scenario (see Fig.5 .b) we randomly choose 15% delay increase for all gates and their fanout wires. This case tries to mimic the situation when we can find hot spots everywhere on the chip [5] .
Then, we perform a second static timing analysis to compute the maximum delay among all the PO's of the circuit using the new gate and wire delays; we denote this delay as delay2. Finally, we compute the overall circuit delay perturbation as 100*(delay2-delay1)/delay1. The simulation results, presented in Table 2 , confirm that by using our partitioning algorithm the perturbation due to temperature variation of the circuit delay is in most of the cases smaller when we use our partitioning algorithm (17% and 13% in average smaller for the half-hotter scenario and for the random-hot-spots scenario compared to the pure hMetis; 29% and 8% in average smaller for the half-hotter scenario and for the random-hot-spots scenario compared to the static slack-based partitioning). In this way circuits are more stable under the disturbing influence of factors such as temperature, power supply fluctuations, and process variations. 
Conclusion
In this paper we propose a timing driven partitioning algorithm. Because we change the partitioning process itself and we use the hMetis algorithm our algorithm is fast, thus applicable to large-sized circuits. Because we use a new delay model, which better reflects the timing criticality inside circuits, our algorithm is robust and circuits are more reliable than the circuits partitioned using pure hMetis or the slack-based partitioning algorithms. The proposed algorithm does not determine area increase because we do not use netlist alteration and it offers a smooth cutsize/delay tradeoff. The slight cutsize increase is the only disadvantage of our partitioning algorithm. We are currently working on multi-objective, multi-constraint hMetis-based partitioning methodologies.
