This paper presents an automated partitioning strategy to divide a design into a set of partitions based on design hierarchy information. While the primary objective is to use these partitions in an Incremental Design flow for compile time reduction, the performance of the partitioned design should not be degraded after partitioning. Experimental results using the incremental design feature of Altera's Quartus tool show that our algorithm can generate partitioning solutions comparable with a set of manually partitioned industrial circuits and results in more than 50% compile time reduction.
INTRODUCTION
With today's design sizes of more than a couple of million gates, reducing the complexity of the physical design process has become a necessity. A major problem with traditional physical design flows is dealing with huge design compile times even after applying minor changes to the design. The problem of reducing compilation time has been already addressed using traditional incremental compilation techniques, where any change in the design netlist is detected and passed to an updated physical design flow that is composed of incremental synthesis, incremental placement, and/or incremental routing [1] . However, since the initial compilation has been done without any knowledge of design information, a change in the design may need many changes in the final placed and routed circuit since the components participating in the design change may have been placed in different places of the target chip.
Recent state-of-the art CAD tools have added incremental design features to their CAD flows. A simple definition for incremental design is the ability to recompile a design using the information from a previous compilation. Conventionally, these tools by default do not differentiate between a full and an incremental compilation. As a result, the entire design is always processed from scratch when the compiler is invoked. Nevertheless, there are situations in which a more incremental compilation flow is desirable. For example, a designer may reach a later point in the design cycle where she is uninterested in improving timing further. In this case, it is desirable to save compile time by reusing previous results for portions of the design that are unmodified. In order to use such a feature the design must be properly partitioned. A good partitioning solution can be defined as one in which the performance of the partitioned circuit is not degraded and significant compile time savings is achieved after applying design changes. In this paper we present an algorithm that can be used to create partitions for this purpose. We will show that our partitioning creates partitions comparable with a set of manually partitioned industrial circuits with compile time savings of up to 50%. Before continuing on to the next section we emphasize that the term "partitioning" used here is different from the traditional partitioning used in a custom physical design flow. While in the traditional flow any part of a design can be assigned to any partition, here we limit the partitioning process to design modules only. The rest of the paper is organized as follows: section 2 presents partitioning and incremental compilation definitions. Our modular partitioning algorithm is described in detail in section 3. Section 4 discusses experimental results, and finally, section 5 concludes the paper.
PARTITIONING AND INCREMENTAL COMPILATION
By definition a partition is a user-designated portion of the design where optimizations between it and the rest of the design are disallowed. The smallest design element that can be assigned to a partition is a design entity/module. Different modules can be assigned to a single partition only when they all belong to a single hierarchy. Fig. 1 shows a few valid partitioning solutions for a simple design hierarchy. The partition boundaries are specified with the dotted lines around the modules. design are assigned to different partitions the performance of the final partitioned design may degrade significantly. This can happen due to absence of cross-boundary optimizations between partitions. Therefore, partitioning should not create partitions with too many critical paths crossing the partitions. As for area, due to absence of cross boundary optimizations between partitions, the total area of the design may increase after partitioning. A good partitioning should not create a large amount of area increase. In this work we indirectly control the amount of area increase by reducing the number of partitions. Our experiments show that the area increase due to our partitioning is below 4%. As the number of partitions increases the compile time saving will increase as well, but the performance may degrade due to preventing optimization among too many partitions. Therefore we should ensure that the number of partitions is reasonable.
Partitioning Requirements

MPART: MODULAR PARTITIONING
In this section we present our modular partitioning (mPart) strategy to create partitioning solutions based on the requirements explained in section 2.1. We use design hierarchy information to guide the partitioning process. mPart uses a simulated annealing methodology to improve the quality of the final partitioning solution based on a cost function. Our cost function contains information about criticality, connectivity, and size of the partitions in a partitioning solution. After reading the design netlist, a module tree is formed based on design hierarchy information. The module tree is actually the same as design hierarchy tree without its leaf nodes (design elements). The flow is then continued and after a preprocessing step on the initial solution, the final partitioning solution is built using netlist information and the modules on the module tree. Finally, the partitioning solution is refined in a postprocessing stage.
Partitioning Preprocessing
The preprocessing step starts with removing all the modules on the module tree that belong to the Library of Parameterized Modules as Quartus does not allow such modules to be assigned to partitions. We may also remove very small Fig. 2 . Pseudo-code for the annealing partitioner in mPart modules from the module tree (this means merging a small module into its parent and modifying the module tree). This seems like a valid decision as there is no point in assigning small modules to partitions as this cannot have any benefit in terms of compile time savings. But removing even such small modules may prevent us from finding better partitioning solutions. Therefore, we do not remove any module from the module tree based on size constraints in the preprocessing step. We later explain that this can be done in the postprocessing stage. Finally, each module on the module tree is assigned to a single partition to create an initial partitioning solution.
Using Simulated Annealing for Partitioning
We use simulated annealing [2] to minimize a cost function associated with our partitioning strategy. Our simulated annealing partitioner initially starts with a partitioning solution based on the set of all initial partitions after preprocessing step. The solution is then iteratively improved by randomly changing the partitioning solution and evaluating the "goodness" of each change with a cost function. If the change results in a reduction in the partitioning cost, then the change is accepted. If the change would cause an increase in the partitioning cost, then the change still has some chance of being accepted even if it makes the partitioning worse. The purpose of accepting some "bad" changes is to prevent the simulated annealing based partitioner from becoming trapped in a local minimum. Fig. 2 shows the pseudo-code of the annealing.
Cost function
As mentioned in section 2.1 a good partitioning solution should have minimum number of inter-partition connections. Also, the number of critical paths crossing partitions should be minimal. In order to achieve a partitioning solution with such requirements we use a cost function that contains a criticality component and a connectivity component. The idea behind the criticality component is to minimize the number of critical paths crossing partitions. The connectivity component tries to minimize total number of inter-partition connections and at the same time maximize internal connectivity of the partitions.
We use the well-know Rent formulation [3] for the connectivity component of our cost function. The rule relates the number of design elements, S, in a partition to the number of external connections, E, on a partition and is given by E =p . S r , wherep denotes the average number of interconnections for a design element in the partition and r is Rent's exponent and is in the range of [0, 1] . A value close to 1 indicates that most of the connections in the partition are external and a value near 0 indicates that almost all connections are internal. For each partition P we define connectivity as conn(P ) = r(P ), where r(P ) is the rent value associated with partition P .
Our criticality component is based on timing information using Quartus Timing Analyzer. We first give a brief overview of the timing analysis needed by our algorithm to get these timing information. The circuit netlist is represented as a graph where nodes in the graph represent input and output pins of circuit elements and I/O pads. Connections between these nodes are modelled with edges in the graph. These edges are assigned with delay and slack info based on the timing information derived from Quartus Timing Analyzer. We define the criticality of each edge e as crit(e) = 1 −
slack(e) max
∀ e 1 slack(e1) . The criticality provides an indication of the relative importance of each edge and is used in defining timing cost, which is the main term in our criticality cost component. The timing cost for each edge e is defined as t cost(e) = delay(e) . crit(e). For each partition P in the partitioning solution we define an internal critical-
, where, I(P ) is the set of internal edges in partition P . Similarly, we define an external criticality crit ext (P ) for each par-
. The criticality cost associated with each partition P is then defined as crit(P ) = crit ext (P ) crit int (P ) . The criticality and connectivity for a partition solution, P S, is then defined as the average criticality and connectivity of all the partitions in the partitioning solution, respectively: crit(P S) = . We now present our cost function based on the criticality and connectivity components. As in [4] , to properly balance the trade-off between the two components in the cost function, we use an auto-normalized cost function defined as
∆conn prev conn . The auto-normalization cost function depends on the change in criticality and connectivity. Two normalization variables are used to normalize the weight of these two components. The effect of these variables is to make the function weight the two components only with the α and β variables, independent of their actual values.
As can be seen, a partitioning solution containing a single partition (top-level module and all its sub-modules) will result in a low cost as all connections are put inside the partition. To resolve this problem we set a size constraint for the partitions. We control partition size by forcing partitions to be of equal size. In this way, we not only prevent partitions from growing into big parts, but also ensure that recompiling any partition will results in almost the same compile time for any partition.
We define a new cost component for our cost function to control partition size as ∆C 2 = γ.
∆part size stdev prev part size stdev , where part size stdev is the standard deviation of the size of partitions. Finally, we present our cost function for the partitioning problem: ∆C = ∆C 1 + ∆C 2 . We have 0 ≤ α, β, γ ≤ 1 and α + β + γ = 1.
Creating new partitioning solutions
At each iteration of the annealing process a change in the partitioning process is made. We define two types of changes: merge and split. In merge a partition is removed from the partitioning solution and all the modules inside the removed partition is merged with the parent partition. Parent partition is the partition containing the immediate parent module of the removed module in the module tree. Fig. 3(a) shows a simple module tree where each module on the tree is assigned to a single partition. A merge operation for partition P 4 will remove that partition and merge its module A 31 with the parent partition P 3 , as shown in Fig. 3(b) . If another merge operation occurs for P 3 then the new module tree and the resulting partitioning solution of Fig. 3(c) will be resulted. In split a module that is already assigned to a partition is removed from that partition and a new partition is created for the module. Fig. 4(a) shows the partitioning solution resulted in Fig. 3(c) . A split operation for module A 31 will result in the partitioning of Fig. 4(b) .
Annealing parameters
In this section we discuss different parameters that control the annealing process. The starting temperature for the annealing is obtained using a method similar to that described in [4] . A set of n moves is randomly generated. Each move is then evaluated and the change in cost is observed. The initial temperature is computed to be 20 times the standard deviation of the set of cost changes. At each temperature in the anneal, n moves are generated and evaluated. The value of n is equal to the number of partitions in the current solution.
As the partitioning process advances this number is reduced and few moves will be made as we reach the end of annealing. At each temperature a change in the partitioning solution will be made using a merge or split operation. A move is randomly selected and is applied to randomly selected partitions in the current solution. Once the move at a particular temperature has been generated and evaluated, the temperature is reduced for the next iteration in the anneal. The new temperature, T new , in given by T new = τ . T old , where the value of τ depends on the fraction of attempted moves that were accepted (R accept ) at T old and is determined using an approach similar to [5] . Finally, the outer loop exit criterion stops the annealing process if the temperature is less than a small fraction ( ) of the average criticality cost per external connection. The moves in the annealing process will always affect external connections in the partitioning solution. If the temperature drops below a fraction of the average criticality cost of an external connection, it is unlikely that any move that results in a cost increase will be accepted, and the annealing can be terminated. The value of is set to 0.05 during the experiments.
Postprocessing
In the postprocessing stage we remove small partitions based on a proposed minimum partition size. It should be noted that keeping small partitions may help when a change occurs in the modules corresponding to those partitions. Also, if a design contains very small critical modules, then assigning these modules to partitions will preserve performance for incremental compilation.
EXPERIMENTAL RESULTS
In this section we present experimental results for the modular partitioning approach. We first introduce the incremental design flow used in all of our experiments and show how it is used to investigate the effect of the partitioning process. We evaluate our mPart algorithm to set different parameters of the algorithm. To measure the effectiveness of our approach we present partitioning results for a set of manually partitioned industrial circuits and will compare the results for the evaluated algorithm with the manual partitioning results and a few heuristical partitioning strategies. All of the experiments have been done using Quartus II v5.0 software from Altera on a Pentium 4 -866MHz with 1GB of RAM.
Experimental Methodology
Our experiments are performed on a set of real industrial circuits 1 with an average size of 20,000 LEs (ranging from 10,000 to over 48,000). Experiments start with performing full compilation for circuits without any partitioning information. We call this a setup compile for the flat circuits. For any design, we then make a specific modification and recompile the whole design from scratch. This is called an incremental compile for the flat netlist. Note that since no partitioning information is used the incremental compile for the flat netlist is a full recompilation. We then use partitioning and floorplanning settings and do a full compilation using this information. This is also called a setup compile, but it is for the partitioned circuits. Similar to the experiments for the flat circuits, we then apply the same design modifications used for incremental compilation for the flat circuits and recompile the design with the incremental design feature turned on. In this case only the partitions corresponding to the modified portions of the designs will be recompiled. We then compare the results for setup compile and incremental compile for partitioned circuits with those of the flat circuits.
In order to have the same incremental change for all of the experiments we use a compiler setting change for all the circuits. We pick a module in a design and change a compiler optimization option in Quartus. 
Algorithm Evaluation
As mentioned in section 3.2 our simulated annealing based partitioning uses three components in the cost function: criticality, connectivity, and partition size. In this section we evaluate the parameters controlling the effect of each of these components.
Size and number of partitions
We first determine a value for γ, which indirectly controls the size of partitions and the number of partitions. The value of α and β is determined using α = β = 1−γ 2 . Table 1 shows the effect of sweeping the value of γ on circuit speed (f max ), area, total compile time (the sum of synthesis, placement, routing, and timing analysis times), number of partitions, and partition size (all results are geometric mean values for all circuits passing the experiments ). Columns 2-3 show the results for the flat compilation in which no partitioning is performed. Each subsequent pair of columns show the results when compilation is done using partitioning and floorplanning settings for different values of γ. The "setup" columns show the results when the original circuits are used and the "incr." columns show the results after an incremental change has been made in the design. In order to compare the results before and after partitioning we compare the setup and incremental results for partitioning case with the setup and incremental results for flat compilation, respectively. The value of γ is increased incrementally from 0.1 to 0.4. Note that a value of 0 for γ will create a few big partitions for most of the designs as there is no constraint on the partition size. Such partitioning is not of any interest.
By increasing γ, size of partitions gets smaller and the number of partitions is increased. As for compile time savings, there is no compile time benefits for the setup compile for partitioned circuits as a full compilation is performed. However, for the incremental compilation, we can get compile time savings in a range of 37% to 49%. As shown, lower values of γ results in less compilation time benefits as most of the partitions are large and an incremental change usually results in recompilation of big partitions. The best compilation time benefit is for the highest value of γ, but higher values of γ will result in higher area increase and higher performance degradation. As shown, by increasing the value of γ we have higher number of partitions, and therefore fewer cross-boundary optimizations occur in the designs. The area increase varies from 2% for the lowest γ value to around 4.5% for the highest. The effect of less cross boundary optimization can also be seen from circuit speed, where it gets degraded more significantly, specially for the incremental compilation, as we increase γ. Based on these results and to have a reasonable number of partitions, we choose a value of γ = 0.3.
Criticality and Connectivity
We now evaluate mPart for criticality and connectivity parameters (α and β). The value of γ is set to 0.3 and we have β = 1 − α − γ. Table 2 shows the effect of sweeping the value of α, which is increased incrementally from 0 to 0.7 (in other words, the value of β is decrease from 0.7 to 0). Note that only those circuits that passed all tests were included in the results. As shown, by increasing the criticality factor (α) we get better circuit speed for both setup and incremental compilation. For small values of α (the first two values) we see large performance degradations both for setup and incremental modes. Experiments show that the change in connectivity component of the cost function (rent value for the partitions) is much lower than the change in criticality component. Therefore, for small values of α change in the partitioning solution results in a small change in the connectivity component of the cost function and many moves do not get accepted in the annealing. Therefore, the annealing process stops very soon and final partitioning solutions contain too many partitions, many of which are very small. As the small partitions are removed from the final partitioning solution this leads to a random selection of the partitions and we can not get good performance preservation results. As the value of α is increased the annealer results in more acceptable partitioning solutions and we can see the effect of α for higher criticality factor values. As for area, by increasing α the amount of area overhead is reduced. As mentioned above, our experiments show that after a move in the partitioning solution, the change in criticality is higher than the change in connectivity. So, as α increases, more moves get accepted and size of partitions in the final partitioning solution is increased. This will results in less number of partitions, which in turn reduces the amount of area increase. As for the total compile time, the compile time benefit is the least for the case when α is 0.7 as we have the lowest number of partitions and highest partition size. Based on these experiments we choose a value of 0.55 for α (0.15 for β). We also tested several values for the minimum allowable partition size and set this value to 250.
The run time for the algorithm is in the range of tens of seconds to a couple of minutes depending on the initial size of the partitioning solution. If we invest more time in the annealing, the results tend to improve in terms of performance preservation, however, as more partitions get merged, compile time saving may decrease.
Manual and Random Partitioning
In order to verify the effectiveness of the automatic partitioning tool we first compare our results with a set of manually partitioned industrial circuits. We use the manually created partitions for these circuits and perform the incremental design flow by applying the same design changes done for previous experiments. For manual partitioning, each circuit is repeatedly partitioned until a partitioning with no performance degradation and good compile time saving is achieved. To have a fair comparison, size and number of partitions are chosen to be close to those created by mPart. Columns 4-5 and 6-7 in Table 3 show the results for manual partitioning and mPart, respectively. As shown, mPart produces good results in terms of performance and compile time savings. Proper partitioning has resulted a slight improvement in circuit performance for mPart. Also mPart results in more than 51% compile time savings and is comparable with manual results. As for area, mPart has slightly higher area increase due to a higher number of partitions.
We now compare mPart with the partitioning solutions created by random partitioning. The random partitioning scheme works in two different modes. First we modules are randomly assigned to different partitions without any consideration for size of partitions (Random1). The other random partitioning (Random2) is done while trying to balance the size of partitions. Columns 8-11 in Table 3 show the results of Random1 and Random2 compared with mPart. As expected, random partitioning results in partitioning solutions that degrades circuit performance and also does not results in good compile time savings. The compile time savings is much worse for the case where no constraint is set for the partition size (Random1). It should be noted that comparison with other state-of-the-art partitioning algorithms, like hMetis [6] is not possible as these algorithms do not keep the hierarchy intact and tend to use elements from different hierarchies in the final partitioning solution.
SUMMARY
We have presented a partitioning strategy to divide a design into a set of partitions based on design hierarchy modules. Experimental results using Quartus showed that our algorithm can generate partitionings comparable with a set of manually partitioned industrial circuits with compile time savings of more than 50% with no performance degradation.
