Abstract
Introduction
In the nanometer regime, structured ASIC design has emerged as a promising new design style to fill the gap between cell-based ASIC designs and field programmable gate arrays (FPGAs). They are composed of regular arrays of prefabricated standard building blocks, with fixed mask structures. One paradigm that is used involves via-configurability, where the building blocks and interconnect skeletons are prefabricated and then both can be custom configured by the appropriate via connections. This programmability involves only a small number of masks [1] [2] [3] [4] [5] . This strategy provides a low NRE (non-recurring engineering) cost, and yet with relatively high-performance solutions. Additionally, the regularity from the well-characterized logic blocks and fixed interconnect structures helps to combat problems associated with manufacturability, yield, noise, or crosstalk.
However, many of the other problems of nanometer design continue to plague structured ASICs, prominent among them is the interconnect performance. A key issue in overcoming this is through the use of buffer insertion along global wires. Buffer insertion can not only improve timing performance, but can also effectively reduce noise. The number of buffers needed in design continues to rise with decreasing feature sizes, and it is projected that at the 32nm technology node, a very large proportion of cells will be buffers [6] . Although interconnects are generally buffered in the later phases of physical design, buffer resources must be planned earlier in the design process, so that enough resources are available for the later insertion. This problem is more acute for viaprogrammable structured ASICs, where, in addition to the basic standard building blocks, buffers must be prefabricated and distributed in the layout. Although it is possible to reconfigure the basic building blocks to work as buffers, this is not an economical approach since (a) the number of buffers can be very large, (b) configuring a large and general standard building blocks as buffers is an inefficient use of resources, and (c) these blocks do not have the driving ability of dedicated buffers. Therefore, it is essential to distribute dedicated buffers in structured ASICs, and to plan for them well, prior to the fabrication of the chip.
The buffer insertion problem for structured ASIC design has not been fully addressed so far. The only work we know of that considers this issue is [9] , where it is assumed that a uniform distribution of dedicated buffers throughout the circuit with logic cell/buffer ratio of 2:1. However, this does not recognize that the demand for buffers depends on the interconnect complexity of circuits, and a single ratio for all kinds of circuits may result in a large waste of buffer resources. This paper proposes a distributed buffer insertion methodology for the use of dedicated buffers in structured ASIC design. A statistical estimate of buffer distribution is derived, based on Rent's rule. For each range of (p, k) values, where p is the Rent's exponent and k the Rent's coefficient, our algorithm (described in section 3.3) finds a buffer distribution for circuits falling in that range. Thus for each range, we can prefabricate an off-the-shelf structured ASIC chip with buffers preplaced according to the estimated buffer distribution of that (p, k) range. In the implementation phase, a designer may choose the appropriate prefabricated chip for implementing custom design according to values of p and k for the design. Experimental results show that the buffer resource estimation is accurate and adequate * This work was supported in part by the NSF under award CCR-0205227. for interconnect buffering purpose.
Buffer Insertion for Structured ASIC Design
In cell-based ASIC design, distributed buffer insertion approach [8] has shown advantage over the buffer block approach [7] . For structured ASIC design, in the same spirit of interspersing buffers with logic units across the circuit, we adopt a buffer insertion model in which the prefabricated buffers are scattered throughout the structured ASIC, and the distribution of buffers should be adequate enough for buffering global wires. We also refer to this as a distributed buffer insertion model to capture the distributed nature of this scheme, although unlike [8] , the buffers are actually prefabricated in structured ASICs. We define the structured ASIC to be a two dimensional array composed of via-configurable standard building blocks, denoted as via-configurable cells (VCC), which are connected by via-configurable interconnect wires. To facilitate the distributed buffering model, we divide the circuit into an array of tiles, and each tile is a square area containing m × m VCCs as well as a predetermined number of dedicated buffers for interconnect buffering usage. For a tile positioned at (i, j), we refer to this number as the buffer capacity, denoted as Bi,j. If Bi,j is uniform for all (i, j), we refer to this as a uniform buffer distribution; otherwise it is said to be nonuniform. In a routing solution, the actual number of utilized buffers in tile (i, j) is referred to as the buffer usage, denoted as bi,j. If bi,j > Bi,j, the routing solution is invalid, and we refer to this situation as buffer overflow. Figure 1 (a) shows a 6 × 6 tile graph for a structured ASIC design, with buffers distributed within tiles. The corresponding buffer capacity of each tile in the circuit with a nonuniform buffer distribution is shown in Figure 1(b) .
In this paradigm, buffers are prefabricated into the layout but are connected to the global lines that require buffering only in the later phases of physical design. To enable this, we utilize a via-defined buffer insertion (VDBI) scheme, which allows a buffer in a tile to be inserted along any interconnect wire traversing the tile, by means of a via configuration. Figure 2 (a) shows a representative buffer in a tile: its input and output are connected to horizontal and vertical wires that go across the tile. Figure 2(b) shows that any wire that crosses the tile can choose to be via-configurably connected to either this buffer through "insertion vias," or through "jumping vias" to a metal strip that can skip the buffer entirely. Our model uses a simple yet effective distance-based criterion for buffer insertion. As noted in [7] , the delay of a wire is relatively insensitive to the precise location of a buffer, and a buffer can be inserted within a feasible region instead of at a specific location without greatly affecting the timing performance. Our distance-based model is similar to that used in [8, 10] : the maximum length of interconnect can be driven by a gate (buffer) is L, and this is referred to as the critical length. For a two-pin net, this implies that the separation distance between two buffers is at most L. 
estimate should be based on basic circuit properties instead of circuit implementation details, so that these base chips can be used to implement a variety of designs, and may work with various physical design tools. With these considerations, our buffer distribution estimation is based on Rent's rule, which is an empirical relationship that correlates the number of signal input and output (I/O) terminals T , to the number of gates, N , in a random logic network [11] , and it takes the form:
where k and p are called the Rent's coefficient and the Rent's exponent, respectively. These parameters reflect the complexity of a circuit, and can be derived in the process of partitioning the circuit netlist. Our work assumes that these parameters have been computed for the circuit to be mapped to the structured ASIC. When N exceeds some critical size Ncrit, the relationship between T and N deviates from the exponential curve and enters region II of Rent's rule [11] ; T will keep constant or drop while N increases.
Estimating the Statistics of Buffer Distribution
The number of buffers required in a tile is highly correlated to the total length of external interconnect wires crossing this tile. If a larger number of wires traverse a given tile, it is likely that a larger number of buffers will have to be inserted in the tile. From the Rent's exponent p and Rent's coefficient k for a circuit, we can apply Rent's rule to statistically estimate the length of interconnect wires crossing a specific tile D, and further, to estimate the number of buffers required in the tile.
A schematic of a circuit layout is shown in Figure 3 (a). The layout is a square consisting of n × n tiles 1 . Each tile is a square consisting of m × m VCCs, and the geometrical size of a tile is t × t units. While considering the estimation for a tile D, we divide the circuit, for convenience, into 9 blocks, labeled A through I, as shown in the figure, with block D in the center. Block A consists of all tiles northwest of D; block B is composed of all tiles to the east of D, and so on. We will use (i, j) to refer to the coordinates of tile D in the n × n tiling. For estimation purposes, it is reasonable to assume that the routing will remain within the bounding box set by the pins of a net. Under this assumption, the interconnect wires that cross block D consist of the contributions of the wires connecting block pairs in set S defined as:
(A, G), (A, I), (B, E), (B, F), (B, G),(B, H), (B, I), (C, E), (C, F), (C, H), (E, I), (F, H), (F, I), (G, H), (H, I
)} . The total wire length passing through tile D, WD, is given by
where Wx,y is the total wirelength of all interconnects crossing tile D, connecting VCCs in blocks x and y, where (x, y) ∈ S. Since the tile dimension t is generally much smaller than the critical length L, the interconnects originating from tile D will not be likely to consume buffer resources at D, and we only consider the contribution from those wires passing D. Knowing WD, if the maximum interconnect length driven by any buffer is length L units from the insertion model described in sec- 1 For a general circuit of rectangular form, the analysis is very similar.
tion 2, then a unit length interconnect segment in a wire crossing tile D will have a probability of 1/L of requiring a buffer to be inserted in tile D. This implies that an external wire passing a tile of dimension t horizontally will probabilistically insert t/L buffers from this tile. We can use this idea to estimate the buffer capacity, BD, required for tile D as the probabilistic usage of buffers in the tile. In other words,
The Wx,y components of WD can be estimated by using Rent's rule. As an example, we now illustrate how the value of WA,I may be estimated; other Wx,y components may be estimated in a similar way. As in [12] , we merge two neighboring blocks H and D into a larger block HD as shown in Figure 3(a) , and apply the I/O terminal conservation rules to the three blocks A, HD and I, which are shown as shaded regions in Figure 3(a) . We have the number of I/Os connecting blocks A and I, denoted as TA to I , to be
in which the T block , block ∈ {AHD, HDI, HD, AHDI} is the number of I/Os of the combinational blocks, and they can be estimated using Rent's rule (1). However, in applying this formula, we may find that the estimate moves into region II of Rent's rule as described in section 3.1.
We use a simplified approach to handle this deviation: when the number of VCCs in the combinational block exceeds Ncrit, we substitute this number with Ncrit. Experimental curves show that Ncrit is between 150 to 200 for various circuits, and we simplify it by taking Ncrit = 175 for all experimental circuits. Experimental results also show that small variation in the choice of Ncrit does not affect estimation results much.
To calculate the number of interconnects between blocks A and I, we define a variable α that is the fraction of terminals that are sinks. Thus we can obtain the number of point-to-point interconnects between block A and block I, IA to I as:
and α can be expressed as the average fanout of the system, as α = f anout f anout+1
. Using equation (5) to obtain the number of interconnects between blocks A and I, we can further combine it with a simplified L-Z shaped routing model to estimate the wire length crossing tile D due to interconnects between A and I, WA,I . Figure 3(b) shows a set of possible L-shaped and Z-shaped connections between the blocks A and I. Probabilistically, we can assume that the average position of the terminals of the interconnects are at the center of blocks A and I. Thus the routing of interconnects will follow the bounding-box path, and falls in the dotted box in Figure 3 (b) . We denote the distance between the centers of A and H by L1; the distance between the centers of A and C by L2; and the distance between the center of A and the northern edge of D as L3. The parameters L1, L2 and L3 are pictorially illustrated in Figure 3 , and these can be expressed in terms of i, j, n and t.
In practice, it is observed that a router will route the bulk of the nets with simple L-shaped and Z-shaped patterns [13] . Hence, we can assume that the routing of an interconnect will utilize one of these two patterns, and the probability of using an L-shaped and Z-shaped route are PL and PZ = 1 − PL, respectively. As in [13] , we assume PL = 0.7 in the estimation. Under this routing model, we can estimate the wirelength crossing tile D due to L-shaped routes as:
The factor "1/2" in the above equation is due to the fact that there are two possible L-shaped routes, and only the upper-L route will pass the D tile. Similarly, there are two kinds of Z-shaped routes, type I: with two horizontal segments, and type II: with two vertical segments. If every Zshaped route has an equal probability of being taken, the type I Z-shaped routes will have a probability of PI = L1/(L1 + L2) of being taken, while the type II Z-shaped routes have a probability of PII = L2/(L1 + L2). Both types of routes can pass tile D, and we can probabilistically estimate the wirelength of Z-shaped routes crossing tile D as:
Finally, we can compute the total interconnect wire length from A to I that traverses D as WA,I = W L,(A,I) + W Z,(A,I) . The wirelength contribution from other block pairs in set S can be computed in a similar way as WA,I , and their values can be substituted into equation (2) and ( 3) to yield the estimated buffer capacity BD for tile D at position (i, j).
Application to Structured ASICs 3.3.1 Classification of Structured ASICs
Structured ASICs consist of predetermined regular architectures, and the buffer resource planning must be addressed at the prefabrication phase. However, due to the unavailability of specific circuit information at this phase, we must preallocate buffers so that it can satisfy the buffer insertion requirements of a range of circuits. As stated earlier, our approach determines the buffer distribution that is necessary for a circuit, based on its basic characteristics, namely, the Rent's exponent p and Rent's coefficient k. The practical way of employing structured ASICs in design is that they are prefabricated with other hard Intellectual Property (IP) blocks, which are embedded processors, I/O controllers, etc. The structured ASIC part can be used to implement users' specific logic, and we can assume that there is only one Rent's exponent p and one Rent's coefficient k associated with this logic.
We now describe how the buffer distribution determined in section 3.2 is used to design off-the-shelf structured ASIC parts that can be used for a specific circuit. We divide the spectrum of Rent's exponent values, p, and Rent's coefficient values, k, into a set of ranges:
For the circuits of tile array size n × n in a specific range pair [pi, pi+1] and [kj , kj+1], we can predetermine the maximum number of buffers required in each tile for these circuits with the algorithm shown in Figure 4 . Using this estimation technique, a set of structured ASIC chips can be prefabricated. When a given circuit is to be mapped on to this fabric, its Rent's parameters are first computed. Based on their values, the appropriate prefabricated structured ASIC chip is chosen, and the circuit is mapped on to that chip. If the Rent's parameter is exactly at the boundary between two ranges, we choose the structured ASIC representing the lower range to avoid waste of buffer resource.
To find a buffer distribution fitting the requirements of all circuits in the range [kj , kj+1] and [pi, pi+1] as in Figure 4 , we must find the maximum estimated buffers in each tile.
Theorem 1. The estimated number of buffers is a monotonically increasing function of the Rent's coefficient, k.
Theorem 1 can be easily proved from the linear dependence on k in Rent's rule. From Theorem 1, we can set k to be at the upper limit of the range, i.e., kj+1 to find the maximum buffers in each tile. However, it is easy to experimentally show that the dependence between the estimated number of buffers and p is not monotonic, and therefore, we apply an enumeration of p with a step size of δ to find the maximum number of estimated buffers in each tile for that range. In our experiments, we find δ = 0.01 is an appropriate choice of the step size. With this estimated buffer capacity for circuits that lie within a range of Rent's parameter values, we can predetermine a single buffer distribution to satisfy the interconnect buffering requirement of all of these circuits, thus creating only one structured ASIC chip that can meet the requirements of all of these circuits. The finer the granularity of these ranges, the more accurate the buffer distribution will be, but the trade-off is that larger number of base chips have to be prefabricated. The size of the structured ASICs is determined at the prefabrication phase and it may be larger than the circuit to be implemented. However, for the same p, k range, it can be proved that the estimated buffer usage for a larger prefabricated structured ASIC will be greater than the estimated buffer usage for a smaller circuit, if we place the smaller circuit at the center of the structured ASIC. Thus if we estimate and distribute buffers aiming at a larger base structured ASIC, it will always satisfy the requirement of smaller implemented circuits.
Uniform Buffer Distribution
The buffer distribution estimation obtained above is a non-uniform one, i.e., the number of buffers at different tile locations are different. An important feature of structured ASICs is the regularity in design, which helps much in improving the manufacturability and reducing the design complexity. So an alternative to the above nonuniform estimated buffer distribution is a uniform buffer distribution. We set the number of buffers each tile in the uniform distribution as the average buffer number over all the tiles in the nonuniform distribution, which is obtained from the algorithm in Figure 4 . This uniform buffer distribution allocates the same amount of total buffers as the nonuniform distribution in the prefabrication phase, and more importantly, it maintains the regularity of the structured ASICs. Since its buffer level is based on the average buffer number from our estimation, it could still satisfy the requirement of interconnect buffering under our flexible buffer insertion model, and this is verified in experimental part.
Experimental Results and Conclusion
Our experiments are performed on 18 of the largest MCNC benchmarks, ranging from 1047 to 8383 logic blocks. These circuits have been technology-mapped to 4-LUTs and flip flops using Flowmap [14] , and then the 4-LUTs and flip flops are combined into basic logic blocks with VPACK tools [15] . The benchmarks are then placed and routed with Versatile Place and Route (VPR) tools [15] under a 0.09µm technology. In placement and routing, the VCC block used is the same as the VPGA based 4-LUT CLB designed in [1] , and the routing architecture is the switch block architecture also from [1] . The technology parameters of 0.09µm technology used in physical design are derived from the scaling of parameters in [1] as well as from [16] and [17] .
For the buffer-related parameters, we take each tile to include 8×8 = 64 VCCs, and compute the critical length L for buffer insertion using the estimation from [10] , which results in L ≈ 500µm. We divide the Rent's exponent and coefficient spectrum into range sets Rp = {[0. For each benchmark circuit, we can derive the Rent's exponent p and coefficient k in a recursive partitioning process using hMetis [18] . We then select the corresponding range in Rp and R k for each circuit. Assuming the prefabricated base structured ASIC has the same tile array size as the actual circuit implementation, we can apply the estimation algorithm in Figure 4 to find the estimated number of buffers in each tile. We use this as the estimated nonuniform buffer capacity for the prefabricated structured ASIC characterized by the corresponding ranges in Rp and R k , and denote this as nonuniform (N U ) distribution. Based on this, we can further calculate the average number of buffers per tile as the level of the estimated uniform buffer capacity, and denote it as uniform (U N I) buffer distribution. We also experiment on a distribution that the ratio between logic cells and buffers is 2:1 everywhere like [9] , and denote this as a uniform 2:1 (U 21) distribution. In our experimental setup, there are 32 buffers in each tile for the U 21 distribution (since there are 64 VCCs each tile).
To verify the choices that have been made, we apply a buffer insertion algorithm from [8] to actually insert buffers into the above placed and routed circuits, under the N U , U N I and U 21 buffer capacity models. This will produce the actual number of buffers used in each tile. Comparing this actual buffer number distribution with that from the estimation models, we list the experimental results in Table 1 . The first four columns list the circuit names, the values of Rent's exponent p and coefficient k, and the tile array size. The next four columns list the average number of buffers per tile for the estimated buffer capacity (EBC), and the actual buffer usages from buffer insertion under the three different buffer distribution models. The value of EBC has been rounded to the nearest integer, and it is exactly the uniform buffer capacity used in the uniform (U N I) distribution model. The uniform buffer capacity for the U 21 model is 32 for all structured ASICs, and is not explicitly listed. As we can see, the average number of buffers inserted in the cases of N U and U N I models are close to those from our statistical estimation; in fact, the actual number of buffers required is always a little smaller than the estimated value due to the fact that our buffer estimation method determines the maximum number of buffers for a set of circuits falling in a range of Rent's exponent and coefficient, and therefore, this estimation can be larger than its actual usage for a specific circuit. On the other hand, this pessimism also increase the robustness of the estimation, so that the estimated buffer capacity will satisfy the buffering requirement of circuits even in the presence of fluctuations in the buffer usages of practical circuits. For the buffer usage under U 21 model, the average number of buffers used is much less than the capacity, 32, and this suggests an inefficient use of buffer resources. Table 1 : Comparison of the buffer usage and timing performance results from buffer insertion under three buffer distribution models: nonuniform (N U ) distribution, uniform (U N I) distribution and uniform 2:1 (U 21) distribution. EBC is the estimated buffer capacity from our estimation algorithm.
The columns labeled "# buffer overflow per tile" report the average buffer usage overflow per tile for each of the three buffer distribution models. U 21 model, due to its extreme pessimism, results in no overflow at all, but at the price of large waste of resources. For the N U model, it is observed that most of the circuits are free from buffer overflow, and the occurrence of instances that do have overflows is very low, less than 0.08 per tile. The U N I buffer distribution model results in more overflows, because it is derived from the U N I model, but the uniformity of the distribution incurs overflow, and it can be considered as a trade-off for the regularity in buffer distribution. However, it is observed these overflows are of low values, and even the worst case has less than 0.73 overflow per tile. These results show that our buffer estimation method can produce an adequate solution, so that the solution can successfully satisfy the buffer insertion requirements of practical circuits. Moreover, even at the trade-off of regularity, our uniform distribution estimation still shows good estimation of buffer levels.
The next three columns list the average buffer usage rate of all circuits under different buffer distribution models. It shows that for most of the circuits, the average buffer usage rate is between 70%-90% for both N U and U N I models, which means that the pessimism of our estimation algorithm is low, and that it actually produces reliable and economical a priori buffer distribution solutions. However, due to the extreme high buffer capacity, the U 21 buffer distribution shows a very poor usage rate, and thus this solution is not economical. The last three columns of Table 1 show the critical path delays for benchmark circuits under three buffer distribution models. For the overflowed buffers, we remove them from buffer insertion solution, and then calculate the critical path delay. We can find there is not much difference in the timing performance of the three distribution models. Although there are some buffer overflow for the U N I model, the rip-up of those buffers does not affect the timing performance much due to: (a) the number of overflowed buffers is small, (b) they may not lie on the critical path. These facts legitimize the use of the uniform buffer capacity model based on our estimation in structured ASICs; we can thus acquire an adequate and economic buffer solution with great regularity to improve the manufacturability and reduce cost, while not affecting the timing performance significantly. Figure 5 further shows the two dimensional buffer distribution for a specific representative circuit, pdc. The three graphs show, respectively, the estimated buffer capacity distributions for nonuniform and uniform models, the actual buffer usage distribution for uniform model and the relative errors. We can see from Figures 5(a) that the distribution curve for the nonuniform distribution is of a "bell curve" form, but with a flat region in the center and a sharp dropoff near the boundary. Thus it does not deviate much from an average uniform distribution as shown in the figure. This property makes it reasonable to use a uniform buffer distribution based on the average of the nonuniform distribution, but still get good buffer usage and small buffer overflow. Figure 5(b) shows that the distribution of the actual buffer usage under the uniform buffer distribution model. For this and all other benchmarks, the general trend is that the usage of buffers fits the uniform buffer capacity well in most part of circuits, but less buffers are used at the periphery. This is due to the fact there is generally less interconnect wires at periphery. In practice, we may preallocate some buffer areas on periphery to be decoupling capacitors to save some resources. The relative error curve in Figure 5 (c) shows that the estimated buffer capacity is generally a little higher than the actual buffer number, because our estimation is aimed at the maximum buffer capacity for a range of circuits, and naturally builds in some pessimism. For the most part, this error is less than 20%, which shows that our a priori buffer distribution estimation provides an economic solution.
