Abstract--A heuristic algorithm for a given topology of a multiple-source and multiple-sink bus to reduce the signal delay time is proposed. The algorithm minimizes the delay by inserting buffers into the candidate locations and sizing the buffers. Experiments show up to 7.2%, 20.7%, and 29.6% improvement in delay for 240.5, and 0.3 micron technologies, respectively.
I. INTRODUCTION
In the design of high-performance VLSI systems, buses inherently exist in the chip. A multiple-source bus is used to conserve routing area and number of terminals of functional blocks. However, the trade-off is a larger loading capacitance and delay time. Therefore, reducing the signal delay time in a multiple-source and multiple-sink bus is an important practical problem.
A conventional approach to reduce the delay between sources and sinks is sizing the source drivers. For a given number of sources, engineers manually tune each driver's size to reduce the delay times until the timing requirement is matched. Many papers [ 1-61 concentrate on the analogous problem, transistor or gate sizing, in CMOS and digital circuits. They try to find the best sizes of a given number of gatesltransistors and to reduce the propagation delay time with analytical or heuristic methods. Other researchers in [7-101 insert buffers into the wires to reduce the delay in single-source Steiner tree distribution.
The multiple-source bus is complicated by its multi-source and multi-sink characteristics. The best solution for a particular source and its sinks may result in another source not fulfilling its timing requirements. This paper is the first effort to reduce multiple-source bus signal delay by buffer insertion.
Given a set of N candidate locations for buffer insertion and timing requirements, our goal is to find a set of locations for buffers and their sizes to minimize delay. An exhaustive search of all the combinations of location and sizes of the inserted buffers is not feasible for industry applications.
We adopt the h-optimal approach by Lin and Kernighan [ll, 121 to search for the buffer assignment. We observe that sizing of buffers belongs to the class of geometric programming problems. A heuristic iterative improvement method is used to insert buffers into the candidate locations and to tune the buffer size. Our algorithm takes the running time of O(N4n In W, , ) , where N is the number of candidate buffer locations, W,, is _ _ _ _ _ _ 'C.C. Tsai is on leave (7/94-7/95) from the Dept. of Electronic Engineering, National Taipei Institute of Technology, Taipei, Taiwan, ROC. (E-mail: cct@en.tit.i:du.tw) the maximum buffer size and n is the total number of sources/ sinks. The results show that averages of 7.2%, 20.7%, and 29.6% time delay improvement for 2.0, 0.5, and 0.3 micron technologies respectively.
The remainder of the paper is organized as follows. Section I1 describes the problem statement. Section I11 gives the wire and buffer delay models. The buffer insertion algorithm and its time complexity are introduced in Section IV. The last two sections present the experimental results and conclusions.
PROBLEM STATEMENT
We first define the following symbols used through this paper.
pi:
The ith terminal in a bus; a terminal may be a source, sink, or both. n: number of terminals in the bus.
A bus consists of n terminals and N bus-wire segments. A terminal may be a source or sink in different timing periods. In each timing period, there can be only one working source but multiple sinks. In the same period, a subset of the wire segments is used. Fig. 1 shows an example of a six-terminal bus with 14 wire segments and four timing periods, In the figure, terminal p1 is the source in periods 1 and 3. Terminals p2 and p4 are the sources in periods 2 and 4, respectively. Note that three sources, pI,p2, andpl are also sinks in the different timing periods. Each source has at least one sink. For instance, source p4 has four sinks, p I , p3, p5, and p6 in the timing period 4. There are 3 source drivers ( P I , p2, and p4) and 14 possible buffer locations with one location on each wire segment.
Given the required time tg and the actual arrival time aii from source i to sink j , we define the slack so:
(1)
V V

Ij
We define the slack sBus of the bus as the minimal slack sij between all possible pair of source i to sink j :
To optimize the system performance, our goal is to maximize the slack sBus by buffer insertion and sizing. Fig. 2 shows a multiple-source bus with buffer insertion. There are buffers or bi-directional buffers on some segments of the bus. The direction of the signal flows are determined by arbitrators. We assume the wire widths in a bus is invariant, while the wire width of the control signals can be sized to match the timing requirements. Note that this is a distributed control system. For each source, the control signal is generated from the same block. The control signal triggers arbitrators which in turn sets the direction of the bi-directional buffers. For the distributed control bus system (Fig. 2) we adopt the following assumptions:
(1) The potential buffer insertion locations are given. (2) All the wire widths in a bus are homogeneous. (3) The control signals arrive earlier than the data and meet the minimum setup time.
In this paper, we concentrate on buffer insertion and sizing. The general problem of bus buffer insertion can be stated as below: 
A. Delay Between Active Components
We choose the Elmore delay model [13] to estimate the signal delay. Fig. 3 shows the delay model of a segment @g) of wire, where Ef g is the length of wire g g ) , and R, and C, are the resistance and the capacitance per unit wire length, respectively. The resistance yg and capacitance cfg of wire @g) are yg = R, Ifs and cfg = C, lfg, respectively. For the delay from a buffer U to the next active component, we trace from the output of buffer U to construct a tree Tu with U as the root, and its descendent bufferdsinks as the leaves. Each branch of Tu corresponds to a wire segment and each internal node corresponds to a junction. For each leaf v, the path (u,v) from root U to v is unique. Based on the distributed RC tree of the Elmore delay model, the wire delay time from U to v, duv, can be denoted by
where c(Tg) is the lumped capacitance of the sub-tree rooted at the node g. The capacitance c(Tg) can be partitioned into two terms, c (T ) the capacitance contributed by wires and cl(Tg) the capacitance contributed by the buffers and sinks, calculated by the following formula:
and Ci(Tg) = cg Wbg if g is a buffer or sink,
where D(Tg) is the set of children of node g.
We define ruvs as the resistance of the common portion of the path between path (u,v) and path (u,s where S(Tu) is the set of leaves in Tu, and
B. Bus Delay Model
A CMOS inverter can be a buffer where the output signal is the inverse of the input signal. For simplicity, two cascaded inverters, named the rear and the front inverters, are considered to be a buffer and inserted into a bus wire to maintain the signal polarity. For a bi-directional bus, we place two buffers in opposite directions and connect them together. Dat<a can only pass in one direction within any timing period. This is achieved by using a tri-state inverter for the front inverter controlled by the control signal. Fig. 4shows the symbol and the equivalent delay model for a buffer, where Db, Rb, and c b represent the intrinsic delay, the output driving resistance, and the input loading capacitance of a unit size inverter (IX), respectively. For a sized inverter (wX), the gate width increases by a factor of w, the output driving resistance is RJw, the input loading capacitance is cbw; the intrinsic delay Db is assumed to be i1 constant independent of w. where KoUv = Db + Douv and Kluv= Rb cW( T u ) .
(10)
Given the source i and the sink j , let (pi, bl, bz,.,., b,,pj) be the sequence of buffers along the path (pi, pj). We decompose the path (j.~~, p j ) into a set €Iij of buffer (source or sink) pairs, i.e. Bij = {(pi; bl), (bl, b2)7.*., (bs, P i ) } .
With buffer insertion, the propagation delay based on the equation (9) from the source i to the sink j can be expressed as:
IV. Bus BUFFER INSERTION ALGORITHM
A. Overview
In the problem of bus buffer insertion, we attempt to insert buffers to maximize the slack sBUs. We can formulate the problem in a nonlinear programing expression: This convexity allows us to associate a stationary point uniquely with a minimum. Because the unique properties of zi make it so suitable for optimization, we call zi a natural variable. Since zi is a monotonic function of wbi, a stationary point with to zi is, of course, also a stationary point with respect to wbi This unique property of zi will make any locally optimal solution of (14) also globally optimal. The second, third, and fourth terms in equation (13) are proportional to the buffer size wbs at the root of each tree T, but inversely proportional to the buffer size Wbu at the leaf, shown in Fig. 5 . In the figure, equation (14) is convex with respect to In W b p Fig. 6shows the outline of the combination of a set of equations (14). The solution space is convex. 
Theorem 1:
is also globally optimal.
The local optimal solution of obj (12) and constraints (14) We propose a simple heuristic method to improve the bus buffer insertion iteratively. There are three levels of operations.
At the top level, we try from x = 1 to N , to insert x buffers. The best result is selected. In the second level, for a given x, we search for the best buffer placement of the x buffers (Subsection 1V.B). For each buffer location assignment, a third level of sizing operation is called to find the best buffer sizes such that the bus slack sBus is maximized (Subsection IV.C). The first level of the algorithm for bus buffer insertion is stated as below:
B. Buffer Placement
We solve the bus-buffer placement problem according to the combinatorial optimization approach proposed by Lin and Kernighan [12, 15, 16] . Given an initial assignment of x buffers and a number 0 < h 2 x, in each iteration, we select h buffers and try the placement of these h buffers at all possible buffer locations. The placement that maximizes the bus slack sBus is kept. The entire process repeats until no further improvement is observed.
Lin and Kernighan define the solution to be a h -optimal: h-optimal: A solution is L-optimal if and only if the perturbation of any h buffers on the buffer location does not improve the current solution.
To reduce the complexity of the operation, we set h to be one. For each buffer placement, we optimize the buffer sizes. Therefore, the proposed buffer placement derives a 1 -optimal solution.
The detailed procedure for the buffer placement is described below: (N, x) 1 [ Randomly place x buffers into N buffer locations; } until no improvement in sBus is observed;
Buffer-Placement
Initially, thex buffers are randomly placed into N candidate locations. In each outer iteration (lines 3-14), we move each buffer to its best location until no further move can improve the slack. The inner iteration (lines 6-12) tries the N-x available buffer locations for a given buffer and returns the best move. Finally, the best result of the buffer placement is returned to the fist level.
C. Buffer Size Decision
Instead of using geometric programming, to simplify the implementation, we adopt a Gauss-Seidel iteration approach. Given a buffer placement, we adjust each buffer size Wbu to its optimal value assuming the rest of the buffer sizes are fixed. We use binary search based on a slope comparison for sizing the buffer size wbu to get a maximum slack. The successive overrelaxation [17] is used to accelerate the convergence of the Gauss-Siedel iterations. Let wbu* be the optimal buffer size of wbu, we have where k is the index of the iteration and a is the step size (a 2 1 ; experimentally,a = 1.2 ).
The procedure of buffer size decision is described as follows: For each buffer, we calculate the best buffer size with binary search (line 4). And then overshooting the buffer size (line 5 ) to accelerate the convergence. In practice, the front inverter should be sized before the rear inverter for faster convergence rate. Finally, the best results is returned to the second level, the buffer placement procedure.
D. Time Complexity
The bus buffer insertion algorithm consists of three hierarchical levels, the outer loop (Subsection IV. where c2 (experimentally, c2 < 5 ) is the number of repeatIn summary, the time complexity of the bus buffer insertion algorithm based on the combination of three hierarchical levels will be O(r& In W").
loops.
V. EXPERIMENTAL RESULTS
The algorithm of bus-buffer insertion has been implemented in C language and runs on a PC-Pentium (60 Mhz) under MS-DOS 6.2. We adopt the CMOS technologies based on 2.0, 0.5, and 0.3 micron design rules [18] . For 0.5 micron technology, the input capacitance and output resistance of a unit size buffer are Cb=1.725fF and Rb=3170 Ohm, respectively. We suppose that the intrinsic delay is kept constant 230ps for any buffer sizes. For 0.3 micron lechnology, the input capacitance and output resistance of a unit size buffer are Cb=0.621fF and Rb=3170 Ohm respectively, and the intrinsic delay is kept constant 150ps for any buffer sizes. In addition, the wire resistance is 0.05 Ohm per square area and the wire capacitance is 0.lfF per micron. To represent a buffer sizing without size limitation, we set the maximum buffer size W,, to 250X. For a wire segment with bi-directional transmission, a bi-directional buffer is placed (if any) at the middle of the segment to balance the delay times contributed by the segment. But for a wire segment with uni-directional transmission, a buffer is placed at the end of the segment on the side of the sources. We also assume that all the sources driving capability can be adjusted but all the sinks have a fixed unit size loading.
Since there are no standard benchmarks available, test cases have been created and used to evaluate our algorithm. Table I summarizes the data of the test cases. The bus is assumed to reach the four edges of the chip core. The length of the critical path is measured along the path contributing the longest time delay. The number of locations is the sum of the number of the segments and sources. In all cases, the bus topologies are different and the required arrival times of all sinks are set to be 90% of the delay of critical path without buffer insertion. Case 1 has the simple source-sink pairs, Case 2 has the minimum die size, and Case 8 has more complicated bus structure. Tables 11,111 , and N show the results of both source driver sizing and bus buffer insertion based on 2.0, 0.5, and 0.3 micron technologies, respectively; where "Delay" is the maximum time delay from sources to sinks, "Bsizes" is the summation of all the inserted buffer sizes, and "Cputime" is the running time measured by Pentium-60MHz. From the tables, the "Delay" based on the buffer insertion approach is always less than that of the source driver sizing, but takes more buffer sizes. From the experiments, the average improvement in delay (Delay saving) is 7.2% for 2.0 micron technology, but 20.7% and 29.6% for 0.5 and 0.3 micron technologies, respectively at the expense of larger total buffer area. We use the same topology but different bus length to compare the impact of core size on the bus buffer insertion. 
VI. CONCLUSION
The bus buffer insertion algorithm has been proposed and implemented. The algorithm is designed to minimize the clock period by inserting buffers in the multi-source and multi-sink bus. As wire resistance increases with larger chip sizes and finer wire widths, bus buffer insertion yields greater improvements in performance. In our experiments, the improvements increase from 7.6% to 20.7% and 29.6% when we move the technology from 2.0 microns to 0.5 and 0.3 micron technologies.
Besides delay reduction, buffer insertion can remedy noise and cross talk problems. Bus buffer may restore the signal before the noise corrupts the data. The inverting buffer can be used to change the phase of signals and thus reducing cross talk.
Future work includes area limitation and power consumption.
