-This paper presents an optimal algorithm for solving the problem of simultaneous fanout optimkation and routing tree construction for an ordered set of critical sinks. The algorithm, which is based on dynamic programming, generates a rectilinear Steiner tree routing solution containing appropriately shed and placed buffers. The resulting solution, }vhich inherits the topology of LT-Trees and the detailed structure of P-Trees, maximhes the signal required time at the driver of the given set of sinks. Experimental results on benchmark circuits demonstrate the effectiveness of this simultaneous approach compared to the sequential methods.
1.~TRODUCTION
The current deep-submicron @S~process technologies have increased the contribution of the interconnect delay to the total path delay in digital circuits. At the same time, the existing design flo!vs and tools have had limite~and ody marginal, success in incorporating interconnect planning and optiation early in the design process. This situation has forced IC designers to re-evaluate the existing computeraided design (CAD) methodologies and techniques.
To address the DSM design chaUenges, one can either increase the Iookahead capabfity of high-level tools or develop ne~v algorithms for solving larger portions of the overall design problem simultaneously. This latter unl~cation-bmed approach is, in our vie~v,more promising. kdee~the nature of IC design problems and the current state of CAD solutions have reached a point \vhere it is both necessary and possible to combine some steps of the synthesis and physical design processes. The unificationbased algorib are capable of capturing existing interactions among the 'merged' design steps and producing higher-quality implementations by systematically searching a much larger solution space (see [SLP9S] and~SP97]).
The algorithm proposed in this paper integrates NO major design steps: fanout optiation and routing tree generation. Each of these WO optiation steps has been very effective in reducing the circuit delay, in one case by boosting the transmitted signals via insertion of stied buffers *~s }vork)VOS fundedin part by SRC under contract no. 9S-DJ-606and by a NSF PECASE a~vard(contract no. MIP-962S999).
Petis$ion to mke d~ti or hard copim of aUor part of this~~,ork for peraod or &sroom use& &=ntti~%tithout fw prmidd that copies are not made or &@W utti for profit or comrnerti adt,antage and that copi~b-this notice nd the frd3 dtahon on &e fist pase. To copy othen>ti, to repubkh, to posi on sewers or to red~tibute to kts, rqties pfior s@c peticsion arrd/or a fee. " ICC~3, Sm lose, C& USA @ 1%s Amf 1-5s113a$zg8/MI I.$5.m and in the other case by generating suitable tie structures. The goal of this \vork is to optimaUy integrate these hvo steps and thereby provide a dorm fie~vork for optiing the nets of a placed circuit to achieve faster implementations.
The proposed dynamic progr amming based algorithm generates and propagates a set of buffered routing tree structures in the form of~vo dimensional (required time versus input load) solution curves. The resuking~g structure is guaranteed to Merit the topology of LT-Trees [To90] and the detaded physical implementation of P-Trees &CLH96]. This algorithm takes a given order for the sinks and, starting from the higher indexed sinks, combines them into groups \vhich are to be driven by buffers. For each group, proper routing structures and buffer locations are examined to generate a set of possible solutions for that subset of ordered sinks. Ody the solutions \vhich are not dominated by otier solutions are kept. These hvo steps are repeated in a dynamic programming fashion until the \vhole set of sinks are combined together. Expetiental results reported in this paper demonstrate the effectiveness of this method versus the conventional flo~vs that sequentially perform routing tree generation and faout optiation.
The remainder of the paper is orgtied as foUo\vs. k section 2, background and motivation are given. Section 3 inkoduces the proposed algorithm. h sections 4 and 5, our experimental results and concluding remarks are presented.
BACKGROUND AND MOTWATION

Fanout Optimtiation
Fanout optiation, an operation perfomed in the logic domain, addresses the problem of distributing a signal to a set of sinks \vith kno~vn loads and required times so as to matie the required time at the signal driver. hterconuect delay is not incorporated in this operation because the locations of the buffers are not kno~vn at this stage. The general fanout optiation problem is N-hard [To90], ho~veverits restriction to some special famities of topologies is kn~~vnto have polynomial complexi~.
Among the many existing \vorks on fanout optiation proble~\ve are interested in the algorithm proposed by [To90] . That \vork introduces a specird class of tree topologies, caUed LT-Trees, for \vhich the fanout problem is solved tith polynomial complexity. The LT-Tree of~e-I (in this paper referred to as LT-Tree) is a tree that permits at most one internal node among the immediate chikhen of every internal node in the tree. Touati in [To90] proposed a dyamic progr amming based algorithm for the fanout optiation problem \vhere the buffer structure is restricted to the LT-Tree topology and sinks \vith larger required times are placed Mer horn the root of the tree. His algorithm fist sorts the sinks in non-decreasing required time order and then starting from the least critical SU it enumerates all rightmost groupings of the sinks to be driven by a buffer. Finally for each grouping, it enumerates au possible \vays of adding either zero or one buffer to drive the rightmost subset of the sinks. Touati gives sufficient conditions for his LT-Tree construction algoriti LmE, to be optimal. 
Routing Tree Generation
Performance-driven interconnect design, an operation performed in the physical domati addresses the problem of connecting a source driver to a set of sinks~vith knolvn loads, required times and positions so as to maximize the required time at the driver. The inherent complexity of this problem has forced researchers to either solve it heuristically or to impose constraints on the smcture of the resulting interconnect. For an overvie}v of the existing performance-driven interconnect design techmque, interested readers are referred to [C=96].
Lillis et al. in [LCLH96] proposed the PermutationConstrained Routing Tree or P-Tree structure as a solution to the above mentioned problem. Their approach consists of hvo major phases: Finding a proper ordering for the sinks, and then generating the routing structure based on the calculated ordering. The second phase of the algoriti called PTREE throughout this paper, is employed in the present paper. Given an ordering of the sink nodes, PTREE finds the optimal embedding of the net into the Hanan grid (the set of points formed by the intersection of hotiontal and vertical lines through the terminals of a net~a66]) by a dynamic prograrmning approach. h PTREE, the (intermediate) routing solutions are stored in the form of NVOdimensional, non-dominated solution curves of total area versus required time for every Hanan point.
The \vorst case complefi~of PTREE is rather hig& O(n~, ho~vever,the runtime for practical purposes remains \vithin an acceptable range [LCLH96] . Ftiermoret by applying some techniques such as controlling the mwum number of Hanan points, the complefi~of PTREE is considerably reduced~vithoutlosing much in terms of the quality.
Lemma 3: For a given order on the sinks and \vith the restriction that the Steiner points lie on the Hanan Grid, PTREE computes the set of all rectilinear Steiner trees \vith non-dominated required time and total capacitancẽ CLH96]. 
Other Works
Okamoto and Cong in [OC96a] proposed a combination of A-Tree routing generation [CLZ93] and van Ginneken's buffer insertion [Gi90] as a solution to the problem of buffered Steiner tree construction. They later extended their \vork in [OC96b] to include \vire stiing as \veU. Their algorithm takes the placement information of the source and the sinks in addition to the signal required arrival times and then heuristically generates a buffered routing s~cture such that it maties the required time at the source of the net. This technique consists of WO phases: bottom up tree construction \vith non-inferior solution computation and top do~vn buffer insertion. The non-tienor solution \vhich gives the mafium required time at the root is chose% and then it is traced back through the computations pefiormed during the f~st phase that led to this solution. During the backtrace, the buffer positions are determined.
During the bottom up phase, the subtrees are combined using a \veighted addition fiction~vith a user specified parameter to heuristica~y decide \vhich hvo subtrees are to be merged. Although this method employs the A-Tree construction algori~it cannot guarantee that the restiting structie remains an A-Tree. Furthermore, the fanout optiation algorithm \vhich is based on critical sink isolation is ad-hoc. The overaU algorithm has no guarantee of optimality. h contrast our proposed method produces a buffered rectiltiear Steiner tree \vhich is optimal subject to the given order of the sinks, the topology of LT-Trees and the detailed structure of P-Trees.
THE F~OUT fiGONTF
A~O~, simultaneous @out and &ing tree optimization algorithm, is a dynamic pro-g based algoriti which cons~cts a buffered routing structure for a given ne~based on the available placemen; loading, and timing information. The goal is to m-e the required time at the driver of the net.
Problem Formulation
A given net N=(s,S), detehes the set of sink nodes, s={s,,s~, . . .,sn),~vhich are to be driven by the driver of the ne~called s. h addition to the input net, the fo~o~ving information is required and used by F~OĨ 
T\vo Dimensional Solution Curves
Although the objective is to fmd an implementation \vith the maximum required time at s, during every step of FANROUT, load versus required time curves are generated and the solutions are compared and evaluated \vith respect to these bvo parameters. Comparison of WO sub-solutiom based on ordy the required time is an invafid comparison and may result in dropping the optimal solution. This is due to the fact that the loading imposed by a sub-solution on the next level of the LT-Tree may cause a large increase in the overa~delay such that the difference behveen the required times is more than that \vhich \vas compensated for. Therefore, both the required time and the input load are needed to evaluate the effect of a sub-solution on the overall structure. 
Detailed Approach
FANROUT incorporates LT-Tree and P-Tree construction techniques into a tiled frame~vork such that the resulting routing structure is both an LT-Tree, in terms of the overafl topology, and a P-Tree, in terms of the detiiled physical s~cture. FANROUT requires an ordering of the sinks and guarantees the optirnality of the solution \vith respect to this ordering ody. At every step, z is the index sho~ving that the n-z+l rightmost sinks (in the ordered list of sinks) are being combined into a group driven by a buffer; see Fig. 4 . The LT-Tree topology a~o~vs the use of an akeady processed sub-group of last n-h+l sinks tvhere h is a number bebveen z and n.~s guarantees that in the fial solution, each buffer drives direcdy at most one other buffer.
For every Hanan nodes and every index z, r(z,v) is a hvodirnensional solution curve including all the non-tierior buffered routing structures each connecting sinks SZthrough Sn}vithits root located on v.
k line 4, these solution curves are initialized to the set of all non-inferior buffered paths connecting v to s.. The code in lines 5 through 16 is for calculating all the buffered routing structures for r(z,v) using the solutions available in r(h,v) as described next. Corresponding to group h, therk exist rt2 r's each for a Hanan node. k be 7, all tie H~~nodes are PTREE returns a co~ection of solution curves each corresponding to a distinct Hanan node. The collection of curves is stored in D by PTREE. Then in tie 10, these solution curves are selected one by one using a variable, A. Recall that each A corresponds uniquely to a Hanan node \vhich is referred to as u in line 11. Once a A is in hand, its encapsulated routing structures are retrieved one by one by a variable, 6. For all these routing structures, afl possible buffers are tried in lines 13 through 16, and for each choice the required time at the input of the buffer is calctiated using the specified delay model. k he 15, for every match a solutio~u, is generated (\vhile saving pointers to its subsolutions, for later use in the top-do~vn haceback phase) \vhich corresponds to a routing structure (i.e., S) and a buffer (i.e., b). This solution is added to r(z, u) because the root of c is located at u . The solution curves r(z, v) are calculated in this~va~ho~vever, these curves may contain inferior solutions \vhich are pruned,in tie 16.
FinaUy, FANROUT buflds the r(l, v) solution curves (for every v) }vhich contain buffered routing s~ctures connected to au the sink nodes. The~for every v and for every solution of r(l, v), the root of the buffered routing structure is connected to the driver and the required time at the input of the driver is calculated in line 1S. The structure \vhich results in the largest required time is chosen and is traced do}vn through the stored pointers. The buffered routing structure is retrieved and returned in lines 19 and 20. 
Quali@ and Complexi@ of F~OUT
The proposed algorithm is an optimal polynomial algorithm based on a set of assumptions. The follo~vingset of lemmas and theorems forma~y prove these claims.
Theorem 1: The solution space of FANROUT is the product of those of PTREE and LmE.
Proofi Any P-Tree structure \vith inserted buffers such that no buffer immediately drives more than one other buffer can be visited by F~OUT. Also, any LT-Tree such that the output nets of its buffers are implemented using PTREE can be visited by FA~OUT.
Lemma 5: For any arbitrary routing \vith no buffer, W, \vhich connects the source to the sinks,~vehave: I. By decreasing the load of any SW the capacitance observed at the root of R does not increase. . By increasing the required time of any sink, the required time at the root of w does not decrease. Proofi For case I, decreasing the load of a sink decreases the amount of the charge needed to bring the voltage of x to a certain level. For case H, if that particular sink is on the critical path, the statement is trivially true. Othenvise, the required time of the driver is determined by the required time of the other sinks and remains unchanged.
Lemma 6: PTREE is monotone with respect to the load and the required time of the sinks.
Proofi Suppose n is a routing structure generated by PTREE. Reducing the capacitance an~or increasing the required time of a sink \vhile preserving x results in the decrease of the capacitance and increase of the required time at the root of m. Therefore, if PNE is run after changing the load and the required time of the sinks in this \vay, the resulting structure is non-inferior Ivith respect to R and P-E \vould store it in the curve (cf. Lemma 1).
Lemma 7: The use of the pruning operation by FANROUT does not result in the loss of any non-inferior solution.
Proofi Assume that Ozis inferior \v.r.t. o,. By induction, if a2 is the \vhole net and its input is direcfly connected to the net driver, then the required time does not decrease and the load does not increase by replacing 02 tith al. If 02 is a solution to a sub-proble~its input is driven by another internal node, call it g. Due to the monotone behavior of PTREE (cf. Lemma 6), at g the required time and the input load of the implementation including~z is guaranteed to be no better than those of the implementation containing cl. A similar argument is then vafid for g and the rest of the internal nodes do~vnto the leaf nodes.
Theorem 2: FANROUT is an optimal algontbm \v.r.t. required time, subject to a set of constraints.
proofi h examination of the dynamic programming structure of FANROUT sho~vs that if no pruning is performed, all the possible solutions \vodd be considered. Therefore, to prove the optimafity of the algorithm it is enough to prove that for an optimal solution, replacing a non-infefior solution \vith an inferior solution cannot improve the \vhole implementation; This, ho~vever, \vas proved in Lemma 7.
Lemma 8: The number of solutions in any solution curve is bounded by the number of the buffers in the hbrary, IL 1.
Proofi The load of any solution is equal to the input capacitance of the driving buffer. Ho\vever, the number of distinct input capacitances of the buffers is bounded by the total number of the avaflable buffers in the library, IL 1.For each load value the solution \vith the maximum required time is stored and the rest \vill be pruned out.
Theorem 3: FANROUT has O(n~memory complexity.
Proofi There are nz Hanan points and for each of them n solution curves are stored. Each solution curves stores no more than IL I solutions. Therefore, the claim is proved.
Theorem 4: FANROUT has O(n~runtime complexity.
Proofi PTREE has 0(n5) \vorst case runtime complexity (cf. Lemma 4). Lines 5 and 6 of the pseudo-code, each introduce O(n) complexity and he 7 introduces another O(n~complexity. Therefore the overa~\vorst case complexity is O(n~.
Reducing the ComplexiŨ
ndoubtedly, the \vorst case complexity of FANROUT is too high for use in many practical cases. Ho\vever, that complexity can be considerably reduced by applying some simple heuristics. h the follo~ving,a couple of heuristics are introduced~vhich are proved to be higMy effective \vith little compromise in terms of the quality of the final results.
I. R@trict the number of Hanan pointi: h the exact version of FA~OUT, there are n2 Hanan points }vhich is a major source of excessive runtime. We may, ho~vever, not a~o~v more than g Hanan points and change the complexity of line 7 to O(Q and the complexity of P~E to O(W? (cf. the note given at the end of sub-section 2.2). Consequently, the \vorst case complexity of F~OUT is changed to O(g'n~.
D. Bound the nl~ximum number of fanouts driven by a bufer: We may impose a practical upper bound on the number of fanouts that a buffer drives. Using that value, say 1,~vedo not allotv FMOUT to connect a buffer to more than 1 fanouts. F-OUT can easily hande this case by changing n in line 6 of the pseudo-code to z+l-I. b this case the complexity introduced by hes 6 and 9 are changed to O@)and 0(n213)(c.fl the note given at the end of sub-section 2.2), respectively. Consequently, the \vorst case complexity of F~OUT is changed to 0~4n~.
. Fast method: By applying both of the above technique the complexity of F~OUT is changed to 0(~14n) }vhich results in a hear worst-case complexity \vhen g and 1 are assumed to be independent of n.
E~E~ENTAL~SULTS
order to veri& the effectiveness of FA~OUT, a set of experimental resdts are reported here. h the presented conventional flo~vs @elo\v), \ve do not impose any restrictions on the ordering for the sti.
k other~vords, every fmout optiation and routing tree generation methods are independently free to choose their o~vn appropriate ordering for the S* (if any needed).
h Table 1 , the results are presented for a set of nets tken from a number of benchmarh \vhere the S* are placed randody. For these examples, hvo conventional flo~vsare compared against FA~OUT \vhere FA~OUT has been used for hvo different orderings:
I. Ordering \vith respect to the sti required times, =Q.
H. Ordering generated by solving the travefing salesman problem on the set of s~, TSR
The f~st conventional flo~v setup, conv-~uses S1S
[SSLM92] for fanout optimization, follo~ved by using P~E for routing tree generation. For each net different fanout optiation methods available in S1S are used and for each net ody the best result in terms of the required time is reported. The second conventional flo~v setup, conv-11, uses P~E for routing tree generation follo~vedby using the buffer insertion method introduced in [Gi90] . Note that b Table 1 , "total-area", "req-tirne" and "~v-leng~s~d for the sum of the area of buffers, the required time at the input of the driver and the total \vire len~respectively.
Our next set of experiments (cf. Table 2) compares the performance of the conventional design flo~vs~ainst our proposed simultaneous algorithm on a number of benchmark using a CASCADE standard cell fibrary (0.5u HP CMOS process). Gate and \vire delays are calculated using a 4-parameter delay equation [LSP971) and the Ekore delav model (sMar to that in1 4S], respectively.Ã lso, tie fast F~OUT (c.~sub-section 4.2.) has been run \vith TSP orderings f;r the experiments reported in Table 2 . These experiments sho~vedthat the runtime of the fast F~OUT is in the order of fe~vminutes comparable to the runtirnes of the conventional flo~vs.Note that the area and delay reported in this table are total chip area and delay afier detail routing.
These experiments \vere run in the S1S environment on an Wtra-2 Sun Spare \vorbtation (sahand.usc.edu)~vith 256MB memory.
CONCLUSIONS
This paper presents a novel algorithm, F~OUT,~vhich performs simultaneous routing and fanout optimization. It is a dynamic-pro~m g based algorithm \vhich properly uses LT-Tree and P-Tree construction algorithms in order to generate buffer routing structures \vith~ximurn signal required time. It computes load versus required time solution curves for every point on the Hanan grid and propagates them Jvtie groupbg more sti according to the given order. Merge and prune operations are defined on the solution curves to propagate the solution curves through the steps of the algoriti and drop the Io\v quality solutions to maintain the polynomial complexity. FA~OUT is an optimal algonti for matiing the required time problem for a given order on the sm. It also inherits all the restrictions that LT-Tree and P-Tree construction algorithms have. F~OUT is a polynomial algorithm as \vell. This ne~v tied design steps yields high quality circuits in terms of post layout chip area and delay.
AC~O~EDGEWENTS
We \vould Ike to thd Dr. John Lillis of the University of fllinois at Chicago for providing the implementation of the P~E algoritbrn and for helpfil discussions about the complexity of P~E. 
8.~FEWNCES
