This work is a contribution to high level synthesis for low power systems. Whle device feature size decreases, interconnect power becomes a dominating factor. Thus it is imponant that accurate physical information is used during high-level synthesis [I]. We propose a new power optimisation algorithm for RTlevel netlists. The optimisation performs simultaneously slicingtree structure-based floorplanning and functional unit binding and allocation. Since Aoorplanning, binding and allocation can use the information generated by the other step, the algorithm can greatly optimise the interconnect power. Compared to interconnect unaware power optimised circuits, it shows that interconnect power can he reduced by an average of 41.2%. while reducing overall power by 24.1 % on an average. The functional unit power remains nearly unchanged. These optimisations are not acheved at the expense of area.
INTRODUCTION
Recently, several research approaches have been reponed taking physical information into account. Most of the proposed algorithms use floorplanning informalion in high-level synthesis to estimate area and performance more accurately 12, 31. Similarly, a lot of techniques have already been proposed taking into account power consumption in high-level synthesis [4, 5, 6, 7] . Just a few of these contributions also consider interconnect p w e r [S, 9, IO] . For high-level interconnect length estimation the well known Rent's rule is often used [I I] . It states the relationship between the pin count ( I O ) and the block count ( B K ) of a chip IO = AS I BK'. A S represents the average size of blocks within the chip, while r is a mystery quantity and is called the Rent's exponent. This model requires knowledge of empirical parameters that are computed from actual design instances. This limits the applicability and therefore we do not use Rent's rule.
This work evaluates an approach of simultaneous binding, allocation and floorplanning optimisation. Binding is the task of assigning compatible operations or variables to resources during the high-level synthesis. Allocation is the choice of the number of resources. In the following binding will denote the combination of binding and allocation. A low power binding is an assignment in which the power dissipation of the resources is minimal. Binding has a great influence on power dissipation, since different bindings lead to different input streams on the input of resources. Binding and allocation affect the area of the design, the netlist topology @eeing the hasis of a floorplan) and the wire activity. In order to find a power optimal solution binding and floorplaning must be regarded simultaneously.
A precondition for combining binding and floorplanning is high estimation accuracy of the power consumption of RT-resources and interconnect. In order to determine the power consumption of resources power models describing the power consumption and area of the individual resources at RT level [12] are needed. Interconnect power primarily depends on the wire length of indwidual wires, the number of vias and the switching activity. We estimate the wire length by generating a slicing tree floorplan. Since a floorplan only affects wires connecting different RT-resources, only the global interconnect is considered. Wires within a resource are encapsulated by the power models.
We use a low power hgh-level optimisation tool, called ORINOCO [13, 141 . to obtain the RTL circuits and the power consumption of the datapath. ORINOCO is interconnect unaware. It is amended by our new interconnect power estimation methodes detailed in section 4.
The paper is organized as follows: In section 2 we present a motivation example. In section 3 we discuss OUT RTL interconnect power estimation. The proposed optimisation methodology is described in section 4. An experimental evaluation is presented in section 5 and conclusions are drawn in section 6. 2. MOTIVATION 
RTL INTERCONNECT POWER ESTIMATION
Given is a scheduled CDFG and a set of allocated modules with specification of area and geometric information. Geometric information specifies the minimum and maximum aspect ratio of a module, representing the flexibility during floorplanning stage (cf. 4.1). To capture the physical meaning of data transfer the CDFC is transformed into a RT-netlist. The netlist is generated by an architecture extraction. Each functional unit is modeled as a 2-input I-output combinational circuit and each register is modeled as a I-input I-output circuit. Every multiplexer is modeled as a n-input I-output circuit with n greater than two (including control input). The netlist represents a multiplexer-based point-to-point interconnection.
The dynamic power dissipation of a VLSI interconnect with a capacitive load can be written as
where Ct and D, are wire capacitance and switching activity for wire 2. The switching activity extraction and the wire capacitance estimation used in our approach is discussed next.
Switching activity extraction
The paradigm of ORINOCO is to estimate the activity that is necessruy to perform the functionality written in the source description. This means that for every point in time were an operation is executed, the current value of inputs is determined. These values and those gathered in the same way for the last operation are then used to compute the activity. In real systems additional activity, called spurious transitions, occurs at the input of FUs. Not all appearances of these transitions can he handled accurately at this hgh level of abstraction.
CIirches
Random logic introduces spurious transitions. These transitions cannot be effectively forecast. Due to this reason we assume that no chaining will he used. " h j s is a sensible assumption for low power design, as otherwise the glitches introduced by the first unit will boost the power of the second [17] . This glitches also contribute significantly to output network power consumption. So far this effect is neglected.
Regisrer 7iming
Withn the described paradigm it is implicitly assumed that bolh new values of an N are applied at the same time. This is often not the case, as registers in the fanin of the N are written in different cycles and values become visible immediately after writting. T h s effect is handled accurately. Due to vaiations in timing this phenomenon would even occur if both registers were written in the same cycle. This situation however can not be handled correctly.
SharedRegisrers
Shared registers output a merged data streams of all values mapped to them. Additional switching is produced. This effect is also handled accurately. 
Input mulfiplexing

Wire capacitance estimation
We derive the wire capacitance by using acapacitance model. This model is based on wire length, number of pins and number of branch points. We use a linear regression technique to model the dependencies. Pins are the connecting p i n t s to RT-resources, e.g. a wire at the input of a multiplier is connected to about 6 gates. that is 6 pins. The number of pins depends on the RT-resource type and can be extracted from the corresponding RT-model. The number of branch points and the wire length is extracted from a floorplan (cf. 4.1).
In Fig. 2 the capacitance extracted from Cadence Silicon EnsembleB is plotted against the capacitance from our model. The unfilled dots are from a model based only on the wire length. It is observable that besides the wire length the number of pins and branch p i n t s is a second major contributor for the overall wire capacitance. This impact on the overall capacitance is due to the additional vias for further branches and pins. For the used 0.25 p m technology the enhanced model has an average std. deviation of 31.9 % and the length based model has an average std. deviation of 36.8 %.
INTERCONNECT DRIVEN HIGH-LEVEL SYNTHESIS
In this section we propose our high-level synthesis flow, which performs simultaneously slicing-tree structure-based floorplanning and functional unit binding.
Simulated annealing (SA) based Boorplanner
For interconnect length estimation an extension of a well known SA based floorplanner by Wong and Liu [15] is used. Simulated annealing is an iterative technique for solving high-dimensional optimisation problems. These techniques switch from one solution (here: floorplan) Io another solution in a well-defined way by using 'moves'. This algorilhm considers slicing Roorplans. A Fig. 3 illustrates how lhe moves FI and F, affects the binary tree (left side) and shows the impact for the corresponding floorplan (right side). Each Roorplan considered during SA process is evaluated based on area A and interconnect power P, using a cost function of the form P + X A, where X A controls the relative importance of A and P. In [151 the interconnect length is estimated by calculating the Manhattan distance for two pin connections and the minimum spanning tree (MST) for connections with more than two pins. These technique does not suit real wiring because no branch p i n t s are considered. Instead, we use Steiner Trees for drawing data transfer wires. To treat the clock distribution network accurately an H-tree (balanced tree) is generated.
Extended approach
For our approach we modified the cost function and the SA process. The new cost function is of the form p,, + PwCye + A A.
PFU is the power consumption of the functional units, multiplexer and registers and P,mC is the power consumption of the interconnect. X A is the area's contribution to the cost function. The annealing process is amendedby three new binding moves B1 -B3. 
Split Bz
Split is the reverse of shore. A single resource is splitted into two resources. Like in move B I , multiplexers can vanish or appear. Splitting can be done without regardmg the lifetime of operations. Apart from potentially reducing switched capacitance, these moves enlarge the avenues for applying other share moves.
In Fig. 5 the adder 5 is splitted into the adders 5 and 8. It is assumed that two new multiplexers. 9 and 10. at the input of the new adder 8 had to be instantiated.
Swap B3
Swap interchanges the inputs of commutative operations. Like in move B,, multiplexers can vanish or appear. This move significantly affects the switching activity in the data path The influence on the netlist is nearly negligible.
Bolnnce point
New components are inserted at their balance point. The balance point is the point. where the new resource would produce the lowest interconnect power. In Fig. 6 (a) this point is inside the left half of leaf 4. Therefore leaf 4 is replaced by a new vertical node with the new leaf 5' placed on the left side and 4 an the right side.
Our floorplanner supports soft macros, which means that leafs are flexible in their aspect-ratio. Therefore inserting or deleting a leaf does not destroy the floorplan The unused area in Fig. 6(b) only originates because we limited this ratio. This avoids unrealistic floorplans.
Optlmisation algorithm
The algorithm itself consists of two nested simulated annealing (SA) processes (Fig. 7) . The inner loop uses floorplan moves (Fl -F5) optimising the actual floorplan for interconnect power. In general an annealing move is chosen randomly. If a move leads to a decreased power consumption this move i5 accepted. If the power is not reduced the move may be acccepted on a probabilistic base. If a generated random number (0. 1) is smaller than e-a C o s t / T , where -A Cost is the power difference and T is the current temperature, the worse solution is accepted. This enables the SA to escape from local minima.
Annealing process acceleration
The algorithm keep on searching until some stoppirlg criteria are met. Stopping criteria are: (1) the last k iterations did not identify a better solution and (2) some parameter have reached a threshold limit. Unfortunately for the most practical applications the runtime is out of scale. To cope with this we integrated some effective speedups (Fig. 8) .
Constructive heuristic
Our iterative algorithm starts with a constructive heuristic to generate a pre-optimised solution (Fig. 8 (I) ). The heuristic optimises the binding and floorplan separately. In the first step the archtecture binding is optimised neglecting the interconnect power. The inner loop (update floorplan) is excluded. Now for this architecture the power optimal floorplan is chosen by executing only the inner loop. 
Noorplan driven binding
The probability of choosing a binding move decreases or increases depending on the following factors ( Fig. 8 (2) ):
1. Through a binding move multiplexers can vanish or appear. An increasing number of multiplexers decreases the prohability of choosing this move. 2. Sharing resources with a physical locality increases the probability. 3. Sharing resources that conduct data exchange increases the probability. 4. Each operation of a resource is assumed to he mapped exclusive on one resource. Then for this operation the balance point is determined. In this context the balance point is the position, where its exclusive resource would produce the shortest wire length. A high deviation between the balance p i n t of a operation and its original resource increases the probability to split the operation.
Avoiding unnecessary Poorplan updares
After each binding move and before executing the inner loop the effect of the move is evaluated (Fig. 8(3) ). Appearing or vanishing resources are inserted or deleted as supplied before. But before entering the inner loop a power pre-estimation is performed. Because of the sub-optimal floorplan the interconnect power is overestimated. Nevertheless the result is a good indicator for the impact of the binding move. If the power increases and the difference exceeds a threshold limit. the binding move is rejected. In this manner most unnecessaq time consuming floorplan updates could be avoided. means that almost no deteriorate move is accepted (Fig. 8 (6) ). In addition, the probability of choosing a move from the inner loop is decreased or increased depending on the moves acceptance rate (Fig. 8 (7) ). This acceptance rate is pre-initialised through the constructive heuristic (see 4.4.1). Binding optimisation and floorplan optimisation are executed consecutively. This interconnect unaware optimisation is the traditionally procedure in low power high-level synthesis.
EXPERIMENTAL RESULTS
Simultaneously optimisation (SIO)
Binding and floorplanning is optimised simultaneously. To achieve comparable results the total number of moves executed in each experiment are identical. The number of moves is determined depending on the benchmark size. The experiments werepetformedona 1.OGHzAthlonbasedPC with256MB memory. The CPU times vary from 6 seconds for diffeq to 138 seconds for rurbodecoder. Fig. 9 shows the percentage power consumption of IUO and S I 0 compared to F P . The power of F P circuits is defined as 100 %. The bars are divided into data path power (lower part) and interconnect power (upper part). Please note that the total power of some interconnect unaware optimised benchmarks increase by 100% (e.g. wavelet), which means that for these benchmarks the traditional optimisation fails. In Table 1 the exact values of the experiments are listed together with the percentage of energy and area reduction. Since scheduling and thus the timing is fix for each benchmark, energy reduction and power reduction are equivalent. Thus, we will further refer to power as energy. Compared to the traditionally procedure (IUO) our proposed Lechnique (SIO) reduces the interconnect power for all benchmarks by an average of 41.2%, while reducing overall power by 24.1 % on an average.
The functional unit power just increases sensible for interconnect dominated designs (average of 7.7 %). Compared to l U 0 the area is also reduced by an average of 14.2 %.
Interconnect unaware optimisation(lU0)
6. CONCLUSION We showed that high-level synthesis has a significant impact on the interconnect power consumption. We proposed a new power optimisation algorithm which simultaneously performs floorplanning and functional unit binding. Experimental results demonstrate the benefit of incorporating interconnect in high-level synthesis for low power and the effectiveness of the proposed technique. Compared to interconnect unaware power optimised circuits, we have shown that interconnect power can he reduced by an average of 41.2 %, while reducing overall power by 24.1 % on an average. In fact, the energy consumption might even increase if the lraditional optimisation flow is used. Our technique is implemented on top of the optimisation tool ORINOCO. Although our technique is general it can be easily incorporated into other high-level synthesis systems.
