Abstract-An architecture-synthesis technique for the lowpower implementation of real-time applications is presented. The technique uses algorithm partitioning to preserve locality in the assignment of operations to hardware units. This results in reduced usage of long high-capacitance buses, fewer accesses to multiplexors and buffers, and more compact layouts. Experimental results show average reductions in bus and multiplexor power of 57.8 and 56.0%, respectively, resulting in an average reduction of 25.8% in total power. In addition, we analyze the effect of varying levels of partitioning on power consumption and present models for estimating bus capacitance.
I. INTRODUCTION
A RCHITECTURE synthesis, or high-level synthesis, is steadily making an inroad into the digital design community. So far, most of the work has focused on techniques for area and speed optimization [1] . In recent years, there has been significant interest in low-power issues due to excessive heat dissipation in increasingly complex digital systems and rising popularity of portable devices, where extending battery life is a primary design objective. Most of the work in design automation for low power has focused on the logic, circuit, and layout levels. Relatively little research has been devoted to high-level techniques, where the impact of design decisions is much greater [2] - [4] .
This work presents a synthesis approach targeted at reducing the power consumed in the interconnection network. The interconnection network includes bus wires, multiplexors, and buffers. In this paper, we use "interconnect" or "interconnect elements" to refer to the interconnection network.
The importance of interconnect optimization at the architecture level is highlighted in Wu's comparison of an maximally time-shared implementation of a quadrature mirror filter (QMF) used for subband coding with a pipelined, fully parallel version [5] . Layouts of both implementations are shown in Fig. 1 . For the same supply voltage, an improvement of a factor of 10.5 was obtained at the expense of a 20% increase in area. The breakdown of the power consumption of the two versions is shown in Fig. 2 . Notice that the interconManuscript received August 2, 1996 ; revised October 21, 1996 . The work of R. Mehra is supported by the ARPA grant J-FBI 93-153 and that of L. M. Guerra is supported by scholarships from AT&T and the Office of Naval Research.
The authors are with the Electrical Engineering and Computer Science Department, University of California, Berkeley, CA 94720 USA.
Publisher Item Identifier S 0018-9200(97)01147-5. nect elements consume about 43% of the total power in the time-shared case, demonstrating the importance of reducing their power. Further, these elements contribute the most to the power reduction achieved in the parallel version. Power improvement factors of 16.9, 15.1, and 12.5 (Table I) multiplexors and buffers. As a result, the interconnect power was reduced to 28% of the total power [ Fig. 2(b) ]. Both of the above facts indicate a high potential for power reduction by targeting the interconnect. While in this example, the fully parallel implementation resulted in large power gains with low area overhead, this may not always be the case. Parallel implementations may be too area intensive and may not necessarily result in reduced interconnect power. If the area overhead is too high, the increase in the average bus lengths may offset the power gain resulting from the simplified interconnect network. In this work, techniques are presented to achieve low-power designs by reducing the interconnect power without incurring the full area overhead of maximally parallel designs. The approach aims to capture some of the optimizations of the above example in an automated way while maintaining a balance between the maximally time-shared and the fully parallel implementations.
II. PRELIMINARIES
Before introducing our low-power synthesis technique, we first present a brief overview of the synthesis tasks and the different architecture-level techniques for low-power design.
A. Architecture Synthesis
Architecture synthesis is concerned with deriving an architectural implementation of a given algorithm. The input is a behavioral description of the algorithm, and the synthesis process involves deciding how the operations in the algorithm will be mapped onto a set of hardware resources. A good tutorial on the main high-level synthesis tasks is presented in [1] and a number of CAD systems for high-level synthesis are described in [6] .
The main tasks in the architecture-synthesis process include module selection, allocation, assignment, and scheduling. Though these terms are defined slightly differently in different systems, the basic functions performed are the same. Module selection involves selecting specific hardware modules that implement the operations specified by the algorithm. Allocation refers to the task of deciding how many instances of each hardware resource are needed. Assignment binds each of the operations to specific hardware instances, and scheduling decides when each operation will be executed. Both allocation and assignment are performed for each of the different resource types (functional units, registers, and buses) in the system. The output of the synthesis process is an architecture netlist (register-transfer level) in a language such as VHDL.
B. Architecture-Level Power Reduction Techniques
The sources of power consumption on a chip are dynamic power, short-circuit power, and leakage power. At the algorithm and architecture levels, only dynamic power is targeted for optimization. This is because short-circuit and leakage currents are influenced mainly by the circuit design style used. Further, these components can be reduced to less than 15% of the total chip power by smart circuit design techniques [7] . At the algorithm and architecture levels, therefore, the power dissipated can be described by the following equation:
where is the physical capacitance, the corresponding switching activity, the voltage swing, the supply voltage, and is the sampling frequency of operation. The activity, and the capacitance, are often lumped together to give the effective capacitance switched per sample
For each resource type-functional units, memory (including register files), interconnect, and control-the effective ca-pacitance depends on three factors: the resource's physical capacitance, the number of times it is accessed, and the correlation of the data that it operates on (the latter two determine the activity factor, ). Power reduction techniques include approaches to enable voltage scaling and techniques to reduce the effective capacitance. For the real-time applications targeted in this work, the sample frequency is a specified constant, and therefore, changing the throughput of the application is not an option. Design techniques for reducing effective capacitance can be classified along the following lines.
• Preservation of data correlations: Switching activity is dependent on correlations between successive data inputs, and increasing correlations may result in power savings.
• Distributed computing/Locality of reference: Accessing global computing resources (control, datapath, memory, I/O, and interconnect) is expensive-the time-sharing nature of these resources requires a high switching rate, and the shared nature of such a resource typically incurs a capacitive overhead. Distributing the accesses over many resources relieves both the switching requirements and the overhead.
• Application-specific processing: Specialized units consume less power than general-purpose ones due to simpler structure and reduced control required to support programmability.
• Demand-driven operation: To avoid wasteful transitions, it is important to perform operations only when needed. Power down of memory and functional units when they are not in use is the most popular technique in this category. While low-power techniques following these themes are being increasingly used in manual designs [8] - [11] , many have yet to be encapsulated into automated techniques. Previously proposed automated synthesis techniques include optimizations that enable voltage scaling [2] and those that preserve data correlations [12] , [13] . In this paper, we explore the impact of using partitioning to exploit spatial locality for reduction of the interconnect power and provide synthesis techniques for this purpose. Until now, power optimization of buses, buffers, and multiplexors has not been addressed. Their optimization is important because, as demonstrated by the QMF filter example in Section I, the interconnect power may be a substantial percentage of the total power and can be affected significantly by architecture-level optimizations.
III. PARTITIONING FOR LOW POWER
In this section, we present a partitioning-based approach for reducing interconnect power. Section III-A explains the main idea behind our approach, Section III-B presents some of the tradeoffs involved, and Section III-C describes our partitioning methodology.
A. The Impact of Exploiting Locality
The idea of using distributed or localized computing for low power has been used previously (e.g., memory and control partitioning). The main idea behind our approach is to apply this concept to interconnect power reduction by automatically synthesizing designs with localized communications. We achieve this by dividing the algorithm into spatially local clusters and performing a spatially local assignment. A spatially local cluster is a group of algorithm operations that are close to each other in the flowgraph representation [1] . A spatially local assignment is a mapping of the algorithm operations to specific hardware units such that no operations in different clusters share the same hardware. Partitioning the algorithm into spatially local clusters ensures that the majority of the data transfers take place within clusters and relatively few occur between clusters. The spatially local assignment restricts intracluster data transfers to buses that are local to a subset of the hardware (local buses); thus only intercluster data transfers use buses that are shared by all resources (global buses). In general, since intracluster buses are localized to a part of the chip, they are shorter than the buses in the nonspatially local designs, while the global buses in the partitioned and nonpartitioned designs may be comparable in length. The combined result is that the shorter local buses are used more frequently than the longer highly capacitive global buses. Further, buffer power is reduced since smaller buffers are required to drive shorter wires. The reduced hardware sharing also results in additional power savings due to fewer accesses to multiplexors. The partitioning information is passed to the architecture-netlist generation and floorplanning tools which place the hardware units of each spatially local cluster close together in the final layout.
Consider a fourth-order parallel-form infinite impulse response (IIR) filter. Local and nonlocal assignments of operations to hardware units are shown in Fig. 3 (a) and (b), respectively ( are adders and are multipliers). In Fig. 3 (a), the filter is partitioned into two spatially local clusters, and the operations of each cluster are mapped to mutually exclusive hardware units ( and are used for operations in cluster I and and are used for those in cluster II). As a result, all communications within cluster I take place only between hardware units and and those within cluster II take place between units and There are only two data transfers between the clusters which are global to the entire chip. In Fig. 3(b) , on the other hand, operations are assigned to hardware units without regard to their closeness. In this case, the communications are not localized to a subset of hardware units and take place on global buses between all five units.
Notice that the local version needs four adders and two multipliers, whereas the nonlocal assignment requires just three adders and two multipliers. This increase in the number of functional units does not necessarily translate into a corresponding increase in the overall area since localization of buses makes the design more conducive to compact layout.
B. Effect of Varying Levels of Partitioning
In the previous section, the parallel-form IIR filter was partitioned into two clusters. In general, a design may be partitioned into any number of clusters. In this section, we study the effect of varying the number of clusters on three designs-a direct-form IIR filter, an eight-point discrete cosine transform (DCT), and a fifth-order wave digital filter. For each, the power dissipation of the individual components of the interconnection network, the combined bus and clock, and the total chip are shown for different numbers of clusters (Figs. 4-6 ). The nonpartitioned case is considered as a single cluster implementation.
With increasing number of clusters, several trends are seen. As expected, the local bus power reduces, and the global bus power increases. This is due to the lengths of the intracluster buses reducing as clusters get smaller and the lengths of global buses increasing as chip area grows. Furthermore, the number of accesses to local buses decrease while accesses to global buses increase. Another distinct trend is seen in the multiplexor power which reduces drastically due to increasingly restricted hardware sharing. The clock power remains constant or increases due to an increased number of units and therefore, longer clock wiring. Although not shown, the register power reduced slightly while the power dissipation in the functional units and buffers remained the same. The reduction in register power can be explained as follows. With more clusters, the number of functional units is increased, and the number of variables to be stored in the register files associated with each unit is reduced. While the total number of register accesses remain the same (determined by the number of reads and writes required), the reduced register file sizes result in lower capacitance switched per access and thus, reduced register power.
The total power reduces drastically as we go from the nonpartitioned (one cluster) to the partitioned designs. After a certain number of clusters, however, the total power starts to increase. Each example has an "optimum" number of clusters-seven, four, and four, for direct-form IIR, DCT, and wave digital filter, respectively. Notice that, in Figs. 4-6 , the combined bus and clock power tracks the total power dissipation showing an optimum at the same level of partitioning. In our partitioning methodology, we use an estimate of this value to decide the optimal number of clusters. Our partitioning methodology is explained in the next section.
C. Partitioning Methodology
The core of our approach of exploiting locality for lowpower synthesis is partitioning. While a detailed explanation of our partitioning methodology can be found in [14] , here we present an overview of the main ideas.
Previous works in partitioning for high-level synthesis have targeted area minimization, with a significant portion of the gains resulting from reduction of the number of buses [15] , [16] . For power minimization, however, the goal is to minimize the number of accesses to long global buses as opposed to the number of buses. For example, an implementation with two global buses may dissipate less bus power than another implementation with only one global bus if it has less accesses to the global buses.
Our partitioning methodology consists of two phases-the first phase generates several candidate partitions, and the second phase evaluates them and selects the best one.
The generation of candidate partitions is based on a spectral partitioning technique [17] , [18] . The technique was introduced by Hall [17] who proved that the eigenvector of the second smallest eigenvalue of the Laplacian of a graph gives a one-dimensional (1-D) placement of graph nodes such that the sum of squares of edge lengths is minimized. The distances between the nodes in this placement quantifies the relative closeness between them. Large gaps between consecutive nodes in the ordering can be used to delimit the clusters. For example, Fig. 7 shows an eighth-order cascade filter and the corresponding eigenvector placement. The spacing between nodes in the placement clearly indicates four distinct clusters that are also evident from the filter topology. We use as the threshold for detecting these gaps, where is the mean of all the distances between consecutive nodes and is their standard deviation. In the cascade example of Fig. 7 , this threshold delimits the expected four clusters. As discussed in Section III-B, varying the number of clusters trades off between various power consuming components. In our scheme, several different candidate partitions are generated by varying the targeted number of clusters. For example, in the cascade filter of Fig. 7 , in addition to the four-cluster partition, a two-cluster partition may also be proposed. A rough estimate of the total bus and clock power is used to evaluate and compare the candidate partitions. The metric used as a measure of the global bus power is the number of global data transfers times the estimated global bus length. Similarly, the local bus power of each cluster is estimated as the number of local data transfers times the cluster's estimated bus length. An estimate of the clock power is the number of clock accesses (number of control steps per sample period) times the estimated length of the clock wiring.
Since the length of the wiring has been shown to be proportional to the square root of the area, an estimate of the total chip area is used as a measure of the global bus and clock lengths. Similarly, a prediction of the area of each cluster is used to estimate the length of local buses in it. These areas are estimated from the maximum height of the weighted-concurrency distribution graph [19] . The distribution graph gives the amount of concurrent hardware needed by the computation in each time slot.
Using the estimates explained above, the different partitioning solutions generated in phase 1 are evaluated in phase 2 to decide on the number of clusters. In a final step, the single most promising candidate partition is applied to the algorithm. The partitioning information is then passed to the synthesis tools which implement each spatially local cluster on a different set of hardware units.
IV. LOW-POWER SYNTHESIS SYSTEM
This section explains our locality-based synthesis methodology. Section IV-A gives an overview of the synthesis flow and Section IV-B explains the techniques for architectural power analysis.
A. Design Flow from Algorithm to Layout
This section describes how the partitioning information is used in the process of mapping a given algorithm to layout, through high-level synthesis and silicon compilation. Fig. 8 shows the overall design flow. The high-level synthesis process takes a behavioral description (e.g., in a like language) and a set of performance constraints and generates an architectural level netlist. The Lager silicon compiler [20] is then used to generate the final layout from the architectural netlist.
Our techniques have been integrated into the Hyper-LP high-level synthesis system. While the basic synthesis flow of the Hyper-LP [2] system is the same as that of the Hyper system [21] , the new partitioning step is added preceding the other synthesis steps and the assignment algorithms are modified to exploit spatial locality. The functional-unit assignment scheme of the Hyper system is constrained such that operations in each spatially local cluster are assigned to a separate set of hardware units. The clustering information is also used during bus assignment and merging to ensure that local buses are used only for data transfers within a cluster, and global buses are used only by intercluster data transfers. This ensures that intracluster data transfers occur on short local buses and only intercluster ones use the long highly capacitive global buses.
Once the architecture netlist is generated, silicon compilation performs a number of tasks such as tiling, placement, and routing to generate the final layout. The Lager silicon compiler is used for this purpose. The partitioning information is passed to the floorplanning tools which place hardware units of a given partition close together in the final layout. As much as possible, all units in the same cluster are placed in the same datapath. The output from Lager is a physical layout of the processor core.
B. Bus and Clock Capacitance Models
In order to validate the effect of our partitioning methodology, it is necessary to compare the power of partitioned and nonpartitioned versions of several designs. We use SPA, an architecture-level power estimation tool [22] , for our estimations. Since our technique targets interconnect power reduction, good estimates of these components are important to derive meaningful results. We therefore spent considerable time generating layouts for several designs to validate and modify the bus-power estimation models in SPA for both the Hyper and the Hyper-LP designs. While the basic dependencies of the SPA models were maintained in the partitioned designs, the scalar factors needed to be adjusted. The models were also extended to estimate clock power. In this section, we present the models used for estimation of bus and clock power. These are heavily dependent on the architecture model and the floorplanning and routing strategy which we briefly describe first.
The architecture model used in the Hyper and Hyper-LP systems is shown in Fig. 9 . Each functional unit has a register file and, if required, a multiplexor at each input and a buffer at each output. The functional unit, along with the associated registers, multiplexors, and buffers, is called a functional unit set. Functional unit sets communicate with other sets via a dedicated network that functionally resembles a full crossbar network. In the final layout, two or more sets may be grouped into the same datapath. Each datapath has its own local controller, synchronized by a global finite state machine. Within the datapath, units are stacked in a bit-slice fashion and over-the-cell wiring is used for communication between them. Fig. 10(a) shows the a typical floorplan and the interdatapath routing strategy. Fig. 10(b) shows the routing of local buses within a datapath.
1) Bus Models:
The bus power consumption is proportional to the capacitance switched per access times the number of accesses. The capacitance switched per access is composed of two parts-that due to the capacitance of the wire itself and that due to the capacitive load on the wire. The capacitance of the wire directly depends on the wire length, which is not determined until after placement and routing and is therefore estimated using an empirical model. The loading on the buses is modeled by adding a fixed load for each fanout (50 fF for our technology). Accesses to each bus are calculated using functional simulation.
For the purposes of modeling length, buses are divided into two main categories-interdatapath and intradatapath. In nonpartitioned (traditional Hyper) designs, SPA must assume that each functional unit set forms its own datapath since these sets are not merged into datapaths until the floorplanning phase. In partitioned (Hyper-LP) designs, however, the merging is dictated by the partitioning. SPA uses the partitioning information to estimate which functional unit sets are combined into the same datapath.
SPA estimates intradatapath connection lengths using a linear model. The average length of over-the-cell connections is estimated to be 0.3 times the cumulative height of the units in the datapath. Fig. 11 compares the average measured length of the wires in the datapaths with those predicted by the linear model. Points on the dotted line indicate exact estimates. Note that the model provides good estimates.
The average interdatapath bus length is proportional to the square root of the chip area [23] and is empirically determined to be Chip Area for our design environment. The chip area is derived from an empirical model presented in [4] . The model is based on the active area, (calculated by summing up areas of all the hardware units) and the total number of wires (number of buses between datapaths, times the wordlength, ) as follows:
The last two terms in the model represent the active area and the wiring area, respectively. Notice that the wiring area depends on the total number of wires and also on the active area. The coefficients, and are derived statistically. The interdatapath bus-length model was validated for partitioned and nonpartitioned versions of three designs (Table II) . Though the model was accurate to within 20% for nonpartitioned designs, it overestimated the bus lengths for the partitioned designs. This appears to be due to be the fact that partitioned designs have fewer blocks with large active area, while the Hyper implementations have more blocks with lower active area. Note that the overestimation of bus lengths, and therefore the capacitance, in Hyper-LP designs leads to conservative estimates of the power relative to Hyper designs.
Once the lengths of both intercluster and intracluster buses are determined, the wiring capacitance is calculated using the average capacitance per unit area and fringe capacitance per unit length of metal1 and metal2 layers. The wiring and load capacitances are added to obtain the bus capacitance switched per access.
2) Clock Models: The bus models presented in the previous section were modified to calculate the clock power. The length of the clock wire consists of two parts as shown in Fig. 10 -the length of the routing to the border of the various datapaths and the length of the clock routing inside the datapaths. On a set of example designs, the length of the clock routing to the datapaths was found to vary anywhere between one to three times the square root of the chip area. Within each datapath, the clock wire-length was approximately the height of the datapath. Based on these observations, the total length of the clock wiring is estimated by the following formula: To estimate the loading on the clock, we assume that the clock was distributed to all the registers and buffers on the chip, and that each offers a 25 fF load to the clock (as obtained from our cell library).
V. RESULTS
In this section, we present the results of our partitioningbased synthesis scheme. Implementations generated using the Hyper and Hyper-LP systems are compared using estimates obtained from SPA. Estimates of the total chip area and bus lengths are obtained using models presented in Section IV-B.
A. Cascade Filter
The first result compares the Hyper-LP and Hyper implementations of the eighth-order cascade filter (Fig. 7) . Given a throughput constraint of 21 clock cycles, the Hyper implementation uses four adders and three shifters while the Hyper-LP implementation uses one adder and one shifter for each cluster resulting in a total of eight units. Layouts of the two implementations are shown in Fig. 12 . In the Hyper implementation, two of the seven functional units are merged by the floorplanning tool. In the Hyper-LP implementation, the four datapaths pictured correspond to each of the four clusters. Table III compares the power dissipated in the two implementations. An overall reduction of 35% in the power consumption was realized by the Hyper-LP approach. As opposed to 106 accesses to global buses in the Hyper implementation, the Hyper-LP version has 95 accesses to local buses which are short (0.27 mm) and only three accesses to long global buses (1.48 mm). As a result, the bus power reduced three-fold, from 2.9 mW to only 1.0 mW. The multiplexor power also reduced three-fold, as the reduced timesharing of units resulted in lower usage of multiplexors. Note that the contribution of the interconnect to the total power dissipation was reduced from 44% to 27%.
B. Other Examples
This section summarizes our experimental results for the cascade and several other digital signal processing (DSP) Table IV shows the number of accesses to buses and multiplexors and the estimated bus lengths for both implementations of each example. The accesses to global buses are reduced drastically for all examples with very little change in the lengths of these buses. Exploiting spatial locality moves a large percentage of the bus accesses from the long global buses to intracluster buses whose lengths are 50 to 75% shorter than those of the global buses. In general, due to reduced hardware sharing, there is a decrease in the multiplexor accesses for all examples. Table V shows the bus, multiplexor, and overall power dissipation for both implementations of each example. The Hyper-LP implementations uniformly dissipate less power than the Hyper implementations. The corresponding percentage improvements are summarized in Fig. 13 . Power consumed by buses is reduced drastically in all examples (up to 80%) and large reductions are also seen in the multiplexor power (more than 70% reduction in three of the examples). The average reduction in bus, multiplexor, and total power is 57.8%, 56.0%, and 25.8%, respectively. In partitioned implementations, we expect buffer power to decrease since smaller buffers can be used to drive the data transfers occurring on short local buses. However, our architecture-netlist generation tool currently uses minimumsized buffers for all data transfers regardless of bus length and therefore, our results show negligible change in buffer power. With necessary modifications, buffer power should contribute toward further reduction in total power.
The power reduction comes at the cost of an increase in the number of functional units. However, overhead elements such as multiplexors are reduced, and, since a large percentage of the communications are localized, the designs are more conducive to compact layout. As a result, some of the examples have lower area in the Hyper-LP implementation. All examples in this section have been optimized for power with no limitations on area. By varying the number of clusters, different design points with lower area penalty can be obtained at the cost of less reduction in power. For example, the parallel filter implementation with two clusters has lower power reductions (30.8%, 24.1%, and 3.7% in bus, mux, and total power, respectively), but the area penalty is much lower (30.4%) than that of the four-cluster implementation shown in the table.
VI. CONCLUSIONS
We have presented a technique for power reduction based on exploiting the locality in a given application. At the core of the approach is a partitioning and assignment strategy. It was seen that the proposed scheme improves the implementation in a variety of ways. The predominant effect is the reduction of accesses to highly capacitive global buses. Our results showed average reductions of 57.8%, 56.0%, and 25.8% in bus, multiplexor, and overall power, respectively, and low associated area overhead. The partitioning and assignment techniques have been integrated into the Hyper-LP system.
The concept of preserving locality is a special case of a more general class of techniques referred to as distributed computing. In general, accesses to global com-puting resources-control, datapath, memory, I/O, and interconnect-are expensive due to high capacitance. Dividing these resources reduces the capacitance being switched per access. This work has presented an automated partitioningbased technique for interconnect power reduction. Future directions include extending this work to other applications such as automated memory and processor partitioning.
