This paper presents a high-level synthesis approach to minimize the total power consumption in behavioral synthesis under time and area constraints. The proposed method has two stages, functional unit (FU) energy optimization and interconnect energy optimization. In the first stage, active and inactive energies of the FUs are optimized using a multiple supply and threshold voltage scheme. Genetic algorithm (GA) based simultaneous assignment of supply and threshold voltages and module selection is proposed. The proposed GA based searching method can be used in large size problems to find a near-optimal solution in a reasonable time. In the second stage, interconnects are simplified by increasing their sharing. This is done by exploiting similar data transfer patterns among FUs. The proposed method is evaluated for several benchmarks under 90 nm CMOS technology. The experimental results show that more than 40% of energy savings can be achieved by our proposed method.
Introduction
In recent years, low power/energy has become a primary concern in VLSI design. In the deep sub-micron process, there is a continuous increase in energy consumption every year. Since the supply and threshold voltages are scaled down, sub-threshold leakage increases exponentially with the decreasing threshold voltage [1] . Even though the dynamic power of a transistor decreases with scaling, it is still a critical factor. The interconnect energy also becomes a critical problem. When the number of metal layers increase and the designs get more complex, interconnect energy is very high and comparable to the static and dynamic energies in functional units (FUs). Therefore, unlike the earlier stages in the CMOS process technology, static and dynamic energies in both FUs and interconnects have to be reduced to achieve total energy reduction.
Various methods to reduce the energy consumption have been proposed. However, most of them are focused on only one area of the energy consumption. An unified supply voltage scaling and re-timing process is proposed in [2] . It also proposed some heuristic algorithms for the dynamic energy minimization problem. In [3] , dual threshold voltage assignment based on a priority function is discussed and an algorithm to analyze the leakage power is proposed. Assignment of high threshold voltage in non-critical paths and low threshold voltage in the critical path is discussed in [4] . A distributed sleep transistor technique is proposed in [5] . All the above works do not consider the interconnect power, which is a major power consuming area in deep submicron process. Both [6] and [7] focus only on the interconnect power and those methods do not reduce the dynamic and static power of FUs.
Reducing only one area of energy consumption may not always reduce the total energy since the energy consumption varies from circuit to circuit. For example, LSIs in mobile devices consume a large amount of static energy, and reducing only the dynamic energy does not make any impact on the total energy consumption. Therefore, in order to achieve total energy reduction, all the energy consuming areas have to be considered.
A method to reduce the power consumption using dual supply voltages and interconnection network simplification is proposed in [8] . However, the problem definition in [8] is too simple and it does not consider the essential conditions such as level converters and registers for a realistic implementation. The power and the area of a register is about 50% of the power and area of an adder. Moreover, 5% to 10% of the area and power of an FU are consumed by the level-converters. Therefore, these factors must be considered in the high-level-synthesis.
In this paper, we propose a method that considers both FU and interconnect energies to give a better overall result for vast variety of circuits. This is an extension of the work done in [8] . We consider the reduction of static and dynamic energies together using multi-supply-threshold voltage scheme, unlike the earlier work in [8] that does not consider the static energy. The high-level-synthesis flow considers the area and power of level converters and registers. It also considers the area and power of interconnects inside the modules. We consider all the additional overheads in the multi-supply-threshold voltage scheme, such as level converters, supply voltage lines, etc for the evaluation. The evaluation is done for many benchmarks under different constraints, using different simulators.
Problem of Minimizing FU and Interconnect Energies
The targeted architecture model has FUs with two different supply and threshold voltages (Fig. 1 ). Level converters [9] , [10] 
The time constraint is given by Eq. (2). The control step of node i is given by S node i . The delay of the FU which the node i has been assigned is D node i and the time constraint is T max .
f or all nodes :
Active energy of the FUs (E active FU ) is calculated using Eq. (3). The term E active node i is the average active energy dissipation per operation in the FU of node i and the active energy of the register connected to that FU i . Register energy consumption data are given in Table 1 . . Note that, the above inactive energy contains both FU and its output register's inactive energy. Table 2 shows a part of a module library. It has modules with different supply and threshold voltages. The delay values, area, dynamic energy and static power are also included for each module. The interconnects inside modules use lower metal layers such as metal 1 and metal 2. In the advance geometry scaling, the area of interconnects will be larger than that of transistors. Therefore, the area of a module can be increased due to the interconnects and that area must be considered. Therefore the areas in the module library of Table 2 and the register library in Table 1 are for both interconnects and transistors. On the other hand, we do not consider the area of the interconnects that connects modules together. The process technologies of 90 nm or above has more than 6 metal layers and each drawn above an another metal layer in the 3D space. Therefore, the area of the interconnects of higher metal layers are overlapped with the area of the core. Hence, the area of the interconnects that connects the modules together do not affect the total core area.
Moreover, we also include area-driven modules, timedriven modules and some intermediate modules in the module library. Then the synthesis automatically chooses the best combination of modules from the module library that minimize the total energy, under the time and area constraints. The final design can have a mixture of areadriven, time-driven and intermediate modules and also with high/low supply/threshold voltages.
For the energy estimation, following assumptions are also considered. 1. One clock cycle is equal to one control step. 2. Delays of writing data into Registers and reading from registers are negligible. Data transfer delays are also negligible.
The active and inactive energies of the multiplexers, buffers and data transfer wires are considered as interconnect energy and calculated by Eq. (5). The terms C i , V dd and N trans are the wire capacitance of the interconnect unit i, the supply voltage of the FU that transfers data through the interconnect unit i and the number of data transfers through the interconnect unit i respectively. Modeling E inter accurately is difficult in high-level synthesis since the wire capacitance can be known only after the low-level design tasks such as placement and routing. Therefore, to estimate E inter as accurately as possible, fan-ins and fan-outs which are known parameters in high level design tasks are used. Figure 2 shows the relationship between the wire capacitance and fan-outs. The total capacitance is obtained by adding all the partial wire capacitances (C 1 + C 2 + C 3 + C 4 ) and the capacitance of the multiplexer. Figure 3 shows a simple multiplexer architecture based on pass gates. Since the number of pass gates increases with the order of log 2 (fan-ins), the capacitance also increases with the same order. However, as the design technology proceeds, the wire capacitance becomes larger than the source-to-drain capac- 
Fig. 2
The relationship between the fan-outs and multiplexers capacitance.
Fig. 3
The relationship between the fan-ins and wire capacitance.
itance of transistors. According to the experiments with 90 nm CMOS design rules, the average wire capacitance is 100 times larger than the average source-to-drain capacitance of transistors in logic gates. Therefore, it is reasonable to assume that the source-to-drain capacitance is negligible. Using this assumption, the multiplexer's capacitance is neglected and only the capacitances of partial wires are considered. Instead of adding all the wire capacitance, "number of fan-outs × average wire capacitance" is calculated. Therefore, by replacing the wire capacitance in Eq. (5) with the sum of fan-outs, the interconnect energy E inter is re-defined as Eq. (6) where, α is a scaling factor, and FO i is the sum of fan-outs associated with the interconnect unit i. The scaling factor α shows how much larger (or smaller) is the interconnect energy compared to the FU energy. This scaling factor can not be accurately determined in high-level tasks since it requires the information obtained after placement and routing. Therefore, scheduling and binding processes are com- bined with low-level tasks such as placement, routing, and circuit simulation to determine α.
The objective function used in the synthesis flow is given by Eq. (7). It is obtained by simply adding active, inactive and interconnect energies. Experimental results in Fig. 4 shows the relationship between the objective function and the energy consumption obtained by the circuit simulation for EW and FIR filters. According to the results, objective function has a linear relationship with the energy consumption. However, this optimization problem is extremely time consuming to solve using the traditional methods such as integer linear programming (ILP) etc. Therefore, GA based heuristic searching method is introduced to solve very large size problems in a reasonable time.
Assignment of Dual Supply and Threshold Voltages Based on a Genetic Algorithm

Overview
The genetic algorithm (GA) is a stochastic search technique based on the mechanism of natural selection and natural genetics. It starts with an initial set of random solutions called population. Each individual in the population is called a chromosome which represents a solution to the problem at hand. The chromosomes evolve through successive iterations, called generations. During each generation, the chromosomes are evaluated, using some measures of fitness. In order to create the solutions for the next generation, new chromosomes, called children are formed by either (i) merging two chromosomes from current generation using a crossover operator or (ii) modifying a chromosome using a mutation operator. One of the earliest GA approach for high-levelsynthesis is given in [11] . It is used to solve simple scheduling and allocation problem and it does not deal with the power consumption minimization. One of the major differences in [11] and our GA algorithm is the crossover and local search operations. We use a graph based crossover (changing a cluster of nodes at a time) while [11] uses a single-point-crossover (changing a single node at a time). Single-point-crossover is not effective in high-levelsynthesis, since it takes a larger number of generations to find a good solution. Moreover, the local search operation used in our algorithm increases the speed of GA significantly compared to the GA algorithm in [11] .
We use the GA algorithm proposed in [12] to solve our energy consumption minimization problem. For the crossover operation, we classify all the nodes in each chromosome into two groups, based on a randomly selected cutpoint, as shown in Fig. 5 . The group 1 is the set of predecessor nodes of the cut-point node and the cut-point node itself. The rest of the nodes are belong to the group 2. The crossover operator merges each group with its neighbor group in another chromosome to generate new ones which are called child chromosomes. The term mutation is referred to the sudden changes occur in an existing chromosome to generate a new one. Figure 6 represents a mutation operation where the FU of node N1 is replaced by a new one.
A new generation is formed by selecting some of the chromosomes and rejecting the others, according to their fitness values, to keep the size of the population constant. Fitter chromosomes have higher probabilities of being selected. After processing for several generations, the algorithm converges to the best chromosome, which hopefully represents the optimal or a suboptimal solution to the problem. Figure 7 shows the flow chart of the GA based search algorithm.
Local search
A local search is applied to new chromosomes generated by crossover and mutation operators. All the chromosomes in the population obtained by the local search represent a local optima. After these chromosomes are evaluated based on their energy consumption values, promising individuals are selected to form the next generation. The local search algorithm is shown as follows.
Step1: Select one chromosome (I i ) from the population (P), where P is a set of chromosomes generated by crossover and mutation operators. P = P − I i . to
Since the module selection for every operation except operation o i are fixed, local optima can be found in a reasonable time period. Suppose that an chromosome is shown in Fig. 8(a) . In this case, the module selection for all the operations except o 1 
The new chromosome obtained by the local search for operation o 1 is shown in Fig. 8(b) , where V dd < V dd . Therefore, the energy consumption is reduced, i.e. the solution is improved.
For small designs with less than 50 nodes, the accuracy of the GA is over 99%, according to [12] . For the larger problems, the accuracy is about 95%. A detailed discussion on the usage of GA in energy consumption minimization problem is given in [12] .
Interconnection Network Simplification Based on Regularity in Data Transfer
In the deep sub-micron process, the power consumed by interconnects is comparable to that of FUs. Therefore, the interconnection network simplification is very important when reducing the total power consumption. However, simplifying the interconnects increases the number of FUs. The increased FUs consume area and static energy. Therefore, a trade-off exists between the number of FUs and complexity of interconnects. Figure 9 (a) shows an FU-energyoptimized design and Fig. 9(b) shows an interconnect-aware Design. Figure 9 (a) has a smaller number of FUs and a larger number of interconnects. Figure 9 (b) has a smaller number of interconnects and a larger number of FUs. In the "Interconnect-aware Design," even the number of FUs are larger, the total power can be smaller due the interconnect power reduction. In the proposed method, we consider the interconnect energy and FU energy together. As a result, a better solution can be found for vast variety of circuits. A very interesting method to reduce the interconnect power is given by Khouri et al. [13] . The main idea in this paper is to increase the interconnect sharing and simplify the interconnection network. To simplify the interconnection network, redundant data transfer patterns are exploited and mapped to same interconnects. In order to find the redundant data transfer patterns, "e-instances" are extracted for a given DFG. An e-instance is a pair of nodes connected by an edge. E-instances are classified into types called "etemplates," based on the operation types of their source and destination nodes. Figure 10(b) shows the e-templates and e-instances that are derived from the DFG in Fig. 10(a) . Einstances in the same e-template share the same FUs, if their operations are not overlapped. Figures 11(a) and 11(b) show a random binding result and e-template based binding result respectively. E-template based binding provides a simple interconnection network by sharing the interconnects between O 1 to O 2 and O 4 to O 6 . As a result, number of buffers, wires and multiplexers needed for the implementation are decreased and energy consumption in interconnects is reduced.
We adopt the idea in [13] , since it effectively reduces the interconnection network complexity and increases interconnect sharing. The interconnection network simplification method given in [13] work only in a single supply and threshold voltage scheme. However, we extended the concept of e-template to be applicable in the dual supply and threshold voltage scheme. We define the e-templates considering, not only the operation types but also the FU types specified in the module library (Table 2) . Unlike the original e-templates, we use a DFG after scheduling and module, such as in Fig. 12(a) . Note that in Fig. 12(b) , e-instances with same operation type (O 1 → O 2 and O 4 → O 6 ) are classified into different e-templates (E1 and E2) since their FU types are different.
The proposed method minimizes the FU energy consumption first and then performs binding to estimate the interconnect energy consumption. As shown in Fig. 13 , binding is performed for all the chromosomes in the population after the local search. Then, FU and interconnect energies are calculated. Then the chromosome that has the minimum total energy is considered as the best solution. Since it is difficult to determine the α value in high level synthesis, the feedback from the layout simulation is used to choose the best solution. The optimization is done for different α values and each layout is evaluated for power consumption using Hspice. Then the circuit with minimum energy consumption is chosen as shown in Fig. 14 . 
Evaluation
Evaluation Environment
For the evaluation, 6-metal 1-poly 90 nm process with high and low threshold voltage transistors is used. A module library is created using logic synthesis tool Design Compiler W-2004.12-SP1 and automatic layout design tool Astro vZ-2007.03-SP3. Level converters are also included in some of the high voltage modules. Each module is measured for active and inactive power dissipations using randomly generated test input patterns. A standard cell library is created for the FUs in the module library. Proposed synthesis tool generates an RTL circuit description in Verilog HDL after the optimization process. This Verilog HDL netlist file and module level standard cell library is used to create any circuit using automatic layout design. The power is cal- culated by circuit simulation using Hspice vZ-2007.03 and NanoSim. The flow chart of the evaluation process is shown in Fig. 15 .
Experimental Results Using Dual Supply and Threshold Voltages
The evaluation is based on various benchmark examples [14] . Table 3 is the comparison with single supply voltage scheme without interconnection network simplification.
In the single supply voltage scheme, binding is done by minimizing the FU area. The results shows that over 40% of energy saving is possible. Table 4 shows the comparison with the single supply voltage scheme with interconnection network simplification proposed in [13] . Compared to this, proposed method gives 25% average energy savings and 43% maximum energy savings. The effect of interconnection network simplification in the dual-supply-threshold voltage scheme is also evaluated. Table 5 shows the power consumption in dual-supply-voltage scheme with and without the interconnection network simplification. According to the results, up to 36% of energy savings are achieved. However, in some examples, power consumption does not reduce, because of the increased overhead due to the interconnection network simplification. When a circuit has smaller number of similar data transfer patterns, interconnect sharing increases the number of FUs. This occurs due to the allocation of separate FUs for different data transfer patterns. These additional FUs increase the leakage current and the power consumption due to the leakage current. Therefore, the power savings depend on the trade-off between the power increase due to the leakage current and the power reduction due to the interconnection network simplification. Fig. 16 The relationship between the α value and power consumption. Figure 17 shows the multiplier effect of the power reduction techniques for each benchmark. The second bar of each benchmark shows the power after simplifying the interconnects in a single supply-threshold scheme. This is the same technique proposed in [13] . For the first 3 benchmarks, power is reduced, but the next two benchmarks the power is increased after the interconnect simplification. This occurs due to the leakage current of the increased FUs after the interconnect simplification. The last bar of each benchmark shows the power consumption of the proposed method. In this case, if the low voltage modules are increased, then the power increase is small. As a result, larger power reduction can be achieved after the interconnect simplification, unlike in the single supply-threshold voltage scheme.
In Table 3 , EW filter example does not give good results since it does not have enough regular data transfer patterns. However, according to the other experimental results, interconnection network simplification gives a 10% of average power reduction and 36% of maximum power reduction in the dual supply and threshold voltage scheme.
Figures 16(a)-16(f) shows the relationship between the α value and power consumption for various examples. Power consumption decreases when the α is increased. However, after the total power is reduced to a minimum, it starts to increase with the α. The α that gives the minimum power is the optimal one for that particular architecture.
As shown in Fig. 16 , the characteristics of the α curve is same for both HSPICE and NanoSim (accuracy level 3 and 4) simulators. Table 6 shows the time required for each step in the optimization process. The time required for the high level synthesis is measured in the following environment. The CPU is Intel Core2Duo 2.4 GHz and the system memory is 3 GB. Operating system is Windows XP. The high level synthesis tool is written using about 20000 lines of C++ code and the compiler is Visual Studio 2003. Note that, the logic synthesis and RTL generation is also done with the high level tasks. The time required for the logic synthesis is less than 10 seconds. The time for the low level tasks such as placement and routing, circuit parameter extraction and circuit simulation is measured in the following environment. The CPU is Intel Xeon 3.2 GHz and the system memory is 4 GB. Operating system is Vine Linux 4.0. According to Table 6 , most of the time is consumed by the HSPICE simulation. All the other processes need less than 30 minutes to be completed. Note that, the circuit simulation of matrix multiplication example cannot be done using a 32 bit operating system since it required more than 4 GB of memory. Therefore, this data is not included in Table 6 . However, as shown in Fig. 16 , less accurate simulations such as Nanosim level 3 and 4 also gives the same characteristic as the HSPICE simulation. Therefore, even the power consumption is not accurate, the best alpha value is same in all three simulations. As a result, we can use Nanosim level 3 and 4, instead of HSPICE for the circuit simulation and the whole process needs only few hours to be completed.
Conclusion
In this research, a method to reduce the total power in high level synthesis stage, under time and area constraints is proposed. The proposed method can be divided in to two stages. In the first stage, FU energy is reduced using a dual supply and threshold voltage assignment. In the second stage, interconnection network simplification is performed. This is based on the sharing of interconnects among FUs by assigning the same interconnects for pairs of nodes that have same data transfer patterns. After the power optimization process finished, low level tasks and circuit simulation is done to determine the total power. This procedure is done for different α values to find the best design with minimum power consumption. Since the optimization process optimize a circuit for a particular α value, the number of circuits subjected to the low level tasks is very small. According to the experimental results, best design can be found by simulating less than 10 different circuits. The low level tasks such as logic synthesis and netlist generation are also done along with the high level tasks, so that the required low level tasks for the circuit simulation are further reduced.
The evaluation process considers all the additional overheads such as level converters, power supply lines, additional FUs, multiplexers and wires. Even considering all the overheads, more than 40% of energy reduction is achieved in some examples. However, if the circuit has a smaller number of regular data transfer patterns, the leakage power may increase with the interconnection network simplification. Therefore, the proposed synthesis process is suited for the circuits with a large number of similar data transfer patterns.
In the proposed method, it is posible to change the ratio of FU energy to interconnect energy by ajusting the α value. Therefore, this method can be used in different applications and technologies that the ratio of interconnect energy to FU energy has different values. The proposed method can be used in FPGAs also. FPGAs have programmable logic blocks and programmable interconnects that can be reconfigured to create arbitrary data paths. However, because of this high flexibility, the interconnect power consumption in FPGAs is very high and occupies a large share in the total power consumption. To reduce the dynamic and static power consumption, dual supply voltage FPGAs [15] and multi threshold voltage FPGAs [16] are proposed recently. Therefore, the proposed method can be applied effectively in those FPGAs.
