Genetic approach to minimizing energy consumption of VLSI processors using multiple supply voltages by Hariyama Masanori et al.
Genetic Approach to Minimizing Energy
Consumption of VLSI Processors
Using Multiple Supply Voltages
Masanori Hariyama, Member, IEEE, Tetsuya Aoyama, and
Michitaka Kameyama, Fellow, IEEE
Abstract—This paper presents an efficient search method for a scheduling and module selection problem using multiple supply
voltages so as to minimize dynamic energy consumption under time and area constraints. The proposed algorithm is based on a
genetic algorithm so that it can find near-optimal solutions in a short time for large-size problems. n efficient search can be achieved by
crossover that prevents generating nonvalid individuals and a local search is also utilized in the algorithm. Experimental results for
large-size problems with 1,000 operations demonstrate that the proposed method can achieve significant energy reduction up to
50 percent and can find a near-optimal solution (within 2.8 percent from the lower bound of optimal solutions) in 10 minutes. On the
other hand, the ILP-based method cannot find any feasible solution in one hour for the large-size problem, even if a state-of-art
mathematical programming solver is used.
Index Terms—Automatic synthesis, scheduling, module selection, data-path design.

1 INTRODUCTION
IN recent years, low power has become a primary designconcern [1]. An effective way to reduce dynamic power
consumption is to lower the supply voltage of a circuit.
However, reducing the supply voltage increases the circuit
delay. The use of multiple supply voltages is a well-known
technique which reduces dynamic power consumption
without increasing the circuit delay [2]. Fig. 1 shows an
example when a time constraint and a data-flow graph
(DFG) to specify an input behavioral description are given.
A lower supply voltage V 0dd can be applied to operations o2
and o3 because they have flexibility about the supply
voltages to which they can be assigned. A higher supply
voltage Vdd can be applied to all the other operations
because they have no flexibility about the supply voltages.
The dynamic power consumption reduces to 70 percent if
V 0dd ¼ Vdd=2 because power consumption is proportional to
V 2dd [3]. The major concern of this technique is that the
number of functional units, that is, the chip area, increases
due to the delay of operations to which lower supply
voltages are applied. For example, three functional units are
required in the case of Fig. 1b, while only two functional
units are required in the case of Fig. 1a. Therefore, an area
constraint, as well as a time constraint, is important for low
power design using multiple supply voltages.
Several researches of high-level synthesis using multiple
supply voltages have been reported [4], [5], [6], [7], [8]. The
algorithms for time-constrained problems are presented [4],
[5]. A time and area constrained problem is also discussed
[7], [8] because a multiple supply voltages approach tends
to result in area overheads as described above. For this
problem, the integer linear programming (ILP) method is
usually used. However, the ILP method is practical only for
small-size DFGs.
This paper presents an efficient search method for the
dynamic energy consumption minimization problem under
time and area constraints which can be applicable to the
large-size DFGs. The proposed algorithm is based on a
genetic algorithm (GA). The critical problem for a GA is to
generate nonvalid individuals which can slow down or
even prevent convergence of algorithms.
In ourproblem, typical crossovermethods suchas the one-
point crossover generate a large number of nonvalid
individuals that don’t satisfy precedence constraint since
theydon’t considerdependencies betweennodes inDFGs.To
solve the problem, we propose a crossover based on data-
flow graph representation. Moreover, we combine a GA and
local search heuristic which can get local optima in a limited
search space to make the searchmore efficient. Experimental
results for large-size problemswith 1,000 operations demon-
strate that the proposed method can achieve significant
energy reduction up to 50 percent and can find near-
optimal solution (within 2.8 percent from the lower bound
of optimal solutions) in 10 minutes. On the other hand,
the ILP-based method cannot find any feasible solution in
one hour for the large-size problem even if a state-of-the-
art mathematical programming optimizer, which includes
a lot of efficient algorithms to reduce computational
642 IEEE TRANSACTIONS ON COMPUTERS, VOL. 54, NO. 6, JUNE 2005
. M. Hariyama and M. Kameyama are with the Graduate School of
Information Sciences, Tohoku University, Aoba6-6-05, Aramaki, Aoba,
Sendai, Miyagi, 980-8579, Japan.
E-mail: {hariyama, kame}@kameyama.ecei.tohoku.ac.jp.
. T. Aoyama is with the System Devices Research Laboratories, NEC Corp.,
Japan. E-mail: t-aoyama@cd.jp.nec.com.
Manuscript received 3 Feb. 2004; revised 29 Sept. 2004; accepted 7 Oct. 2004;
published online 15 Apr. 2005.
For information on obtaining reprints of this article, please send e-mail to:
tc@computer.org, and reference IEEECS Log Number TCSI-0036-0204.
0018-9340/05/$20.00  2005 IEEE Published by the IEEE Computer Society
Authorized licensed use limited to: TOHOKU UNIVERSITY. Downloaded on February 5, 2009 at 00:54 from IEEE Xplore.  Restrictions apply.
complexity, is used to accelerate the exploration of search
space.
2 ENERGY MINIMIZATION PROBLEM
2.1 Data Flow Graph
An input behavioral description is given by a DFG, as
shown in Fig. 2. A DFG is a directed acyclic graph GðV ;EÞ,
where V is a set of nodes and E a set of edges. Each vi 2 V
represents an operation (oi) in the behavioral description. A
directed edge eij from vi 2 V to vj 2 V exists in E if the
result of an operation oi is used as an input of an operation
oj. In this case, vi is called an immediate predecessor of vj
and the set of all immediate predecessors of vi is denoted by
Predvi . Each operation oi can be executed in Doi control
steps. The value of Doi depends on functional units
performing operation oi.
2.2 Datapath Architecture
Fig. 3 shows a datapath architecture, where functional units
and registers are connected by multiple buses to support
parallel data transfer. The architecture model is very
flexible. The number of FUs, types of FUs, the number of
registers, and the number of buses can be changed as long
as area and time constraints are satisfied. Connections
between FUs are not restricted and arbitrary point-to-point
interconnection between FUs can be implemented. More-
over, the datapath architecture allows both a nonpipelined
datapath and a pipelined one with an arbitrary degree of
spatial parallelism.
We focus on the minimization of dynamic energy
consumption that is caused by signal transitions in circuits.
The technique of gating a clock is used to prevent registers
from loading unnecessary new values so that unnecessary
signal transitions in functional units fed by the registers are
suppressed. The gated-clock datapath architecture also
simplifies the objective function of the energy consumption
minimization problem, as described later.
The use of multiple supply voltages is a well-known
technique to obtain low energy implementation at reduced
performance overhead. In the context of high-level synthesis,
one way to utilize multiple supply voltages is module
selection, that is, the process of mapping operations from
the DFG to component templates from the RTL library that
containsmultiple versions of each component corresponding
to different supply voltages. Note that only a functional unit
template, not a specific instance, is associated with each
operation. Table 1 shows an example of the RTL component
library. The OP type denotes an operation type that can be
performed by the functional unit templates. For example,
functional unit templates of types F1 and F3 can perform
addition (denotedby“ADD”)andmultiplication (denotedby
“MUL”), respectively. The delay denotes the number of steps
for one operation. The energy denotes the average energy
consumption for oneoperation.The functionalunit templates
have an OP type, a supply voltage, an area, a delay, and an
energy. The library can also contain a different circuit
implementation for each OP type. For example, an addition
can be implemented by using a ripple-carry adder, carry-
lookahead adder, and carry-select adder, etc.
2.3 Problem Definition
For the energy consumption minimization problem, we
make the following assumptions:
Assumption 1. All operations synchronize with a clock cycle and
let the single clock cycle be “step.”
HARIYAMA ET AL.: GENETIC APPROACH TO MINIMIZING ENERGY CONSUMPTION OF VLSI PROCESSORS USING MULTIPLE SUPPLY... 643
Fig. 1. Power consumption reduction using multiple supply voltages.
(a) Single supply voltage. (b) Dual supply voltages.
Fig. 2. Data flow graph.
Fig. 3. Architecture model.
TABLE 1
RTL Component Library
Authorized licensed use limited to: TOHOKU UNIVERSITY. Downloaded on February 5, 2009 at 00:54 from IEEE Xplore.  Restrictions apply.
Assumption 2. The execution time of each operation is given by
kTclk for an integer value k, where Tclk is the clock period.
Assumption 3. The delay involved in a register-to-register
transfer is negligible.
Assumption 4. The energy consumed by registers and an
interconnection network is negligible. The areas are also
negligible.
Assumption 5. Static power consumption is negligible.
Scheduling and module selection are discussed in our
approach. Basically, scheduling refers to the process of
mapping operations to control steps. As can be seen from
Table 1, multicycle operations are used for our problem.
Thus, the scheduling is extended to determine a start
control step of each operation.
The goal is to minimize the total energy consumed when
all the operations are performed. The total energy is simply
given by the sum of energy consumption for all the
operations because the gated-clock datapath architecture
is employed as described above. The energy consumption
for each operation depends on the functional unit to which






where EFi is the energy consumed by a functional unit of
type Fi and NFi is the number of all the functional units of
type Fi used in the processor.
We describe the following constraints:
Time Constraint. Any operation must finish by Tmax, that
is, the maximum number of control steps available to
execute the operations in the DFG. Therefore, (2) must be
satisfied for any operation oi.
Schedoi þDoi  1  Tmax; ð2Þ
where Schedoi is the start control step of operation oi.
Area Constraint. The total area of all the functional units
must not exceed Amax, that is, a maximum chip area




ðAFi NFiÞ  Amax; ð3Þ
where AFi is an area of the functional unit of type Fi.
Precedence Constraint. An operation oj must not start
before an operation oi has finished if oi is a predecessor
of oj (i.e., oi 2 Predoj ).
Thus, the energy consumption minimization problem is
defined as the problem to schedule operations and assign a
functional unit to each operation so as to minimize the
energy consumption under given time and area constraints.
The integer linear programming (ILP) method is usually
used for the problem. We can formulate this problem as an
integer linear programming problem. However, the
ILP method is impractical since its execution time grows
rapidly with the size of problems. Instead, we propose an
efficient search method which is based on genetic algo-
rithms and can be applicable to large-scale problems, as
described next section.
3 GA-BASED EFFICIENT SEARCH METHOD
3.1 Basic Genetic Algorithm
A genetic algorithm is a stochastic search technique based on
the mechanism of natural selection and natural genetics. A
genetic algorithmstartswithan initial setof randomsolutions
called population. Each individual in the population is called
a chromosome, which represents a solution to the problem at
hand. The chromosomes evolve through successive itera-
tions, called generations. During each generation, the
chromosomes are evaluated, using somemeasures of fitness.
To create the next generation, new chromosomes, called
children, are formed by either 1) merging two chromosomes
from the current generation using a crossover operator or
2) modifying a chromosome using a mutation operator. A
new generation is formed by 1) selecting, according to the
fitness values, some of the parents and children and
2) rejecting others so as to keep the population size constant.
Fitter chromosomes have higher probabilities of being
selected. After several generations, the algorithms converge
to the best chromosome, which hopefully represents the
optimal or suboptimal solution to the problem. The flowchart
of the basic genetic algorithms is shown in Fig. 4.
3.2 Chromosome Representation
The approach for energy consumption minimization con-
sists of scheduling and module selection as described in the
previous section. Because the chromosome representation
for the problem must contain the information for both
scheduling and module selection, we can use the following
string for the problem with n nodes:
x1 y1 x2 y2 x3 y3 . . . xn yn;
where xi is the start control step of operation oi and
corresponds to scheduling and yi is the functional unit
template which is assigned to operation oi and corresponds
to module selection.
644 IEEE TRANSACTIONS ON COMPUTERS, VOL. 54, NO. 6, JUNE 2005
Fig. 4. Flowchart of the basic genetic algorithm.
Authorized licensed use limited to: TOHOKU UNIVERSITY. Downloaded on February 5, 2009 at 00:54 from IEEE Xplore.  Restrictions apply.
The earliest and latest bounds of xi are determined by
the As-Soon-As-Possible (ASAP) and As-Late-As-Possible
(ALAP) algorithms, respectively. The ASAP algorithm
maps each operation onto the earliest possible step [9],
while the ALAP one maps each operation onto the latest
possible step [10]. Let Ei and Li be the start control steps
into which operation oi is scheduled by the ASAP and
ALAP algorithms. The number of control steps between Ei
and Li is called the mobility range of operation oi and is
defined as follows:
mrangeðoiÞ ¼ fSjjEi  j  Lig:
Fig. 5 shows the mobility range of every operation in the
DFG shown in Fig. 2 for a time constraint of six steps, where
Sj is the jth control step. For example, the mobility range of
o3, mrangeðo3Þ, is fS1; S2; S3; S4g since its ASAP and ALAP
labels are E3 ¼ 1 and L3 ¼ 4, respectively. Fig. 6 shows a
scheduling example for the mobility range shown in Fig. 5.
The start control steps of operations o1 and o2 are S3 and S1,
respectively.
The possible range of yi can be defined as the following
FUti , where the type of an operation oi is denoted by ti.
FUti : the set of functional unit templates from the given
RTL component library that can perform operations of
the type ti.
Fig. 7 shows a module selection example for the RTL
component library shown in Table 1, where Fj is the type of
functional unit template. In the figure, the set of functional
unit templates that can perform operation o1, FUt1 , is
fF1; F2g. The functional units to which operation o1 and o2
are assigned are the type F1 and F2, respectively.
Fig. 8 shows a scheduling and module selection example.
The start control step of operation o3 is S2ð¼ x3Þ and the
functional unit to which operation o3 is assigned is F2ð¼ y3Þ.
The chromosome representation of this example is shown in
Table 2.
3.3 Crossover Based on DFG Representation
For our problem, typical crossover methods such as the one-
point crossover generate a large number of nonvalid
individuals which slow down or even prevent convergence
of algorithms, where the nonvalid individuals are defined as
individuals which do not satisfy the precedence constraint.
For example, let us explain the one-point crossover, where
one cut-point is randomly selected on the chromosomes and
the left parts of two parents are exchanged to generate
children. Let the DFGs Parent1 and Parent2 be parents
(Fig. 9). Their chromosomes are shown in Table 3. Suppose
that a cut-point is selected between O3 and O4. Then, the
nodes are classified into two groups: Group1 and Group2
corresponding to the left part and right part on the
chromosomes, respectively (Fig. 10). Note that O1, O2, and
O3 are put into Group1 although they have dependencies
with O4 and O5 in Group2. The chromosomes of resulting
children are shown in Table 4. Both children Child1 and
Child2arenonvalid individualsbecause theydon’t satisfy the
precedence constraint, as shown in Fig. 11. The grouping
strategy ignoring the dependencies results in a high prob-
ability of generating the nonvalid individuals since the
different schedules are applied to different parents.
To solve this problem, we propose a crossover method
that groups as many nodes with dependencies as possible.
It is based on the idea that nodes in the same group should
satisfy the precedence constraint. Given DFGs of parents,
Parent1 and Parent2, the algorithm is described as follows:
Step 1: Randomly select a cut-point node on the DFGs of the
parents. That is, randomly select the number CP of the
cut-point node from 1 to n, where n is the total number of
nodes in the DFG.
Step 2: Classify the node of Parent1 and Parent2 into two
groups. Let G11 and G21 be, respectively, a set of the
predecessors of the cut-point node OCP and a set of
nodes excepting G11 in Parent1. Let G12 and G22 be,
HARIYAMA ET AL.: GENETIC APPROACH TO MINIMIZING ENERGY CONSUMPTION OF VLSI PROCESSORS USING MULTIPLE SUPPLY... 645
Fig. 5. Mobility range of each operation in the DFG shown in Fig. 2.
Fig. 6. Scheduling example.
Fig. 7. Module selection example.
Fig. 8. Scheduling and module selection example.
Authorized licensed use limited to: TOHOKU UNIVERSITY. Downloaded on February 5, 2009 at 00:54 from IEEE Xplore.  Restrictions apply.
respectively, a set of the predecessors of the cut-point
node OCP and a set of nodes excepting G11 in Parent2.
Step 3: Exchange Group1 of Parent1 with that of Parent2.
That is, Child1 is generated by merging G11 and G22 and
Child2 is generated by merging G12 and G21.
For example, assume that the node O4 is selected as a
cut-point node for parents shown in Fig. 9. Then, Group1
and Group2 are determined as shown in Fig. 12. Note that
only O4 in Group1 has dependencies with O6 in Group2 and
the precedence constraint is satisfied. The DFG representa-
tion of chromosomes generated by the graph-based cross-
over is shown in Fig. 13.
3.4 Combination of Local Search and GA
In order to achieve a more efficient search, a genetic
algorithm is combined with a local search. A local search
technique is used to find local optima in a given problem
search space and a genetic algorithm is used to search the
space of local optima in order to find the global optimum.
Fig. 14 shows the structure of the combination of a local
search and genetic algorithm. The local search is applied to
new children generated by a crossover and mutation
operators. All the individuals in the population obtained
by the local search represent local optima. They are
evaluated based on their energy consumption values.
Promising individuals are selected from the set of local
optimal solutions to form the next generation.
We describe a local search for our problem. The local
search is applied to all individuals in every generation. The
algorithm is shown as follows:
Step1: Select one individual ðIiÞ from the population ðP Þ,
where P is a set of individuals generated by crossover
and mutation operators. P ¼ P  fIig.
Step2: Select one operation ðoiÞ from OIi , where OIi is a set
of nodes in the individual ðIiÞ. OIi ¼ OIi  foig.
Step3: Search a feasible scheduling and module selection
for operation oi to improve the solution, while the
scheduling and module selection for all the operations
except operation oi are fixed.
Step4: If OIi 6¼ , then go to Step2.
Step5: If P 6¼ , then go to Step1.
Since the scheduling and module selection for every
operation except operation oi are fixed, the local optima
are found in reasonable time. Suppose that an individual
shown in Fig. 15a is given. Let us explain the local search for
operation o1. In this case, the scheduling and module
selection for all the operations except operation o1, that is,
operations o2, o3, o4, and o5, are fixed. A feasible scheduling
and module selection for only operation o1 are searched.
The resulting individual obtained by the local search for
operation o1 is shown in Fig. 15b, where V
0
dd < Vdd. The
functional unit which is assigned to operation o1 changes
from a high voltage unit to a low voltage one. The energy
consumption for a operation o1 is reduced, that is, the
solution is improved.
4 EXPERIMENTAL RESULTS
We describe some details for our algorithm.
. In order to be sure that only valid individuals are
placed into the initial population, we utilize the
mobility range of operations.
. Individuals having lower energy consumption are
given higher fitness values. The nonvalid indivi-
duals are given the lowest fitness value.
. A roulette wheel approach is adopted as the
selection procedure. It can select a new population
with respect to the probability distribution based on
fitness values.
. Individuals which are 10 percent of population-size
are placed into the next generation without any
genetic operation. They are selected according to
646 IEEE TRANSACTIONS ON COMPUTERS, VOL. 54, NO. 6, JUNE 2005
TABLE 2
Chromosomes of the DFG Shown in Fig. 8
Fig. 9. Scheduling and module selection example (left: Parent1, right:
Parent2).
TABLE 3
Chromosome Representation of the DFGs Shown in Fig. 9
Fig. 10. Two groups divided by a cut-point for Table 3.
TABLE 4
Chromosomes Generated by One-Point Crossover
for Chromosomes Shown in Table 3
The cut-point is between operations o3 and o4.
Authorized licensed use limited to: TOHOKU UNIVERSITY. Downloaded on February 5, 2009 at 00:54 from IEEE Xplore.  Restrictions apply.
their fitness values. An individual whose fitness
value is higher has a higher priority.
Let us evaluate our algorithm (hybrid of the GA using
the graph-based crossover and the local search) on some
high-level synthesis benchmarks (EW filter, FIR filter, and
HAL) and compare it with GA using the one-point
crossover and GA using the graph-based crossover.
The algorithm is executed on a PC (Athlon@800MHz,
Memory 768M Byte, OS: Windows XP) and compared with
two other search methods (GA using the one-point cross-
over and the graph-based crossover). We assume that the
library shown in Table 1 is given which contains two
different supply voltages 5V and 3V.
The results for some high-level synthesis benchmarks
(EW filter, FIR filter, and HAL) are tabulated in Table 5. Our
algorithm is denoted by hybrid. Tmax is the time constraint
and Amax is the area constraint. Tmax is given for three
different values (Tc, 1:5Tc, and 2Tc, where Tc is the critical
path delay). E1 is the energy consumption corresponding to
the supply voltage of 5V. E2 is the energy consumption of
the best solution obtained by our algorithm for 50 trials. The
reduction ratio is the percentage of E2 over E1. We also
evaluate our algorithm with the average and variance of
50 trials. When Tmax ¼ Tc, the average energy reduction is
84.8 percent compared to E1. Similarly, when Tmax ¼ 1:5Tc,
the average energy reduction is 58.5 percent and, for
Tmax ¼ 2Tc, the average energy reduction is 50.0 percent.
To obtain the optimal solution, the energy minimization
problem is formulated by using the integer linear program-
ming, as described in Appendix A, and is solved by using
the integer program solver package (CPLEX 7.1, ILOG
Corp.). GAP is the difference between the reduction ratio
and the percentage of the optimal solution over the E1. The
results show that the proposed crossover is useful com-
pared with the one-point crossover and the optimal
solutions are obtained for typical high-level synthesis
benchmarks.
To evaluate our algorithm for large-size problems, the
high-level synthesis benchmarks (EW filter, FIR filter, and
HAL) are extended to the large-size examples (EWF30,
FIR90, and HAL100). The DFGs of the EWF30 consists of
30 DFGs of the EW filter. Similarly, the DFGs of the FIR90
and HAL100 consist of 90 DFGs of FIR filter and 100 DFGs
of HAL, respectively. The search time is set to 10 minutes.
The results for each example under the search time of
10 minutes are tabulated in Table 6. The time constraint is
2Tc. E2 is the energy consumption of the best solution
obtained by our algorithm for 50 trials. The average energy
reduction is 58.7 percent compared to E1. The lower bound
is obtained by an optimal linear programming solution.
Note that the ILP method cannot obtain optimal solutions
for these examples in one hour. GAP is the difference
between the reduction ratio and the percentage of the lower
HARIYAMA ET AL.: GENETIC APPROACH TO MINIMIZING ENERGY CONSUMPTION OF VLSI PROCESSORS USING MULTIPLE SUPPLY... 647
Fig. 11. DFG representation of chromosomes shown in Table 4 (left:
Child1, right: Child2).
Fig. 12. Groups obtained by the graph-based crossover.
Fig. 13. DFG representation of chromosomes generated by the graph-
based crossover.
Fig. 14. Flowchart of the genetic algorithm with a local search.
Fig. 15. Example of a local search for an operation o1.
Authorized licensed use limited to: TOHOKU UNIVERSITY. Downloaded on February 5, 2009 at 00:54 from IEEE Xplore.  Restrictions apply.
bound over the E1. The results show that our solution is
within 2.8 percent of the lower bound and our algorithm is
practical for large-size problems in quality and search time.
On the other hand, the ILP method cannot find any feasible
solutions in one hour for these large-size problems.
5 CONCLUSION
We present an efficient search algorithm based on a genetic
algorithm for the energy consumptionminimizationproblem
under time and area constraints. We have also demonstrated
that our algorithm can be applicable to large-size problems.
Our algorithm consists of two schemes. One is a DFG-
representation-based crossover that seldom generates non-
valid individuals. The other is a combination of a local search
and GA. For large-size examples, high-quality solutions are
obtained by our algorithm in a short time.
The architecturemodel in this paper is simple but effective
for cases where the power dissipation caused by functional
units occupies most of the total power dissipation. To handle
the case where power dissipations caused by interconnec-
tions and registers are dominant parts, we need additional
tasks, interconnection allocation, and register allocation,
648 IEEE TRANSACTIONS ON COMPUTERS, VOL. 54, NO. 6, JUNE 2005
TABLE 5
Results for the Set of Benchmarks
TABLE 6
Results for the Large-Size Examples
Authorized licensed use limited to: TOHOKU UNIVERSITY. Downloaded on February 5, 2009 at 00:54 from IEEE Xplore.  Restrictions apply.
because interconnections are determined after interconnec-
tion allocation and because the number of registers is also
determined after register allocation. Integrating these tasks
and scheduling requires an enormously large search space
and is one of the next challenging issues.
The problem formulation of considering static power
due to leakage current is also of importance and we are now
extending our method to the problem considering static
power. We will be able to achieve it by changing library
representation and the objective function such that:
. the FU library includes both of power dissipations in
an active and an inactive steps and
. the objective function includes the power dissipa-
tions in inactive steps for each functional unit.
APPENDIX
PROBLEM FORMULATION BASED ON INTEGER LINEAR
PROGRAMMING
Let us consider energy minimization under a time
constraint and an area one. Our approach consists of
module selection and scheduling. As described above,
module selection refers to the process of mapping opera-
tions from the DFG to component templates from the given
RTL component library that contains multiple versions of
each component corresponding to different supply vol-
tages. Basically, scheduling refers to the process of mapping
operations to control steps. As can be seen from Table 1,
multicycle operations are used for our formulation. Thus,
scheduling is extended to determining the initial control
step of each operation [13].
We use the following notations for the ILP-based
formulation:
Amax: The maximum chip area, that is, an area constraint.
Tmax: The maximum number of control steps, that is, a
time constraint.
Sj: The jth control step. 1  j  Tmax.
Omax: The total number of operations in the given DFG.
oi: An operation in DFG. 1  i  Omax.
Fmax: The maximum number of possible types of
functional unit.
Fk: Possible functional unit types. 1  k  Fmax.
dk: Delay time of the functional unit of the type Fk.
Ei: The earliest control step of oi that is obtained using
as-soon-as-possible (ASAP) scheduling, assuming that each
operation is mapped to the fastest functional unit template
available in the library.
Li: the latest control step of oi that is obtained using as-
late-as-possible (ALSP) scheduling, assuming that each
operation is mapped to the fastest functional unit template
available in the library.
xij: A 0-1 integer variable for scheduling. 1  i  Omax,
Ei  j  Li. If the initial control step of oi is Sj, xij ¼ 1;
otherwise, xij ¼ 0.
yi;k: A 0-1 integer variable for module selection.
1  i  Omax, 1  k  Fmax. If oi is mapped to the functional
unit of the type Fk, yi;k ¼ 1; otherwise, yi;k ¼ 0.
EFi : The energy consumed by a functional unit of the
type Fi for an operation.
NFi : The number of functional units of the type Fi.
AFi : The area of a functional unit of the type Fi.
ti: The type of an operation oi.
FUtk : The set of functional unit templates from the given
RTL component library that can perform operations of the
type tk.
FUINDEXtk : The set of all the functional unit indices in
FUtk (i.e., FUINDEXtk ¼ fljFl 2 FUtkg).
optypek: The operation type of the functional unit
template of the type Fk. For example, in Table 1, optype1
and optype3 are “ADD” and “MUL,” respectively.
OPFk : The set of operations foijoptypek ¼ tig, that is, the
set of operations that can be performed by the Fk-type
functional unit.
OPINDEXFk : The set of all the operation indices in
OPFk (i.e., OPINDEXFk ¼ fijoi 2 OPFkg).
To simplify the formulation, we assumed that energy
consumed by registers and the interconnection network is
negligible. Then, the objective function of the energy
minimization problem is defined as the energy consumed
by functional units to perform all the operations in the DFG
since the gated-clock datapath architecture is employed as
described above. The energy consumption minimization





subject to the following six constraints:
Constraint 1. The number of the initial steps of oimust be one
and the initial step must be in the range from Ei to Li.
X
EijLi
xij ¼ 1; for 1  i  Omax: ð5Þ




yi;k ¼ 1; for 1  i  Omax: ð6Þ
Constraint 3. The operation without successors must not







ðdk  yi;kÞ  1  Tmax;
for all oi without successors:
ð7Þ
Note that Constraints 2 and 3 ensure that the total
number of control steps do not exceed Tmax.
Constraint 4. An operation oj must be performed after
completion of an operation oi if oi is a predecessor of oj










for i and j that satisfy oi 2 Predoj ;
ð8Þ
where Predvi denotes the set of all immediate predeces-
sors of vi.
HARIYAMA ET AL.: GENETIC APPROACH TO MINIMIZING ENERGY CONSUMPTION OF VLSI PROCESSORS USING MULTIPLE SUPPLY... 649
Authorized licensed use limited to: TOHOKU UNIVERSITY. Downloaded on February 5, 2009 at 00:54 from IEEE Xplore.  Restrictions apply.
Constraint 5. The number of operations that can be
performed by the Fk-type functional unit in each step
must not exceed the maximum number NFk of functional






for 1  j  Tmax; 1  k  Fmax:
ð9Þ
Constraint 6. The total area of the functional units must not
exceed the maximum chip area Amax.
X
1im
ðAFi NFiÞ  Amax: ð10Þ
ACKNOWLEDGMENTS
This work was supported in part by the Industrial
Technology Research Grant Program from the New Energy
and Industrial Technology Development Organization
(NEDO) of Japan.
REFERENCES
[1] A.P. Chandrakasan, S. Sheng, and R.W. Brodersen, “Low-Power
Digital CMOS Design,” IEEE J. Solid State Circuits, pp. 473-484,
Apr. 1992.
[2] K. Usami and M. Horowitz, “Clustered Voltage Scaling Technique
for Low-Power Design,” Proc. Int’l Workshop Low Power Design,
1995.
[3] A. Raghunathan, N.K. Jha, and S. Dey, High-Level Power Analysis
and Optimization. Kluwer Academic, 1997.
[4] S. Raje and M. Sarrafzadeh, “Variable Voltage Scheduling,” Proc.
1995 Int’l Workshop Low Power Design, 1995.
[5] J.-M. Chang and M. Pedram, “Energy Minimization Using
Multiple Supply Voltages,” IEEE Trans. VLSI Systems, pp. 436-
443, Dec. 1997.
[6] W.-T. Shiue and C. Chakrabarti, “Low Power Scheduling with
Resources Operating at Multiple Voltages,” IEEE Trans. Circuits
and Systems II, vol. 47, pp. 536-543, June 2000.
[7] M. Johnson and K. Roy, “Low-Power Data-Path Scheduling under
Resource,” Proc. IEEE Int’l Conf. Computer Design, 1996.
[8] Y.-R. Lin, C.-T. Hwang, and A.C.-H. Wu, “Scheduling Techniques
for Variable Voltage Low Power Design,” ACM Trans. Design of
Automation Electronic Systems, pp. 227-248, July 1997.
[9] C. Tseng and D.P. Siewiorek, “Automated Synthesis of Data Paths
in Digital Systems,” IEEE Trans. Computer-Aided Design, vol. 5,
pp. 379-395, 1986.
[10] S.Y. Kung, H.J. Whitehouse, and T. Kailath, VLSI and Modern
Signal Processing, pp. 258-264. Englewood Cliffs, N.J.: Prentice
Hall, 1985.
[11] D.E. Goldberg, Genetic Algorithms in Search, Optimization, and
Machine Learning. Addison-Wesley, 1989.
[12] D. Gajski, High-Level Synthesis. Kluwer Academic, 1992.
[13] C.T. Hwang, J.H. Lee, and Y.C. Hsu, “A Formal Approach to the
Scheduling Problem in High-Level Synthesis,” IEEE Trans.
Computer-Aided Design, vol. 10, no. 4, pp. 464-475, 1991.
Masanori Hariyama received the BE degree in
electronic engineering, the MS degree in In-
formation Sciences, and the PhD degree in
information sciences from Tohoku University,
Sendai, Japan, in 1992, 1994, and 1997,
respectively. He is currently an associate pro-
fessor in the Graduate School of Information
Sciences, Tohoku University. His research
interests include VLSI computing for real-world
application, such as robots, high-level design
methodology for VLSIs, and reconfigurable computing. He is a member
of the IEEE and the IEEE Computer Society.
Tetsuya Aoyama received the BE degree in
electronic engineering, and the MS degree in
information sciences from Tohoku University,
Sendai, Japan, in 2001 and 2003, respectively.
He joined NEC Corporation, System Devices
Research Laboratories in 2003. His research
interests include low-power hardware design
and high-level design methodology for VLSIs.
Michitaka Kameyama received the BE, ME,
and DE degrees in electronic engineering from
Tohoku University, Sendai, Japan, in 1973,
1975, and 1978, respectively. He is currently a
professor in the Graduate School of Information
Sciences, Tohoku University. His general re-
search interests include intelligent integrated
systems for real-world application and robotics,
VLSI processor architecture and high-level
synthesis, and multiple-valued VLSI computing.
He received the Outstanding Paper Awards at the 1984, 1985, 1987,
and 1989 IEEE International Symposiums on Multiple-Valued Logic, the
Technically Excellent Award from the Society of Instrument and Control
Engineers of Japan in 1986, the Outstanding Transactions Paper Award
from the IEICE in 1989, the Technically Excellent Award from the
Robotics Society of Japan in 1990, and the Special Award at the Ninth
LSI Design of the Year in 2002. He is a fellow of the IEEE and a member
of the IEEE Computer Society.
. For more information on this or any other computing topic,
please visit our Digital Library at www.computer.org/publications/dlib.
650 IEEE TRANSACTIONS ON COMPUTERS, VOL. 54, NO. 6, JUNE 2005
Authorized licensed use limited to: TOHOKU UNIVERSITY. Downloaded on February 5, 2009 at 00:54 from IEEE Xplore.  Restrictions apply.
