



















I-A Period Graph Throughput Estimator for 
Multiprocessor Systems1
Neal K. Bambha and Shuvra S. Bhattacharyya
Department of Electrical and Computer Engineering, and 
Institute for Advanced Computer Studies
University of Maryland, College Park
Abstract
A critical challenge in synthesis techniques for iterative applications is the efficient analysis of performa
the presence of communication resource contention. To address this challenge, we introduce the concept of th
graph. The period graph is constructed from the output of a simulation of the system, with idle states include
graph, and its maximum cycle mean is used to estimate overall system throughput. As an example of the util
period graph, we demonstrate its use in a joint power/performance optimization solution that uses either a
genetic algorithm, or a simulated annealing algorithm. We analyze the fidelity of this estimator, and quant
speedup and optimization accuracy obtained compared to simulation.
1  Introduction
In many practical multiprocessor systems, there is contention for one or more shared communication re
One example of this is a shared bus. A processor must first gain access to the bus before it can execute an in
sor communication (IPC) operation. One consequence of this contention is that under self-timed, iterative exec
there is no known method for deriving an analytical expression for the throughput of the system [19], and thus
lation is required to get a clear picture of application performance. However, simulation is computationall
expensive, and it is highly undesirable to perform simulation inside the innermost optimization loop during syn
To avoid such a simulation, an accurate and efficient estimator for throughput is required. This paper present
cient estimator for the throughput of these systems. Our work is in the context of self-timed execution of it
dataflow specifications, which is an efficient and popular design methodology in the domain of digital signal pr
ing (DSP) [13]. An iterative dataflow specification consists of a dataflow representation of the body of a loop
to be iterated a large or indefinite number of times (e.g., across a vast stream of speech samples). In self-time
1. Technical Report UMIACS-TR-2000-49, Institute for Advanced Computer Studies, University of Mary
land at College Park, 2000. This research was sponsored in part by the US National Science Foundation u





























ates. Duetion, the assignment of tasks (dataflow graph nodes) to processors, and the execution ordering of tasks on each pro
determined at compile-time, and at run-time, processors synchronize with one another only based on inter-proces
munication requirements, and do not necessarily synchronize at the end of each loop iteration. 
In this paper, we assume that a deterministic protocol is used to arbitrate contention for communication resou
assume that a schedule has already been computed so the order of the tasks on the processors is known, and 
adjusting some task parameters that vary the task execution times in order to perform an optimization of the sys
assume that reasonably accurate estimates are available for the task execution times, and for the variation of execu
with parameter changes. Later in the paper, we specifically address the problem of finding an optimum set of sup
ages for the processors in order to reduce power while satisfying a throughput constraint.
2  Previous work
The estimates for task execution times can be obtained through several methods. The most straightforward 
programmer to provide them while developing a library of primitive blocks, as is done in the Ptolemy system [3]. Ana
techniques also exist. Li and Malik [14] have proposed algorithms for estimating the execution time of embedded s
in an efficient manner. Much work has been done on scheduling and binding methods for high level synthesis [16][8
These techniques attempt to optimize the schedule makespan, which is a suitable performance metric for non-iterative app
cations or fully-static implementations, but is not ideally suited to the iterative, self-timed context that we address
paper. Supply voltage reduction has been used for some time in memories and consumer electronics [15]. Chandr
al. [5][6] have presented a method based on reduced voltage level operation combined with architectural-level par
showing that the throughput can be maintained while reducing power. Tiwari et al. [21] presented a technique for es
the power given a set of software instructions. This technique can be used in conjunction with the approaches pro
this paper to obtain more accurate or automated estimates for the power consumption of the tasks in period graph 
 The period graph model is inspired by the synchronization graph model [19] for self-time multiprocessor systems
The synchronization graph has proven useful for a variety of techniques for minimizing synchronization overhead, 
ing interprocessor communication buffers, and scheduling communication operations [11, 19]. The period graph 
differs from the synchronization graph model in that it explicitly models steady-state behavior under commun
resource contention, which is not accounted for in synchronization graphs.
A preliminary, partial summary of this paper has been published in [2]. 
3  Period Graph
If contention is resolved deterministically, and execution times are constant, then self-timed evolution may lea
initial transient state, but the execution will eventually become periodic. This holds because the multiprocessor may 
eled as a finite-state system, and thus, aperiodic behavior — which implies the presence of infinitely many distinct s
cannot hold. In DSP systems, although execution times are not always constant, or known precisely, they typicall
closely to their respective estimates with high frequency. Under such conditions, the periodic execution pattern o



























he tasks, to the largely deterministic nature of DSP applications, such system-level performance analysis, and optimization b
task-level estimates is common practice in the DSP design community [13].
For self-timed systems, when we apply execution time estimates to estimate overall throughput, it is necessar
ulate (using the execution time estimates) past the transient state until a periodic execution pattern (steady state)
Unfortunately, the duration of the transient may be exponential in the size of the application specification [19], a
makes simulation-intensive, iterative synthesis approaches highly unattractive. 
The objective in this paper is to greatly reduce the rate at which simulation must be carried out during iterative 
sis through the use of a novel period graph model. Given an assignment  of task execution times, and a self-timed sc
ule, the associated period graph is constructed from the periodic, steady-state pattern of the resulting simulat
maximum cycle mean (MCM) of the period graph (with certain adjustments) is then used as a computationally-eff
means of estimating the iteration period (the reciprocal of the throughput) as changes are explored within a neighbo
. In this context, the MCM is the maximum over all directed cycles of the sum of the task execution times divided
sum of the edge delays. The MCM can be computed in low polynomial time [12].
The first step in the construction of the period graph is the identification of the period from the simulator outpu
can be performed by tracing backward through the simulation and searching for the latest intermediate time insta
which the system state  equals the state  obtained at the end of the simulation (here,  denotes the sim
time limit). If no match is found, then the end of the first period exceeds , and thus, the simulation needs to be e
beyond . Otherwise, the region (Gantt chart) that spans the interval  constitutes a (minimal) period of the si
steady state. 
Here, the system state  contains the execution state of each processor, which is either “idle” or representa
ordered pair , where  is the task being executed at time , and  denotes the time remaining until the curre
cation of  is completed. The state  also contains the current buffer sizes of all IPC buffers, as well as any info
(e.g., request queue status) that is used by the protocol for resolution of communication contention. Our approac
ciently determining this period is as follows:
• Perform a simulation of the schedule for some time . Define a constant , which is an initial estimate for th
ber of complete cycles (invocations) of the graph that must be simulated in order to find a period. This constant r
sents the length of the initial transient, before the output becomes periodic. If this initial estimate is too low, it wil
increased during the algorithm. Let  be the number of processors, and let  be the number of tasks schedule
cessor , where . Tasks include IPC tasks as well as computational tasks. Label these tasks 
consider the case where the system executes these tasks infinitely. The invocation number of a task is defined as the 
number of times a given task has executed, and is denoted with a superscript. For example,  denotes th
invocation of task  on processor . Define a simulation array for each processor  where  and 
the number of tasks on processor  that were output by the simulator. The elements of the simulation array are t




S ta( ) S tf( ) tf
tf
tf ta tf,[ ]
S t( )
A τ,( ) A t τ
A S t( )
Tsim C
N nj
j j 1 N[ , ]∈ V1j V2j …Vnj,
Vb j( )a j( ) bth
a j Simj i[ ] i 1 Mj[ , ]∈ Mj
j
St rt Simj i[ ]( ) Start Simj i 1+[ ]( )>
 
rs of 








ed-• Create two idle vectors of length  for each processor spanning one invocation. Label the first idle vector 
where . Label the second idle vector .
• Examine the IPC buffer vector at some fixed point of each idle vector. The IPC buffer vector consists of the numbe
tokens queued on all the IPC edges of the graph enumerated in some order. The IPC buffer vector must be outp
simulator at least once every graph iteration. For example, the simulator could output an IPC buffer vector for ea
cessor every time the processor executes the first task scheduled on it. In this way, each idle vector would be as
with one IPC buffer vector. Label these vectors  and  where  and  is the num
of edges in the IPC graph. The IPC buffer vector represents the state of the communication buffers in the system
 be the number of data tokens on edge  at time . Let  be the number of the node that
cuting on processor  at time . Pseudo-code for constructing the period is shown in Figure 2:
Our experience suggests that in practice, most graphs have periods spanning only a few invocations, so the above 
for finding the period is efficient. For a system with a period that spans  invocations and with at most  tasks per
sor, this method requires  comparisons. 




k 1 nj[ , ]∈ Idle2
1
j k[ ]
IPCBuf1j q[ ] IPCBuf2j q[ ] q 1 E[ , ]∈ E
Tokens e t,( ) e t TaskNumj t( )
j t
N L
LN N 1+( )














































































t 0= t Tsim< t++
j 1…N=
aj TaskNumj t( )=
invocationj a[ ]++
bj invocationj a[ ]=
TaskNumj t( ) TaskNumj t 1–( )>









1 k[ ] Finish Simj k[ ]( ) Start Simj k 1+[ ]( )–=
Idlej
2 k[ ] Finish Simj span*nj k+[ ]( ) Start Simj span*nj 1 k+ +[ ]( )–=
q 1…E=
IPCBuf1 q[ ] Tokens q Start Sim1 1[ ]( ),( )=






∏ 1= IPCBuf1 IPCBuf2=






















idles areule; Figure 1(c) shows the periodic steady state that results from the schedule of Figure 1(a) and the execution time 
shown in Figure 1(b); and Figure 1(d) shows the resulting period graph. The nodes in Figure 1(d) that contain d
stripes correspond to idle time ranges in the period, and solid black circles on edges represent delays, which model -
ation dependencies. Note that the steady state period may span multiple graph iterations (2 in this example), a
period graph, this translates to multiple instances of each application graph task.
For clarity in this illustration, we have assumed negligible latency associated with IPC. As described below, non
gible IPC costs can easily be accommodated in the period graph model by introducing send and receive tasks at appropriate
points.
As illustrated in Figure 1, the period graph consists of all the tasks comprising the period that was detected, 
idle time ranges between tasks (including those that are caused by communication contention) also treated as no
graph. The nodes are connected by edges in the order that they appear in the period. An edge is placed from the la
the period for each processor to the first node in the period. This edge is given a delay value of one (to model the a
transition between period iterations), while all of the other intraprocessor edges have delay values of zero. This is 
all the processors in the system. Our model utilizes sendand receive nodes for IPC. For each IPC point, a send node
placed on the processor that is sending data, and a corresponding receive node is placed on the processor that will 
data. The period graph is completed by adding an edge from each send node to its corresponding receive node.
4  Fidelity of the estimator
We calculate the fidelity of the period graph estimator as the task execution times are varied. Here, we use the
of varying the processor voltages in order to change the task execution times. When the voltage on a processor is v
execution time of a computational task varies according to
. (1)
where  is the supply voltage,  is the threshold voltage, and  is a constant [6]. We use a value of 
threshold voltage. The execution time  of each of these states in the original (non-scaled) period graph is refere
voltage . The change in execution time of each computation node is found by taking the derivative:
, (2)
where  is the new voltage. It is not obvious, however, how one should adjust the idle times in the period graph. 
arate the idle nodes into two sets: contention idles and data idles. When a node has the necessary data to execute (the n
sary data has already been produced), but is idle waiting for access to the bus, the associated idle node is clas
contention idle. When a node is idle waiting for its predecessors’ data, the associated idle node is classified as a dat By
experimenting with a large number of application graphs, we found that we could capture the effects of content














































ption isscaled. Using these rules, the fidelity is calculated as follows:
• Given an application graph, construct a valid schedule. We used the dynamic level scheduling algorithm given
Next, construct the period graph as discussed earlier. Generate  voltage vectors (assignments of voltages to th
sors in the target architecture). For each voltage vector, perform a simulation to determine the throughput, with 
cution times of the tasks on each processor given by (1) according to the voltage on the processor. Also, o
estimate for the throughput by calculating the MCM of the voltage-scaled period graph, in which the execution t
the computation nodes are given by (1), and the execution times of the idle nodes are as explained above.





the s denote the simulated throughput values; and the s are the corresponding estimates from the period grap
Figure 3 plots  for a six-processor system in which the voltage on the individual processors can vary b
plus or minus five percent. The x-axis represents the sum of the absolute values of the voltage changes over all p
Each point on the graph is a fidelity calculation for  voltage vectors. A value of one is a “perfect” fidelity. It c
seen that in the range shown, the fidelity is always greater than 0.65. It is also important that the estimator have a sror
at each point. Figure 4 plots
. (6)
It can be seen that the error increases as the voltage vector moves away from the reference point, and that the e
slightly biased. For the range shown in the graphs, where each processor voltage is changed by a maximum of fif
cent, the error is less than four percent.
5  Using the Period Graph in a Joint Power/Performance Algorithm
An effective way to reduce power consumption of a processor core in CMOS technology is to lower the supply 
level, which exploits the quadratic dependence of power on voltage [6]. Reducing the supply voltage also has the 























1–( ) if x 0<( )
0 if x 0=( )













Figure 3. Plot of fidelity (equation 3) for six processor system vs. magnitude of voltage














Sum of Absolute Value of % Change in Voltage on all Processors
Fidelity - 6 processors each changing at most 15%





















Sum of Absolute Value of % Change in Voltage on all Processors






























where  is the clock frequency,  is the load capacitance, and  is the switching activity [6]. To accommod
possibility of putting processors in states of lower switching activity during idle periods, our model includes a p
eter  for the idle states, and a parameter  for the computational tasks, where 
more detailed power analysis could assign a different  for each computational task if that data were availabl
ferent power optimization technique, which can be used in conjunction with the voltage scaling technique pr
here, utilizes a nearly complete processor shutdown during the idle periods [10][20]. In our model, this would
spond to . Our model for the power is the average energy consumption per graph iteration period. T
responds in a typical DSP system to the average energy required to process one sample. Here, the energy of
equals its power times its execution time.
In a system consisting of multiple processors, one has the ability to choose, within a certain range, the
operating voltage on each processor. This opens up an additional degree of freedom that can be exploited to 
the system power consumption. By choosing a lower voltage of a processor that is executing tasks that are n
critical path, the throughput can remain unchanged while the overall power consumption is reduced. In ge
combination of raising voltages on some processors while lowering others can yield the most attractive powe
mance solution. 
When applying voltage scaling to a multiprocessor system, the valid solution space is typically much to
to search by brute-force methods. In addition, since there is no general analytical formula for calculating the t
put of these systems in the presence of communication resource contention, each candidate solution must
simulated or estimated using some heuristic. 
6  Genetic algorithm formulation
To demonstrate the general utility of the period graph based performance estimation approach, we incor
it into two significantly different probabilistic search techniques to derive two different algorithms for system
voltage scaling. The first algorithm presented utilizes the framework of genetic algorithms (GAs) [1]. The sp
GA explored here consists of an inner GA nested within an outer GA. The inner GA performs a local search a
point from the population of the outer GA, using the MCM of the period graph in its objective function as an es
for the throughput. A period constraint  is given as an input to the optimization problem, where the p
is the reciprocal of the throughput. The objective function calculates the power consumption associated wi
solution by calculating the total energy per period, as discussed earlier. If the period associated with a solut
lates the period constraint , the power consumption is multiplied by a large penalty f
. The GA attempts to minimize this objective function. 
In the outer loop, a population of  voltage vectors is generated. A simulation is run and a period
constructed for each of these outer loop voltage vectors. For each of the outer loop voltage vectors, a new in









100 Tsolution Tconstraint–( )( )exp
Nouter



























 Fourier is the voltage on processor  in the outer population,  is the voltage on processor  in the inn
ulation, and  is a user-defined threshold. The inner population size is . The inner GA then performs
search using this population for a number of generations  in an attempt to find a locally optima
age vector. The inner GA uses the MCM of the period graph in its objective function. After an invocation of the
GA is finished, one simulation is performed using the resulting voltage vector, and the actual throughput for th
is used to compute its fitness. The outer loop voltage vector is then replaced with this locally-optimized volta
tor for use in the next outer loop generation. The outer loop is run for a number of generations 
7  Simulated annealing algorithm
Simulated annealing is another well-known method for searching large design spaces. Using a standa
lated annealing package [4], we have implemented an alternative version of period-graph-based voltage sca
mization. The objective function here is the same as for the genetic algorithm. The system is first simulated
initial voltage vector , and the period graph is built. In order to insure that the period graph will be a
enough estimator, a resimulation threshold  is maintained. The difference between the current input  to 
objective function, and the voltage vector  corresponding to the simulation used to compute the curren
graph, is calculated. If
, (8)
the graph is resimulated using . The period graph is rebuilt, and . For , the graph will b
imulated every time, and the period graph will offer no speed advantage. The larger the value of , the less o
graph will be resimulated, and the faster the optimization algorithm will perform. However, when  is too larg
fidelity of the period graph estimate will be unacceptably low and the quality of the final result will suffer. Bas
our experiments with a number of graphs, the optimal value of  is highly application-dependent, but a v
 (10%) generally gives good results.
8  Results
Figure 5 shows an example of the reduction in power resulting from the genetic optimization algorithm 
FFT3 application graph (Figure 10). The parameters of the GA were , 
. The local search voltages were constrained to be within five percent of the correspo
outer loop voltages. The period constraint was calculated by simulating the system with all six processors ope
voltage . For this example, the system power consumption was reduced by 43%, while maintaining the 
throughput. To evaluate the advantage of the period graph approach over using brute-force simulation, a
nested GA was implemented. This algorithm was identical to the algorithm discussed above, except that t
loop did not use the period graph estimate for the throughput. Instead, each voltage vector was evaluated b
tion. This algorithm consumed 26 times more CPU time, and produced similar results, as shown in Figure 6. 
Figure 7 summarizes the power reduction results for the simulated annealing algorithm applied to a fast





















Nouter Ninner 50= = Generationsouter 10=
Generationsinner 20=
Vref












Iteration Number(1000 generation/iteration) (6 minutes cputime/iteration)
Genetic algo. fft3 (fixed throughput constraint) using period graph
P/P0












Iteration Number (1000 generations/iteration) (126 minutes cputime/iteration)






pplica-transform (FFT3) application graph, for different values of the resimulation threshold . It can be seen that 
increased, the algorithm progresses more quickly. The simulated annealing algorithm begins with a ‘melting’ r
where the temperature is increased until a phase change is detected. The initial flat part of the curves corres
the time spent in the melting routine. We have found that for values of  above 20%, the period graph is not
enough estimator and the algorithm does not converge. 































application 0 2% 5% 10% 25%
fft1 (28) 0.96 0.95 0.65 0.6 1
fft2 (28) 0.97 0.9 0.71 0.97 1
fft3(28) 1 0.77 0.59 0.59 1
mus (20) 0.89 0.71 0.67 0.82 1
meas (12) 0.77 0.73 0.81 0.82 1
qmf (14) 0.84 0.65 0.67 0.73 1
rand1 (30) 0.91 0.77 0.53 0.65 1
rand2 (100) 1 0.85 0.77 0.73 1
rand3 (200) 1 1 1 0.94 1























sforma-tions using different values of the resimulation threshold. At the start of the optimization, all processor voltage
set at 5 volts. The throughput at this point was used as the throughput constraint. In the table, the first three r
respond to three different FFT implementations, mus refers to a music synthesis algorithm, qmf refers to a quadrature
mirror filter bank, meas is a measurement application, and the last three rows correspond to graphs that were
ated using Sih’s algorithm for randomly generating application graphs [18]. The numbers in parentheses g
numbers of nodes in these applications. The optimization was performed for a fixed time of 30 minutes in each case
The optimum resimulation threshold was between 2% and 10% in all cases. For , the period graph 
good estimator and none of the results returned during the optimization algorithm satisfied the throughput co
For the largest graph, the fixed simulation time was not long enough to make much improvement, but the be
occurred for , where the simulations are less frequent. Table 2 summarizes the power reduction
genetic algorithm with and without using the period graph, with a fixed compile time of one hour.
9  Conclusion
This paper has explored a period graph model that enables efficient voltage scaling optimization for self-tim
implementations of iterative applications. The period graph can be used as a computationally efficient estim
the throughput in multiprocessor systems in which communication contention renders exact analysis too tim
suming. This model is especially useful in iterative synthesis techniques, such as those based on probabilisti
Our paper has demonstrated effective voltage scaling techniques based on incorporating the period graph int
algorithm and simulated annealing formulations. Other optimizations, such as exploiting memory/speed trade
the individual tasks, are also possible. These may be more appropriate to the genetic algorithm and simulate
ing framework, as a larger set of independent moves is available during optimization. Other useful directions
ther work include integrating the period graph model into the scheduling phase, rather than restricting its
voltage scaling of fixed schedules, and the investigation of adaptive methods for dynamically adjusting the fre
of resimulation. The application graphs are shown in figures 8, 9, 10, 11, 12, 13, 14, and 15,. 
10  References
[1] T. Back, U. Hammel, and H. Schwefel. “Evolutionary computation: Comments on the history and current 
IEEE Transactions on Evolutionary Computation, April, 1997.
[2] N. K. Bambha and S. S. Bhattacharyya. A joint power/performance optimization technique for multiprocess
tems using a period graph construct. In Proceedings of the International Symposium on Systems Synthesis, Madrid,
Spain, September 2000. To appear.
[3] J. T. Buck, S. Ha, E. A. Lee, and D. G. Messerschmitt. Ptolemy: A framework for simulating and prototypin
erogeneous systems. International Journal of Computer Simulation, January 1994.
[4] Carter, Everett, Taygeta Scientific Inc. http://www.taygeta.com/annealing/simanneal.html
[5] A. P. Chandrakasan, M. Potkonjak, R. Mehra, J. Rabaey, and R. Brodersen, “Optimizing power using tran
tions,” IEEE Trans. Computer-Aided Design, vol. 14, no. 1, 1995.















16 17 18 19
20 21 22 23 24 25 26 27












16 17 18 19
20 21 22 23 24 25 26 27
















Table 2. Genetic algorithm (optimized power)/(initial power) for fixed 
compile time












16 17 18 19
20 21 22 23 24 25 26 27
0
1 2 3 4 5 6 7 8 9 10














Figure 12. Meas application graph.
Figure 13. rand1 (30 nodes) application graph.
n0
n1
n2 n3n4 n5 n6 n7
n13
n29
n12n11 n10 n9 n8
n14
n15n16 n17 n18 n19 n20 n21












n9 n10n11 n12 n13n14 n15
n46
n44
n40 n36n32 n28 n24n20 n16
n47
n48















































State Circuits, 27(4):473—484, 1992.
[7] J. M. Chang and M. Pedram, “Register allocation and binding for low power,” D sign Automation Conf., June,
1995.
[8] A. Dasgupta and R. Karri, “Simultaneous scheduling and binding for power minimization during microarchite
synthesis,” in Proc. Intl. Symp. Low Power Design, Apr. 1995.
[9] L. Goodby, A. Orailoglu, and P. M. Chau, “Microarchitectural synthesis of performance-constrained low-p
VLSI designs,” in Proc. Int. Conf. Computer Design, Oct. 1994.
[10] C. Hwang and A.C.-H. Wu. “A predictive system shutdown method for energy saving of event-driven com
tion.” International Conference on Computer-Aided Design, 1997.
[11] M. Khandelia and S. S. Bhattacharyya. Contention-conscious transaction ordering in embedded multiproc
In Proceedings of the International Conference on Application Specific Systems, Architectures, and Processor, pages
276-285, Boston, Massachusetts, July 2000. 
[12] E. L. Lawler. Combinatorial Optimization. Holt, Rinehart and Winston. 1976.
[13] E. A. Lee and S. Ha. Scheduling strategies for multiprocessor real time DSP. Global Telecommunications Con-
ference, November 1989.
[14] Y. S. Li and S. Malik. Performance analysis of embedded software using implicit path enumeration. In Pr ceed-














































































































Pow-[15] P. Macken, M. Degrauwe, M. Van Paemel, and G. Oguey, “A voltage reduction technique for digital sys
in Proc. IEEE Intl. Solid-State Circuits Conf., 1990.
[16] A. Raghunathan and N. K. Jha, “Behavioral synthesis for low power,” in Proc. Intl. Conf. Computer Design, Oct.
1994.
[17] G. C. Sih and E. A. Lee. A compile-time scheduling heuristic for interconnection-constrained heterogeneo
cessor architectures. IEEE Transactions on Parallel and Distributed Systems, 4(2):75-87, February 1993.
[18] G. C. Sih, “Multiprocessor Scheduling to Account for Interprocessor Communication”, Ph.D. thesis, De
EECS, U. C. Berkeley, 1991.
[19] S. Sriram, and S. S. Bhattacharyya, Embedded Multiprocessors: Scheduling and Synchronization. Marcel Dekker,
Inc., 2000
[20] M. Srivastava, A. P. Chandrakasan, and R.W. Brodersen. “Predictive system shutdown and other arch
techniques for energy efficient programmable computation.” IEEE Transactions on VLSI Systems, 4(1): 42—55, 1996
[21] V. Tiwari, S. Malik, and A. Wolfe, “Power Analysis of Embedded Software: A First Step Towards Software 
er Minimization”, IEEE Trans. VLSI, December 1994.
