FSMD Partitioning for Low Power using Simulated by Nainesh Agarwal & Nikitas Dimopoulos
FSMD Partitioning for Low Power using Simulated
Annealing
Nainesh Agarwal and Nikitas Dimopoulos
Dept. of Elec. and Comp. Engineering
University of Victoria
Victoria, BC, Canada
{nagarwal,nikitas}@ece.uvic.ca
Abstract—It is well known that signiﬁcant power savings
can be obtained by disabling or shutting down parts of a
circuit during idle periods. One method is to use a high level
partitioning technique which considers both the controller and
the datapath together. The FSMD is split into two or more
simpler communicating processors. These separate processors
can then be clock gated or power gated to achieve dramatic
power savings since only one processor is active at any given time.
Here, we propose a technique which uses simulated annealing
to efﬁciently partition a FSMD for power gating. We use this
non-linear model to partition 4 application circuits. We then
develop a framework to estimate the potential power savings.
The estimation framework shows that up to 69% static power
savings and 30% dynamic power savings can be expected.
I. INTRODUCTION
To reduce power dissipation in VLSI circuits, power gating
has been shown to be an effective technique [1]. Power gating
relies on the detection of idle periods in various parts of the
circuit, during which, the supply voltage can be switched off to
the appropriate circuit components to conserve leakage power.
Here we focus on the class of sequential circuits char-
acterized by a FSMD (Finite State Machine with Datapath)
representation. Partitioning is one technique used to facilitate
logic isolation in FSMD circuits [2]. Once partitioned, the
isolated circuit components are switched off [3] or clock gated
[2], [4] to conserve static or dynamic power, respectively. Two
methods are generally employed for the partitioning of these
sequential circuits. The ﬁrst method relies on disabling parts of
the FSM (Finite State Machine) controller. Here, the controller
is partitioned into two or more mutually-exclusive FSMs. Each
partition is then selectively clock gated [4] or power gated [3].
Thus, only one FSM is active at any given time, while the
others are idle and their clocks are stopped, or their power is
gated off. The second method tries to discover idle periods
in one or more datapath components of the circuit. These
components can then be clock gated or power gated. In [5],
idle periods in the ALU are discovered and for these periods
the ALU is power gated. In [6], individual registers are clock
gated, while in [7], individual registers are power gated.
Although gating parts of either the controller or the datapath
has been shown to be highly effective in reducing power,
further savings can be achieved if both the controller and
the datapath are considered together. This was proposed in
[2], where on average 42% power reduction was observed.
However, in [2], a simple heuristic was used in a branch and
bound method to partition the FSMD. Further, the method was
more suited for a clock gating environment. We use a more
thorough and detailed model and hope to achieve better power
reduction. Also, our model is well suited for a power gating
environment where static power is of signiﬁcant concern. A
comparison between our results and those from [2] is difﬁcult
since the development platform and the benchmark circuits
used differ signiﬁcantly. We are working on a framework to
allow a suitable comparison between the two methods.
We formulate FSMD partitioning as a non-linear program-
ming problem. Our objective is to maximize the isolation of
circuit components by minimizing the communication between
the partitioned FSMDs. This maximizes the number of com-
ponents that can be put to sleep thus reducing the overall
power dissipation. In [4] an ILP based partitioning technique is
proposed for just the FSM component of the circuit. This ILP
formulation relies on state transition probabilities which can
be difﬁcult to obtain in practice. We do not use state transition
probabilities in our model.
II. FSMD PARTITIONING
The proposed partitioning technique works at the behavioral
level, before synthesis. The FSMD described at the behavioral
level is split into two or more separate FSMD units. At any
given time, only one FSMD is active while the others are
powered off. This results in signiﬁcant power savings (static
and dynamic) [2].
If there are data components, namely registers, that are used
in multiple partitions, they are kept alive when either of these
partitions are active. Ideally, data components must be isolated
as much as possible so they can be turned off as long as
possible. Further, when data components are shared between
FSMDs, their updated values need to be communicated to the
newly activated FSMD. This communication overhead results
in power dissipation that should be minimized.
Another important criterion is to minimize the number of
transitions between partitions. Each time a transition occurs we
not only have the communication penalty, but also encounter a
startup delay whereby the capacitances of the newly activated
FSMD are charged up. To reduce this performance penalty a
lookahead mechanism can be used, which is described brieﬂy
in section IV.To minimize these adverse effects we need to efﬁciently
partition the FSMD to reduce the amount of shared data
components. To achieve such an efﬁcient partition we have
formulated a non-linear programming problem and have used
the simulated annealing (SA) algorithm to solve it. Our
objective is to minimize the number of shared components
between partitions and also minimize the number of possible
transitions between the partitions.
III. PROBLEM FORMULATION
Let P be a ﬁnite state machine with datapath (FSMD) con-
sisting of a set of N ﬁnite states deﬁned as S = {s1,...,sN}
and transitions. We represent the set of transitions (edges) of
the FSMD as Eij, which is a binary variable. It is 1 if and
only if there exists an edge or transition from state si to sj.
Further, let V AR be the set of storage variables which are
used in the various expressions and statements of the FSMD.
The partitions of P, are a subset of S, along with the
transitions related to the states in S. Let the number of
partitions be M. The set of partitions can then be identiﬁed as
Pk for all k ∈ [1,M]. It is our goal here to partition machine
P into submachines (partitions) Pk such that the interaction
between these partitions is minimized.
Given an FSMD, we ﬁrst determine the set of variables that
are shared between the various states. A variable, v ∈ V AR
is considered shared between states si and sj if the variable
is read or written in state si and read or written in state sj.
We refer to the bits of variable v as being transition bits. We
represent the total number of transition bits between states si
and sj as Tij.
We introduce binary variables sik for all i ∈ [1,N], which
are 1 if and only if state si belongs to partition k. Here, N =
|S| is the number of states of the original machine P.
The total number of transition bits between partitions can
now be counted using the following
Ttotal =
N X
i,j=1
Tij
"
1 −
M X
k=1
siksjk
#
, (1)
where M is the number of partitions. Also, we must adhere to
the following constraint
∀i ∈ [1,N] :
M X
k=1
sik = 1, (2)
which requires that each state si belong to one and only one
partition k.
Equation 1 can be simpliﬁed to
Ttotal =
N X
i,j=1
Tij −
N X
i,j=1
Tij
"
M X
k=1
siksjk
#
. (3)
The ﬁrst term in equation 3 is constant since it represents all
the shared variable bits in the original machine. Therefore it
can be ignored in the optimization. We now get
Ttotal ≈ −
N X
i,j=1
Tij
"
M X
k=1
siksjk
#
. (4)
The summations in equation 4 can be switched and the
equation can be speciﬁed more concisely in vector form as
Ttotal ≈ −trace
 
sTTs

. (5)
As before, the total number of edges between all partitions
can be counted using the following
Etotal =
N X
i,j=1
Eij
"
1 −
M X
k=1
siksjk
#
, (6)
which is simpliﬁed for the minimization to
Etotal ≈ −trace
 
sTEs

. (7)
The objective function to be minimized can now be formu-
lated as a combination of the number of transition bits and the
number of edges. It can be stated as
min

−trace
 
sTTs

− λ trace
 
sTEs

. (8)
The edges are weighted by
λ =
X
v∈V AR
|v|, (9)
which is the sum of all register bits in the original partition P.
This is because, in the worst case, all register bits may need
to be communicated from one partition to the other.
Equation 8 can be simpliﬁed further and can be stated as
min

−trace
 
sT [T + λE]s

. (10)
In each iteration of the simulated annealing algorithm, the
partition of a randomly chosen state is modiﬁed. We have
added a quality constraint to this update procedure such that
each partition contains at least a few of the total states from
the original machine P. This constraint can be speciﬁed as
∀k ∈ [1,M] :
N X
i=1
sik ≥ 0.6
N
M
. (11)
This implies that for M = 2 partitions, each partition must
contain at least 30% of the total number of states. The
parameter 0.6 is based on experimentation and is used to
disallow trivial partitions with only one or two states. We
are currently examining a broader benchmark to gain a better
understanding of how this parameter should be set.
For the simulated annealing algorithm, the cooling schedule
used is Ti+1 = 0.8 Ti. This is experimentally found to give a
good convergence proﬁle.IV. IMPLEMENTATION
Although we are working on a formal architecture to imple-
ment the ideas presented, here we provide some guidelines and
issues that need to be considered for efﬁcient implementation.
Power gating of a circuit block is performed by using an
appropriate header or footer transistor [5]. To begin gating, a
“sleep” signal is applied to the gate of this transistor to turn
off the supply voltage to the circuit block. To revive the block
for use, the “sleep” signal is de-asserted and power is restored.
In a power gating environment, the process of deactivating
or activating circuit components is not instant, such as in
clock gating. In clock gating the circuit state can be switched
immediately by enabling the clock or enable signal [4]. In
power gating, however, deactivating or activating the circuit
block involves the discharging or charging of capacitances,
which can be time-consuming. In particular, the delay in
activation can lead to system stalls. To handle this delay, two
methods are generally employed. First, while a sub-circuit
is being restored, idle waiting cycles are inserted into the
system until the sub-circuit is fully activated. Alternatively, to
reduce delay overhead, the sub-circuit can be activated ahead
of time prior to its usage. However, this causes additional
power dissipation. To make the recovery process more efﬁcient
a branch prediction scheme can be used to reduce the cases
where a sub-circuit is activated in anticipation but not used
[7]. In our power estimation framework presented here, we
have assumed the ﬁrst case where delay cycles are added
while a sub-circuit is activated. We are currently working on
implementing a branch prediction lookahead scheme, which
will reduce the requirement for delay cycles.
Once a sub-circuit is activated it needs to update any
changes to its data elements that may have occurred in the
other partitions while it was asleep. Therefore, upon activation,
the sub-circuit needs to receive all shared data element values
from the previously active partition. To provide this functional-
ity, each partition needs the addition of an entry and exit state,
sentry and sexit, respectively. When a partition is activated, it
is in its entry state, while the partition being deactivated is in
its exit state. It is here that the transfer of data takes place from
the previous to the new partition. This requires one additional
clock cycle. During this time both partitions need to be active
leading to increased power dissipation and circuit delay.
V. EVALUATION FRAMEWORK
To test the effectiveness of our FSMD partitioning approach
we have implemented four integer algorithms:
• A simple 8-bit counter.
• A 5/3 Discrete Wavelet Transform using lifting [8], [9].
• A multiplierless approximation to the eight-point Discrete
Cosine Transform (DCT) [10].
• A transform used in H.264 (MPEG4 Part 10) [11].
The designs are implemented using CoDeL [12], which
produces synthesizable FSMD descriptions in VHDL. CoDeL
is a procedural language in which the order of the statements
implicitly represents the sequence of activities. It extracts
the data and control ﬂow from the program automatically,
assigns the necessary hardware blocks and exploits inherent
parallelism. To generate the FSMD, CoDeL assigns a state
to each group of statements that can be executed in parallel.
The CoDeL compiler has now been augmented to provide the
required parameters for our models. This includes the state
transition bits, Tij, and the edges, Eij.
The solution to our model is obtained using a simulated
annealing algorithm [13] implemented in Matlab [14].
For effective power estimation, trace data is used from
circuit simulation using Synopsys. The trace data provides in-
formation on the state transition sequence during computation.
VI. POWER ESTIMATION
Here we present a framework for estimating the potential
power savings from partitioning a FSMD. The savings in
power dissipation can be broken down into savings in static
power and savings in dynamic power.
Using experimentation we have found that, at least in the
circuit implementations we have used, the static power of the
circuit is roughly proportional to the amount of sequential
logic in the circuit. 1 Thus, by examining the number of se-
quential elements in the partitions, and the proportion of time
they are put to sleep, we can estimate the static power savings.
Formally, the static power (SP) savings can be expressed as
SP Savings = 1 −
M X
k=1
P(Pk) ·
P
vi∈V ARk |vi|
P
vi∈V AR |vi|
(12)
where V ARk ⊆ V AR is the set of registers in partition k, and
|vi| is the bit length of register vi. The parameter P(Pk) is the
proportion of total time spent in partition k. This is obtained
through trace analysis using behavioral simulation.
The dynamic power dissipation in a circuit is due to
switching activity. After partitioning, the largest component
of power savings is from the reduction of clocking of register
components. All other activity in the circuit is necessarily
the same as the unpartitioned FSMD to achieve the desired
functionality. This includes register value updates and arith-
metic computations. Thus, the dynamic power savings can be
estimated by examining the reduction in the number of register
bits that need to be clocked after FSMD partitioning. However,
we need to take into account the overhead added due to data
communication whenever a change in active partition occurs.
This overhead is given by
Overhead = 0.5
M X
k=1
M X
l=k+1
"
NPCkl ·
X
vi∈TV ARkl
|vi|
#
(13)
where NPCkl is the number of partition changes between
partitions k and l over a time period, T, TV ARkl =
{V ARk ∩ V ARl} is the number of shared variables between
partitions k and l, and |vi| is the bit length of variable vi. The
factor 0.5 is used to capture that on average roughly half the
bit values will be modiﬁed.
1A power-area relationship has also been exploited in [3], [4].TABLE I
POWER SAVINGS BASED ON ESTIMATION FRAMEWORK
Sum of Register Bit Lengths in Set (
P
|vi|) Cycles Power Savings
V AR V AR1 V AR2 TV AR12 P(P1) P(P2) NPC12 f · T SP (%) DP (%)
Counter 16 16 0 0 0.33 0.67 3 6 66.7 30.0
DCT 385 321 97 33 0.6 0.4 2 10 39.9 17.6
H.264 129 65 65 1 0.4 0.6 2 5 49.6 22.3
DWT 289 288 97 80 0.99 0.01 6 1681 1.0 0.4
TABLE II
POWER SAVINGS USING MULTIPLE PARTITIONS FOR THE DCT CIRCUIT
DCT Power Savings
No. of Partitions (M) SP (%) DP (%)
2 39.9 17.6
3 69.1 30.7
4 57.0 24.2
In synchronous circuits, it is well known that the contin-
uously switching clock can account for as much as 45% of
the system power [15]. Therefore, we must adjust the total
dynamic power savings by this factor. The dynamic power
(DP) savings can now be estimated as
DP Savings = 0.45 −
M X
k=1
0.45 · P(Pk) ·
P
vi∈V ARk |vi|
P
vi∈V AR |vi|
−
0.45 · Overhead
f · T
P
vi∈V AR |vi|
(14)
where f is the circuit frequency and T is the run time.
VII. RESULTS
In table I we present the estimated power savings using the
framework presented in section VI. We ﬁnd that for the counter
circuit 67% static power savings can be expected, while 30%
dynamic power savings can be expected. For the more useful
applications of DCT and H.264 the static power savings are
about 40% and 50%, respectively. The DWT circuit shows
only about 1% static power savings. The reason for low power
savings for the DWT circuit is a loop in partition 1 which
executes 90% of the time. This leads to the circuit operating
in partition 1 for a signiﬁcant portion of time leading to low
savings from partitioning.
In table II we examine the use of multiple partitions (M ≥
2). We ﬁnd that 3 partitions gives the largest savings in power
since a larger part of the overall circuit can be isolated and
put to sleep. Using 4 partitions, however, reduces the overall
power savings as there is now a lot of shared data between
the partitions leading to signiﬁcant communication overhead.
From the results, it is clear that signiﬁcant power savings
are possible using the proposed partitioning approaches. It is
also evident that the optimum number of partitions depends
on the circuit and the amount of shared variables between the
partitions.
VIII. CONCLUSION
In this paper, we have presented a FSMD partitioning
technique, which efﬁciently decomposes the controller and the
datapath into multiple partitions. We have formulated a non-
linear model and solved it using the simulated annealing algo-
rithm. An estimation framework is developed to evaluate the
potential power savings from the partitioning provided. Using
this framework we ﬁnd that, in most cases, signiﬁcant power
savings can be expected using our partitioning approach.
ACKNOWLEDGEMENTS
This work was supported by grants from the Natural Sci-
ences and Engineering Research Council of Canada (NSERC)
and by the University of Victoria through the Lansdowne
Chair.
REFERENCES
[1] M. Powell, S.-H. Yang, B. Falsaﬁ, K. Roy, and T. N. Vijaykumar,
“Gated-Vdd: a circuit technique to reduce leakage in deep-submicron
cache memories,” in ISLPED 2000. ACM Press, 2000, pp. 90–95.
[2] E. Hwang, F. Vahid, and Y.-C. Hsu, “FSMD functional partitioning for
low power,” in DATE 1999, 1999.
[3] B. Liu, Y. Cai, Q. Zhou, J. Bian, and X. Hong, “FSM decomposition
for power gating design automation in sequential circuits,” in ASICON
2005, Oct 2005.
[4] F. Gao and J. P. Hayes, “ILP-based optimization of sequential circuits
for low power,” in ISLPED 2003, 2003.
[5] Z. Hu, A. Buyuktosunoglu, V. Srinivasan, V. Zyuban, H. Jacobson, and
P. Bose, “Microarchitectural techniques for power gating of execution
units,” in ISLPED ’04. ACM Press, 2004, pp. 32–37.
[6] N. Agarwal and N. J. Dimopoulos, “Efﬁcient automated clock gating
using CoDeL,” in Proc. SAMOS VI, vol. 4017, July 2006, pp. 79–88.
[7] ——, “Automated power gating of registers using CoDeL and FSM
branch prediction,” in Proc. SAMOS VII, July 2007.
[8] D. L. Gall and A. Tabatabai, “Subband coding of digital images using
symmetric kernel ﬁlters and arithmetic coding techniques,” in Proc. of
the Intl. Conf. on Acoustics, Speech Signal Processing, April 1988, pp.
761–764.
[9] W. Sweldens, “The lifting scheme: A new philosophy in biorthogonal
wavelet constructions,” in Proc. SPIE 2569, 1995, pp. 68–79.
[10] J. Liang and T. Tran, “Fast multiplierless approximation of the dct with
the lifting scheme,” in Proc. SPIE Apps. of Dig. Img. Process. XXIII,
Aug 2000.
[11] H. S. Malvar, A. Hallapuro, M. Karczewicz, and L. Kerofsky, “Low-
complexity transform and quantization in h.264/avc.” IEEE Trans.
Circuits Syst. Video Techn., vol. 13, no. 7, pp. 598–603, 2003.
[12] R. Sivakumar, V. Dimakopoulos, and N. Dimopoulos, “CoDeL: A rapid
prototyping environment for the speciﬁcation and automatic synthesis
of controllers for multiprocessor interconnection networks,” in Proc.
SAMOS III, July 2003, pp. 58–63.
[13] J. Vandekerckhove, “General simulated annealing algorithm,”
http://www.mathworks.com/matlabcentral/ﬁleexchange/loadCategory.do.
[14] MathWorks, “Matlab,” http://www.mathworks.com/.
[15] G. Palumbo, F. Pappalardo, and S. Sannella, “Evaluation on power
reduction applying gated clock approaches,” in ISCAS 2002, vol. 4, May
2002.