In this paper; we address the issue of low power realization of FSMs using decomposition and gated clock architecture. We decompose the N state machine into two interacting machines with NI, N2 states such that N = N I x N2. Our cost function is the number of self-edges, which is to be maximized. For all the self-edge conditions, the inputs and clock of the respective machine is disabled to reduce the switching activity and therefore, the reduction in power can be achieved. We describe the greedy algorithm which maximizes the cost function. We are attempting to keep the area same by keeping the number offlip-pops minimum. We compared the results of our algorithm with JEDl [7]. In one case, we could achieve the power reduction up to 67% with the less area as well. Based on the results, we conclude that our approach is suitable for machines with large number of states and less number of outputs.
Introduction
Methods of low power realization of FSMs are of great interest since FSMs are important components of digital systems and power being a design constraint. Power dissipated by FSMs can be controlled by the way the codes are assigned to the states of an FSM. Such attempts are reported in [4] and [6] . In [4] , the weighted graph is constructed depending upon the steady state probability distribution, where the states are nodes. The high weight on an edge between the nodes implies that they should be given codes with less Hamming distance, since there is high probability of transition among them. In [6] , heuristic algorithm is given to embed the state transition graph (STG) in a hypercube such that minimum number of flip-flops will be switched whenever there is a state transition.
Recently, attempts using decomposition for low power realization of FSMs are also reported [ 1, 2, 3] . In [2, 3] , an Silicon Automation Systems, India.
STG is partitioned into several pieces, each piece being implemented as a separate machine with a wait state. In this case, only one of the sub-machine is active and other submachines are in the reset state. Therefore, the clock for 'inactive' sub-machines can be gated and primary inputs can be disabled which reduces the switching activity and hence, the total power dissipation. In [2] , STG is partitioned into two sub-machines with unequal sizes such that smaller submachine is active most of the time and clock for larger submachine is disabled. The cost function involves probability distribution to find out the portion of STG, which has higher transition probabilities. They use Kernighan-Lin algorithm for finding a partition which maximizes the cost function. The drawback in the latter techniques is the use of more number of flip-flops, which may result in large area. This is because if N state machine is decomposed into two machines, with N I , N2 states respectively where
implying that more number of bits are required to encode the decomposed machine and hence the more number of flip-flops for their implementation. Moreover, they add one wait state in each sub-machine. In case of [l] , when machine M is partitioned into smaller machines, the bits are assigned not only to the states in the machine but also to the machines as well. Although assigning code to identify the machines results in simple decoding logic, the use of multiplexers to disable the primary inputs and clock may result in large area.
In our approach, we decompose the N state machine into two interacting machines with approximately states each. Such attempt has been reported in [5]. Unlike [5] , which uses one-hot encoding for speed and area, we are using minimum number of bits to encode the states of decomposed machine. Therefore, the number of flip-flops required will be approximately same as that of the implementation of original machine. Moreover, we formulate the problem as orthogonal partitioning problem, in which we propose a cost function which is number of self-edges in each machine. These self-edges in each sub-machine can be implemented separately which form a clock and primary input disabling logic for each of the sub-machine. In our approach, we try to keep the area same, still reducing power dissipation. This can be achieved, since the number of flipflops used will be the same and also the combinational logic which was earlier implementing the next state functions is splitted in parts in which two self-edge logic blocks are always active and two combinational logic blocks which evaluate the next state functions for each machine are active partially. Since self-edge conditions are implemented separately and clock is disabled for those conditions, they are treated as don't cares for the logic blocks which evaluate next state function. We are decomposing the original N state machine into two fl state machines interacting with each other and running concurrently. When one or both the machines have a self-loop, then clock and primary inputs are disabled for the respective machine/machines. Our clock disabling circuit involve simple 2-input AND gates. Note, that since the original machine is decomposed into two interacting machines, each machine has self-edge block, therefore, both of them require primary input disabling AND gates and clock disabling AND gate as well.
Therefore, the implementation of the FSM with Ni primary inputs involves, 2 x Ni 2-input AND gates for primary input disabling logic and 2 2-input AND gates for clock disabling logic.
Rest of the paper is organized as follows. In section 2, we describe formal definitions. In section 3, we illustrate orthogonal partitioning and gated clock architecture with an example. In section 4, we explain the motivation for the cost function. In section 5, we describe a greedy algorithm to maximize the cost function. In section 6, we describe the experimental results and in section 7, we conclude the paper with future directions.
Formal Definitions
Any finite state machine M can be described by a 6-tuple
where, C is the input alphabet, A is the output alphabet, Q is the set of states, 6 : C x Q + Q is the next state function and X : C x Q -+ A is the output function. The state qreset is the reset state. Consider an FSM M which is to be decomposed into two interacting machines as shown in Figure  l (b) . Such a decomposition can be obtained using the approach outlined as follows. Suppose Q = { q l , q 2 , . . . q N } .
A partition II(Q) of a set Q is a set of disjoint and nonempty subsets of Q whose union is Q. The zero partition of Q is denoted by II, ( Q ) , and is the partition whose elements are the singleton subsets of Q.
Let IIA(Q) = {A17A2, ... It is clear from the definition of orthogonal partition and decomposition model that one block in partition is a state of the sub-machine corresponding to that partition. A block of partition contains many states of the original machine. Therefore, if there are transitions among the states which are now put in one block, each transition is now a self-edge. When there is self-edge for some input, next state is same as present state and clock for the flip-flops can be disabled. Since the clock of the flip-flop is disabled, then the next state value computed by next state functions is of no-use. Therefore, primary inputs can also be disabled to reduce unnecessary switching activity. This is shown in Figure 2 . In Figure 2 , SLI and SL2 represent the selfedge conditions for machine MI and M2 respectively. CL1 and CL2 are combinational logic blocks computing next state functions for M I , M2 respectively, while OL computes the primary outputs depending on the primary inputs and present states of both the machines. For machine MI, SL1 implements the function, 6 s l : C x 112 x II1 -+ II1 such that, 6(a, q 2 i , qlj) = qlj,VqljCQ1, q 2 i~Q 2 where Q1, Q2 are set of states corresponding to machines M I , M2 respectively. On the other hand, CLl implements a function d(u,q2i,Qij) = Q i k r whereqij # q i k andqij14ikCQi. In the following subsection, we explain the realization using this architecture. Let Al, A2 denote the blocks Of IIA i.e. AI = (1,2), A2 = (3,4). Similarly, let B1, Bz denote the blocks of l l~. Since, there are 2 states in each machine, 1 bit for each machine will suffice for state assignment.
The state assignment is as shown in Table 1 Figure 4 . The combinational block SL1 implements row 1 and row 3 of state transition table (STT), which are self-edges. SL1 will compute '0' for these conditions and clock and primary input is disabled using AND gates. CL 1 implements min-term corresponding to state transition shown in row 2 of STT. Primary inputs and state bits from both the machines are the inputs to SL1 and CLl.
Note that, in this case, we did not require state bit from M B as an input to CLI. In general, it can be input to CLI as well as SL1. 
Clk

Motivation for Cost Function
When the machine makes a state transition, the state bits change. Whenever, there is a state bit change, there is not only switching activity in the flip-flop, but switching activity is there in combinational logic as well. Therefore, there had been attempts to minimize the Hamming distance among the codes assigned to the states. Also, if there are many self-edges, then the present states and next states are same. Still, there is a power dissipation because of the clock and combinational logic will also be active as change in some of the inputs can cause some switching activity. Therefore, whenever there are self-loops, if inputs and clock is disabled, then considerable amount of power can be saved. While partitioning we are trying to maximize the number of self-edges. The self-edges are then implemented separately to disable the clock and primary inputs. More the number of self-edges, more will be the power saved. Therefore, our cost function is the number of self-edges.
Greedy Algorithm for Ortbogonal Partitioning
Problem of finding two interacting machines is the problem of finding two partitions. We are interested in finding orthogonal partitions KIA, K I , such that the number of selfedges in digraphs corresponding to the machines is maximum. It is obvious that, if two elements are in the same block of IIA, then they must be in different blocks of U,; otherwise their meet will not be U,. The greedy algorithm builds partition IIA by forcing tightly connected states into the same block, so that the edges between them are replaced by a self-loop. While building the second partition I I , , states are added one at a time to blocks by doing a local search (on assignments of the state to blocks) to determine which assignment creates the minimum number of additional edges in the directed graph of I I , .
The pseudo code for the algorithm is shown in Figure 5 . The algorithm is same as the algorithm given in 151. The algorithm simultaneously generates the orthogonal partitions IIA and I I , . We assume that we have been provided positive integers nl, 122, with 921922 2 N . The algorithm generates partitions such that IIA will consist of n1 blocks, each containing at most n 2 states, and II, will consist of 922 blocks, each containing at most nl states.
The routine putinmost-suitable-block() chooses one block among all the available blocks of l l~, such that when a state is put in that block, the additional number of edges created in the directed graph corresponding to HB is minimized. The routine find-most-adjacent-state() returns the state with the maximum number of fan-in edges from the 
Experimental Results
We implemented the extraction of self-edge logic block and disbling of clock and primary inputs in enhanced version of a program called DECOMP and used the same greedy algorithm of DECOMP [5] . We ran DECOMP on a set of 6 MCNC benchmarks, which are shown in Table 3 .
In Table 3 , Ni, No, N p , N , would mean number of inputs, number of outputs, number of prodcut terms and number of states respectively. To synthesize the circuits, we used JEDI for state assignment, script. rugged for minimization and mapped the circuits using m u library in SIS [8] . We used power-estimate program in SIS to compute the power dissipation. The results are shown in Table 4 . In case of 12-bit counter modulol2, we get the power reduction by about 33% with less area as well. The power reduction is significant in case of c128 which is 128 state counter. It is almost 67% with the area reduction 62% as well. This Table 3 : MCNC benchmark information is because optimal decomposition in terms of 12-state and 1 I-state counters can be obtained and they can be encoded nicely using JEDI. In case of s1494 we do not get the significant power reduction because the output logic for them is bigger and it is always active. Note that, although in case of s1494, there is an increase in area, still there is power reduction. This is because of the disabling of inputs and clock. We have similar observations in case of s820 and s832. In case of s5 10, the number of outputs are 7 as compared to 19 of s1494 and also the number of product terms are less, we could get the power reduction by 100 uWatt.
Conclusion and Future Directions
In this paper, we described a gated clock architecture for low power realization of decomposed FSMs. We used a greedy algorithm to maximize the cost function. Looking at the empirical results, one may deduce that our algorithm works well for the FSMs with large state machines with less number of outputs and produces significant power reduction. We would like to emphasize that in case of FSMs with simple structures such as loops, we could obtain significant power reduction along with area reduction as well. In case of small examples, the disabling logic itself is an overhead. In case of large examples (in terms of number of states), if the number of primary outputs is large then, we do not get the significant power reduction. This is because, the combinational block which computes primary outputs is always active.
There are several future directions. We have proposed a simple cost function which is the number of self-edges.
We have not taken into account the probability distribution, which can tell exactly for how long which combinational block is active. To come up with the cost function which takes into account such a probability distribution, may be better than the just number of self-edges. In that case, the number of self-edges can be distributed among two machines, such that the areas of self-edge logic, which is always active, is small. It is also worthwhile to see when the sizes of two machines are different i.e. N I >> Nz.
The problem of how many self-edges can be created for a given graph G(V, E ) when decomposed using orthogonal partitions, can be of interest becasue, one can predict more accurately how much power can be saved. Working on this problem, has also given us an insight to look at the state assignment as an orthogonal partitioning problem. The state assignment for N state machine can be considered as a finding logN partitions with two blocks each, such that their meet is a zero partition. To the best of our knowledge, so far state assignment problem has not been looked at from this perspective. It can be worthwhile.to look at it from this perspective because, one can go on building the partitions incrementally, keeping the track of cost function.
