demonstrated that in on-demand RF read architectures, there is scope for further rfa reduction and proposed several bypass-aware instruction scheduling techniques aimed at reducing the number of access to the register file. Our experiments on the Intel XScale processor pipeline with on-demand RF read running MiBench benchmarks show that up to 26% and, on average, 12% rfa can be reduced. Further, one of our scheduling techniques, which is RFPN2, is an effective heuristic to reduce the number of rfa (11.4% on average) without much loss in performance (less than 1% on average) and within a reasonable compilation time. We have demonstrated that our compilation technique consistently reduces the number of the rfa on various bypass configurations. 
I. INTRODUCTION
Finite-state machine (FSM) synthesis is a well-studied problem. It consists of state minimization (SM) and state encoding (SE) procedures. SM finds a functionally equivalent FSM that has the minimum number of states. SE assigns distinct binary codes to each state of the FSM such that the sequential circuit modeled by the FSM can be efficient in terms of area, performance, and/or power.
The SM problem can be optimally solved in completely specified FSMs [6] , and there are standard approaches to solving the SM problem in incompletely specified machines [9] . On the other hand, there have been many techniques to solve the SE problem based on different optimization objectives and implementation technologies.
In area-driven FSM synthesis, De Micheli et al. proposed an SE algorithm to minimize area in a programmable logic array implementation by generating a minimum (multivalued) symbolic cover of the FSM followed by a step of satisfying the encoding constraints [13] . Successive extensions also introduced output constraints and more efficient algorithms to satisfy the input and output encoding constraints [18] , [19] . MUSTANG [3] is one of the earliest state encoding techniques for multilevel logic minimization; it assigns a weight to each pair of symbols and gives adjacent codes to pairs of states with large weight. JEDI [11] adopts a weighted graph model similar to the one in MUSTANG, but it uses a simulated annealing algorithm to perform the embedding. Instead, MUSE [4] and MIS-MV [12] apply multivalued minimization followed by extraction and satisfaction of encoding constraints, generalizing it to multilevel logic.
In power-driven FSM synthesis, the optimization objective of SE is often formulated as minimizing the total switching activity in the circuit in light of the fact that dynamic power is proportional to the switching activity. Roy and Prasad proposed a simulated annealing algorithm to reduce the switching activity in conventional area-driven state encoding algorithms [15] . Washabaugh et al. assigned weights to each pair of states based on the state transition probability between them. Then, they used a branch-and-bound algorithm to minimize the total weighted sum of state transitions [20] . Based on the same weighted state transition graph (STG) model, Olson and Kang applied a genetic algorithm to optimize the SE solution; in addition, they also considered area minimization during encoding to achieve different area-power tradeoffs [14] . Later, Benini and De Micheli proposed POW3, a greedy algorithm that assigns binary codes to states bit by bit. At each step, the codes are selected to minimize the number of states with different partial codes [2] . This algorithm has low run-time and very good synthesis results. Moreover, Iman and Pedram developed a complete low-power FSM synthesis framework that minimizes the switching activity both in the state registers and combinational circuit [8] .
All previous FSM synthesis approaches start with SM and look for the best encoding solution in the minimized FSM. Such serial strategy, however, is a heuristic that disregards the fact that a state-minimized FSM is not necessarily the best starting point for SE. Concurrent SM and SE were proposed in [1] , [5] , [7] , and [10] , with little success. Hallbauer [7] proposed a method for asynchronous circuits based on pseudodichotomies trying to perform SM while heuristically reducing the encoding length, with no reported results. To explore the solution space in the nonminimized FSM, the method of Lee and Perkowski [10] employed a branch-and-bound technique; however, it is only feasible for very small machines (no more than 16 states). Avedillo et al. [1] presented a heuristic method in which the encoding is incrementally generated and may create incompletely specified codes for the states in the original FSM. Although reasonably efficient, the experimental results on a subset of the Microelectronics Center of North Carolina (MCNC) benchmarks do not show improvements over a serial synthesis strategy. Fuhrer and Nowick proposed OPTIMIST [5] , a concurrent SM and SE algorithm for two-level logic implementation. It provides an exact solution to FSM optimization for two-level logic implementation. However, the largest FSM in the reported table of results has only nine states, and the authors pointed out that their algorithm does not scale well.
To overcome the pitfalls of locality for a serial strategy and computational bottlenecks in concurrent approaches, we propose FSM reengineering, a performance enhancement method to FSM synthesis. Our method consists of three phases. First, we encode a minimized FSM. Then, based on the encoded solution, we reconstruct a functionally equivalent FSM with increased number of states. Finally, we reencode the new enlarged FSM. This method enables the SE algorithms to explore a larger solution space composed of functionally equivalent but nonminimized FSMs. Unlike the concurrent SM and SE approaches, our method has much lower runtime complexity and, therefore, is capable of handling large FSMs. Furthermore, this method can be applied on top of an existing FSM synthesis flow to improve the solution quality. An extended version of this paper can be found in [21] .
II. PROBLEM FORMULATION
We use the same weighted STG to model FSM as in [20] , where an encoded FSM is represented by a graph G = (V, E, {C i }, {w ij }): a node v i ∈ V represents a state s i with code C i ; a directed edge (v i , v j ) ∈ E represents a transition from state s i to state s j ; weight w ij is a positive number assigned to each edge and depends on the optimization objective.
We use H(v i , v j ) to denote the Hamming distance between the state codes C i and C j for a particular encoding solution. The weighted sum of an encoded FSM can be expressed as follows:
(
Equation (1) has a significant implication for state encoding, because in many SE algorithms, both power-and area-driven, such weighted sum is used as the objective function in logic optimization. For example, in power-driven state encoding such as POW3 [2] , GALOPS [14] , SYCLOP [15] , and SABSA [20] , the weight is defined as the state transition probability, whereas in area-driven state encoding such as MUSTANG [3] and JEDI [11] , the weight is defined as attrition between states and calculated by the adjacency matrices. In this paper, we simply refer to the weighted sum of Hamming distances as the "cost" in the FSM optimization.
The FSM reengineering problem is formulated as follows:
Given an encoded FSM M and its corresponding weighted graph G = (V, E, {C i }, {w ij }), construct and encode a functionally equivalent FSM M so that the total cost reduction in the new graph
In this section, we will present the algorithms in the FSM reengineering method. We first show the overall flow of the method. Then, we present the state-splitting technique to reconstruct a functionally equivalent machine. Two heuristic algorithms are used to select states for splitting and decide how to partition next states. Finally, we propose a genetic algorithm to perform state splitting. Fig. 1 outlines the overall flow of the proposed FSM reengineering method. This flow can be applied to FSM synthesis with different optimization objectives. First, we use an existing SE algorithm to obtain an "optimal" solution (assuming the algorithm always produces the best possible solution it can achieve) for a given FSM. Then, we analyze the solution and find the state pairs that contribute the most to the total cost. The second phase is FSM reconstruction based on the encoded FSM. We leverage the state-splitting technique to mitigate the cost between frequently transitioned states with a large Hamming distance, while maintaining the functionality of the FSM. In the third phase, the reconstructed FSM is reencoded using the same (but it may be a different one) SE algorithm.
A. Overall Flow
In the rest of this section, we will illustrate the three key steps in the second phase: 1) select the best candidate state for splitting; 2) decide how to split the selected state; 3) estimate the (maximum) cost reduction after state splitting.
B. Heuristic for Selecting States to Split
Due to the topological constraints in the STG, there exist states in the encoded FSM with a large Hamming distance. This could be caused by two reasons: 1) there are other neighboring state pairs with larger weights and a higher priority to be assigned adjacent codes and 2) SE algorithms do not guarantee the minimum costs in all the encoded solutions because SE is Σ P 2 -complete [17] . State splitting makes it possible to assign different codes to the original state and its new companion split state, so that each of them will have a smaller Hamming distance from its neighboring states.
This can be seen in Fig. 2 . In this STG, no matter what code we assign to state S, it will have a Hamming distance ≥ 3 from at least one of its previous states. (To see this, notice that both codes 11111 and 00000 are assigned to its previous states.) After we split S, we can assign code 11110 to state S and code 00001 to its equivalent state S . This ensures that S will have a Hamming distance of one from all its previous states, and S will have a Hamming distance of two from S 4 and a distance of one from all the remaining previous states. One can easily verify that the FSM after state splitting is equivalent to the original FSM.
Intuitively, a state with a large (average) Hamming distance from its previous states can benefit the most from state splitting because it will have fewer previous states in the reconstructed FSM, which allows the encoding algorithm to find a code that minimizes the total weighted sum of Hamming distances.
For each state s i , we define
This value measures the average Hamming distance between state s i and all its previous states. Our heuristic will pick the state with the largest r-value for splitting; if there is a tie, random selection is used. 
C. Heuristic for Splitting a Selected State
Once a state S is selected for splitting, we replace it by two states S and S , such that: 1) both have the same next-state transitions as in the original state to keep the same functionality and 2) each of S and S carries part of the previous states of S, as shown in Fig. 2 . Ideally, we want to split the state in such a way that the total cost will be maximally reduced in the new FSM after reencoding.
Let P S be the set of previous states of S. A greedy heuristic to partition P S is shown in Fig. 3 . The algorithm first chooses two states s 1 and s 2 from P S with the largest Hamming distance and puts them into two partitions P T 1 and P T 2 , respectively (lines 3-4). For each of the other states t ∈ P S, we assign t to P T 1 if it is closer to s 1 in terms of the Hamming distance, or to P T 2 if it is closer to s 2 (lines 6-9). We define the center of a partition to be the code that has the minimum total Hamming distance from all the states in the partition. We can calculate the center in each partition using the majority function on all the codes in that partition. After the centers c 1 and c 2 of the two partitions (line 11) are computed, we repartition set P S based on these new centers and continue if the new partition results in a reduced total Hamming distance (line 13).
This procedure is illustrated in Fig. 2 . In the original FSM, state S has six encoded previous states. S 1 and S 6 have the largest Hamming distance and are put into two partitions. The remaining states from S 2 to S 5 are partitioned according to lines 7-10 in Fig. 3 . Then, the center in the first partition (that contains S 1 ) is calculated as "11110"; the center in the second partition (that contains S 6 ) is "00001." Then, all the states are again assigned based on their Hamming distance to each center. In this round, state S 4 is moved from the first partition to the second partition because it has a smaller Hamming distance to the center in the second partition. There will be no more partitioning afterward because the total Hamming distance has reached a minimum.
D. Genetic Algorithm for Selecting and Splitting States
In this section, we describe a genetic algorithm for state splitting in Fig. 4 . First, we disqualify all the states with a single previous state (lines [1] [2] [3] because splitting a state with only one previous state does not help reduce the costs. Then, we insert all the candidate states into a queue.
A state-splitting scheme is represented by a Boolean vector of the same length as the aforementioned candidate queue. A bit "1" at the ith position of the vector indicates that the ith candidate state is split, and a bit "0" means that the scheme chooses not to split this state. Each vector is referred to as a chromosome. According to each chromosome, we split the states (lines 7-9) and calculate their fitness (line 10), which is defined as the total cost according to that chromosome. The smaller the cost, the better the chromosome. We start with an initial population of N randomly generated chromosomes (line 5). Children are created by a roulette wheel method, in which the probability that a chromosome is selected as one of the two parents is proportional to its fitness (line 13). With a certain ratio, crossover is performed among parents to produce children by exchanging substrings in their chromosomes. A simple mutation operation flips a bit in a chromosome with a given probability known as bit mutation rate (line 14) . When the population pool is full, i.e., the number of new chromosomes reaches N , the algorithm stops to evaluate fitness of each individual for the creation of the next generation. This process is repeated for MAX_GEN times, and the best chromosome is chosen as the state-splitting strategy.
IV. EXPERIMENTAL RESULTS
Our experiments are to show how the proposed FSM reengineering approach can enhance the performance of a state encoding algorithm. POW3 [2] and JEDI [11] , which are two of the most popular SE algorithms for power-and area-driven syntheses, respectively, are used in our experiments. We implemented the FSM reengineering method in C and demonstrated it on a subset of MCNC91 FSM benchmarks. These benchmarks, 25 in total, include all those that can be modeled using Markov chains [20] and, thus, are suitable for state splitting (e.g., their states have multiple incoming edges). The FSM benchmarks are synthesized by SIS [16] .
A. Power-Driven Optimization
We use the total switching activity in sequential logic, which is the same cost function as in POW3, as our cost function for powerdriven FSM reengineering. We first report both the switching activity in FSMs encoded by POW3 and the reduction by FSM reengineering in Table I .
The second column in the table shows the number of states in each of the 25 FSM benchmarks. The third column shows the number of states split by the reengineering method using the heuristic algorithm (heu) and genetic algorithm (ga), respectively. We notice that, for several benchmarks, the genetic algorithm creates many more split states than the heuristic one. This is because the heuristic state-splitting algorithm is a greedy approach that only makes progress when there is enough reduction in the switching activity, whereas the genetic algorithm works more globally. For example, in s1488, five states need to be split to achieve a 0.2% reduction in the switching activity. The greedy heuristic algorithm, in this case, cannot find any single state whose splitting may lead to a sufficient switching activity reduction. The reengineering method achieves a switching activity reduction of 8 .5% using the genetic algorithm (column 5) and 5.7% using the heuristic algorithm (column 6), on average.
In the last three columns, we report the run-times of POW3 and our method. Note that both heuristic and genetic algorithms invoke POW3 encoding in the cost estimation step (see Fig. 1 ). This is acceptable because the run-time of POW3 is small. In case the SE algorithm is time consuming, we can employ some fast heuristic to estimate the switching activity reduction. Overall, the heuristic algorithm takes an average run-time of less than 1 s; the genetic algorithm completes most of the benchmarks within 1 min.
Finally, we report the total power in the synthesized circuits in Table II . This includes the power in sequential and combinational logic. The area and delay overhead in the reconstructed FSMs are shown in columns 5-6 and 7-8, respectively. We see that averages of 5.5% and 3.3% power reduction are achieved in the reengineered FSM using genetic and heuristic algorithms, at the cost of only 1.3% and 0.9% area increase and 1.3% and 0.8% delay increase, respectively. We also note that on three benchmarks (namely ex5, mark1, and planet), the total power reduction is negative after FSM reengineering. This is because in FSM reengineering, we used the state switching activity as a cost function, which only accounts for power in sequential logic. In this case, power increases in combinational logic due to the area increase, and it exceeds the power saving in sequential logic. With a cost function that better reflects the power consumption in the circuits, the FSM reengineering method can achieve a better correlation between optimized cost and real circuit results. A † in the table means that the power value is not available from SIS.
B. Area-Driven Optimization
In this section, we show the enhancement due to FSM reengineering on the area-driven SE algorithm JEDI. Table III reports the number of literals, which is the metric for area optimization in multilevel logic circuits, and the mapped area in the synthesized circuits, as well as the power and delay overhead. The cost function computed based on attrition matrices in the STG model does not correlate very well with the number of literals in the output logic functions, as instead does the switching activity with power. This makes genetic-algorithmbased FSM reengineering less effective, since the cost function is used as fitness to produce each new generation in the genetic algorithm. However, calculating the number of literals after each state splitting is computationally expensive. Therefore, we only report the results by the heuristic algorithm. Compared to power reduction, the area reduction results are less impressive due to the less accurate cost function.
V. CONCLUSION
We introduced an FSM reengineering method to improve the FSM synthesis tool. It reduces the effect of locality in a serial synthesis strategy and has lower complexity than concurrent SM and SE. Based on the existing power and area cost function, we have seen improvements on the existing power-and area-driven SE algorithms.
