Introduction
The inclusion of embedded memory blocks in FPGA devices has motivated a revival of interest in memory-based implementations. Manufacturers, like Xilinx or Altera, are beginning to provide tools to map logic into embedded memory blocks. Recently, many research works are focused on RAM-based implementations of Finite State Machines (FSMs) [1] - [7] . Different approaches have been proposed to improve the performance of these implementations (such as speed, area or power consumption) over conventional cellbased implementations [1] - [4] . In addition, as the functionality of a RAM-based FSM is defined by the data stored in the memory, these implementations offer a simple way to dynamically modify the functionality (adding or updating transitions and outputs by means of write operations). This run-time reconfiguration offers some advantages like being independent on the placement and routing and being applicable even on FPGAs without dynamic partial reconfiguration capabilities [6] , [7] .
Nowadays, there is a great interest in developing hardware implementations of problems traditionally solved by software in order to achieve the demanded highperformance. Two prominent examples of such problems are pattern matching and packet routing in networks [8] , [9] . The transformation of these problems into equivalent state machines results in FSMs with a very large number of states (from a few thousand to millions). These large FSMs require novel approaches to obtain efficient implementations.
In this application context, the authors propose a new model of state machine called Finite Virtual State Machine (FVSM) to improve the performance of RAM-based FSM implementations. This model is inspired by the memory hi-erarchy of computer science. To the best knowledge of the authors, multiple memory levels and locality of memory references concepts have never been used in RAM-based FSM designs. The proposed model uses a memory hierarchy of two levels and exploits the locality of the FSM state references. However, unlike the cache accesses done by a processor, in which the data may be unavailable, an FSM must ensure the correct value of the output in any clock cycle. By using the cache analogy, the proposed approach can be viewed as a cache prefetching scheme where no cache miss is allowed.
The proposed architecture has two memory levels: main and secondary memory. The main memory implements a RAM-based FSM. Taking into account that only a subset of the states are reachable in a given period of time, the FSM is decomposed into different subFSMs, each of which is composed by the needed states for the FSM operation during a different period of time. At each moment, the subFSM stored in main memory is the active one. On the other hand, secondary memory stores information about all subFSMs.
The FVSM is designed in such a way that any particular subFSM needed by the evolution of the FSM is dynamically transferred from secondary to main memory before any of its states are reached. Unlike other approaches [7] , [9] , this transfer is not controlled by an external circuit, but by the active subFSM and it is done without interrupting the proper FSM operation. The aim of the proposed approach is to increase the operation speed by exploiting the fact that the RAM-based implementation of the subFSM on main memory has a better performance than the more complex full FSM. On the other hand, the model offers some advantages when it is used on reconfigurable applications. Any subFSM (even the active one) can be reconfigured in secondary memory by a host while the active subFSM continues in operation in main memory. So, the reconfiguration process can be done without interrupting the FSM operation.
Definition and Implementation of an FVSM
An FSM is a 6-tuple (X, Y, S , g, h, s 0 ) where X, Y, and S are a finite set of inputs, outputs, and states, respectively; g : S × X → S is the transition function, h : S × X → Y is the output function, and s 0 ∈ S is the initial state. Let 
and u : I × F → I are the transition, output and update functions, respectively. The initial virtual state is (i 0 , f 0 ) where i 0 and f 0 are the initial instance and frame, respectively. Given the present virtual state (i, f ), the next frame and instance are determined by the transition function and the update function, respectively; so, the next virtual state is (u(i, f ), t(i, f, x)) where x ∈ X. An instance change occurs when the next instance is different to the present one. The virtual state (i, f ) is called an update state of i if u(i, f ) = i and i i. Our goal is to model a given FSM as an FVSM. Let us say that an FSM and an FVSM are equivalent if it exists a partial function v :
. Each instance determines a different subset of FSM states which are stored on the frames of main memory. These (usually non-disjoint) subsets define for each instance a different subFSM where the frames play the role of the FSM states. show an example of FSM and an equivalent FVSM. The subFSM defined by each instance is represented as a State Transition Graph (STG) where circles represent frames. In the STG of the instance i, the circle representing the frame f is labeled "
The update states are denoted by double-line circles; and transitions involving instance changes, by slashed arcs. Figure 2 shows the general architecture of an FVSM. It is composed of two memories (main and secondary memory) controlled by the same clock signal. Main memory is a simple dual-port memory with asymmetric port widths configured in read-after-write mode [6] . The write port allows the modification of the active subFSM while the FSM is operating by using the read port. The transitions of the active subFSM, composed by the next frame encoding bits and the outputs, are stored in main memory. An instance change consists in transferring from secondary to main memory the data needed to transform the present instance (i.e., the active subFSM) into the next instance. These data, called an instance update, are the transitions of the virtual states in which both instances differ. The address where an instance update must be stored in main memory is called update address. As the read port operates at transition-level, the write port width must be multiple of the read port width in order to transfer the instance update in a single write operation. The ratio between the width of write and read data port (called aspect ratio) is equal to the number of transitions of the largest instance update. FPGA resources are very suitable for implementing these asymmetric memories due to the high aspect ratio of distributed RAMs (unlimited) and block RAMs (up to 64 in each Xilinx memory blocks).
Each secondary memory word is composed by an instance update and its corresponding update address. It is addressed by the instance update selection signal. The operations to load an instance requite two clock cycles. In the first one, the instance update and its update address are read from secondary memory by setting the instance update selection signal. In the second one, the instance update is written on main memory at the update address by setting the update enable signal. The read and write operations are pipelined by using the registers allowing a throughput of one update per cycle. This update process is controlled by the active subFSM and it is carried out simultaneously with the FSM operation. So, the transitions of the FVSM must be extended with the instance update selection and update enable signals. Figure 1 (c) shows an execution trace of the FVSM example. Given any sequence of inputs (in), the sequence of outputs (out) generated by both the FSM and its equivalent FVSM are the same. In each clock cycle (clk), the trace shows the FSM present state (ps), the FVSM present frame (pf ), the FVSM present instance (pi), the data contained by the frames in main memory (these data are detailed in Fig. 1 (d), and (Fig. 1 (e) shows the secondary memory content). In cycle 7, the ue signal enables the write operation of the instance update of i 0 on main memory. In cycle 8, the write operation is done. As Fig. 1 (f) shows, each write operation modifies two frames (i.e., four transitions). As main memory operates in read-after-write mode, the output takes the value 0 just in cycle 8, which corresponds to the new virtual state (s 5 ) stored in f 1 (the present frame in this cycle) by the update process. The described two-cycles update process is the same in any instance change of any FVSM.
In an FPGA, the speed of a distributed RAM decreases with depth because it is composed by smaller RAM components connected via multiplexors and decoders, whose complexity grows with depth. This speed degradation can also be important when using little memory blocks or memory blocks in power-aware design [10] . The critical path of the FVSM architecture is imposed by the memory with the higher depth in read operation (the depth of the main memory in write operation is always less than in read operation, so it has no influence on the critical path). The main memory depth in read operations increase with the number of inputs and the number of frames. The number of frames depends on the number of virtual states of the largest instance and the required aspect ratio. This ratio grows with the number of inputs and the number of virtual states of the largest instance update. On the other hand, the secondary memory depth grows with the number of instance updates. An optimization algorithm, called virtualization algorithm, is used to find the FVSM implementation with the highest speed. This algorithm minimizes the maximum depth of both memories (hereinafter called FVSM depth) taking into account the above mentioned parameters.
Virtualization Algorithm
The optimization problem of finding the best FVSM implementation is solved by a branch-and-bound algorithm whose candidate solutions are FVSMs. The algorithm dynamically constructs the tree of candidate solutions, each of which is defined by a subset of FSM states which will be the update states of the FVSM. The fact that the secondary memory only has one read port imposes restrictions over the implementability of the FVSM. Not all candidate solutions can be implemented in the proposed architecture. For example, in the FSM of Fig. 1 (a) , the states s 2 and s 5 can not simultaneously be update states of an FVSM implementable in the proposed architecture because it requires to read two different instances updates from the secondary memory in the same clock cycle (transition from s 0 to s 1 ). So, only candidate solutions that are implementable are feasible solutions. The initial candidate solution (the root of the tree) is an FVSM where all FSM states are update states. Each child node of a candidate solution is created by selecting a different state to be deleted from the set of update states. So, the main memory depth of a node is less than or equal to that of their child nodes. This depth is used as a lower bound of the FVSM depth of any descendant node. Therefore, the subtree of a candidate solution can be discarded if its main memory depth is greater than or equal to that the best known solution.
Each update state determines a different instance which must be loaded at any output transition of this state. An instance is composed by all the virtual states that can be reached from its update state to any different update state. As the next instance must be loaded after its update state is reached, this state must also be included in the present instance. For example, in Fig. 1 (b) Fig. 1 (b) ). When i 0 is the present instance, i 1 is loaded after s 2 is reached; so, i 0 must be composed by the virtual states s 0 , s 1 , s 5 , and s 2 . Table 1 compares the speed of implementations of RAMbased FSM and FVSM architectures. The FSM architecture is parametrized by the number of state encoding bits (seb) and the number of inputs (in). The parameters of the FVSM architecture are in, the number of frame encoding bits (feb), and the number of instance update encoding bits (iueb). Each case of Table 1 corresponds to the parameter Table 1 Comparison between FSM and FVSM architecture.
Experimental Results
in seb feb iueb imp in seb feb iueb imp in seb feb iueb imp (%) (%) (%) values required to implement the same state machine in both architectures. The column imp represents the percentage of increase in clock frequency of the FVSM over the RAMbased FSM architecture. None of the architectures has been parametrized by the number of outputs because their performance is not affected by this parameter (the depth of the memories is independent of this parameter). Both architectures were implemented in a Xilinx xc2v8000-4 FPGA using distributed RAM. The results show the important speed improvement that can be achieved by using the FVSM architecture (Table 1 only contains those cases whose improvement is equal or greater than 20%). As we can see in Table 1 , the same state machine could be implemented with different values of feb and iueb parameters depending on the effectiveness of the virtualization algorithm. Therefore, this algorithm is a key to improve the operation speed. In order to demonstrate that the speed improvements of the FVSM architecture shown in Table 1 can be achieved in state machine implementations, a set of 20 FSM testbenches has been virtualized and implemented in the same FPGA device. In almost half the cases, the speed improvement is greater than 33%. For each case, Table 2 shows the number of states (s), the number of transitions (t), the number of inputs (in), the parameters of the FSM and FVSM implementations (seb, feb and iueb) and the speed improvement (imp). This speed improvement depends on the FVSM parameter values reached by the virtualization algorithm for each testbench. The values of the parameters in, seb, feb, and iueb shown in Table 2 correspond to those values shown in bold type in Table 1 .
Conclusions and Future Work
The experimental results show that the proposed approach offers a significant improve in the operation speed with respect to conventional RAM-based FSM implementations. In general, more important improvements are obtained for large FSMs, what makes this approach very suitable for hardware implementation of problems traditionally solved by software (like pattern matching and packet routing). Once the usefulness of the approach is proved, the authors are studying how to take further advantage of the capabilities of the proposed model. In many cases, the virtualization obtained by the branch-and-bound algorithm is not optimal. New strategies to refine the algorithm are being studied. On other hand, as explained in Sect. 2, the aspect ratio of an FVSM grows with the number of FSM inputs. Different approaches are being studied to reduce the number of inputs connected to the main memory (such as FSMs with input multiplexing [1] or dummy states [5] ). This eases the implementation of FVSMs by using embedded memory blocks which have a limited aspect ratio. Another approach for reducing the required ratio consists in doing write operations at a clock frequency multiple of that of read operations. This allows the reduction of the write data port width.
Furthermore, the authors are considering the use of more than one port in secondary memory (e.g., Xilinx true dual-port memory blocks). This architecture improvement would allow a relaxation of the restriction imposed by the single port over the implementability of FVSMs. The number of update states could then be higher and thus the instances would have a smaller size. Therefore, a speed increase would be obtained. In addition, the power consumption of a FVSM implementation can be reduced respect to the conventional RAM-based implementation if the secondary memory is only enabled when a transfer is required. Based on this strategy, the authors are studying the use of the FVSM in low-power design. Moreover, the authors are studying the addition of a new level of hierarchy to the model in order to extend the capacity of the target device with a new memory level stored in external off-chip memory. This could allow the implementation of large FSMs that exceed the capacity of the target device.
