A method is proposed targeting a decrease in the number of LUTs in circuits of FPGA-based Mealy FSMs. The method improves hardware consumption for Mealy FSMs with the encoding of collections of output variables. The approach is based on constructing a partition for the set of internal states. Each state has two codes. It diminishes the number of arguments in input memory functions. An example of synthesis is given, along with results of investigations. The method targets rather complex FSMs, having more than 15 states.
Introduction
A lot of digital systems include control units (Baranov, 2008; Gajski et al., 2009) . As follows from the works of Czerwiński and Kania (2013) or Minns and Elliot (2008) , different models of finite state machines (FSMs) are used very often for representing and designing control units. In many practical cases, the model of a Mealy FSM is used for these purposes (Sklyarov et al., 2014; Micheli, 1994) . That is why we choose the Mealy FSM model in this research.
It is very important to diminish the amount of hardware consumed by an FSM logic circuit (Gajski et al., 2009; Czerwiński and Kania, 2013) . Solution methods for this problem strongly depend on specific features of logic elements used for implementing the circuits (Czerwiński and Kania, 2013; Sklyarov et al., 2014) . In our article, we discuss a case when field programmable gate arrays (FPGAs) are used to implement Mealy FSM logic circuits. We chose FPGAs because they are very popular and are used very often for implementing FSM logic circuits (Maxfield, 2004; Grout, 2008) .
It is enough to use only two components of FPGA fabric to implement any logic circuit. These components are logic elements (LEs) and a matrix of programmable interconnections (Altera, 2018; Xilinx, 2018 ). An LE includes a look-up table (LUT) element, a programmable flip-flop and multiplexers. The LUT is a memory block * Corresponding author having S L address inputs and a single output. The LUT can keep a truth table of an arbitrary Boolean function having up to S L arguments. It is possible to bypass the flip-flop of an LE. Consequently, the output of the LE could be either combinational or registered.
The LUT has a rather small amount of inputs (S L ≤ 6) (Altera, 2018; Xilinx, 2018) . This peculiarity leads to applying functional decomposition (Scholl, 2001; Kam et al., 1997; Nowicka et al., 1999) of Boolean functions having more than S L arguments. The decomposition leads to multilevel circuits with complex interconnections. In turn, it leads to increasing the propagation time and power consumption of the circuit (Barkalov et al., 2015) . It is very important to decrease the power consumption for FSM circuits (Kubica and Kania, 2017) as well as for other digital systems (Sajewski, 2017) .
To improve the characteristics of FSM circuits, it is necessary to reduce the number of arguments in Boolean functions representing an FSM logic circuit (Sklyarov et al., 2014) . As a rule, various methods of state assignment are used to solve this problem (Minns and Elliot, 2008; Kam et al., 1997) . JEDI (Lin and Newton, 1989 ) is one of the best among these methods. JEDI is used, for example, in CAD tools such as SIS (Sentowich et al., 1992) and ABC (ABC System, 2018) .
Also, a hardware reduction can be obtained due to a structural decomposition of the FSM circuit (Barkalov et al., 2012) . In this case, the designers use methods such as the replacement of logical conditions (Sklyarov 596 A. Barkalov et al. et al., 2014; Baranov, 1994) , the encoding of collections of microoperatios (Sklyarov et al., 2014; Baranov, 1994) , the transformation of object codes (Barkalov and Barkalov, Jr., 2005) . These methods are based on the representation of an FSM circuit as a multi-level circuit. Each level of the FSM circuit is represented by a system of additional functions. They are much simpler than the functions implemented by a single-level circuit. The composition of additional functions represents the system of functions of a single-level circuit
In this article, we propose a design method targeting a hardware reduction in LUT-based Mealy FSMs. The method is based on a three-level structure of an FSM circuit and an encoding of collections of output variables.
Background of Mealy FSMs
A Mealy FSM is defined as the sextuple S = (X, Y, A, δ, λ, a 1 ) (Baranov, 2008; Micheli, 1994) , where
The sextuple S can be represented by a state transition table (STT) (Micheli, 1994) . The STT includes the following columns: a m is the current state; a s is the state of the transition; X h is a conjunction of some elements of the set X (or their complements) determining the transition from a m into a s ; Y h is the collection of outputs generated during the transition a m , a s ; h is the number of the transition.
Consider the STT of a Mealy FSM S 1 (Table 1) . It has H = 20 rows. The following sets can be derived from Table 1 :
When the set of states is constructed, the state assignment should be executed (Micheli, 1994; Baranov, 1994) . During this step, each state a m ∈ A is represented by its code K(a m ) having R bits. The variables T r ∈ T are used for state assignment, where T is a set of state variables. The method of one-hot state assignment is very popular in the FPGA-based design of FSMs (Garcia-Vargas et al., 2007; Tiwari and Tomko, 2004) . But if embedded memory blocks (EMB) are used, then a binary assignment is more preferable, when we have
A register (RG) is used to keep the state codes. It includes flip-flops with the mutual synchronization pulse Clock and mutual clearing pulse Start. As a rule, D flip-flops are used for implementing the RG (Baranov, 2008) . To change the content of the RG, input memory functions D r ∈ Φ are used, where
To design an FSM logic circuit, it is necessary to construct a structure table (ST) of the Mealy FSM. It is the extension of the initial STT by the following three columns: K(a m ) is the code of the current state; K(a s ) is the code of the state of transition; Φ h is a collection of input memory functions equal to 1 to load into RG the code K(a s ). The ST forms a basis for deriving the functions
They are used for implementing the FSM logic circuit. Let us construct the ST for Mealy FSM S 1 . We have M = 10, so that R = 4. Let us form an ST for the Mealy FSM represented by Table 1 . Since M = 10, we see that R = 4. This yields the sets T = {T 1 , . . . , T 4 } and Φ = {D 1 , . . . , D 4 }. Use the trivial state assignment resulting in the following state codes: K(a 1 ) = 0000, . . . , K(a 10 ) = 1001. These codes are used in the structure table (Table 2 ).
An ST is used to derive functions (2) and (3). For example, observe the symbol D 1 in rows 14-16 of Table 2 . This gives the equation
Functions (2) and (3) depend on terms
In (4), the symbol A m stands for the conjunction of state variables T r ∈ T corresponding to the code K(a m ) from the row h of the ST. (Fig. 1) . We use the symbol LUTer for circuits implemented with LUTs.
In U 1 , the block LUTerΦ implements the system (2). If a function D r ∈ Φ is generated as an output function of some LUT, then this output is connected with D flip-flops. These flip-flops form a distributed register (RG) keeping the state codes. Pulse Start is used for zeroing the RG. The pulse Clock allows changing the content of the RG. The block LUTerY implements the system (3).
Let us analyse the design methods targeting the hardware reduction in FPGA-based Mealy FSM circuits.
State of the art
Four basic optimization problems arise in the process of FSM design (Sklyarov et al., 2014) . They are the following: (i) a decrease in the chip area occupied by an FSM circuit (the problem of hardware reduction); (ii) a reduction in signal propagation time; (iii) a reduction in power consumption; (iv) an improvement in testability. In this article, we consider the first of these problems. Functions (2) and (3) could depend on up to R + L arguments. Our analysis of the library LGSynth93 (LGSynth93, 1993) shows that for some benchmarks we have L + R > 15. At the same time, we get S L ≤ 6 for modern LUTs (Altera, 2018; Xilinx, 2018) . Therefore, the following condition could be met for FSM U 1 :
If (5) is fulfilled, the problem of hardware reduction arises for a particular FSM. There are four main groups of methods for solving this problem:
(a) the appropriate state assignment (Baranov, 2008; Micheli, 1994; Kam et al., 1997) ;
(b) the functional decomposition of Boolean functions (2) and (3) representing an FSM circuit (Scholl, 2001; Nowicka et al., 1999; Rawski et al., 2005a; 2005b; Sasao, 2011) ;
(c) the replacement of LUTs by embedded memory blocks (Sklyarov et al., 2014; Barkalov et al., 2015; Sutter et al., 2002; Cong and Yan, 2000; Sklyarov, 2000; Garcia-Vargas et al., 2007; Tiwari and Tomko, 2004; Rawski et al., 2011) ; (d) the structural decomposition of the FSM circuit (Sklyarov et al., 2014; Barkalov and Titarenko, 2009; Kołopieńczyk et al., 2017) .
Known methods of state assignment target obtaining state codes making it possible to diminish the number of arguments in functions (2) and (3). Modern FPGAs have a lot of flip-flops. Therefore the one-hot state assignment is very popular in FPGA-based design (Sklyarov et al., 2014) . In this case, we have R = M and only a single variable T r ∈ T forms a conjunction A m (m = 1, M). It allows reducing the number of arguments in terms (4). But results in (Sklyarov, 2000) show that the binary encoding of states produces better results than the one-hot if M > 10.
It seems to us that JEDI is the best among the known state assignment algorithms (Czerwiński and Kania, 2013) . It is distributed with the system SIS (Sentowich et al., 1992) . JEDI targets a multi-level implemented FSM circuits. In the case of the input dominant version of JEDI, it maximizes the size of common cubes in functions (2) and (3). The output dominant version of JEDI maximizes the number of common cubes in these functions.
There are different strategies of state assignment used in standard industrial packages. For example, seven different methods are used in the design tool XST of Xilinx (Xilinx, 2018) . Among them, there are one-hot, compact, Gray, Johnson and other. It is really difficult to say which would be the best for a particular FSM.
In the case of functional decomposition (Scholl, 2001; Rawski et al., 2005a; 2011; Sasao, 2011) , an original function is broken down into smaller and smaller components. The process is terminated when each component depends on no more than S L arguments. Three main approaches are used for the decomposition: serial, parallel and balanced.
Each step of serial decomposition leads to an increase in the number of circuit levels. In turn, this results in a reduction in the maximum operating frequency of the FSM circuit. In the parallel decomposition, these characteristics are minimized.
The balanced decomposition leads to a solution minimizing disadvantages and maximizing strong sides of the previous two approaches. This approach is used, for example, by the systems DEMAIN (DEMAIN, 2018) and PKmin (PKmin, 2018) .
There are a lot of EMBs in modern FPGA chips (Sentowich et al., 1992) . Using EMBs allows improving characteristics of FSM circuits (Sklyarov, 2000) . A lot of EMB-based design methods can be found in the literature (Sklyarov et al., 2014; Cong and Yan, 2000; Sklyarov, 2000; Garcia-Vargas et al., 2007; Tiwari and Tomko, 2004; Rawski et al., 2005a; 2011) .
All these methods use the property of the configurability of EMBs (Nowicka et al., 1999) . This property allows changing the numbers of cells and their outputs (Grout, 2008) . Consequently, the modern EMBs are very flexible and could be tuned to meet the requirements of a particular FSM.
Let V 0 be the number of memory cells having only a single output. Assume that
In this case, only a single EMB is necessary to implement an FSM logic circuit. Our investigations (Kołopieńczyk et al., 2017) show that the condition (6) is met for 68% of benchmarks from the library LGSynth93 (LGSynth93, 1993).
If (6) is violated, then an FSM circuit could be implemented as: (i) a network of EMBs or (ii) a network of LUTs and EMBs. A survey of various approaches to EMB-based design is provided by Garcia-Vargas and Senhadji-Navarro (2015) . But these methods could be used only if there are "free" EMBs, which are not used for implementing other parts of a digital system.
Our article is connected with structural decomposition of FSM circuits.
In this case, an FSM circuit is represented by several blocks (Barkalov et al., 2015) . Some blocks implement functions different from (2) or (3). We discuss the encoding of collections of output variables (COVs). Let us explain this approach.
Each row of ST includes a COV. The following COVs could be derived from Table 2 :
The COV Y 1 corresponds to the transition from a 10 into a 1 when no output variables are generated. Therefore, there are Q = 10 different COVs in the case of
Use variables z r ∈ Z for encoding COVs, where |Z| = R Q . This approach leads to LUT-based Mealy FSM U 2 with a decomposed output block (Fig. 2) . In this FSM, the LUTerΦ implements the system (2). The LUTerZ implements functions
The LUTerY implements the functions
Let us compare FSMs U 1 and U 2 . Two FSMs are called equivalent if they are designed using the same STT. Obviously, there are the same amounts of LUTs in the blocks LUTerΦ in equivalent FSMs U 1 and U 2 . Assume that N Q.
In this case, the number of LUTs in LUTerZ is much lower than the numbers of LUTs in LUTerY of U 1 . Assume that
In this case, only N LUTs are sufficient to implement the circuit of LUTerY of U 2 . Obviously, the method should be applied if the number of elements in the block LUTerY of U 1 significantly exceeds the total number of LUTs in the 
Hardware reduction for LUT-based Mealy FSMs

599
blocks LUTerZ and LUTerY of equivalent U 2 . Our investigations of the library LGSynth93 (LGSynth93, 1993) show that circuits of FSMs U 2 always require fewer LUTs than the circuits of equivalent FSMs U 1 . However, the circuits of U 2 have more structural levels than their counterparts U 1 . This may lead to a decreased performance of FSMs U 2 compared with equivalent FSMs U 1 . An overview of various methods of structural decomposition in presented in the works of Sklyarov et al. (2014) and Barkalov et al. (2015) . All the known methods are based on introduction of additional variables and reducing the number of functions depending on both state and input variables.
To design an FSM U 2 , it is necessary to transform its initial ST. To do it, the column Y h should be replaced by a column Z h . The column Z h includes variables z r ∈ Z equal to 1 in the code
Using (7), we can find that R Q = 4 for FSM U 2 (S 1 ). We use the symbol U i (S j ) to show that an FSM U i is designed using an STT of the Mealy FSM 
a 5 0100 Table 3 , and so on. In this article we propose a design method allowing us to reduce the number of LUTs in blocks LUTerΦ and LUTerZ of Mealy FSM U 2 . The method is based on introducing new variables τ r ∈ T . Let us discuss the proposed method.
Main idea of the proposed method
Let us find a partition
In (12), the symbol R i stands for the number of additional state variables τ r ∈ T necessary for encoding the states a m ∈ A i . The symbol L i stands for the number of input variables x e ∈ X i determining transitions from states
It is necessary to have a code showing that a m / ∈ A i . This necessity explains 1 in (13). Codes C(a m ) are generated based on codes K(a s ).
There are R 0 variables in the set T , where
The first R 1 variables are used to encode the states a m ∈ A 1 , the next R 2 variables encode the states a m ∈ A 2 and so on.
Each class A i ∈ Π A determines a structure 
The set X i ⊆ X includes input variables from the column X h of ST i . The set Z i ⊆ Z includes additional variables from the column Z h of ST i . The set Φ i ∈ Φ includes input memory functions from the column Φ h of ST i . Let us point out that current states a m ∈ A have codes C(a m ), whereas the states of transitions a s ∈ A have codes K(a m ).
Using this preliminary information, we propose the structural diagram of Mealy FSM U 3 (Fig. 3) . It includes three levels of logic.
Each block LUTeri corresponds to the table ST i (i = 1, I). The LUTeri generates the systems of functions:
In (15) and (16), the symbol T i stands for the subset of T whose elements are used to encode the states a m ∈ A i .
600
A. Barkalov et al.
The block LUTerTZ generates variables z r ∈ Z and D r ∈ Φ. Each LUT of this block executes function OR:
In (17) 
This block transforms codes K(a s ) into codes C(a s ).
At each instant, only a single LUTeri is "active." This means that there are 1's at some of its outputs. At the same time, there are only 0's at the outputs of other blocks. These blocks are "idle." The following relation is used to show that a block LUTeri is idle:
If (12) is true, then a single LUT is sufficient to implement any function
In this case, there are exactly R + R Q LUTs in the circuit of LUTerTZ.
In this case, there are only R 0 LUTs in the circuit of LUTerT . If conditions (11), (21) and (22) Accordingly, this is the best case for applying our approach. In this case, it is important to reduce the number of functions implemented by each block LUTeri. This is possible through finding an appropriate partition Π A . In what follows, we discuss an approach to find this partition.
Construction of partition Π A
The problem could be formulated as the following one. It is necessary to find a partition Π A of the set A having a minimum number of blocks I and such that the condition (12) is met for each block. In this article we propose a sequential algorithm to solve this problem. The algorithm minimizes appearance of the same input variables into different sets X i ⊂ X. In the best case, the following relation takes place: 
The second evaluation counts to the number of variables z r ∈ Z common for Z(a m ) and Z i :
There are two stages in generating each block. At the first stage, we choose a basic element (BE) for the block A i . We take the state a m ∈ A * as a BE, if the following condition takes place
In (26) 
If there are more than one such state, then we should choose a state with the following property:
If the evaluations (28) are the same for several states from P (A i ), then the state with the minimum value of the subscript is selected.
Let us find the partition Π A for U 3 (S 1 ). The process is represented by Table 4 . The partition is constructed for S L = 5.
Let us explain the columns of Table 4 . The column a m contains states of the FSM. There are a number of input variables for states a m ∈ A in the column |X(a m )|. There are basic elements for each stage shown in the columns BE i (i ∈ {1, 2, 3}). The symbol "I" stands for (24), the symbol "II" for (25). The sign ⊕ means that the state from the corresponding row is included in the set A i . The sign "-" means that a m / ∈ A * , where a m is a state from the corresponding row. The row A i includes states a m ∈ A i . The states are listed in the order of selection. As follows from Table 4 , there are M = 10 steps of selection. As a result, there is a partition
, a 3 , a 6 } and A 3 = {a 1 , a 7 , a 9 , a 10 }. Using Table 3 , one could find the following sets: (Fig. 4) .
Using (13), we find that R 1 = R 2 = 2 and R 3 = 3. It gives R 0 = 7 and T = {τ 1 , . . . , τ 7 }. There are 8 LUTs As follows from Fig. 4 , only x 3 is shared between LUTer1 and LUTer2. Therefore, our approach allows obtaining circuits with more regular connections than, e.g., for FSM U 1 . The same is true for pulses Start and Clock, which are distributed only among the LUTs of LUTerTZ.
Proposed design method and an example of synthesis
In this article, we propose a design method for Mealy FSM U 3 . It includes the following steps:
1. Finding set A from the state transition table.
2. Executing the state assignment.
3. Encoding collections of output variables.
4. Constructing the structure table of FM U 1 .
5. Constructing the transformed structure table.
6. Constructing the partition Π A .
Constructing tables ST i for classes
8. Finding systems (15) and (16) for each class A i .
9. Finding systems (17) and (18) for LUTerTZ.
10. Constructing tables for LUTerY and LUTerT .
11. Implementing FSM circuit with particular LUTs.
Let us discuss an example of synthesis for Mealy FSM U 3 (S 1 ). We have already executed the first six design steps of this example. Table 2 represents the FSM  structure table; Table 3 represents the transformed ST; partition Π A follows from Table 4 .
Use state variables τ 1 , τ 2 ∈ T 1 for encoding states a m ∈ A 1 , τ 3 , τ 4 ∈ T 2 for a m ∈ A 2 and τ 5 , τ 6 , τ 7 ∈ T 3 for a m ∈ A 3 . There are state codes C(a m ) shown in Table 5 .
Tables ST 1 -ST 3 are constructed using the transformed ST (Table 3) 
For example, Table 6 represents ST 1 . It has H 1 = 7 rows. The following minimized Boolean functions could be derived from it:
Acting in the same manner, it is possible to construct other tables (ST 2 and ST 3 ) and functions. These functions are used to construct systems (17) and (18).
Each LUT of LUTerTZ executes the function OR. It clearly follows from equations (17) . Accordingly, it is a trivial thing to find equations for output functions of LUTerTZ.
Functions (9) depend on variables z r ∈ Z. Therefore, the following columns are present in the table 
01 a 5 0100 1 z (Table 7) .
This table could be viewed as N truth tables for output functions. Let us point out that all output functions are equal to zero for codes 1010-1111.
The table of LUTerT is constructed based on the table of state codes C(a m ).
It has columns a m , K(a m ), T , m. In the discussed case, there are M = 10 rows in this table (Table 8 ). Let us explain how to fill this table. For example, we have C(a 4 ) = 01 (Table 5) . So, τ 1 = 0, τ 2 = 1, τ 3 = τ 4 = 0 (a 4 / ∈ A 2 ) and τ 5 = τ 6 = τ 7 = 0 (a 4 / ∈ A 3 ). All other rows are filled in the same manner. The table of LUTerT corresponds to R 0 tables of LUTs.
To implement an FSM circuit, it is necessary to use standard CAD tools (Altera, 2018; Xilinx, 2018) . They form bit-streams for each LUT based on the technology mapping of the FSM circuit (Maxfield, 2004; Grout, 2008) . We do not discuss this step for our example. 
Results
To investigate the efficiency of the proposed method, we use standard benchmarks from the LGSynth93 library. It includes 48 benchmarks related to the practice of FSM design. These benchmarks are presented in the KISS2 format.
We choose this set of benchmarks because it includes both simple (M < 10) and quite complex (M > 50) FSMs. Also, it is very often used to study the efficiency of various methods of FSM design (Czerwiński and Kania, 2013; Tiwari and Tomko, 2004) .
To use these benchmarks, we applied the CAD tool named K2F. It translates the KISS2 file into VHDL model of an FSM. To synthesize and simulate the FSM, we use the Active-HDL environment. To get the FSM circuit, we use the Xilinx ISE package. Its version 14.1 was used for synthesis and implementation of the FSM for a given control algorithm.
We compared our approach with four other methods, namely: (i) Auto of ISE 14.1, (ii) Compact of ISE 14.1, (iii) JEDI, (iv) DEMAIN. The results of investigations are shown in Table 9 . The system DEMAIN is used for synthesis of combinational circuits. We use this system to decompose the Boolean functions representing circuits of benchmarks FSMs.
For each method, we found two characteristics of benchmark FSMs. They are the number of LUTs in the FSM circuit (columns "LUTs") and the FSM maximum operating clock frequency (column "Freq.") measured in MHz.
The results of summation for both the numbers of LUTs and frequency in are included the row "Total." We have taken the summarized characteristics of U 3 as 100%. The row "Percentage" shows the percentage of summarized characteristics with respect to the benchmarks synthesized as U 3 .
As can be seen in Table 9 , the proposed method allows minimizing the number of LUTs in FSM circuits To support this conclusion, Table 10 was included. It contain the results for 10 most complex benchmarks of the library LGSynth93 (LGSynth93, 1993). Our approach requires 19% fewer LUTs in comparison with JEDI and 25% fewer in comparison with DEMAIN. Thus, the gain practically doubled for complex benchmarks of LGSynth93 with respect to the average gain for all benchmarks.
As follows from Table 9 , our approach produces FSMs which are a bit slower than the FSMs produced by U 1 Auto (2%), JEDI (5%) and DEMAIN (5%). But this drawback is diminished for complex benchmarks (Table 10 ). For the complex benchmarks, our approach obtains the operating frequency only 4% lower than JEDI and practically the same as DEMAIN.
To show the benefits of usage of different methods, we use Table 9 to construct two line graphs shown in Figs. 5 and 6. We show the difference in the numbers of LUTs for different methods from Table 9 in Fig. 5 . Figure 6 shows the difference of frequencies. For both the graphs, the number of states is shown in the x-axis. In both graphs, we used the methods with the minimum value of LUTs (Fig. 5) or the maximum value of frequency (Fig. 6) as a reference. As can be seen, the differences for benchmarks with the numbers of states up to about 15 are minor and the models are quite similar (Fig. 5) . In this range, model U 3 is not the winner (the differences between U 3 and the smallest one are up to about 5 LUTs). Since then U 3 is the best model (at the bottom of the chart). For benchmarks with more than 15 states, U 3 has the lowest number of LUTs used.
Differences between models are quite significant and start from about 10 LUTs and end (owing to the lack of benchmarks with more states) at about 75 LUTs.
As can be seen in Fig. 6 , the JEDI model is the fastest, the U 3 model is slower for about 50-150 MHz for benchmarks having less than 20 states and from 5 to 35 MHz slower for benchmarks having more than 20 states. One can observe an interesting property of the U 3 model for LGSynth benchmarks with over 20 states. is better for rather complex FSMs structures. Let us point out that these conclusions are valid only for LGSynth93 benchmarks and the device XC5VLX30FF324 used for implementing FSM circuits. In the case of FPGA-based design, it is almost impossible to make some predictions for the common case. However, it is evident from out investigations that our approach could give better results for FSMs with M > 15.
Conclusion
The paper presents an original approach targeting LUT-based Mealy FSMs. The method is based on structural decomposition of the FSM circuit. As a result, the circuit has three levels of logic. The proposed method also uses an encoding of output variable collections.
The initial structure of additional state variables has been proposed. It leads to a reduction in the number of arguments in input memory functions in comparison with known design methods. As a result, a single LUT is sufficient to implement any function for any sub-table.
