Two methods are proposed targeted at reduction in the number of look-up table elements in logic circuits of compositional microprogram control units (CMCUs) with code sharing. The methods assume the application of field-programmable gate arrays for the implementation of the combinational part of the CMCU, whereas embedded-memory blocks are used for implementation of its control memory. Both methods are based on the existence of classes of pseudoequivalent operational linear chains in a microprogram to be implemented. Conditions for the application of the proposed methods and examples of design are shown. Results of conducted experiments are given.
Introduction
Very often a system includes a control unit (CU) to coordinate the interplay of all system blocks (Navabi, 2007) . A particular model chosen to represent a control unit depends strongly on peculiarities of the microprogram to be implemented (Barkalov and Titarenko, 2008) .
If the number of control algorithm operator vertices is at least twice less than that of operational linear chains (OLCs), then the model of the compositional microprogram control unit (CMCU) can be used to interpret this control algorithm (Adamski and Barkalov, 2006) . Let us point out that the model of the CMCU can be used in any digital system, not only in computers. Fieldprogrammable gate arrays (FPGAs) are widely used for the implementation of control units (Navabi, 2007; Maxfield, 2004) . As a rule, these devices include a lot of lookup table (LUT) elements with a very limited number of inputs (Altera, 2010; Xilinx, 2010) , as well as configurable embedded memory blocks (EMBs).
The problem of the reduction in the hardware amount in the logic circuit of a control unit is still a task of great importance (Maxfield, 2004; Kam et al., 1998; Kania, 2004; Solovjev and Klimowicz, 2008) . Its solution allows decreasing such characteristics as the cost of the circuit, the number of chips, power consumption and so on (Micheli, 1994) . In the case of the FPGA, this problem can be solved by decreasing the number of input variables in each of the functions to be implemented (Barkalov and Titarenko, 2008) . The limited number of inputs per LUT (up to six) results in the necessity of functional decomposition of implemented functions (Scholl, 2001) . In turn, this results in a slow-down of the control unit because of the increase in the number of levels in its combinational part. The number of levels can be decreased due to the use of EMBs for implementing some parts of a control unit (Borowik et al., 2007) . This approach is used in the CMCU, too (Barkalov and Titarenko, 2008; Titarenko and Bieganowski, 2009) . It is known that in the case of finitestate-machines (FSMs) (Baranov, 2008) , the appropriate state assignment (Micheli, 1994; Barkalov et al., 2006; Czerwiński and Kania, 2004; Escherman, 1993) is an effective tool for hardware amount optimization. In the case of the CMCU, only its model with code sharing gives such a possibility. In this article we propose two possible solutions to the hardware amount decrease problem for the CMCU with code sharing implemented using FPGA chips based on LUT elements and EMBs.
Background of CMCU with code sharing
Let a microprogram to be implemented be represented by a graph-scheme of algorithm (GSA) (Baranov, 2008) with the set of vertices B = b 0 , b E ∪ E 1 ∪ E 2 and the set
752
A. Barkalov et al. of arcs E. Here b 0 is an initial vertex, b E is a final vertex, E 1 is a set of operator vertices and E 2 is a set of conditional vertices. An operator vertex b q ∈ E 1 contains a collection of microoperations Y (b q ) ⊆ Y , where Y = {y 1 , . . . , y N } is a set of microoperations. A conditional vertex b q ∈ E 2 contains some element x e ∈ X, where X = {x 1 , . . . , x L } is a set of logical conditions.
Let the set C = {α 1 , . . . , α G } be formed for a GSA Γ, where α g ∈ C is an operational linear chain. An OLC α g ∈ C is a sequence of operator vertices such that each pair of its adjacent components corresponds to some arc from the set E. Each OLC α g ∈ C has only one output O g and an arbitrary number of inputs (Adamski and Barkalov, 2006) .
Let us name as a linear GSA a GSA Γ where the number of its operational vertices exceeds at least twice the number of its operational linear chains.
Let each vertex b g ∈ E 1 correspond to the microinstruction MI q with address A(b q ), and let this address have R bits, where
Let each OLC α g ∈ C include F g components, and let |C| = G. Let Q = max(F 1 , . . . , F G ). Encode each OLC α g ∈ C by the binary code K(α g ) using variables τ r ∈ τ , where |τ | = R 1 and
Encode each component of the OLC α g ∈ C by the binary code K(b q ) using variables T r ∈ T , where |T | = R 2 and R 2 = log 2 Q .
The encoding of components should be executed in such a manner that the condition
is met for each OLC α g ∈ C. If the condition
holds, then the GSA Γ can be interpreted by the CMCU with code sharing U 1 (Fig. 1) . In the CMCU U 1 , the address of the microinstruction corresponding to component b q of the OLC α g ∈ C is represented as
where & is the concatenation operator. The block of input addresses (BIA) generates input memory functions used to load a component code into a counter CT and a code of the OLC into a register RG, respectively. A control memory CM keeps microoperations y n ∈ Y and special variables y 0 and y E . If y 0 = 1, then the current content of CT is incremented, otherwise both CT and RG are loaded from the BIA. The first case corresponds to the transition from any component of the OLC α g except its output. The second case corresponds to the transition from the OLC output. If y E = 1, then a flip-flop TF is cleared, the variable Fetch = ∅ and the operation of the CMCU is terminated. It corresponds to the vertex b E of the GSA. Pulse Start is used to load zero codes into both RG and CT, which corresponds to the address of the first microinstruction. At the same time, the flip-flop TF is set up, Fetch = 1 and microinstructions can be read from CM. Pulse Clock is used for timing the CMCU. The CMCU U 1 can be viewed as a Moore FSM and chains α G ∈ C correspond to internal states of the FSM. The codes K(α g ) do not depend on the codes of components. Therefore, all well-known state assignment methods can be used for the optimization of the BIA circuit. It is shown by Kołopieńczyk (2008) that for linear GSAs the model of the CMCU with code sharing always consumes less amount of LUT elements in comparison with the classical Moore FSM. In the paper, we propose two approaches to the reduction in the number of LUT elements in the logic circuit of the BIA block.
Proposed approaches
Our approaches are based on the existence of a pseudoequivalent OLC in the GSA Γ. Recall that the OLC α i , α j ∈ C are pseudoequivalent OLCs if their outputs are connected with the input of the same vertex (Adamski and Barkalov, 2006) . Let Π C = {B 1 , . . . , B I } be the partition of the set C by the classes of the pseudoequivalent OLC. Let us construct a set Π 0 ⊆ Π C , where B i ∈ Π 0 if the outputs of the OLC α g ∈ B i are not connected with the final vertex b E . Encode each class B i ∈ Π 0 by a binary code K(B i ) with R 3 bits, where
Reduction in the number of LUT elements for control units with code sharing Let variables z r ∈ Z be used for such encoding, where
hold for all OLCs α g ∈ C 1 , where C 1 ⊆ C is a set of OLC from classes B i ∈ Π 0 . It was shown by Kołopieńczyk (2008) that the number of LUTs needed to implement the BIA can be reduced by about 50% when codes K(B i ) are used to make transitions instead of OLC addresses. The results of Kołopieńczyk (2008) were compared with the base structure of a compositional control unit with code sharing. The method presented by Kołopieńczyk (2008) generates codes K(B i ) using a combinational circuit called the block of code transformer (BCT) (Fig. 2(a) ). In the paper, we propose to store codes K(B i ) in control memory to partially (Fig. 2(c) ) or completely remove the BCT (Fig. 2(b) ).
In our methods we utilize some free areas in control memory that are usually left unused, because the organization memory block is limited to some fixed variants. For example, in the Xilinx II Pro family, the embedded memory block can be organized as 16K × 1 bit, 8K × 2 bits, 4K × 4 bits, 2K × 9 bits, 1K × 18 bits, 512 × 36 bits (Xilinx, 2010) . There are two possible variants of putting codes K(B i ) into control memory: by adding some control microinstructions (microprogram expansion) or by adding a field to each microinstruction (microinstruction extension). Let us name a circuit with microprogram expansion as U 2 and circuit with microinstruction extension as U 3 . The microinstruction formats used in U 2 and U 3 are shown in Fig. 3 .
The most significant bit of each format contains the value of the variable y 0 . If y 0 = 1 (Fig. 3(a) ), then the microinstruction contains the field F Y with the code of collection of microoperations to be executed. If y 0 = 0 ( Fig. 3(b) ), then the microinstruction contains the field In the CMCU U 2 , the block BIA implements the systems
while the other elements of U 2 have the same meaning as their counterparts for U 1 . The functions (10)- (11) are generated when the concatenation of the contents of RG and CT represents an address of the control microinstruction. In this case, the data-path of the controlled digital system is in the idle state. This can be achieved if the synchronization of a data-path is controlled by variable y 0 . In this article, we propose a synthesis method for U 2 which includes the following steps:
1. Construction of sets C, C 1 , Π C and Π 0 for a graphscheme of the algorithm Γ.
Including an additional component into each OLC
3. Encoding of the OLC α ∈ C, OLC components and classes B i ∈ Π 0 .
754
A. Barkalov et al. 4 . Construction of the control memory content.
5. Construction of the transition table of the CMCU U 2 .
6. Implementation of the CMCU logic circuit.
This approach can be applied only if Condition (9) is satisfied. Otherwise, it leads to an increase in the value of R 2 , the violation of (5), and code sharing makes no sense (Adamski and Barkalov, 2006) . If Condition (9) is violated, we propose to use the CMCU U 3 with the extended microinstruction format shown in Fig. 3(c) .
If y 0 = 1, then it corresponds to OMI where the content of field F B is ignored. If y 0 = 0, then it corresponds to the output of a particular OLC α g ∈ C 1 . In this case, a system data-path executes microoperations represented by the field F Y , and CMCU transition depends on the code
In the CMCU U 3 , all elements have the same meaning as their counterparts for U 2 . The only difference between U 2 and U 3 is in the organization of their control memory blocks. We should point out that the number of inputs for LUT elements of the BIA is decreased if R 3 < R 1 . This can result in a decrease in the numbers of LUT elements and their levels in the circuit of the BIA in comparison with the CMCU U 1 .
Our analysis of CMCU U 2 and U 3 shows that the latter requires more bits in an output word of its control memory. In the case of one-hot encoding of microoperations (Adamski and Barkalov, 2006) , the CMCU U 2 requires control memory with
bits. At the same time, this value for U 3 is determined as
The number 2 in both (12) and (13) is added to take into account the bits for keeping the additional bits y 0 and y E . Let us point out that such components of the CMCU as the BIA, CT, RG and TF are implemented using LUT elements, whereas control memory is implemented using EMBs of the same FPGA chip. These blocks have a fixed number of outputs denoted here as t. In reality, t = 1, 2, 4, 8, 16 (Maxfield, 2004) . This means that
bits are free and can be used to represent the code K(B i ).
If the condition
is violated, we can use a code transformer to implement (R 3 − R 4 ) bits of the code K(B i ). This leads to the CMCU U 4 (Fig. 2(c) ), where the block of code transformer implements some bits of the code K(B i ).
In the case of the CMCU U 4 , the BIA implements the systems
and the BCT implements the system
The other components of U 4 have the same meaning as their counterparts for U 3 . Let us point out that variables z r ∈ Z represent R 4 leftmost bits of the code K(B i ), whereas variables v r ∈ V represent the remaining
bits. Obviously, the CMCU U 4 can be reduced to U 3 if the condition (15) is true. In this article, we propose a synthesis method for U 4 which includes the following steps:
Encoding of the OLC α g ∈ C, their components and classes
3. Construction of the control memory content.
4. Construction of a transition table of the CMCU U 4 .
5. Construction of the table of the BCT.
Examples of the application of the proposed methods
Let a microprogram to be implemented be represented by a GSA Γ 1 (Fig. 4) . Applying the approaches of Barkalov and Titarenko (2008) , the following sets can be found for the GSA Γ 1 :
Because M = 21, R = 5, the condition (5) holds and code sharing makes sense. Because α 8 C 1 , the condition (9) is satisfied and all the OLC α g ∈ C 1 can be modified. After the modification, we have the OLC
Let us encode OLC α g ∈ C in an arbitrary manner, namely, K(α 1 ) = 000, . . . , K(α 8 ) = 111. Let the code 00 be assigned to the first component of any OLC α g ∈ C, the code 01 to the second, the code 10 to the third, and the Then, the memory cell with this address includes a binary code corresponding to y 7 , y 8 , y E . Finally, the memory cell with the address 01011 corresponds to MC 4 . Because this control microinstruction follows the vertex b 10 , which is the output of OLC α ∈ B 2 , this memory cell includes the code K(B 2 ) = 01. The contents of all other cells can be found in the same manner.
The transitions from the outputs of the OLC α g ∈ C 1 can be represented by the following system of transition formulae (Baranov, 2008) : As follows from (20), the outputs of the pseudoequivalent OLC α g ∈ B i are described by one line of the system. This allows replacing the outputs of OLC α g ∈ B i by the corresponding class B i ∈ Π c . This leads to the system of generalized transition formulae:
The system of transition formulae can be transformed into a transition table of the CMCU with the following columns:
Here Ψ h (Φ h ) is the collection of input memory functions that are equal to 1 for the h-th transition of the CMCU (h = 1, . . . , H) . The transition number h is determined by a conjunction of some logical conditions X h (h = 1, . . . , H). In our example, H = 10 and the table of CMCU transitions is represented by Table 1 . 
This table is used to derive the systems (10) and (11). The following equations can be derived, for example, from Table 1 :
The first term in function D 1 corresponds to the lines 5-7 of Table 1 , and the only term in function D 5 corresponds to the lines 9 and 10 of Table 1. The implementation of the CMCU U 2 logic circuit is reduced to that of the systems (10)-(11) using LUT elements and the implementation of its control memory using EMBs of FPGA chips. Some industrial or academic packages such as SIS, Xilinx ISE or Altera Quartus II (Altera, 2010; Xilinx, 2010; Sentovich et al., 1992) can be used to solve this problem.
Let us discuss a bit different case, when a microprogram is represented by a GSA Γ 2 (Fig. 6 ). Applying the approaches by Barkalov and Titarenko (2008) , the following sets can be found for the GSA Γ 2 : C = {α 1 , . . . , α 7 }, 
As we can see, the GSA Γ 2 includes N = 13 different microoperations.
The analysis of the GSA Γ 2 shows that R 1 +R 2 = R, thus the application of code sharing makes sense. Because the condition (9) is violated for the OLC α 2 , α 4 , α 6 ∈ C 1 , the model U 3 should be applied. Let q = 3, t = 4, where t is the number of PROM outputs. It can be found that R 3 = 2, n 2 = 17, R 4 = 1 and the condition (15) is violated. This means that the model U 4 (Γ 2 ) should be used, where V = {v 1 }, Z = {z 1 }. Let us discuss this design example.
Let K(α 1 ) = 000, . . . , K(α 7 ) = 110, and let the code 00 be assigned to the first component of any OLC α g ∈ C, . . ., the code 11 to the fourth component of any OLC α g ∈ C. Now microinstruction addresses for the CMCU U 4 (Γ 2 ) are shown in Fig. 7 .
Let us encode the classes B i ∈ Π 0 in a trivial way (Table 2) . As follows from the analysis of the GSA Γ 2 ,
The part of the control memory content for the OLC α 2 ∈ B 2 is shown in Table 3 . Using the same rules as in the case of the CMCU U 2 (Γ 1 ), the following system of the generalized transition formulae can be constructed for the GSA Γ 2 :
This system is used to construct the transition table of the CMCU U 4 (Γ 2 ) with H 4 (Γ 2 ) = 11 lines (Table 4) . 
This table is used to derive the systems (16) and (17). After minimization, the following equations can be found, for example, from Table 4 :
758
A. Barkalov et al.
The table of the BCT includes the following columns:
In our example, this table has G 4 (Γ 2 ) = 3 lines (Table 5 ). 
In the ordinary case, G 4 (Γ i ) is equal to the number of the OLC α g ∈ B i , where K(B i ) includes v r = 0 (r = 1, . . . , R 5 ). This table is used to derive the system (18). After minimization, we can get the following equation from Table 5 :
The implementation of the CMCU U 4 logic circuit is reduced to that of the systems (16)- (18) using LUT elements and that of control memory using EMBs of FPGA chips. Some industrial or academic packages (Altera, 2010; Xilinx, 2010; Sentovich et al., 1992) can be used to solve this problem.
Analysis of the proposed methods
The proposed methods were tested using Xilinx ISE 12.1. The synthesis was made for Xilinx Virtex 5 FPGA (xc5vlx30-3ff324) and Xilinx XC9500 CPLD (xc9536-5PC44) with area optimization (opt mode area level 1). The outcomes of our research are shown in Table 6 . The number of slices (column S F ) and the time of cycle (column T C ) were taken from synthesis reports of Xilinx ISE. We decided to measure results in slices because the methods presented in Table 6 use LUTs and flip-flops. We assume that a slice is utilized when LUT or the flip-flop in that slice is utilized. The circuit U 5 with the Moore FSM was created in the VHDL according to guidelines of Xilinx (2006) . The control memory for all CMCU models in Table 6 was implemented using one EMB.
All tested models were generated using original software developed by us. The input file to our application is the KISS file format (Yang, 1991) and the output is an FSM or a CMCU model in the VHDL. The test files (linear GSAs) were taken from Kołopieńczyk (2008) .
When we compare results of CMCU models, we can see that the proposed methods give on average about 50% reduction in the area (using the same number of flip-flops and embedded memory blocks) in comparison with the base model. The results are similar to those presented by Kołopieńczyk (2008) . When compared with the Moore FSM, we can see that the area can be reduced on the average by about 70% for the FSM with "compact" state encoding. We should recall here that these results were obtained for synthetic linear GSAs and can vary in real circuits.
It is very interesting that these methods can be applied in the case of complex programmable logic devices (CPLDs). In this case, the logic circuit of the BIA is implemented using programmable array logic (PAL) macrocells and control memory can be implemented using some PROM/RAM chips external to a CPLD chip. This is connected with the fact that the use of the proposed method permits decreasing the number of transition table lines from H 1 (U 1 ) to H 2 (U 2 ), where
In (24)- (25), E i is the number of transitions from the output of any OLC α g ∈ B i , where B i ∈ Π 0 . For GSA Γ 1 , we have H 1 = 24 and H 2 = 10.
To estimate the hardware amount of the BIA, let us use the following symbols: q is the number of terms in a PAL macrocell, E i (D r ) is the number of the terms in function D r ∈ Φ ∪ Ψ for CMCU U i , Q i (D r , q) is the number of macrocells required for the implementation of function D r ∈ Φ ∪ Ψ of CMCU U i using PAL macrocells with q terms:
The formula (26) is based on results presented by Kania (2004) . From Table 1 we obtain that:
Then Q 2 (D r , 3) = 1 for Γ = 1, 2, 4, 5 and Q 2 (D 3 , 3) = 2. Thus, in the case of the CMCU U 2 (Γ 1 ), the logic circuit for the BIA block needs Q 2 (Γ 1 ) = 6 macrocells and has two levels.
If we construct the transition table for the CMCU U 1 (Γ 1 ), we will find that
and Q 1 (Γ 1 ) = 17 macrocells. Let us point out that this circuit has two levels, too. The following conclusion can be drawn: transition from U 1 (Γ 1 ) to U 2 (Γ 1 ) allows decreasing the hardware amount in 2.8 times. Of course, more cycles are needed to execute the algorithm Γ 1 using U 2 than in the case of U 1 . In this article, we do not estimate this delay in both cases of FPGA and CPLD implementation. To find the value of Let us estimate the number of PAL macrocells with q = 3 in the logic circuit of the CMCU U 4 (Γ 2 ) and U 1 (Γ 4 ). As follows from Table 4 Table 5 it can be found that E 4 (v 1 ) = 2. Using (26), we can determine Q 4 (D 1 , 3) = Q 4 (D 5 , 3) = Q 4 (v 1 , 3) = 1 and Q 4 (D r , 3) = 2 for r = 2, 3, 4. Thus, Q 4 (Γ 2 ) = 9 macrocells.
If we construct the transition table for the CMCU U 1 (Γ 2 ), we will find that
for r = 2, 3, 4. Thus Q 1 (Γ 2 ) = 15 macrocells. Therefore, the combinational part of the CMCU U 4 (Γ 2 ) is implemented using 1.67 times fewer PAL macrocells. Let us point out that cycle times for both CMCU U 1 (Γ 2 ) and U 4 (Γ 2 ) are the same. Besides, they use the same number of PROM chips with t = 4 outputs to implement their control memories.
Conclusion
The aim of the proposed methods is to reduce the number of LUT elements in the combinational part of compositional microprogram control units implemented with FPGA. These methods are based on the existence of classes of pseudoequivalent operational linear chains. The encoding of these classes allows a reduction in the number of variables in each of input memory functions. At the same time, these methods allow reducing the number of terms in input memory functions. Therefore the proposed method can be applied in the case of a CPLD, too.
The first approach suggests the introduction of additional control microinstructions. If the condition (9) takes place, each OLC can be transformed in this way. This leads to the optimization of hardware, but it also has a negative effect connected with additional idle cycles in the digital system data-path. The second approach is based on the inclusion of the class code into the microinstruction and it does not affect the performance, but sometimes an additional block of code transformer is needed so the hardware amount could be increased in comparison with the first approach. Also, note that the second approach is more flexible because it does not depend on the condition (9) and can be applied for any GSA.
Our experiments show that the proposed methods always produce logic circuits with fewer LUT elements compared to the use of the known model of the CMCU with code sharing U 1 . These methods produce CMCU circuits with less hardware, when they are implemented with a CPLD, too. This shows once more that reduction methods targeted at CPLDs can be applied for the FPGA and vice versa. This was first mentioned by Kania (2004) . Recall that these methods are useful only if a microprogram to be implemented is represented by a linear GSA.
The proposed methods can be applied to a linear GSA with many long chains of operations. Such long chains can be found in many computational tasks, i.e., MD5 (Rivest, 1992) and SHA (Eastlake and Jones, 2001) algorithms. The MD5 algorithm has 64 unconditional steps, and each step depends on the results of previous calculations. This means that there is no possibility to make operations in paralel and the only way to improve the throughput is to use the pipelined data-path (Jarvinen et al., 2005) . We should note here that control units described in our article execute microinstructions sequentially, but the microoperations for data-path are executed in parallel. One microinstruction can activate an unlimited number of microoperations in parallel. A sequential control unit does not limit the data-path to be sequential.
