Abstract
Introduction
Two of the most crucial problems in system LSIs are their long design time and short life cycles. A solution to these problems may be reconfigurable architecture. Reconfigurable LSIs will reduce the hardware development time drastically, since one LSI can be used for various applications.
In this paper, we consider a realization of combinational logic functions by reconfigurable architecture. Various methods exist to realize multiple-output logic functions by reconfigurable architecture. Among them, random access memories (RAMS) and programmable logic arrays (PLAs) directly implement logic functions. However, when the number of input variables n is large, the necessary hardware becomes too large. Thus, field programmable logic arrays (FPGAs) are often used. Unfortunately, FPGAs require layout and routing in addition to logic design. Also, the area for programming and interconnections are much larger than the logic area. Thus, FPGAs require large chip area.
When speed of the operation is not so important, a general-purpose microprocessor can be used to implement logic functions. However, the microprocessor implementation is often 100 to 1000 times slower than the direct circuit realizations. Also, the power dissipation is rather high.
Here, we assume the following applications:
0 The system need not be so fast as custom logic circuits, 0 The system is too large to implement by a PLA or a 0 System must be reconfigurable.
but must be faster than the software realization.
RAM directly.
In this paper, we consider a method to implement logic functions by using sequential network, where the speed must be faster than the conventional software realization. Note that the branching program method requires O(n) computation time. To reduce the instruction fetch time, special sequential machines that traverse the BDD structure are proposed [20, 6] . In this case, the necessary memory is proportional to the number of nodes in the BDD. If k variables are evaluated at the same time, then the evaluation speed will be k times faster than the branching program method. This corresponds to using a multiplevalued decision diagram (MDD) instead of a BDD [12, 81.
If we partition the BDD into several pages, and operate each page in parallel, then we have the pipelined architecture [8] , which is several times faster than the naive realization.
In this paper, we propose the LUT cascade method, which uses lookup tables (LUTs) as basic logic elements. In this method, a cascade of LUTs is used to implement logic functions, which makes the sequencer simple enough to be implemented by a reconfigurable network. Also, with more memory, we can design a faster system.
In the branching program method, the number of memory references is proportional to the number of input variables. On the other hand, in the LUT cascade method, the number of memory references is equal to the number of levels of the cascade. Experimental results show that the number of levels of the cascade is about one tenth of the numbers of the input variables. Thus, the we can expect that the LUT cascade method will be about ten times faster than the branching program method.
As for the amount of memory, branching program method requires memory that is proportional to the number of nodes in the BDD. On the other hand, the LUT cascade method requires more memory than the branching program method. 
Cascade Realization of Logic Functions
In this part, we will show a method to implement a logic by 1x1. 
XI

Figure 2.2: Functional decomposition using BDD.
The column multiplicity of a decomposition chart depends on the partition X = (XI, X 2 ) of the input variables. Theorem 2.2 gives tighter bounds on s than Theorem 2.1.
Lemma 2.1 [2, 51 Let the partition of X be ( X I , Xz
Since ii is hard to obtain, we approximate it by the average value of the logarithm of the widths of all the levels in the BDD. From these relations, we can easily estimate the number of LUTs and the level of the cascade.
Representation of Multiple-output Function
Although the method described in the previous section is useful for a single-output function, it is hard to apply to multiple-output functions. In the case of an m-output function, the number of terminal nodes of the MTBDD [ 161 can be as much as 2m, which may be too large to construct. Also, the representations using characteristic function (CF) of multiple-output function have been developed [ 11. However, in many cases, BDDs for CFs are too large to construct. From this, we use the following method to represent a multiple-output function. 
(End of Example}
An ECFN is an (n. + w)-input single-output function that represents an n-input m-output function by time domain multiplexing. When constructing a BDD for an ECFN, we can reduce the size of the BDD by mixing the auxiliary variables and ordinary input variables. We can also reduce the sizes of BDDs by considering the encoding methods [19] . 
Level Reduction by Output Partition
To evaluate an m-output function by using the network for the ECFN, we have to iterate logic evaluation m times by changing the values of the auxiliary variables. Thus, when m is large, the evaluation time tends to be long. To solve this difficulty, we use a parallel process to make it faster:
To represent a multiple-output logic function F = {fo, f 1 , . . . , fm-l}, partition the output set F into F I , F~, . . . , F , . , w h e r e F 1 U U 2 U . . . U F r = 3, and Thus, by Theorem 2.2, n and 5 are also decreased. So, in many cases, the levels of the network are also reduced.
We partition the output set into Fl , F 2 , . . . , F,. , so that each group has nearly the same number of elements. If we evaluate them in parallel, then the evaluation speed-up will be T times. Furthermore, in many cases, since levels of the network will be decreased, the evaluation speed will be more than T times. The output partition can be done as 
Architecture for Reconfigurable Hardware
The cascade of LUTs shown in Fig. 2.3 can be simulated by the architecture shown in Fig. 5.1 . In this architecture, the memory stores the data for LUTs, while the control part (a sequencer) stores the information of the interconnections among LUTs. Since the network structure is very simple, the control part is also simple. We can make the operation fast by using a special hardware tailored to the given logic function. I.
2.
3.
4.
5.
6. 7. Ifi = s is assigned, stop. Fig. 2.3 . Let L(bits) be the size of memory available, and let pmax be the width of the BDD. Then, we have 6 Experimental Results *: Contains redundant variables. We used the number of dependent variables to obtain the bounds.
In this way, we can evaluate f by accessing the memory three times. (End of Example)
Memory-Packing
Theorem 5.1 Suppose that an n-variable logic function is realized by the cascade of k-LUTs shown in
k > u + l , U = [log, Pmaxl1 n + u -2 + u -l ) < L 2 k ( k -1
Realization of Cascades
N : Number of LUTs.
of BDDs for ECFNs are, in most cases, smaller than corresponding MTBDDs and BDDs for CFs. Blank entries show that the BDDs were too large to construct. We optimized the BDD for ECFN by mixing the input variables and auxiliary variables. We find the ordering of the variables by using a heuristics that reduces the total number of nodes in the QROBDD [16] . Note that this heuristic will reduce 2 in Theorem 2.2. In the table, ,uLmazl denotes the width of the shared BDDs (SBDDs), and pmaz2 denotes the width of the BDD for the ECFN. s1 and s2 show the lower and upper bounds on the number of levels obtained from Theorem 2.1 and Theorem 2.2, respectively. We can see that sa is tighter than SI. Also, s denotes the number of levels in a cascade, and N denotes the number of LUTs. In this experiment, encodings of outputs [ 191 are not optimized.
Comparison with Murgai-Hirose-Fujita's Method
Murgai-Hirose-Fujita [ 141 have developed a logic simulation system which realizes given function by using k-LUT (IC = 15). In their paper [14] , no level of the networks are shown. So, we did similar experiment by using MIS-FPGA, and obtained N , the number of LUTs, and s, the number of levels. In this experiment, we used the following script:
> xl-imp -n 2 > x l j a r t i t i o n -n 1 5 > s i m p l i f y > x l j a r t i t i o n -n 15
The results are shown in the last two columns of Table 6 .1. In most cases, MIS-FPGA produced networks with more LUTs, but fewer levels. Note that Murgai-Hirose-Fujita [14] use an event-driven method, so the evaluation time is proportional to N , the number of LUTs. Table 6 .2 compares the numbers of LUTs and levels of cascades when the outputs are partitioned into four and eight groups by using Algorithm 4.1. Partitioning the outputs into four groups reduced the number of levels into half. In this case, the parallel evaluation is more than eight times faster than the original one.
Prototype of Reconfigurable Hardware
In order to evaluate the performance of the architecture shown in Section 5, we developed reconfigurable hardware using a commercially available FPGA board as follows: 0 FPGA: Altera EPFlOK200S 0 Clock frequency :40MHz 0 RAM: Static 4MBytes 0 Interface: PCI In this prototype, we did not implement memory-packing nor output partition.
Comparison with Branching Programs
We converted QROBDDs of benchmark functions into branching programs, and implemented on a special machine that traverses BDDs. Note that the branching program based on a QROBDD does not require index, and require only one memory reference for one variable.
Evaluation time of the cascade method and the branching program method is proportional to the number of memory references. Since we use the same FPGA board, the ratio of evaluation time for branching program method to the cascade method is n + [log, rnl to s. For the functions in Table 6 .1, the cascade method is, on the average, 9.25 times faster than the branching program method.
Conclusions
In this paper, we have shown a method to represent a multiple-output logic function by a cascade of k-LUTs. We also developed a reconfigurable hardware consisting of a memory and a sequencer. The features of the method include:
1. The system uses a cascade of LUTs: The hardware is simple to implement. The design consists of iterative decompositions of BDDs for ECFNs. 2. It is faster than branching programs. 3. It uses time domain multiplexing that reduces the number of output pins. 4. The system users BDDs for ECFNs, which are smaller than the corresponding SBDDs: The input variables and the auxiliary variables are mixed to reduce the BDDs. 5. Given the size of memory, we can find the best value of k to optimize the hardware. 6. By partitioning the outputs into T groups, the hardware becomes at least T times faster.
In this paper, we only considered the case where the values of k are the same for all the stages of a cascade. However, in general, the value of k can be different for different stages. By using this technique, we can implement larger functions on a smaller memory.
