Abstract
Introduction
Digital filters are important elements in signal processing [9] , and can be classified into two types: FIR (Finite Impulse Response) filters and IIR (Infinite Impulse Response) filters. FIR filters have nonrecursive structure, and so always have stable operations. Also, FIR filters can have linear phase characteristics, so they are useful for waveform transmission.
To realize FIR filters on FPGAs, we can use Distributed Arithmetic to convert the multiply-accumulation operations into table-lookup operations [7, 13, 14] . For table lookup, embedded memory blocks in FPGAs can be used. In this paper, we propose a method to implement the combinational part of the FIR filter by an LUT cascade, shown in Fig. 3 .1, a series connection of memories. The LUT cascade realizations require much smaller memory than a single memory realization. Our method is useful in the design of FIR filters by embedded memories in FPGAs [2] , or dedicated LUT cascade chips [8] .
The rest of the paper is organized as follows. Section 2 introduces FIR filters. Section 3 introduces LUT cascades and functional decomposition. Section 4 defines the WS function, and shows its realization by an LUT cascade. Section 5 introduces the arithmetic decomposition of WS functions, Section 6 shows experimental results, and Section 7 concludes the paper.
FIR Filter Definition 2.1 The FIR filter computes
Y(n) = N −1 i=0 h i · X (n − i),(2.
1) where X (i) is the value of the input X at the time i, and Y(i) is the value of the output Y at the time i
1 . h i is a filter coefficient represented by a q-bit fixed-point binary number, and N is the number of taps in the filter 2 . verter. In this case, the inputs to h 0 , h 1 , . . . , h N −1 are either 0 or 1, so the multipliers can be replaced by AND gates. In Fig. 2 .2, ACC denotes the shifting accumulator, which accumulates the numbers while doing shifting operations. The ACC can be implemented by, for example, the network in Fig. 2.3 . This method reduces the amount of hardware to 1/q, but increases the computation time q times. The combinational part in Fig. 2.2 has N -inputs and q-outputs. This part realizes the WS function, which will be defined later. In FIR filters that have linear phase characteristics, coefficients satisfy the relation h i = h N −i−1 . Such a filter is symmetric. A symmetric filter can be implemented by Fig. 2.4 . It requires less hardware than Fig. 2.2 . In this case, we use an (N + 1)/2-input adder for q-bit numbers. The ⊕ symbol in Fig. 2 .4 denotes a serial adder 3 .
In Fig. 2 .5, the combinational part is implemented by a ROM. For example, when the number of inputs is three, the ROM stores the precomputed values shown in often used to implement convolution operations, since many multipliers and multi-input adders can be replaced by one memory [1, 3, 6, 7, 13, 14] . This method is applicable only when the coefficients h i are constants. In FIR filters, the coefficients h i are constants, so we can apply this method. Note that a ROM with n-inputs and q-outputs requires q2 n bits.
LUT Cascade and Functional Decomposition
This chapter describes a relationship between LUT cascades and functional decompositions. 
. In this case, the column multiplicity is five.
, and let the column multiplicity of the decomposition chart be µ. Then, 
WS Function and Its LUT Cascade Realization
and represent a value as a q-bit binary number 5 . Here, h i denotes a coefficient represented as a q-bit binary number. We use the fixed-point 2's complement representation, and assume that q includes one bit for the sign. Let the binary representation of h i be ( h i ) 2 , then F satisfies relations:
. . .
Note that the WS function is generated first by rounding the coefficients into q bits, and then adding the coefficients.
The next lemma shows that the column multiplicity of the decomposition chart for any q-output WS functions is at most 2 q . (Proof) When i ≤ q, the column multiplicity is at most 2 q . So, we will examine the case of i > q. Consider the first row of the decomposition chart, i.e., the row for X 2 = (0, 0, . . . , 0). The number of different elements is at most 2 q , since each element of the decomposition chart is a vector of q bits. Thus, for two different vectors a, b ∈ {0, 1} i , there exist two columns that correspond to the assignments a, b
where the symbol + denotes the integer addition of binary numbers. Therefore, we have the relation: F ( a, c) = F ( b, c). Since this relation holds for all j > 0, two column patterns that correspond to vectors a and b are the same.
From the above, we can show that the column multiplicity of the decomposition chart is at most 2 q . 2 Example 4.1 
Theorem 4.2 An arbitrary n-input q-output WS function can be implemented by an LUT cascade of at most
n − k − 1 k − q + 2
cells with k inputs and q outputs.
(Proof) When we implement the WS function by the method of Theorem 3.2, we have the LUT cascade shown in Fig. 4 .1. Let q be the number of rails in the cascade, and s be the number of cells. Let t be the number of inputs to the final cell, then we have the relations
Since a k-input q-output cell requires q2 k bits, we have the following:
Corollary 4.1 To implement an n-input q-output WS function by an LUT cascade with k-input q-output cells, we need
at most q( n − k − 1 k − q + 2)2 k bits.
Lemma 4.2 The ROM for an n-input q-output WS function requires q2
n bits. 
Corollary 4.2 Assume that an LUT cascade uses k-input q-output cells, then the ratio for the amount of memory for the LUT cascade to the ROM for an n-input q-output WS function is (
By setting k = q + 1 in Corollary 4.2, we have the ratio (n − q)2 q−n+1 . This shows that the larger the value n − q, the larger the reduction ratio by using an LUT cascade.
Arithmetic Decomposition of WS Functions
Consider filter realizations where q is the number of quantization bits. Experimental results show that q output WS functions require (q + 1)-input q-output cells. Thus, when q is large, large embedded memories in an FPGA are required. To implement WS functions with many outputs on a small FPGA, we can decompose WS functions into smaller ones.
A 2q-output WS function can be decomposed into a pair of WS functions as follows: Let, h i be a coefficient of 2q output WS function. Then, h i can be written as
where h Ai denotes the most significant q bits, and h Bi denotes the least significant q bits. In this case, we can implement the 2q output WS function by using a pair of WS functions and an adder, as shown in Fig. 5.1 . Note that the adder has 2q inputs and q outputs. 
This is an arithmetic decomposition of a WS function. In a similar way, a 4q output WS function can be de-compose into four WS functions as follows: Let, h i be a coefficient of the 4q output WS function. Then, h i can be written as
where h Ai , h Bi , h Ci , and h Di denote q-bit numbers. As shown in Fig. 5.2 , we realize the 4q-output WS function by using four q-output WS functions and adders. Note that block A realizes a q-output WS function, while blocks B, C, and D realize (q + log 2 N )-output WS functions. Note that the output adder has 4q inputs and 2q outputs.
6 Experimental Results
Method of Experiment
To confirm the theoretical results in the previous chapters, we designed many WS functions for FIR filters by using LUT cascades. To design LUT cascades, we used binary decision diagrams (BDDs) instead of decomposition charts [10, 12] . In this case, the width of the BDD corresponds to the column multiplicity of the decomposition chart. The parameters of the filters are as follows: In total, we designed more than 100 different WS functions. We rounded the filter coefficients by ignoring the lower bits when the (q + 1)-th bit is 0, and by adding 1 to the q-th bit when the (q + 1)-th bit is 1.
Mapping to FPGAs
We used Altera Cyclone II FPGAs [2] which contain M4K embedded memory blocks in addition to LUT-type logic elements (LEs). The M4K has 4096 bits and 521 parity bits, and can be configured as memories with different numbers of inputs and different numbers of outputs. We implemented LUT cascades by using M4Ks. The environment of FPGA mapping is shown in Table 6 .1. When the number of the quantization bits is less than 16, many of the filter coefficients h 0 , h 1 , . . . , h n1 are rounded to zero. Thus, the WS function depends on only the part of the input variables, and the filters do not work properly. So, we have to increase the number of quantization bits. Since a q-bit output WS function requires (q + 1)-input q-output cells, the implementation requires a large embedded memory in an FPGA. To reduce the size of the memory cells, we used the arithmetic decomposition shown in Fig. 5.2 , where q = 4. Since each WS function has at most q + log 2 N outputs, where q = 4 and N = 17, the WS function can be realized with cells with at most q + log 2 N + 1 = 10 inputs, as proven in Theorem 4.1. Table 6 .2 shows realizations of filters in the form of Fig. 5.2 , where the number of the quantization bits is 16. The table shows the size of each cascade, maximum width of BDDs, memory bits for each cascade, total memory bits, number of M4Ks, number of LEs, and operating frequency. In all cases, cascade realizations reduced the sizes of memory. A single memory realization requires 2 17 × 16 = 2 21 = 2M bits, while LUT cascade realizations require 29 ∼ 40 k bits. These realizations used 4 to 5 % of total M4K of the FPGA, and less than 1 % of LEs. Note that the target FPGA( Cyclone II) contains 250 M4Ks. Thus, the total amount of memory is 1 Mega bits. So, the simple memory realization does not fit into the target FPGA.
Analysis of Results
The operating frequency is 101 to 109 MHz.
In Fig. 2.4 , the variable x 0 that corresponds to h 0 is placed to the root-side of the BDD, and the variable x (N −1)/2 that corresponds to h (N −1)/2 is placed to the leafside of the BDD. This initial ordering of the BDD produced smaller BDDs, and produced smaller LUT cascades. The WS functions with larger q and larger N tend to have larger 
Conclusion
In this paper, we defined the WS function that represents the combinational part of the distributed arithmetic in an FIR filter. Also, we showed methods to realize a WS function by LUT cascades and adders. Major results are When the number of quantization bits is large, we have to partition the outputs into several groups by arithmetic decomposition.
Also, when the number of taps are large, we have to partition the inputs into groups. For each group, we can implement a WS function, and finally, we can obtain the sum by using an adder [1] . This greatly reduces the necessary amount of memory. Note that LUT cascades can be used for these WS functions and the adder.
In this paper, we defined the WS function as a model of the combinational part of the distributed arithmetic for digital filters. Note that WS functions can be used to implement Discrete Cosine Transform (DCT), Discrete Fourier Transform (DFT) and other convolution operations.
