The multiplier-free design of transforms implemented in LUT-based FPGAs is presented. To fit bit-level grain size in the FPGA device at algorithm level the authors use modified distributed arithmetic (DA) and a named adder-based DA to formulate bit-level transform expressions, then they further minimise hardware cost by the proposed vertical subexpression sharing. For implementation, the required input buffer design is also considered by employing FPGA device characteristics and cyclic formulation. The proposed design can offer savings in excess of two-thirds of hardware cost compared with ROM-based DA.
Introduction
Transforms are widely used in digital signal processing applications, such as multimedia, wireless and communication systems, as basic computation units for hrther processing. Transforms such as discrete cosine transform (DCT) [ I ] compute multiple inner products and the data inputs to the transforms are independent of each other. Direct computation of inner products often uses multipliers. However, since the multiplier costs too much in terms of silicon area, a more efficient way is to use distributed arithmetic (DA) [2] , ROM-based DA, to implement these fixed coefficient computations with ROM tables. Due to its bit-level reformulation DA is potentially suitable for the fine grain architecture of FPGA hardware.
Field programmable gate arrays (FPGAs) are modem logic devices that can be programmed by the users to implement their own logic functions, in which the lookup table (LUT)-based architecture is the most popular one. A FPGA chip consists of many configurable k-LUT's (called CLB in Xilinx FPGA chips) that can implement an arbitrary function with up to k inputs. For example, in Xilinx XC4000 series, k is equal to 5, and equivalent to memory size of 16w x 2b [3] . However, such limited memory size constrains the available filter tap numbers to implement ROM-based DA in a single FPGA chip.
Besides, the routing channels in FPGA are often very limited. These limitations often result in inefficient hardware utilisation of the FPGA chip for ROM-based DA.
To eliminate these drawbacks we adopt a modified DA formulation, called adder-based DA, first proposed in [4] , to implement these transforms and filtering methods on FPGA chips. The adder-based DA preserves the advantages of the bit-level computation held by the ROM-based DA, but uses an adder network instead of ROM to on-line compute the summation. This bit-slice datapath logic is Although it has no advantage precomputation and storage in ROM, the adder-based DA has the benefit of sharing by scheduling the computation. Here we adopt and further modify the concept of common subexpression sharing [5-71 to efficiently minimise the numbers of adder. Common subexpression sharing shares common addition and subtraction between different constant coefficient multiplications. Previous techniques based on wordlevel are not suitable for bit-level grain FPGA device. Thus, we use DA bit-level input and adopt bit-level sharing, i.e. vertical subexpression sharing. With these combining techniques, inner products can be hardware efficiently implemented in the FPGAs.
Though the above techniques can efficiently reduce hardware cost, transform design still needs special consideration towards its input interface circuits. These interface circuits, such as input delay chains and parallel to serial (P/S) converters, still entail large area costs for transform designs. Two techniques are proposed to overcome this problem. In the first technique we propose a skewed buffer design to implement the P/S converter. In the second, for transforms that can be formulated to be filter-like operations, we proposed a cyclic formulation to reformulate transform equations to filter-like operations to solve the problem.
Adder-based DA and subexpression sharing
In the following, we assume that the size of the inner product is L, word length of the variable input X is W, and word length of the coefficient A is W,. 
Adder-based DA constructs an adder network to compute the teim , " = Ai,jX,. The major difference between the conventional ROM-based DA and our proposed DA is illustrated in Fig. 1 . To reduce the routing requirement we reformulate the above equation into bit-serial form by decomposing S, and exchanging the summation, where
The bit-serial design accumulates and shifts the term CF1 $,,2-J at each cycle t to obtain the inner product.
The implementation of eqn. 3 depends on both L and W,. Adder-based DA only computes the non-zero bits. Designs with fewer nonzero bits will save more area.
Review of previous common subexpression sharing approaches
Computation of adder-based DA can be further minimised by common subexpression sharing [5-71 which shares the common subexpression among several multiplicationaccumulation operations so that total operation count is reduced. For example, Fig. 2 shows the FIR filter coefficients represented by the canonical signed digit (CSD). The circled groups of digits have the same subexpression. The filtering operation is 
we can rewrite the filtering operation as
Thus, by sharing the common subexpression w[i], the number of additions is reduced from six to four.
The drawbacks of previous common subexpression sharing techniques are: longer delay word length, additional intermediate delays and word-level input. Longer I/O tap delay word length is due to the transposed direct form architecture. Additional intermediate delays that are due to sharing between different delayed version of input will easily offset the advantages of subexpression sharing. Word-level input results in word-level subexpression, which is not suitable for FPGA.
The proposed vertical subexpression sharing
To eliminate the above drawbacks we propose vertical subexpression sharing. Fig. 3 shows an example of vertical subexpression sharing, where the corresponding computation is Y=Xl x 101 l 2 +x2 x 01012 +X3 x 001 12. With adder-based DA we can formulate bit-level output we get the final adder network design as shown in Fig. 3 . In this figure the carry output of each FA in adder network is routed back to its own carry input for bit-level operation. Vertical sharing ensures bit-level computations and communications and is suitable for bit-level grain FPGA. Besides, with vertical sharing, summation of the tap result is shared with the computation of tap multiplication. Thus lower total hardware cost is attained.
To show the advantage of the proposed method we use the library data of XC4000E-3 [3] for delay and area calculations. For the same example with 8-bits input the hardware cost to implement Fig. 3 by using the proposed approach is three CLBs and the delay time is 6.2511s cycle time with eight computation cycles for one output. The
Adder network of adder-based DA with the shift adder part word-level design in Fig. 2 , using the previous approaches, uses 26 CLBs and needs 36.511s cycle time and one cycle to compute one output. Both designs use ripple carry adders for fair comparison and minimum cost. Compared with previous approaches, the proposed method can achieve lower area-time complexity for the bit-level FPGA environment.
FPGA-based transform design

I DA architecture for transforms
To simplify the explanation, we will use the l-D &point DCT shown in Fig. 4 as our transform example. This architecture can also be made suitable for the ROMbased DA by replacing the adder network with ROM. The input buffer or PIS converter is a shift register chain, which converts the word-serial bit-parallel input to wordparallel bit-serial output. The computation result of the adder network is out through an accumulator to get the final product.
In the following, we will use the Xilinx XC4000-3 [3] to calculate the hardware cost and delay.
Adder network design
The structure of the adder network is like that in Fig. 3 . To implement the adder network, half of the CLB is used for one bit adder and the output D flip-flops (DFFs) in each CLB store the carry output of the bit-serial adder. Most of the interconnections in the adder network are locally connected due to its tree structure.
The hardware saving due to the adder network depends on the coefficients. For the l-D DCT example, we first use the split kernel method, i.e. cos48 cos48 cos48 cos48 cos20 cos68 -cos60 -cos28 cos48 -cos40 -cos48 cos48 cos60 -cos28 cos28 -cos68 input adder network buifer Output The DA architecture for I -D X-point DCT Yl cos8 cos38 cos58 cos78
where 0 = 71/16, X , is the input sequences and the is the output sequences. Expanding the coefficients into bit-level, scaling with 1/42, and combining with vertical subexpression sharing, we can get only 16 addition terms for total computation. The scaling factor is easily removed if the 1-D DCT is applied to 2-D DCT.
The vertical subexpression sharing of DCT coefficients is shown in Fig. 5 for each output. From the tables, the network for output Yo, Y, , Y4 and Y, needs six bit-serial adders and the network for output Y,, Y,, Yj and Y, needs 10 bit-serial adders. The common subexpression sharing in this transform is different from that in the filter shown in the previous section. First, there is no relationship between the inputs so we cannot share the similar addition patterns between different inputs, such as the two terms of YO in Fig. 5 . Second, we use the adder-based DA formulation, so all the shared terms are in a vertical direction, and the remaining nonzero bits will be directly fed into the final shift-adders without using more adders. This will embed the summation of each tap result in the sharing term and thus save hardware. Table 1 shows the hardware comparisons of two DA methods for the DCT example. The adder network cost of the proposed design is reduced to just 12.5% of the cost of ROMs in ROM-based DA. The savings mainly come from the bit-level formulation and subexpression sharing. Also, the two DFFs in the CLB are very useful to construct the bit-serial adders without increasing cost.
The delay time of ROM-based DA design including the final accumulator is 10.411s from the ROM address input to the accumulator output. The delay time of adder-based DA with vertical subexpression sharing is 8.3511s from network input to the accumulator output. The cycle numbers of the two designs are the same for the case with the same input and output precision. The proposed design is faster. Table 2 shows the hardware cost comparisons of different approaches for transform length N, where K is the address numbers of each partitioned ROM, B is word length of the ROM table (B 2 W,) and r .1 is the ceiling function. We choose K = 5 since one CLB can implement 32w x l b RAM. Only the cost of the adder network and the ROM table is shown here, since other parts are identical. The hardware cost of the adder network, which varies according to coefficient distribution, is estimated by assuming uniformly distributed coefficients. Fig. 6 shows that the proposed adder network requires only 4% of the ROM table cost. Exponential growth of the ROM table cost inhibits its applicability. Fig. 7 shows that the proposed network with vertical subexpression sharing can reduce the hardware cost by up to 70% compared with that without sharing. Compared with previous subexpression sharing, Fig. 8 shows that up to 70% of hardware cost can be saved, in which the proposed design has combined the PIS converter cost for fair comparison. The main Hardware cost ratio between proposed design and ROM-based savings come from the DA bit-level formulation instead of word-level one.
Hardware cost analysis
The hardware complexity of peripherals in the proposed transform design is: PIS converter:
and SIP converter: N x Wx12. The PIS converter used the skewed buffer design introduced later and controller cost is ignored. Fig. 9 shows the overall hardware cost ratio between the proposed design and ROM-based DA, where B is assumed to be W, and has the same partition strategy as previous paragraph. At least two-thirds of the hardware cost can be saved.
Interface circuit design
Input buffer fP/S converter) design with skewed buffer approach
In previous designs [S-101 of PIS converter, direct implementation of PIS converter with DFFs seems to be the only way for FPGA based transform designs. Thus, they can not be efficiently implemented with RAM, as proposed in [ 1 1, 121, for filters. However, DFF implementation is very inefficient since DFF is a very scarce resource in FPGAs (only two DFFs in one CLB). Fig. 10 shows the proposed skewed buffer design. To illustrate the process, Fig. 11 shows the RAM contents for L = 4 and Wc = 4. Each small block surrounded by double lines is one RAM bank implemented by a CLB. The 4 x 4 regions constitute one RAM block in Fig. 10 . At the initial point, input is first rotated shifted by a W,-bits barrel shifter, and stored in the RAM banks. These inputs are skewed by 1-bit. Then the RAM content is accessed by a diagonal line order to implement PIS conversion. After the content is read out, new input can be written to the same address, since old data is not used any more. Half-cycle read and half-cycle write can perform this read-write operation. The read out data is rotated to match the input Fig. 12 shows that at least half the CLB count can be saved by using the presented approach. where W=exp( -j 2~1 5 ) . By using the approach in [13] , a cyclic convolution form can be expressed as:
4 Fig. 13 shows the transform architecture with cyclic formulation. This design is also an adder-based DA design but with as added I 1 0 permutation stage to reorder the input for cyclic convolution. Thus, the PIS converter can be efficiently implemented by RAM-based CLB technique as proposed in [ 1 1, 121 , where one bit in RAM can be a delay without extra circuitry.
The overall cost for a prime length DFT is 
Conclusion
In this paper, we have proposed the hardware-efficient FPGA design and implementations for transforms by considering both algorithm and architecture level. At algorithm level, we use adder-based DA instead of ROMbased DA, and propose a modified common subexpression sharing technique for more hardware sharing. This leads to large savings. The bit-level algorithm formulation matches the FPGA bit-level gain feature well. As for interface considerations, by exploiting the features of FPGA architecture, we proposed two transform designs: a skewed buffer design and cyclic formulated transform. At least two-thirds of the hardware cost is saved, compared with ROM-based DA. For different tradeoff considerations, the first design is suitable for high-speed design while the second design is suitable for area-critical design.
Acknowledgments
This paper was supported by National Science Council, R.O.C., under the grant NSC-87-22 15-E009-039.
