In this paper, several bit-serial, high-order implementations of cascade, lattice and direct-form FIR filters using Distributed Arithmetic (DA) are studied. Although lattice and casciide structures present many interesting properties related to quantification error and stability, they DA versions has not been thoroughly compared. Three types of filters with their particular bit-serial DA model error have been built using an A L T W A IOK50 FPGA and their area-time figure is analysed. Mains results show that a 60"' order bit-serial cascade and direct-form implementation with nearly 4MHz and a 40" order lattice structure with 7.5MHz can be implemented. Moreover, in contrast to the first ~triictiire~, the lattice filter prescnts the lower quantification error.
INTRODUCTION
Distributed Arithmetic (DA) is a well-known method [I] , [2] to save resources in multiply-and-accumulate structures (MAC) utilised to implement DSP functions. This arithmetic trades memory for combinatory clements, rcsulting ideal to implemcnt custi~m DSP (CDSP) in LUT-based FPGAs 131. In addition to a DA implementation. the designer can select from a bitserial to a full-parellel implementation to trade bandwidth for resource utilisation [41.
Cascadc and lattice structures present several interesting propcrtics such a s low quantification error and high-stability in their coefficients. Moreover, lattice cells can be easily expanded without a fullredesign 15). In this way, the goal of this paper is to implement FPGA based direct-form, cascade and lattice high-order FIR filters by using bit-serial DA. The resultant topologies w e compared in both area and speed. Pipeline techniques and scalable parameters are includcd in the designs by using a hardware description language. Finally, DA error models of tlie thrce previous structures are described. To the best of our knowledge, in the WGA-rclated technical litcrature mainly DA direct form filters liave been studied [6, 7] . Furthermore, most of thcsc works do not implement high-order filters.
In the next section, DA fundamentals and thc proposed architecturcs for each kind of filter are reviewed. In Scction 3 the results of the FPGA implemeiitation of the structures are presented. Finally, an error model is discussed in Section IV.
DA STRUCTURES
The FIR filter operation (eq.1) can be expressed by (cq.2) using the 2's complement representation of the xjn) input samples of N bits.
The terms i n brackets in Eq.2 can be prc-calculated, saved into a rncmory and addressed by x,,~) ( Table I ).
Considering tliat each x,,,, can take two values ('0' or 'I,), each product term has one or tlie 2"-" possible values. 
DA direct-form implementation
In a DA bit-serial implementation of a FIR filter, each product Lerm is addressed once per bit (the MSB bit is the sign bit). After llie last product-term has been obtained, it is added with its appropriate shift with the rest of the product term previously added. Tlic structure that repscscnts tlie direct-form FIR filter i s showed i n Sig.1. Considering a 4-input I.U1' ITtiA, the product-terms larger than four nccd to be divided into r parts so 4 5 Tlr, where '1' is the number of taps of tlic filter. In short, the addcrs i n the tree structure add the r LIJT outputs. I?vcntunlly, ii shil't iicciimulator i s rcquircd to add and shift ciich product term. Fig.1 represents a bit-serial implementation 111 a filter with 8 bits input saniple. Thus the output o f llie filter is obtained each eight clock cycles. When the sign-hit ill-rives, a subtraction iiistead of an addition in the shiftaccuinulator is done. Finnlly, a symnietrical filtw i s implemented by using carry save adders bcforc the LUT. Detailed infol-mation of the operation can he obtained iii LII,L41.
hdditiinnally, pipelining thc structure can extend Ihe rangc of processing speed. The operation frequency (fs) can be expressed by eq.4, whtrc L is the latency and n the number cif bits of each input smnplc.
Despite the register increment in the DA pipeline vcrsion, the final area resources increase slightly, due to the PPGA structure. order sections to obtain a reduction i n area. The structure of these sections ciin be IIA adapted with tlie symmetry cquiiti~ii (eq.5) Ilia1 represents the k"' section of a 1' ordcr filter and its expmsion in llh prixiilct ternis (cq.6). Last equation represents the basic cell o f a cascade structiire that can bc designed by using a bitserial approach ( fig.2 ). 
DA lattice filtcr implementation
The recursive cquntions that describe the latticc cell structures (cq.8) are uhed to obtain cascade implcmcntations of M cclls IS]. Both the I tcrni and g term represent the forward iuid tlie backward prediction in a linear prediction filter structure.
Using the D A equations (cq.9) we ciin reproduce the f and g tcrrns with two I J T s , where g' represents the g(n-1) term. Thc bit-serial implementation of the lattice structure reaches a real-time operation close to 7.5MHz. In both, cascade and direct-form structures, it continuously decreases to 4MIIz. 
MHZ ALATTICE OCASCADE .DIRECT

BIT-SERIAL DA ERROR MODEL
In this section a DA error model of each structure is proposed. We take the assumption that rounding error
(ern]) and data input (x[n]) are non correlated [SI.
In DA bit-serial cascade structures (Fig.?-) , the error is modelled by eq.11, where e,, and e, (in grey) are the LUT and shift-accumulator rounding errors. Furthermore, as a result of the 4-input LUT structures, the partition of the memories in the FPGA case is limited by r<T/4 (T is the order of the filter). Eq.14 shows the variance of the error in the direct-form structure.
(Eq.14) Fig.4 shows the improved lattice cell with the error sources e,, and e, (in grey). The error and the variance in this cell can be expressed by: (Eq.15)
As example, a T order FIR filter with p=pm=pa=X hits can be used to compare the three models. The results in the direct-form, cascade and lattice implementations are T.4.2384e-07+1.6953e-06, T.3.3907e-06 and T.2.5868e-11, respectively. As consequence, the lattice filter presents the lowest error, meanwhile cascade form the highest. Finally, the direct-form structure has also a high error compared with the lattice cells.
CONCLUSIONS
A distributed arithmetic bit-serial approach has been presented to implement three classical structures of a FIR filter: direct-form, cascade and lattice. Moreover, we have developed a comparative DA error model of the three structures. The conclusions from the research presented are summarised as follows:
We can implement a hit-serial 40'" order lattice filler in a IOKSO device with a real-time frequency operation of 7.5MHz. The pipclined cascade and direct-form hit-serial implementations reach 4.5MI-Iz and a 60"' order in their symmetrical implementations.
We have been presented a improved lattice cell than reduce the memory occupation by using the input carry in the shift-accumulator. The cell can he used to decrease or to increase the order of the lattice filter with the same performance. We have been offered a DA error model, which show that lattice structure represents the lowest rounding error while cascade has the highest.
Direct-form is more scalable than the rest of the structures and we can easily select the result precision (simple, double and full precision) whereas casciide and lattice present more difficulty to change the inner precision.
