In this paper the design o f a digital-serial Nt a p FIR filter with programmable coefficients is presented. The design considers the general case of W-hit sample word and M-hit coefficient word. The processing of the data within the filter takes place with full precision. The output data is truncated to W hits. The design introduced new digit-serial multiplier that guarantee minimum processing time and reduces the hardware requirements. Sign-amplitude representation for the coefficients and two's complement for the input samples simplified the circuit configuration and allows the use of one common two's complement circuit for all the filter section. A 100-tap, 8-hit word length version filter is implemented using a n ALTERA FPGA device. The filter can he used in real-time processing with sample rate range from 1.5 to 21 MHz 1.
INTRODUCTION
Bit parallel designs process all of the bits of an input simultaneously at a significant hardware cost. In contrast, a bit serial structure processes the input one bit at a time, generally using the results of the operations on the first bits to influence the processing o f subsequent bits. The advantage enjoyed by the bit serial design is that all of the bits pass through the same logic, resulting in a huge reduction in the required hardware. Typically, the bit serial approach requires l N t h of the hardware required for the equivalent M-bit parallel design. The price of this logic reduction is that the serial hardware takes M clock cycles to execute, while the equivalent parallel structure executes in one. The timehardware product, however, for the serial S~N C I U K is often smaller than for equivalent parallel designs because the logic delays between registers are generally significantly smaller. This means that the serial machine can operate at a higher clock frequency. In the case of FPGAs, signal routing contributes significant propagation delays and often uses up logic cells. The serial structures tend to have very localized routing, often to only one destination. In contrast, the parallel machines
S.Masupe MlEEE

University of Botswana Faculty of Engineering and Tech
Botswana usually need signals extended across the width of the processing element. The limited and slow routing resources in FPGAs make the serial processing elements even more attractive. In some cases, the overall throughput for a serial design implemented in an FPGA can actually exceed that of an equivalent parallel design in the same device.
In case of DSP, the output sample is the sum of a number o f terms while the term itself represents a multiplication of M-bit sample word and W-bit coefficient word.
Accordingly, the processing of the data in case of DSP has two levels; the first is on the level of the calculation o f the term (multiplication process) and the second level is on the system level to get the output by accumulating the outputs of the first level. The fully serial implementation uses one serial-bit multiplier to calculate the terms in serial form and one accumulator to accumulate the results. Such implementation guarantee minimum hardware but at the same time gives a system that cannot be used with any real-time application. On the other hand, using parallel multiplier for each term in the expression of the filter together with parallel adding network to get the output, results in a tremendous amount of hardware with speed so high such that it is not needed in many applications. Many of the researches in the field of DSP concentrate on finding algorithms and implementation techniques that results in real-time system with reasonable hardware complexity and that can be implemented using FPGA devices. The organization o f this paper is as follows. In the next section the FIR filter architecture, the different canonical and inverted form topologies to design FIR filters are summarized. The expression and the topology that we are going to consider in implementation are explained. Section 3 covers all the design and implementation aspects. The digit-serial architectures are briefly exposed, the basic digit-serial adder cell is explained, the proposed digit-serial multiplier is also given and finally the inputloutputicontrol block is given. Section 4 deals with the FPGA implementation while in section S some conclusions and results are given.
FIR STRUCTURE
The N-tap FIR digital filter is normally described by the equation:
In our application, the coefficient H, has M hits width and will he represented in sign-amplitude form. The input sample has a word length W-bit and is represented in 2's complement form:
The filter equation. accordinelv. will take the form:
Using sign-amplitude representation for the coefficients and two's complement form for the input samples results in the following simple multiplication algorithm:
If the coefficient H, is positive: multiply H, with the extended sign x,, If the coefficient H, is negative: multiply the amplitude o f H , with the extended sign two's complement of the input sample x, .
ii.
The result of the multiplication will be in the two's complement form. The transposed forms (symmetrical and nonsymmetrical) are considered in this paper. 
THE PROPOSED FPGA-BASED FIR FILTER STRUCTURE
The N-tap FIR filter structure is shown in Fig.3 .
It consists o f one inputloutput unit and an array of N multiply-accumulate (MAC) blocks with coefticient storage. In case of nonsymmetrical filter, each multiplyaccumulate cell receives two sets of signals, one from the inpulloutput unit and one from the previous cell. The cell processes the received data and propagates the result in serial form to the next cell. In case of symmetrical filter, each cell receives three sets of data; one set from the inputloutput unit, the second an accumulated value from the previous cell and the third an accumulated value from the next cell. The cell processes the received data and generates two accumulated values; one of them propagates to the next cell and the other to the previous one. In the following, the basic blocks are considered. 
The Multiplier-Accumulator (MAC) Unit
The basic unit in our design is the multiplieriaccumulator unit. For N-tap filter, at sampling instant n, the ith MAC block receives the input sample x(n-i) and calculates the partial value y , where:
J=;+I with.
The first term in equation (4) represents multiplication operation and the second term represents the accumulated sum of all the terms of the filter equation starting from the last term (oy,x(n-N+I)) up to a;+, x(n-i-l). This means that the MAC unit implements the multiplication and accumulation operations. To achieve its function, the MAC is composed, in general, of three blocks: two's complement circuit controlled by the sign of the coefficient; next is a serial-multiplier to implement the serial-bit multiplication; the third block is a storage section. The storage section consists of one parallel-in (or serial-in)-parallel-out register to store one of the coeficients and one serial-in-serial-out register to store the accumulated value. As mentioned in section-I, to keep full precision along the whole datapath the length of the register storing the accumulated value is (Windata + W,,wricnt + log2 N) and the serial multiplier needs the same number of cycles to complete the multiplylaccumulate aperation. To reduce the hardware requirements and to reduce the processing time, we are proposing the use of one common two's complement circuit as part from the inpulloutput unit and to use a new serial multiplier (serial-digit multiplier slice) that needs half the number of cycles to get the result. The proposed serial-digit multiplier generates two bits simultaneously from the partial value yDj at each cycle.
In the following we are going to start by introducing the serial-bit multiplier and then the other blocks that form the MAC cell.
Digit-Serial Architectures
In digit-serial computation, data words of size L bits are partitioned into digits of size K bits (the digitsize, K, is divisor of the word-size, L) and are processed serially one digit at a time with the least significant digit first. A complete word is processed in P=LIK clock cycles and consecutive words follow each other continuously. The time of P cycles is named a sample period. In every digit-serial operator, it is necessary to add some control signal lo indicate a new word entry. The digit-serial processors include parallel-serial and serial-parallel convcrters to process in digit format and to present the result in parallel format. A set of digit-serial architectures can he designed by using different digitsizes.
To implement the multiplication operation N.X, we normally generate a partial product matrix with M (M is the width of H) rows and W+M vertical lines (W is the word length of X): The hits forming vertical line j have the same weight 2'. In our proposal we are going to deal with the hits of the vertical lines and split them into digits of two hits each.
2-bit Full Adder (2 bits Digit-Serial Adder)
Fig.4.a shows the conventional single-hit full adder with three inputs and two outputs. The single-bit full adder adds hit a, of word A , hit hi of word B and an input cany hit cjn and generates the sum hit si and the output cany hit c,,,. The hasic element used to build the MAC of the proposed system is a two-bit full adder. In literature it is known as digit-serial adder with digit size K=2. This adder is shown in Fig.4 .h. The digit-serial full adder has, in its general form, five inputs (ai , a;-,, xi, xi+,, and c,J and three outputs si, s,-, and col, where si and s,., represent the sum of the input signals with the carry c, . , propagates back to the input aRer one delay period. In case of FPGA devices, the digit-serial adder element can be implemented using look-up-table (LUT) or, as in our implementation, by using two single-hit full adders together with some flip-flops to store the outputs. The digit-serial adder symbol is given in Fig.4 .c. 
Digit Serial Multiplier
The proposed digital-serial multiplier is shown in Fig.5 for the case of filter coefficient word length of 8 bits. The word length of the input sample has no effect on the hardware. In case of using 16 bits coefficients, two circuits can he cascaded.
It is possible to look to the proposed digit-serial multiplier as a modified form of one of the slices used to implement the Wallace-tree parallel multiplier. Here, the slice processes at the same time two columns of the partial product matrix without any horizontal propagation for the carry. The depth of the slice equals to the word length of the coeflicient Ai. The proposed digital-serial multiplier splits the partial product matrix of the producl 
Multiplexer Block
To reduce the hardware requirements, we used one common two's complement circuit as part of the InputiOutput block. The InputiOutput block forms directly the two's complement of the received input signal and feeds the MAC with the direct and the two's complement forms of the input sample. The input stage of the MAC circuit is a set of M 2x1 multiplexers. The bits of the direct and the two's complement forms of the input signal are connected to the inputs of the M multiplexers. The sign hit of the corresponding coefficient controls the multiplexers.
Register Block
Each MAC has two registers:
(M+I)-bit register to store the coefficient ai .
The coefficient is stored in sign-amplitude form with M bits to store the amplitude. This register can he either serial-in parallel-out (SIPO) or parallel-in-parallel-out (PIPO) register. The coefficients are stored as constant cells in the FPGA architecture. Any one or more of the coefficients can he modified hy sending the appropriate hit stream(s) to the FPGA.
2.
Serial-in-serial-out (SISO) register: This register is used to store the intermediate values y,; . The MAC can be designed to processes the data with full precision. The length of this register will vary accordingly from W+M+I for i = N-1 to W+M+logiN for the output MAC (i.e. fori = 0). In our case, we select double precision, i.e., the register length equals to (M+W).
F i g 5 Digit-serial array multiplier with digit-size K=2.
The Input /Output and Control Unit
Both the input and the output of the filter are serial data streams with each word presented least significant bit first. The input words are of the same length as the output. For precision processing, the intermediate word length is equal to the sum of the word sizes of the input (number of.bits excluding sign extension required for processor), the coefficients (length of multipliers) and the number of levels in the digit-serial multiplier slice. The word size of the input need not be the same as that of the coefficients. The input is sign extended to bring it lo the same length as the output. This requirement is due to the nature of the serial processing; the inputs need to be as long as the outputs. The extra few bits due to the digit-serial multiplier allow the column sum to grow without overflow. The multiplier array must be reset before each new word begins to shifl in. The delays, however, cannot be reset since they hold the old words. In the FPGA implementation, a local reset was wired to the array columns corresponding lo the multipliers instead of the global reset. That local reset was brought out as a control line in addition to the global reset that clears the filter. Cascading two filters may obtain more taps. This is done by adding an extra word delay to the end of the delay chain to feed the serial input of the second filter. The serial outputs of the two filters are summed using a serial adder to obtain the final output. This expansion scheme can be extended to create any number of taps 
Bit-Serial Two's Complement Unit
In the proposed FIR filter, the coefficients and the sample inputs are signed numbers. As mentioned before. the system uses two's complement representation Cor the input samples, and sign-amplitude representation for the coefficients. To allow all the MAC blocks used to build the filter to work in parallel, each block must contain a separate bit serial two's complementor. This solution represents a large waste in the hardware. To reduce the hardware, the circuit of the proposed MAC has two sets of inputs each of M bits. The first set carries the hits of the input sample x(n-i) directly, and the other input receives the hits of the two's complement of the input sample. The two sets of input signals are connected to an M 2x1 multiplexers controlled by the sign hit of the corresponding coefficient. By this way, the system uses only one common two's complementor which is a pan of the input unit of the filter. The circuit of the two's complementor consists of two flip-flops, an XOR gate and an OR gate. One of the two flip-flops acts as detection flip-flop and it must be reset before starting processing each new input sample. This is because the detection flip-flop causes the XOR to invert the input continuously after the first "I" is detected.
Muliiple Precision io Single Precision Block
The full precision data has to be truncated to single precision before going into the final serial-parallel converter. In case of W-hit word, the only useful hits for us are the W most significant bits of the result. Therefore, to feed these W hits into the serial-parallel converter they have to amve with the right format. Control of the filter is achieved by generating an initialisation signal at each new sample time. This single clock-cycle wide pulse is delivered to the filter as the LSB of each sample is presented to the multipliers. This signal insures that the carry signals are reset at the beginning of each process cycle. (Delayed versions of this signal are input to the serial column adder, initialising each carry-save adder in the adder tree).
FPGA IMPLEMENTATION
The filter described above was implemented in an ALTERA EPFlOK200SRC240-I FPGA. Default placement and routing compilation options were selected. The FPGA chip accommodated the InputlOutputlControl unit and 100 MAC units allowing 200-tap symmetrical FIR filter. The bit-serial implementation reaches a realtime operation nearly to 7.5 MHz. As a matter of fact, another version is tested using LUT to implement the digit-serial adder.
CONCLUSIONS
A study of full precision digital-serial FIR filter with programmable coefficients has been presented.
Symmetrical and nonsymmetrical inverted form FIR filters have been considered. The design methodology using each of these structures has been detailed and implemented in an ALTERA FPGA device. The areatime calculation for the proposed system and some of the existing system showed a great improvement using our system. The main improvement came from the proposed digit-serial multiplier. As a matter of fact, the hardware requirement to build one of the digit-serial multiplier slice is W/2 times less than the hardware required to build one parallel multiplier. The filter has been automatically implemented using the default parameters of the partitioning, placement, and routing tools. Thus, in critical applications, an imponant area reduction and speedup can be expected if some of these tasks are manually performed.
6.
