Abstract-In this paper, we present a novel implementation for an N -tap complex finite impulse response (FIR) filter, using complex number on-line arithmetic, based on adopting a redundant complex number system (RCNS) to represent complex operands as a single number. We present cost comparisons with (i) a real number on-line arithmetic approach, and (ii) a real number parallel arithmetic approach, to demonstrate a significant improvement in cost.
I. INTRODUCTION
The N -tap finite impluse response (FIR) filter is defined as an output sequence y n (n = 1, . . .) of an input sequence x n (n = 1, . . .), in which
where h k (k = 0, 1, . . . , N − 1) are the filter coefficients. The standard implementation is shown in Figure 1 . Assuming m-bit precision, it requires N m-bit multipliers and an Noperand m-bit adder. For a complex FIR filter, the filter coefficients as well the the input and output sequences are complex numbers. This significantly increases the size of the design, since an m-bit complex number multiplier is equivalent to 4 m-bit real number multipliers and 2 m-bit real number adders. Since area is a critical factor in FPGA design, we propose an approach that utilizes a radix 2j number system, and that yields a significant lower cost than an alternative radix 2 on-line implementation and a real number bit-parallel implementation. 
II. COMPLEX NUMBER ON-LINE FLOATING-POINT ARITHMETIC
On-line arithmetic [3] is a class of arithmetic operations in which all operations are performed digit serially, in a most significant digit first (MSDF) manner. Several advantages, compared to conventional parallel arithmetic include: (i) ability to overlap dependent operations, since on-line algorithms produce the output serially, most-significant digit first, enabling successive operations to begin before previous operations have completed; (ii) low-bandwidth communication, since intermediate results pass to and from modules digitserially, so connections need only be one digit wide; and (iii) support for variable precision, since once a desired precision is obtained, successive outputs can be ignored. One of the key parameters of on-line arithmetic is the on-line delay, defined as the number of digits of the operand(s) necessary in order to generate the first digit of the result. Each successive digit of the result is generated one per cycle. This is illustrated in Figure 2 , with on-line delay δ = 4. The latency of an online arithmetic operator, assuming m-digit precision is then δ + m − 1. Complex number on-line arithmetic [5] uses a class of on-line arithmetic operators on complex number operands. For efficient representation, a Redundant Complex Number System (RCNS) [1] is adopted. A RCNS a radix rj system, in which digits are in the set {−a, . . . , 0, . . . , a}, where r ≥ 2 and r 2 /2 ≤ a ≤ r 2 − 1. Such a number system can be denoted RCN S rj,a . A Redundant Complex Number System with r = 2, a = 3 denoted RCN S 2j, 3 , allows ease of the definition of primitive on-line arithmetic modules, as well as ease of conversion to and from other representations. This number system was introduced as Quarter-imaginary Number System in [4] . For implementation of the complex FIR filter, in order to permit a relatively wide range of input values, we assume floating-point arithmetic. Two on-line floating-point arithmetic operations are used: (i) RCN S 2j,3 on-line floatingpoint addition; and (ii) RCN S 2j,3 on-line floating-point constant coefficient multiplication. The recurrence algorithms and implementation parameters when mapped to a Xilinx Virtex FPGA are discussed in detail.
Using RCN S 2j,3 , a floating-point complex number x = (X R + jX I ) · (2j) ex can be normalized with regard either to the real component X R or the imaginary component X I , depending on which has larger absolute value. The exponent e x is shared between the real and imaginary component. Exponent overflow/underflow can be handled by setting an exception flag, and allowing processing of results (although erroneous) to continue.
The output of a complex number operation can be undernormalized for several reasons:
1. The range of an output determined by the on-line algorithm allows it to be undernormalized. 2. Digit cancellation resulting from the addition/subtraction of numbers with the same exponent value.
In this paper, we assume operands of an RCN S 2j,3 online algorithm have non-zero most significant digits and are normalized. When the result Z exceeds the range of a normalized fraction (i.e. max(|Z R , Z I |) ≥ 1) then the exponent is incremented. When the result is below the range of a normalized fraction (i.e. max(|Z R , Z I | < 1 2 ), then the exponent is decremented and leading zeros are discarded. The normalization algorithm which takes as input the generated output digit z k , the output exponent e z and the on-line delay for the arithmetic operation δ is shown below. This is similar to the normalization algorithm presented in [2] for radix-2 on-line rotation.
and z k = 0 and not(done) then e z = e z + 1 done = 1 else if k ≥ δ and z k = 0 and not(done) then e z = e z − 1 else if (k ≥ δ and z k = 0) then done = 1 end if
III. RECODING ALGORITHMS
Although RCN S 2j,3 allows flexibility in representation, there are also several drawbacks:
• Handling digits 3 and −3 requires producing significand multiples 3X and −3X, requiring an extra addition step.
• A significand X with fractional real and imaginary components X R and X I can have integer digits, such as (11.3212) 2j = 3 8 + 3 8 j, which can complicate ensuring complex significands within the range max(|X R |, |X I |) < 1. To handle these cases, several recoding modules are presented: (i) digit-set recoding; and (ii) most-significant-digit recoding.
A. Digit-set recoding
In order to reduce the complexity introduced by handling digits −3 and 3, digit-set recoding initially recodes a RCN S 2j,3 digit x k ∈ {−3, . . . , 3} into a pair of digits
In order to restrict χ k ∈ {−2, . . . , 2}, two cases of pairs of values must be prevented:
To do so, x k+2 is examined. If x k+2 ≤ −2 and x k = 2, which could allow the first case, x k is recoded as (1, 2), otherwise as (0, 2). In the same way, if x k+2 ≥ 2 and x k = −2, which could allow the second case, x k is recoded as (1, 2), otherwise as (0, 2) Then it is assured that χ k ∈ {−2, . . . , 2}. The digit-set recoding algorithm DSREC is shown below.
B. Most-significant-digit recoding
In order to handle carries produced when performing operations on significands consisting of RCN S 2j,3 digits, mostsignificant-digit recoding recodes most-significant residual digits w −1 , w 0 ∈ {−1, 0, 1} of respective weights (2j) 1 = 2j and (2j) 0 = 1, and digits w 1 , w 2 ∈ {−3, . . . , 3}, of respective weights (2j) −1 and (2j) −2 , into digits ω 1 , ω 2 ∈ {−3, . . . , 3} of respective weights (2j) −1 and (2j) −2 . The algorithm MSREC for recoding general digits w k−2 and w k into digit ω k is shown below.
MSREC(w
ez is produced such that
Each output digit at step k, namely z k is generated based on input digits x k+δ−1 and y k+δ−1 . The algorithm is shown below, where W E [k] is the low-precision estimate of the evenindexed (real) component of the recurrence W [k]. The design of a m-digit significand and e-bit exponent RCN S 2j,3 online floating point adder is shown in Figure 3 . The SUBE unit computes the difference of the exponents. The ALIGN unit performs alignment of operand y to synchronize the arrival of the input digits. The SWAP unit exchanges the operands if necessary. The PPM and MMP modules are simple full-adders that appropriately negate (indicated by "-" on the port) inputs and outputs to perform borrow-save addition. The NORM unit normalizes the result by updating the output exponent z k . A summary of cost of individual modules is shown in Table I . The design requires 3m+4e+4 CLB slices. Assuming m = 24 and e = 8, the cost is 108 CLB slices. 
end for /* Recurrence */ for k = 1 to m do
ey , the output z = (Z R + jZ I ) · (2j) ez is produced such that
Each output digit at step k, namely z k is generated based on parallel input vector X and input digit y k+δ−1 . The algorithm is shown below, where W E [k] is the low-precision estimate of the even-indexed (real) component of the recurrence W [k]. The design of a m-digit significand and e-bit exponent RCN S 2j,3 on-line floating point constant coefficient multiplier is shown in Figure 4 . The ADDER unit computes the sum of the exponents. The digit-vector multiplier computes the product Xy k at each iteration. The borrow-save adder computes the sum W k of the previous residual W k−1 and the intermediate product Xy k , and stores the result in the register REG W. The NORM unit normalizes the result based on the current output exponent e z , the on-line delay δ, and the output digit z k . A summary of cost of individual modules is shown in Table II . The design requires 8m + 3e + 32 CLB slices. Assuming m = 24 and e = 8, the cost is 248 CLB slices. 
Digit-vector multiplier 4m
Borrow-save adder 4m NORM 2e
MSREC 20
Total cost 8m + 3e + 32 
Each tap (slice) of the FIR filter, assuming 24-digit significand and 8-bit exponent floating-point operands, consists of 24 digit-wide registers to store individual digits of inputs x(n), x(n−1), . . . , x(n−N −1) and a complex number on-line floating point constant coefficient multiplier. Each multiplier product is fed to one of the operands of a complex number on-line floating point adder. The parallel adder in Figure 1 can be implemented as a binary tree of complex number on-line floating point adders, each one initially adding two intermediate multiplier outputs and producing an intermediate sum output, until the final output y(n) is computed.
A. Radix 2j on-line network
An N -tap complex FIR filter can be designed as a network of radix-2j on-line floating-point arithmetic operators, where, assuming m-digit significands and e-bit exponents, a radix 2j on-line floating-point adder has a cost of 3m + 4e + 4 CLB slices, and a radix 2 on-line floating-point multiplier has a cost of 8m + 8e + 32 CLB slices. For m = 24 and e = 8, the cost of a radix 2j on-line floating-point adder is 108 CLB slices and the cost of a radix 2j on-line floating-point multiplier is 248 CLB slices. Since for an N -tap complex FIR filter, N − 1 radix 2j floating-point adders and N radix 2j floating-point multipliers are used, then the cost is 356N − 108 CLB slices.
B. Radix 2 on-line network
An N -tap complex FIR filter can be alternatively designed as a network of radix-2 on-line floating-point arithmetic operators, where, assuming m-digit significands and e-bit exponents, a radix 2 on-line floating-point adder has a cost of 1.5m + 3e + 2 CLB slices, and a radix 2 on-line floatingpoint multiplier has a cost of 3m + 3e + 2 CLB slices. For m = 24 and e = 8, the cost of a radix 2 on-line floatingpoint adder is 70 CLB slices and the cost of a radix 2 on-line floating-point multiplier is 98 CLB slices. Since for an N -tap complex FIR filter, 3N − 1 radix 2 floating-point adders and 4N radix 2 floating-point multipliers are used, then the cost is 602N − 70 CLB slices.
C. Radix 2 parallel network
An N -tap complex FIR filter can be alternatively design as a network of radix-2 parallel arithmetic operators. The library of Xilinx CORE arithmetic modules [6] , which can be scaled in terms of precision is used. Since the modules are defined for fixed-point arithmetic, appropriate exponent handling units are used to support floating-point arithmetic. For 24-bit significands and 8-bit exponents, the cost of a radix 2 parallel floating-point adder is 30 CLB slices and the cost of a radix 2 parallel floating-point multiplier is 320 CLB slices. Since for an N -tap complex FIR filter, 3N −1 radix 2 floatingpoint adders and 4N radix 2 floating-point multipliers are used, then the cost is 1370N − 30 CLB slices.
D. Cost comparison
The cost of the proposed radix 2j on-line network, and the alternative radix 2 on-line network and the radix 2 parallel network are compared for the implementation of an Ntap complex FIR filter for common values of N, including N=8,16,64, and 256. In each case, we assume floating-point operands consisting of 24-digit (or bit) significands and 8-bit exponents, as shown in Table III . 
VII. CONCLUSION
We have demonstrated a new approach for implementating an N -tap complex FIR filter, based on using complex number on-line arithmetic modules which adopt a redundant complex number system (RCNS) for efficient representation. Significant improvement in cost in comparison to a radix-2 on-line approach and a radix-2 parallel approach have been shown. This offers motivation for further research into other applications utilizing complex number operations.
