Abstract-In this paper, we present an implementation for the Complex Householder Transform, using complex number on-line arithmetic, based on adopting a redundant complex number system (RCNS) to represent complex operands as a single number. We present comparisons with (i) a real number on-line arithmetic approach, and (ii) a real number arithmetic parallel approach, to demonstrate a signi cant improvement in cost.
I. INTRODUCTION
The Householder Transform [4] is an important operation in numerous signal processing applications, including QR decomption and array processing [7] . When the elements of the matrix are complex numbers, it is denoted the Complex Householder Transform (CHT) [2] . The CHT is applied to a column vector x to zero out all the elements except the rst one. Given a column vector x = [x 1 , . . . , x k ] T ∈ C k , where x 1 = |x 1 |e jθ with θ ∈ R, the basic steps of the CHT algorithm are: (i) de ne a column vector u = x + e jθ ||x|| 2 e 1 , where e 1 T = [1, 0, . . . , 0]; (ii) de ne the k × k CHT B as:
(iii) apply the matrix B to the vector x to produce a column vector y = Bx, in which all elements are zeroed out except for the rst element, i.e. 
. .
In these steps, I is the k × k identity matrix and u H is the complex conjugate transpose of u. 
II. COMPLEX NUMBER ON-LINE FLOATING-POINT

ARITHMETIC
On-line arithmetic [3] is a class of arithmetic operations in which all operations are performed digit serially, in a most signi cant digit rst (MSDF) manner. Several advantages, compared to conventional parallel arithmetic, include: (i) ability to overlap dependent operations, since on-line algorithms produce the output serially, most-signi cant digit rst, enabling successive operations to begin before previous operations have completed; (ii) low-bandwidth communication, since intermediate results pass to and from modules digitserially, so connections need only be one digit wide; and (iii) support for variable precision, since once a desired precision is obtained, successive outputs can be ignored. One of the key parameters of on-line arithmetic is the on-line delay, de ned as the number of digits of the operand(s) necessary in order to generate the rst digit of the result, which is gnerally one cycle after the result is computed. Each successive digit of the result is generated one per cycle. This is illustrated in Figure 1 , with on-line delay δ = 4. The latency of an on-line arithmetic operator, assuming m-digit precision is then δ + m − 1. This number system was introduced as Quarter-imaginary Number System in [5] . At the binary level, the digits will be represented using borrow-save encoding, in which each digit x k ∈ {−3, −2, −1, 0, 1, 2, 3} is represented as 4 bits
. This is an extension of radix 2 borrow-save encoding, in which each digit x k ∈ {−1, 0, 1} is represented as 2 bits
The throughput of a radix 2j on-line arithmetic operator is the same as for the radix 2 implementation of a complex arithmetic operator. A radix 2j on-line arithmetic operator generates real and imaginary digits in alternate cycles, with each radix 2j digit corresponding to two radix 2 digits. The equivalent radix 2 implementation of a complex on-line arithmetic operator generates real and imaginary digits in the same cycle. This is demonstrated below for the output of two radix 2j borrow-save encoded digits z 1 = (z Since the throughputs are equal (differing only in respective on-line delays), a radix 2j output can be converted to a radix 2 format for input to radix 2 on-line arithmetic operators, and vice versa, at the digit level. The algorithm below demonstrates the conversion from a radix 2j borrow-save digit
Radix 2j to Radix 2 Borrow-save Conversion
The algorithm below demonstrates conversion from real and imaginary radix 2 borrow-save digits z R,k = (z
Radix 2 to Radix 2j Borrow-save Conversion
Using a radix 2j representation, a oating-point complex number
ex can be normalized with regard either to the real component X R or the imaginary component X I , depending on which has larger absolute value.
The exponent e x is shared between the real and imaginary component. A radix 2j fraction x is considered normalized if
The normalization algorithm which takes as input the generated output digit z k , the output exponent e z and the on-line delay for the arithmetic operation δ is shown below. Although RCN S 2j,3 allows exibility in representation, there are also several drawbacks:
NORM(z
• Handling digits 3 and −3 requires producing signi cand multiples 3X and −3X, requiring an extra addition step.
• A signi cand X with fractional real and imaginary components X R and X I can have integer digits, such as (11.3212) 2j = Several recoding algorithms to handle these issues are described, including: (i) digit-set recoding; and (ii) mostsigni cant-digit recoding.
Digit-set recoding initially recodes a RCN S
In order to restrict χ k ∈ {−2, . . . , 2}, two cases of pairs of values must be prevented:
To do so, x k+2 is examined. If x k+2 ≤ −2 and x k = 2, which could allow the rst case, x k is recoded as (1, 2) , otherwise as (0, 2). In the same way, if x k+2 ≥ 2 and x k = −2, which could allow the second case, x k is recoded as (1, 2) , otherwise as (0, 2) Then it is assured that χ k ∈ {−2, . . . , 2}. The digit-set recoding algorithm DSREC is shown below.
In order to handle carries produced when performing operations on signi cands consisting of RCN S 2j,3 digits, mostsigni cant-digit recoding recodes most-signi cant residual digits w −1 , w 0 ∈ {−1, 0, 1} of respective weights (2j) 1 = 2j and (2j) 0 = 1, and digits w 1 , w 2 ∈ {−3, . . . , 3}, of respective weights (2j) −1 and (2j) −2 , into digits ω 1 , ω 2 ∈ {−3, . . . , 3} of respective weights (2j) −1 and (2j) −2 . The algorithm MSREC for recoding general digits w k−2 and w k into digit ω k is shown next. III. CHT IMPLEMENTATION The CHT produces a complex column vector y k in which y 1 is the only non-zero element. Simplifying the computation results in
MSREC(w
The implementation requires k complex multipliers (CMULT), k − 1 real adders (RADD), 1 real divider (RDIV), a real square root unit (RSQRT), a unit to negate a digit (NEG), and a complex-real multiplier (CRMULT) as shown in Figure 3 . Since the product of a complex-conjugate multiplier is a real number, radix 2j to radix 2 converters will be used to convert the outputs into radix 2 representation for input to the real adders. Likewise, radix 2 to radix 2j converters will be used to convert the output of the real square root unit to radix 2j representation for input to the complexreal multiplier, which produces the complex output y 1 . The recurrence algorithms and designs of the radix 2j on-line oating-point complex-conjugate multiplier and the radix 2j on-line oating-point complex-real multiplier are described next.
CMULT CMULT 
A. Radix 2j on-line oating-point complex-conjugate multiplication
Radix 2j oating-point complex-conjugate multiplication (z = xx * ) is de ned such that given inputs
ez is produced such that
The recurrence formula for radix 2j on-line multiplication (z = xy) is the following:
where
For radix 2j on-line oating-point complex-conjugate multiplication (z = xx * ), since
the recurrence can be rewritten as
For the implementation, two types of modular slices are required. An odd-indexed slice M 2k−1 (k = 1 to m/2 ) consists of one borrow-save digit multiplier, a 2:1 borrowsave digit adder, a digit-wide latch, a D ip-op, a bit-wide D ip-op, a digit-wide D ip-op, a TWICE unit for computing
, and a 2-to-1 MUX for appropriately shifting the residual. An even-indexed slice M 2k (k = 1 to m/2 ) consists of a digit-wide latch, a bit-wide D ipop, and a TWICE unit for computing 2X[k − 1]. A ag bit eo controls switching between odd-indexed and even-indexed slices (eo = 1 for an odd-indexed slice, and eo = 0 for an even-indexed slice). The CONJ unit generates digits of x * . The digit x k is recoded into digit set {−2, . . . , 2} using one DSREC unit. The most-signi cant digit of the recurrence is determined using one MSREC unit, which performs output digit selection as well as handles most signi cant carry-out bits of the adder. The MULT2E unit multiplies the operand exponent by two to produce the output exponent e z . The NORM unit normalizes the result by updating the output exponent e z . The design of a m-digit signi cand and e-bit exponent radix 2j on-line oating-point complex-conjugate multiplier unit is shown in Figure 4 . The number of individual module types utilized, the cost per module type, and the total overall cost are summarized in Table I . Assuming m = 24 and e = 8, the cost is 224 CLB slices. The on-line delay is δ = 9.
B. Radix 2j on-line oating point complex-real multiplication
Radix 2j oating-point complex-real multiplication (z = xy) is de ned such that given complex input x = (X R + jX I ) · (2j) ex and real input y = Y · (2) ey , the output z = (
ez is produced such that For radix 2j on-line oating-point complex-real multiplication, the recurrence from Equation 5 can be rewritten as:
(9) For the implementation two types of modular slices are required. An odd-indexed slice M 2k−1 (k = 1 to m/2 ) consists of one borrow-save digit multiplier, a 2:1 borrowsave digit adder, a digit-wide latch, a bit-wide D ip-op, and a digit-wide D ip-op. An even-indexed slice M 2k (k = 1 to m/2 ) consists of two borrow-save digit multipliers, a 3:1 borrow-save digit adder, two digit-wide latches, a bit-wide D ip-op, and a digit-wide D ip-op. The digits x k and y k are recoded into the digit set {−2, . . . , 2} using two DSREC units. The most-signi cant digits of the recurrence are determined using two MSREC units, which perform output digit selection as well as handling the most signi cant carry-out bits of the adder. The ADDE unit adds the operand exponents to produce the output exponent e z . The NORM unit normalizes the result by updating the output exponent e z . The design of a m-digit signi cand and e-bit exponent radix 2j on-line oating-point complex-real multiplier unit is shown in Figure 5 . The number of individual module types utilized, the cost per module type, and the total overall cost are summarized in Table II . Assuming m = 24 and e = 8, the cost is 356 CLB slices. The on-line delay is δ = 9. Total cost 12m + 3e + 44 Three approaches for the design of the CHT are compared: (i) a radix 2j approach which uses a combination of radix 2j on-line arithmetic modules for complex inputs and radix 2 online arithmetic modules for real inputs; (ii) a radix 2 approach which strictly uses radix 2 on-line arithmetic modules; and (iii) a radix 2 parallel approach which uses the Xilinx library of oating-point parallel arithmetic operators [8] . The results in terms of cost and latency are shown next.
The cost of the proposed radix 2j on-line network, and the alternative radix 2 on-line network and the radix 2 parallel network are compared for the implementation of CHT unit which operates on a k-digit vector x, for various values of k. In each case, we assume oating-point operands consisting of 24-digit (or bit) signi cands and 8-bit exponents, as shown in Table III.   TABLE III   COMPARISON OF CLB COSTS FOR CHT 
