Abstract-In this paper, we present a novel implementation for the inverse of an n-by-n matrix consisting of complex elements, using complex number on-line arithmetic, based on adopting a redundant complex number system (RCNS) to represent complex operands as a single number. We present comparisons with (i) a real number on-line arithmetic approach, and (ii) a real number arithmetic parallel approach, to demonstrate a significant improvement in cost and delay.
I. INTRODUCTION
The inverse of a matrix is an important operation in the field of linear algebra, as well as various business and inputoutput models. The inverse of an n-by-n matrix A is denoted A −1 and satisfies AA −1 = A −1 A = I n , where I n is the nby-n identiy matrix containing 1's along the main diagonal and 0's elsewhere. The inverse serves as a way to "divide" matrices, in that if AB = C, then A = CB −1 and B = A −1 C. If a matrix A has no inverse (the determinant of A is 0), it is called singular. Various techniques for inverting an nby-n matrix include: (i) the adjoint-matrix method [2] , (ii) LU decomposition [8] , and (iii) Gauss-Jordan elimination [7] . Due to the regular structure and relatively low cost of the GaussJordan elimination method, it will be utilized for implementing the complex matrix inversion unit.
The Gauss-Jordan elimination method extends the original n-by-n matrix A to a n-by-2n matrix: 
Gaussian elimination is used to "zero-off" non-diagonal elements on the left half of the matrix, and the diagonal elements of the left half are scaled such that the resultant n-by-2n matrix is: 
where A −1 is the matrix inverse of A.
II. COMPLEX NUMBER ON-LINE FLOATING-POINT

ARITHMETIC
On-line arithmetic [4] is a class of arithmetic operations in which all operations are performed digit serially, in a most significant digit first (MSDF) manner. Several advantages, compared to conventional parallel arithmetic, include: (i) ability to overlap dependent operations, since on-line algorithms produce the output serially, most-significant digit first, enabling successive operations to begin before previous operations have completed; (ii) low-bandwidth communication, since intermediate results pass to and from modules digitserially, so connections need only be one digit wide; and (iii) support for variable precision, since once a desired precision is obtained, successive outputs can be ignored. One of the key parameters of on-line arithmetic is the on-line delay, defined as the number of digits of the operand(s) necessary in order to generate the first digit of the result. Each successive digit of the result is generated one per cycle. This is illustrated in Figure 1 , with on-line delay δ = 4. The latency of an online arithmetic operator, assuming m-digit precision is then δ + m − 1. This number system was introduced as Quarter-imaginary Number System in [5] . For implementation of the complex matrix inversion unit, in order to permit a relatively wide range of input values, we assume floating-point arithmetic. Three on-line floating-point arithmetic operations are used: (i) RCN S 2j,3 on-line floating-point addition; (ii) RCN S 2j,3 on-line floating-point multiplication; and (iii) RCN S 2j,3 online floating-point division. The recurrence algorithms and implementation parameters when mapped to a Xilinx Virtex FPGA are discussed in detail.
Using RCN S 2j,3 , a floating-point complex number x = (X R + jX I ) · (2j) ex can be normalized with regard either to the real component X R or the imaginary component X I , depending on which has larger absolute value. The exponent e x is shared between the real and imaginary component. Exponent overflow/underflow can be handled by setting an exception flag, and allowing processing of results (although erroneous) to continue.
A RCN S 2j,3 fraction x is considered normalized if 2
The output of a complex number operation can be undernormalized for several reasons:
1. The range of an output determined by the on-line algorithm allows it to be undernormalized. 2. Digit cancellation resulting from the addition/subtraction of numbers with the same exponent value.
In this paper, we assume operands of an RCN S 2j,3 online algorithm have non-zero most significant digits and are normalized. When the result Z exceeds the range of a normalized fraction (i.e. max(|Z R , Z I |) ≥ 1) then the exponent is incremented. When the result is below the range of a normalized fraction (i.e. max(|Z R , Z I | < 1 2 ), then the exponent is decremented and leading zeros are discarded. The normalization algorithm which takes as input the generated output digit z k , the output exponent e z and the on-line delay for the arithmetic operation δ is shown below. This is similar to the normalization algorithm presented in [3] for radix-2 on-line rotation.
Although RCN S 2j,3 allows flexibility in representation, there are also several drawbacks:
• Handling digits 3 and −3 requires producing significand multiples 3X and −3X, requiring an extra addition step.
• A significand X with fractional real and imaginary components X R and X I can have integer digits, such as (11.3212) 2j = To handle these cases, several recoding modules are presented: (i) digit-set recoding; and (ii) most-significant-digit recoding.
A. Digit-set recoding
In order to reduce the complexity introduced by handling digits −3 and 3, digit-set recoding initially recodes
, two cases of pairs of values must be prevented:
To do so, x k+2 is examined. If x k+2 ≤ −2 and x k = 2, which could allow the first case, x k is recoded as (1, 2), otherwise as (0, 2). In the same way, if x k+2 ≥ 2 and x k = −2, which could allow the second case, x k is recoded as (1, 2), otherwise as (0, 2) Then it is assured that χ k ∈ {−2, . . . , 2}. The digit-set recoding algorithm DSREC is shown below.
B. Most-significant-digit recoding
In order to handle carries produced when performing operations on significands consisting of RCN S 2j,3 digits, mostsignificant-digit recoding recodes most-significant residual digits w −1 , w 0 ∈ {−1, 0, 1} of respective weights (2j) 
MSREC(w
ez is produced such that
Each output digit at step k − δ, namely z k−δ is generated based on input digits x k and y k . The design of a m-digit significand and e-bit exponent RCN S 2j,3 on-line floating point adder is shown in Figure 2 . The SUBE unit computes the difference of the exponents. The ALIGN unit performs alignment of operand y k to synchronize the arrival of the input digits. The SWAP unit exchanges the operands if necessary. The PPM and MMP modules are simple full-adders that appropriately negate (indicated by "-" on the port) inputs and outputs to perform borrow-save addition. The NORM unit normalizes the result by updating the output exponent e z . A summary of cost of individual modules is shown in Table I . Assuming m = 24 and e = 8, the cost is 108 CLB slices. The on-line delay is δ = 3.
B. RCN S 2j,3 on-line floating-point multiplication
RCN S 2j,3 floating-point multiplication (z = xy) is defined such that given inputs x = (X R + jX I ) · (2j) ex and y = (Y R + jY I ) · (2j) ey , the output z = (Z R + jZ I ) · (2j) ez is produced such that Total cost 3m + 4e + 4 The operands x k and y k are recoded into digit set {−2, . . . , 2} using two DSREC units. The two most significant digits of the recurrence are determined using two MSREC units which perfom output digit selection as well as handle potential most significant carry-out bits from the adders. The ADDE unit adds the two input exponents to produce the exponent of the output, not considering exponent overflow/underflow. The NORM unit normalizes the result by updating the output exponent e z each cycle until the output digit z k−δ is non-zero. The design of a m-digit significand and e-bit exponent on-line floating-point multiplier is shown in Figure 3 . The number of individual module types utilized, the cost per module type, and the total overall cost are summarized in Table II . Assuming m = 24 and e=8, the total cost is 452 CLB slices. The on-line delay is δ = 9. 
A RCN S 2j,3 on-line floating-point divider can be designed as a series of modular slices, where each slice consists of two borrow-save digit multipliers, a 3:1 borrow-save digit adder, a pair of digit-wide latches, a D flip-flop, and a digit-wide register of D flip-flops. The operand digit y k and the output digit z k−δ are recoded into digit set {−2, . . . , 2} using two DSREC units. The most significant digits of the recurrence are determined using two MSREC units which handle potential most significant carry-out bits from the adders. The SELDIV unit, selects the output digit z k−δ . The SUBE unit subtracts the two exponents to produce the exponent of the output, not considering exponent overflow/underflow. The digit x k is appended to the most significant end of the vector product Z[k −1]y k . The NORM unit normalizes the result by updating the exponent e z . The design of a m-digit significand and e-bit exponent on-line floating-point divider is shown in Figure 4 . The number of individual module types utilized, the cost per module type, and the total overall cost are summarized in Table III . Assuming m = 24 and e=8, the total cost is 545 CLB slices. The on-line delay is δ = 9. 
"0" 
Since this operation is applied toward (n − 1) elements within each of (n − 1) rows successively n times, until finally only diagonal elements remain in the left half of the matrix, at which point the diagonal elements are scaled to produce 1's along the diagonal requiring n 2 divisions, the total number of arithmetic operations is: n(n − 1) 2 additions/subtractions, n(n − 1) 2 multiplications, and 2n(n − 1) divisions. The total delay is: n additions/subtractions, n multiplications, and (n + 1) divisions. We compare a parallel radix 2 approach, an online radix 2 approach, and the on-line RCN S 2j,3 approach.
A. RCN S 2j,3 on-line network
An n-by-n complex matrix inversion unit can be designed as a network of RCN S 2j,3 floating-point arithmetic operators. For 24-bit significands and 8-bit exponents, the cost and online delay of a complex floating-point adder is 108 CLB slices and 3 cycles, respectively, the cost and on-line delay of an on-line complex floating-point multiplier is 452 CLB slices and 9 cycles, respectively, and the cost and delay of an online complex floating-point divider is 545 CLB slices and 9 cycles, respectively. The total cost is 560n 3 − 30n 2 − 530n CLB slices. The total latency, summing the on-line delays to produce the first digit, after which the remaining 23 digits are produced one per cycle, is 21n + 32 cycles.
B. Radix 2 on-line network
An n-by-n complex matrix inversion unit can be alternatively designed as a network of radix 2 floating-point arithmetic operators, as described in [6] . For 24-bit significands and 8-bit exponents, the cost and on-line delay of an equivalent complex floating-point adder is 140 CLB slices and 3 cycles, respectively, the cost and on-line delay of an equivalent online complex floating-point multiplier is 868 CLB slices and 7 cycles, respectively, and the cost and delay of an equivalent online complex floating-point divider is 1502 CLB slices and 12 cycles, respectively. The total cost is 1008n 3 + 988n 2 − 1996n CLB slices. The total latency, summing the on-line delays to produce the first digit, after which the remaining 23 digits are produced one per cycle, is 22n + 35 cycles.
C. Radix 2 parallel network
An n-by-n complex matrix inversion unit can be alternatively designed as a network of radix-2 parallel arithmetic operators. The library of Xilinx CORE floating-point arithmetic modules [9] , which can be scaled in terms of precision is used. For 24-bit significands and 8-bit exponents, the cost and latency of an equivalent parallel complex floating-point adder is 672 CLB slices and 11 cycles, respectively, the cost and delay of an equivalent parallel complex floating-point multiplier is 1292 CLB slices and 17 cycles, and the cost and delay of an equivalent parallel complex floating-point divider is 3492 CLB slices and 44 cycles. The total cost is 1964n 3 + 3056n 2 − 5020n CLB slices. The total latency is 72n + 44 cycles.
