While random network coding has proved to be a powerful tool for disseminating information in networks, it is highly susceptible to errors caused by various sources. Recently, constantdimension codes (CDCs), especially Kotter-Kschischang (KK) codes, have been proposed for error control in random network coding. It has been shown that KK codes can be constructed from Gabidulin codes, an important class of rank metric codes used in storage and cryptography. Although rank metric decoders have been proposed for both Gabidulin and KK codes, it is not clear whether such decoders are feasible and suitable for hardware implementations. In this paper, we propose novel decoder architectures for both codes. The synthesis results of our decoder architectures for Gabidulin and KK codes over small fields and with limited error-correcting capabilities not only are affordable, but also achieve high throughput.
INTRODUCTION
While random network coding has proved to be a powerful tool for disseminating information in networks, it is highly susceptible to errors caused by various sources such as noise, malicious or malfunctioning nodes, or insufficient min-cut [1] . Thus, error control for random network coding is critical and has received growing attention recently. Nearly optimal ReedSolomon-like constant-dimension codes (CDCs) based on the subspace metric, called Kotter-Kschischang (KK) codes, were proposed in [1] for noncoherent error control in network coding. Later it was shown [2] that KK codes correspond to "liftings" [2] of Gabidulin codes [3] . As Reed-Solomon codes achieve the Singleton bound of Hamming distance, Gabidulin codes are a class of maximum rank distance (MRD) codes, which achieve the Singleton bound of rank distance. Due to the connection between Gabidulin and KK codes, the decoding of KK codes can be viewed as a generalized Gabidulin decoding, which involves errors, erasures, and deviations.
This work was supported in part by Thales Communications, Inc. and in part by a grant from the Commonwealth of Pennsylvania, Department of Community and Economic Development, through the Pennsylvania Infrastructure Technology Alliance (PITA).
978-1-4244-4335-2/09/$25.00 ©2009 IEEE Although the complexity of errors-only and generalized decoding of Gabidulin codes was analyzed [4] , there are no hardware architectures for these decoders reported yet. Thus it remains unknown whether these decoding algorithms are feasible and suitable for hardware implementations. The feasibility of the generalized Gabidulin decoding algorithm in hardware implementations determines whether random network coding, along with error control, can be readily applied to certain applications.
In this paper, we propose decoder architectures for Gabidulin and KK codes. We first propose a high-throughput hardware architecture for errors-only Gabidulin decoding, then extend it to decode KK codes. To evaluate the performance of our decoder architectures, we implement our decoder architecture for an (8, 4) Gabidulin code over IF28, whose code length is the longest given the field. We also implement our decoder architecture for the corresponding KK code of the (8,4) Gabidulin code over IF28, which can be used in network coding with various packet lengths by Cartesian product. The synthesis results of our decoder architectures show that decoder architectures for Gabidulin and KK codes over small fields and with limited error-correcting capabilities not only are affordable, but also achieve high throughput. Our decoder architectures are novel to the best of our knowledge.
PRELIMINARIES

Rank metric and Gabidulin Codes
The rank weight of a vector over IFq7n is defined as the maximal number of its coordinates that are linearly independent over the base field IFa: Rank metric is the weight of vector difference [3] .
A Gabidulin code [3] is a linear (n, k) code over IFq7n defined by a parity-check matrix [5] , finding the root space, finding the error locators by Gabidulin's algorithm [3] , and finding error locations. Note that all polynomials involved in the decoding process have only non-zero terms with degrees [j] , and such polynomials are called linearized polynomials. The greatest value of j of non-zero terms of a linearized polynomial is defined as its q-degree.
The data flow of a Gabidulin decoder is given in Figure for the root space of a(x) using the methods in [6, 7] . Then we can find the error locators Xj's corresponding to E, 's by solving a system of equations proved [2] that the subspace distance [1] Figure 2 . Like interpolation in errors-and-erasures RS decoding, the general Gabidulin decoding uses minpoly({3) to compute a minimum linearized polynomial, which satisfies two conditions: the elements of {3 are its roots and its q-degree is minimal.
ARCHITECTURE FOR GABIDULIN DECODING
where T is the number of errors. After solving (1) using Gabidulin's algorithm, the error locations Lj's are revealed from X j 's by solving
In this section, we propose a novel decoder architecture for Gabidulin codes. We describe the key features of our decoder architecture below. In most practical applications, data are stored and transmitted in binary formats. Henceforth in this paper we assume q == 2. [8, 9] ). Finite field elements can be represented by vectors using different bases: polynomial basis, normal basis, and dual basis [8] . Under normal basis, squaring is simply cyclic shifting to more significant bits, which makes it very attractive in rank metric decoders since all polynomials involved are linearized polynomials. It was pointed out in [4] that using normal basis can facilitate the computation of symbolic product. It was also suggested that solving (2) can be trivial using normal basis.
Finite Field Arithmetic
There are additional savings due to normal basis. In Gabidulin's algorithm [3] to solve (1), the major complex-
which requires divisions and finding square roots. Actually, when q == 2, they can be computed in an inversionless form Ai,j == Ai-1,j -Ai-l,jA~=~],i_l and Qi,j == Qi-l,j -Qi-l,j+lA~=~] i-I' which requires only finding square roots. Similar t~squaring, finding square root in Sy nd rome s There are serial and parallel architectures for normal basis finite field multipliers. To achieve high throughput in our decoder, we consider only parallel architectures. The complexity of a normal basis, CN , is defined as the number of terms aibi in computing a bit of c = ab, where ai's and bi's are the bits of a and b, respectively. In this paper, we focus on the field JF 2 8 generated by
+ 1, on which CN is minimized to 21 [8] . Most normal basis multipliers are based on the Massey-Omura (MO) architecture. According to [9] , a parallel MO multiplier needs m 2 = 64 AND gates and at most m(C N + m -2)/2 = 108 XOR gates. Using the common subexpression elimination algorithm from [10] , we reduce the number of XOR gates to 88 while maintain the same critical path delay (CPD) as that of one AND gate and five XOR gates.
Since squaring is almost free, an efficient method to get the inverse of (3 is to find (3 -1 = (32 m -2 = (32(34 . .. based on multipliers. Division is simply the combination of inversion and multiplication .
BMA Architectures
The modified BMA for rank metric codes in [5] is similar to the BMA for Reed-Solomon codes except that polynomial multiplications are replaced by symbolic products. The BMA in [5] requires finite field divisions, which are more time-consuming than other arithmetic operations. We first propose an inversionless variant in Algorithm 1, which is more suitable for hardware implementation. 
B (r )(x) .
To further increase the throughput, more regular architectures are necessary for shorter CPD. Following the approaches in [11] , we develop two architectures based on Algorithm 1.
In Algorithm 1, the critical path is in step 2(a). Note that 6. r is the rth coefficient of the discrepancy polynomial
6.(r )(x) = A(r )(x) Q9 S(x). By using e (r )(x) = B (r) (x) Q9
S(x) , 6. (r +l ) (x) can be computed as
which has the same CPD as step 2(c). This reformulation leads to a more regular architecture in Algorithm 2, which is analogous to the riBM architecture in [11] . Compared to Algorithm 1, its control flow is also simpler. 
(r) (X), and 8(r+1) (x) == X[l] Q9 8(r+1) (x).
Given the similarities between steps 2(a) and 2(b), A(x)
and~(x) can be combined together into one polynomial Li(x), which is more regular. Similarly, B(x) and 8(x) can be combined into one polynomial e(x). In Algorithm 3 we have the RiBMA architecture, which is closely related to the RiBM architecture in [11] . 
r (r)Li (r)(x) -Li6r)e(r)(X);
(b) Set k == k + 1; (c) If Li6r) i-0 and k > 0, set k == -k, r(r+1)
Li6r), and e(r)(x) == Li(r) (x); (d) Set Li(r+1) (x) == L;~~l Li~~~l)x[i], e(r)(x)
L;~~l e~~lX[i];
(e) Set r(r+1) == (r(r) ) [1] and e(r+1) (x) == x[l] Q9
e(r)(x).
Set A(x) == L~=o Li~~ix[i].
Decoding Failure
A complete decoder declares decoding failure when no valid codeword is found within the decoding radius of the received word. To the best of our knowledge, decoding failures of Gabidulin and KK codes were not discussed in previous works. Similar to Reed-Solomon decoding algorithms, a rank decoder can return decoding failure when the roots of the error span polynomial A(x) are not unique. That is, the root space of
A(x) has dimensions less than the q-degree of A(x).
Note that this applies to both Gabidulin and KK decoders. For KK decoders, another condition to declare decoding failure is when the total number of erasures and deviations exceeds 2t. By using the left-RRE forms instead of RRE forms, we reduce the complexity of reduction slightly. More important, the reduction for left-RRE forms is completely determined by the left part of Y, which greatly simplifies hardware implementation.
KK Codes Lifted from Cartesian Gabidulin Codes
The left-RRE form also considerably simplifies the decoding of KK codes that are lifted from Cartesian Gabidulin codes.
In network practice, the packet length is very long and m is much larger than n. In such cases, the decoding complexity of KK codes is prohibitive due to the huge field size of IF2m • A low-complexity approach in [2] suggested that instead of using a single long Gabidulin code, the Cartesian product of many short Gabidulin codes with the same distance can be used to construct KK codes for long packets. For KK codes that are lifted from Cartesian Gabidulin codes, we can perform decoding in a serial manner with only one decoder, or in a semiparallel way with more decoders, or even in a fully parallel fashion. It is a tradeoff between cost/area/power and throughput.
Gaussian Elimination
We first show that finding the root space and minimum linearized polynomials can be done by Gaussian elimination.
According to [2] , the complexity between the probabilistic algorithm in [7] and Berlekamp's deterministic method [6] is small for q == 2. So the deterministic method is preferred since it is much easier to implement.
Berlekamp's deterministic method first evaluates the poly- 
Latency Analysis
We analyze the worst-case decoding latencies of our decoder architectures, in terms of clock cycles, in Table 1 .
Note that we assume that the coefficient of the highest degree term is one. Thus it can be solved by Gaussian elimination. Furthermore, Gabidulin's algorithm is essentially a smart way of Gaussian elimination, which takes advantage of the properties of the matrix. So Gaussian elimination appears in most steps of the decoding process, including reduction for the RRE form, finding minimum linearized polynomials, finding the root space, and Gabidulin's algorithm. The reduction and finding the root space are Gaussian eliminations on matrices over IFq, while linearized interpolation and Gabidulin's algorithm operate on matrices over IFq7n • For high-throughput implementations, we adopt the pivoting architecture from [12] . It was developed for non-singular matrices over IF2. It always keeps the pivot element on the top-left location of the matrix, by cyclically shifting the rows and columns. To apply it to singular matrices, which appear in the reduction for the RRE form and finding the root space, we adapt the architecture to detect singularity. Our architecture is also flexible about matrix sizes, which are determined by the varying numbers of errors, erasures, and deviations. Eliminations over IFq7n require divisions. By cross-multiplications, we can avoid divisions in Gaussian elimination; Divisions are used only when the row is reduced to have only one non-zero element. In Gabidulin's algorithm, the matrix is first reduced to a triangular form. Then it performs a backward elimination after getting each coefficient. Hence we introduce a backward pivoting scheme, where the pivot element is always at the bottom-right corner.
IMPLEMENTATION RESULTS
We implement our decoder architecture in Verilog for an (8, 4) Gabidulin code, which can correct up to two errors. We also implement our decoder architecture for the corresponding KK code, which can correct E errors, J-j erasures, and 8 deviations as long as 2E + J-j + 8 < 5. Our designs are synthesized using Cadence RTL Compiler 7.1 and MOSIS SCMOS TSMC 0.18 IJm standard cell library [13] . The synthesis results are given in Table 2 . The total area in Table 2 includes both cell area and estimated net area, and the total power in Table 2 includes both leakage and estimated dynamic power. All estimation are made by the synthesis tool. In our calculation of throughput, we consider all input bits. Each received vector of As in [12] , the latency of Gaussian elimination for the left-RRE form is at most n(n + 1)/2 cycles. Additionally it takes at most n cycles more to extract t. (8, 4) codes, the longest latency of them is no more than 2(d -1) + mu + m8. The latency of RiBMA is 2t for 2t iterations.
The latency of a symbolic product a(x) Q9 b(x) is determined by the q-degree of a(x). When computing SDU(X), we are concerned about only the terms of q-degree less than d -1 because only those are meaningful for the key equation. For computing SPD(X), the result of O"D(X) Q9 S(x) in SDU(X) can be reused, so it needs only one symbolic product.
the (8,4) Gabidulin code has 64 bits and that of the (8,4) KK code has 128 bits. The gate count of our generalized decoder is close to that of the (255,239) Reed-Solomon decoder in [14] , which is 115,500. Although their code lengths are quite different, both codes are the longest in each class of codes. So, for Gabidulin and KK codes over small fields, which have limited error-correcting capabilities, their hardware implementations are feasible. The area and power of decoder architectures in Table 2 are affordable except for applications with very stringent area and power requirements. For practical network applications, the packet size is large. For example, for a packet size of 512 bytes, we can use a KK code that is based on Cartesian product of 511 length-8 Gabidulin codes. For higher throughput, more decoders can be used to decode in parallel. We list the gate counts and throughput of serial and factor-7 parallel schemes in Table 3 . Although the area and power shown in Tables 2 and 3 are affordable, they are for short Gabidulin and KK codes over a small field. The (8,4) Gabidulin and KK codes can correct at most two errors. Although the (8,4) KK decoder can be used for long packets by Cartesian product, the Cartesian product of (8,4) Gabidulin codes and its corresponding CDC also can correct at most two errors. When we increase the error correction capabilities of both Gabidulin and KK codes, longer codes are needed and thus larger fields are required. The large field size implies a higher complexity for finite field arithmetic. It remains to be seen whether the decoder architectures continue to be affordable for longer codes over larger fields, and this will be the subject of our future work.
132
