Rank metric decoder architectures for random linear network coding with error control. by Chen,  Ning et al.
Durham Research Online
Deposited in DRO:
05 February 2016
Version of attached ﬁle:
Accepted Version
Peer-review status of attached ﬁle:
Peer-reviewed
Citation for published item:
Chen, Ning and Yan, Zhiyuan and Gadouleau, Maximilien and Wang, Ying and Suter, Bruce W. (2012) 'Rank
metric decoder architectures for random linear network coding with error control.', IEEE transactions on very
large scale integration (VLSI) systems., 20 (2). pp. 296-309.
Further information on publisher's website:
http://dx.doi.org/10.1109/TVLSI.2010.2096239
Publisher's copyright statement:
c© 2012 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in
any current or future media, including reprinting/republishing this material for advertising or promotional purposes,
creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of
this work in other works.
Additional information:
Use policy
The full-text may be used and/or reproduced, and given to third parties in any format or medium, without prior permission or charge, for
personal research or study, educational, or not-for-proﬁt purposes provided that:
• a full bibliographic reference is made to the original source
• a link is made to the metadata record in DRO
• the full-text is not changed in any way
The full-text must not be sold in any format or medium without the formal permission of the copyright holders.
Please consult the full DRO policy for further details.
Durham University Library, Stockton Road, Durham DH1 3LY, United Kingdom
Tel : +44 (0)191 334 3042 | Fax : +44 (0)191 334 2971
http://dro.dur.ac.uk
1Rank Metric Decoder Architectures for Random
Linear Network Coding with Error Control
Ning Chen Member, IEEE, Zhiyuan Yan Senior Member, IEEE, Maximilien Gadouleau Member, IEEE,
Ying Wang Member, IEEE, and Bruce W. Suter Senior Member, IEEE
Abstract—While random linear network coding is a powerful
tool for disseminating information in communication networks,
it is highly susceptible to errors caused by various sources. Due
to error propagation, errors greatly deteriorate the throughput
of network coding and seriously undermine both reliability and
security of data. Hence error control for network coding is vital.
Recently, constant-dimension codes (CDCs), especially Ko¨tter–
Kschischang (KK) codes, have been proposed for error control in
random linear network coding. KK codes can also be constructed
from Gabidulin codes, an important class of rank metric codes.
Rank metric decoders have been recently proposed for both
Gabidulin and KK codes, but they have high computational
complexities. Furthermore, it is not clear whether such decoders
are feasible and suitable for hardware implementations. In this
paper, we reduce the complexities of rank metric decoders and
propose novel decoder architectures for both codes. The synthesis
results of our decoder architectures for Gabidulin and KK codes
with limited error-correcting capabilities over small fields show
that our architectures not only are affordable, but also achieve
high throughput.
Index Terms—Constant-dimension codes (CDCs), Decoding, Er-
ror correction coding, Gabidulin codes, Galois fields, Integrated
circuits, Ko¨tter–Kschischang codes, Network coding, Rank metric
codes, Subspace codes.
I. INTRODUCTION
Network coding [1] is a promising candidate for a new
unifying design paradigm for communication networks, due
to its advantages in throughput and robustness to network
failures. Hence, network coding is already used or considered
in gossip-based data dissemination, 802.11 wireless ad hoc net-
working, peer-to-peer networks, and mobile ad hoc networks
(MANETs).
Random linear network coding (RLNC) [2] is arguably the
most important class of network coding. RLNC treats all pack-
ets as vectors over some finite field and forms an outgoing
This work was supported in part by Thales Communications, Inc., a summer
extension grant from Air Force Research Lab, and NSF under grant ECCS-
0925890. The material in this paper was presented in part at the IEEE
Workshop on Signal Processing Systems, Tampere, Finland, October 2009.
Ning Chen was with the Department of Electrical and Computer Engi-
neering, Lehigh University, Bethlehem, PA 18015 USA. Now he is with the
Enterprise Storage Division, PMC-Sierra Inc., Allentown, PA 18104 USA (e-
mail: ning chen@pmc-sierra.com).
Zhiyuan Yan is with the Department of Electrical and Computer Engineer-
ing, Lehigh University, Bethlehem, PA 18015 USA (e-mail: yan@lehigh.edu).
Maximilien Gadouleau was with the Department of Electrical and Computer
Engineering, Lehigh University, Bethlehem, PA 18015 USA. Now he is with
the Department of Computer Science, Queen Mary, University of London, E1
4NS UK (e-mail: mgadouleau@eecs.qmul.ac.uk).
Ying Wang is with Qualcomm Flarion Technologies, Bridgewater, NJ 08807
USA (e-mail: aywang11@gmail.com).
Bruce W. Suter is with Air Force Research Laboratory, Rome, New York
13441 USA (e-mail: bruce.suter@rl.af.mil).
packet by linearly combining incoming packets using random
coefficients. Due to its random linear operations, RLNC not
only achieves network capacity in a distributed manner, but
also provides robustness to changing network conditions. Un-
fortunately, it is highly susceptible to errors caused by various
reasons, such as noise, malicious or malfunctioning nodes,
or insufficient min-cut [3]. Since linearly combining pack-
ets results in error propagation, errors greatly deteriorate the
throughput of network coding and seriously undermine both
reliability and security of data. Thus, error control for random
linear network coding is critical.
Error control schemes proposed for RLNC assume two types
of transmission models. The schemes of the first type (see, for
example, [4]) depend on and take advantage of the underlying
network topology or the particular linear network coding oper-
ations performed at various network nodes. The schemes of the
second type [3], [5] assume that the transmitter and receiver
have no knowledge of such channel transfer characteristics.
The two transmission models are referred to as coherent and
noncoherent network coding, respectively.
It has been recently shown [3] that an error control code for
noncoherent network coding, called a subspace code, is a set
of subspaces (of a vector space), and information is encoded
in the choice of a subspace as a codeword; a set of packets
that generate the chosen subspace is then transmitted [3]. A
subspace code is called a constant-dimension code (CDC) if its
subspaces are of the same dimension. CDCs are of particular
interest since they lead to simplified network protocols due to
the fixed dimension. A class of asymptotically optimal CDCs
have been proposed in [3], and they are referred to as the KK
codes. A decoding algorithm based on interpolation for bivari-
ate linearized polynomials is also proposed in [3] for the KK
codes. It was shown that KK codes correspond to lifting [5]
of Gabidulin codes [6], [7], a class of optimal rank metric
codes. Gabidulin codes are also called maximum rank distance
(MRD) codes, since they achieve the Singleton bound in the
rank metric [6], as Reed–Solomon (RS) codes achieve the
Singleton bound of Hamming distance. Due to the connection
between Gabidulin and KK codes, the decoding of KK codes
can be viewed as generalized decoding of Gabidulin codes,
which involves deviations as well as errors and erasures [5].
Gabidulin codes are significant in themselves: For coherent
network coding, the error correction capability of error control
schemes is succinctly described by the rank metric [8]; thus
error control codes for coherent network coding are essentially
rank metric codes.
The benefits of network coding above come at the price of
2additional operations needed at the source nodes for encoding,
at the intermediate nodes for linear combining, and at the
destination node(s) for decoding. In practice, the decoding
complexities at destination nodes are much greater than the
encoding and combining complexities. The decoding complex-
ities of RLNC are particularly high when large underlying
fields are assumed and when additional mechanisms such as
error control are accounted for. Clearly, the decoding com-
plexities of RLNC are critical to both software and hardware
implementations. Furthermore, area/power overheads of their
VLSI implementations are important factors in system design.
Unfortunately, prior research efforts have mostly focused on
theoretical aspects of network coding, and complexity reduc-
tion and efficient VLSI implementation of network coding
decoders have not been sufficiently investigated so far. For
example, although the decoding complexities of Gabidulin and
KK codes were analyzed in [9], [10] and [3], [5], respectively,
they do not reflect the impact of the size of the underlying
finite fields. To ensure high probability of success for RLNC,
a field of size 28 or 216 is desired [11]. However, these large
field sizes will increase decoding complexities and hence com-
plicate hardware implementations. Finally, to the best of our
knowledge, hardware architectures for these decoders have not
been investigated in the open literature.
In this paper, we fill this significant gap by investigating
complexity reduction and efficient hardware implementation
for decoders in RLNC with error control. This effort is signifi-
cant to the evaluation and design of network coding for several
reasons. First, our results evaluate the complexities of decoders
for RLNC as well as the area, power, and throughput of their
hardware implementations, thereby helping to determine the
feasibility and suitability of network coding for various ap-
plications. Second, our research results provide instrumental
guidelines to the design of network coding from the perspec-
tive of complexity as well as hardware implementation. Third,
our research results lead to efficient decoders and hence reduce
the area and power overheads of network coding.
In this paper, we focus on the generalized Gabidulin decod-
ing algorithm [5] for the KK codes and the decoding algorithm
in [12] for Gabidulin codes for two reasons. First, compared
with the decoding algorithm in [3], the generalized Gabidulin
decoding [5] has a smaller complexity, especially for high-
rate KK codes [5]. Second, components in the errors-only
Gabidulin decoding algorithm in [12] can be easily adapted in
the generalized Gabidulin decoding of KK codes. Thus, among
the decoding algorithms for Gabidulin codes, we focus on the
decoding algorithm in [12].
Although we focus on RLNC with error control in this
paper, our results can be easily applied to RLNC without
error control. For RLNC without error control, the decoding
complexity is primarily due to inverting of the global coding
matrix via Gauss-Jordan elimination, which is also considered
in this paper.
Our main contributions include several algorithmic reformu-
lations that reduce the computational complexities of decoders
for both Gabidulin and KK codes. Our complexity-saving al-
gorithmic reformulations are:
• We first adopt normal basis representations for all finite
field elements, and then significantly reduce the complex-
ity of bit-parallel normal basis multipliers by using our
common subexpression elimination (CSE) algorithm [13];
• The decoding algorithms of both Gabidulin and KK codes
involve solving key equations. We adapt the inversion-
less Berlekamp–Massey algorithm (BMA) in [14], [15]
to solving key equations for rank metric codes. Our in-
versionless BMA leads to reduced complexities as well
as efficient architectures;
• The decoding algorithm of KK codes requires that the
input be arranged in a row reduced echelon (RRE) form
[16]. We define a more generalized form called n-RRE
form, and show that it is sufficient if the input is in the
n-RRE form. This change not only reduces the complex-
ity of reformulating the input, but also enables parallel
processing of decoding KK codes based on Cartesian
products.
Another main contribution of this paper is efficient decoder
architectures for both Gabidulin and KK codes. Aiming to
reduce the area and to improve the regularity of our decoder
architectures, we have also reformulated other steps in the
decoding algorithm. To evaluate the performance of our de-
coder architectures for Gabidulin and KK codes, we implement
our decoder architecture for two rate-1/2 Gabidulin codes and
their corresponding KK codes. Our KK decoders can be used
in network coding with various packet lengths by Cartesian
product [5]. The synthesis results of our decoders show that
our decoder architectures for Gabidulin and KK codes over
small fields with limited error-correcting capabilities not only
are affordable, but also achieve high throughput. Our decoder
architectures and implementation results are novel to the best
of our knowledge.
The decoders considered in this work are bounded distance
decoders, and their decoding capability is characterized in [5,
Theorem 11]. The thrust of our work is to reduce complexities
and to devise efficient architectures for such decoders, while
maintaining their decoder capability. To this end, our reformu-
lations of the decoding algorithms do not affect the decoding
capability of the bounded distance decoders of Gabidulin and
KK codes. The error performance of the bounded distance de-
coders has been investigated in our previous works [17]–[19].
Hence, despite its significance, a detailed error performance
analysis is out of the scope of this paper, and we do not include
it due to limited space.
The rest of the paper is organized as follows. After briefly re-
viewing the background in Section II, we present our complexity-
saving algorithmic reformulations and efficient decoder archi-
tectures in Sections III and IV, respectively. In Section V,
the proposed architectures are implemented in Verilog and
synthesized for area/performance evaluation. The conclusion
is given in Section VI.
II. PRELIMINARIES
A. Notation
Let q denote a power of prime and Fqm denote a finite
field of order qm. We use Fn×mq to denote the set of all n×m
matrices over Fq and use In to denote an n×n identity matrix.
3For a set U ⊆ {0, 1, . . . , n− 1}, Uc denotes the complement
subset {0, 1, . . . , n−1}\U and IU denotes the columns of In
in U . In this paper, all vectors and matrices are in bold face.
The rank weight of a vector over Fqm is defined as the max-
imal number of its coordinates that are linearly independent
over the base field Fq. Rank metric between two vectors over
Fqm is the rank weight of their difference [20]. For a column
vector X ∈ Fnqm , we can expand each of its component into
a row vector over the base field Fq. Such a row expansion
leads to an n ×m matrix over Fq . In this paper, we slightly
abuse the notation so that X can represent a vector in Fnqm
or a matrix in Fn×mq , although the meaning is usually clear
given the context.
Given a matrixX , its row space, rank, and reduced row ech-
elon (RRE) form are denoted by 〈X〉, rankX , and RRE(X),
respectively. For a subspace 〈X〉, its dimension is denoted
by dim〈X〉 and rankX = dim〈X〉. The rank distance of
two vectors X and Y in Fnqm is defined as dR(X,Y ) ,
rank(X −Y ). The subspace distance [3] of their row spaces
〈X〉, 〈Y 〉 is defined as dS(〈X〉, 〈Y 〉) , dim〈X〉+dim〈Y 〉−
2 dim(〈X〉 ∩ 〈Y 〉).
A linearized polynomial [21], [22] (or q-polynomial) over
Fqm is a polynomial of the form f(x) =
∑p
i=0 fix
qi
, where
fi ∈ Fqm . For a linearized polynomial f(x), its q-degree is
defined to be the greatest value of i for which fi is non-zero.
For convenience, let [i] denote qi. The symbolic product of
two linearized polynomials a(x) and b(x), denoted by ⊗ (that
is, a(x) ⊗ b(x) = a(b(x))), is also a linearized polynomial.
The q-reverse of a linearized polynomial f(x) =
∑p
i=0 fix
[i]
is given by the polynomial f¯(x) =
∑p
i=0 f¯ix
[i]
, where f¯i =
f
[i−p]
p−i for i = 0, 1, . . . , p and p is the q-degree of f(x). For
a set α of field elements, we use minpoly(α) to denote its
minimal linearized polynomial, which is the monic linearized
polynomial of least degree such that all the elements of α are
its roots.
B. Gabidulin Codes and Their Decoding
A Gabidulin code [6] is a linear (n, k) code over Fqm ,
whose parity-check matrix has a form as
H =


h
[0]
0 h
[0]
1 · · · h
[0]
n−1
h
[1]
0 h
[1]
1 · · · h
[1]
n−1
.
.
.
.
.
.
.
.
.
.
.
.
h
[n−k−1]
0 h
[n−k−1]
1 · · · h
[n−k−1]
n−1

 (1)
where h0, h1, . . . , hn−1 ∈ Fqm are linearly independent over
Fq. Let h denote (h0, h1, . . . , hn−1)T . Since Fqm is an m-
dimensional vector space over Fq , it is necessary that n ≤ m.
The minimum rank distance of a Gabidulin code is d = n −
k + 1, and hence Gabidulin codes are MRD codes.
The decoding process of Gabidulin codes includes five ma-
jor steps: syndrome computation, key equation solver, find-
ing the root space, finding the error locators by Gabidulin’s
algorithm [6], and finding error locations. The data flow of
Gabidulin decoding is shown in Figure 1.
Key equation solvers based on a modified Berlekamp–Massey
algorithm (BMA) [12] or a modified Welch–Berlekamp algo-
Received Syndromes BMA
Corrected Error Gabidulin’s Roots
r S σ(x)
EX
Fig. 1. Data flow of Gabidulin decoding
rithm (WBA) [23] have been proposed. In this paper, we focus
on the modified BMA due to its low complexity.
As in RS decoding, we can compute syndromes for Gabidu-
lin codes as S = (S0, S1, . . . , Sd−2) , Hr for any received
vector r. Then the syndrome polynomial S(x) =
∑d−2
j=0 Sjx
[j]
can be used to solve the key equation [12]
σ(x) ⊗ S(x) ≡ ω(x) mod x[d−1] (2)
for the error span polynomial σ(x), using the BMA. Up to
t = b(d− 1)/2c error values Ej’s can be obtained by finding
a basis E0, E1, . . . for the root space of σ(x) using the meth-
ods in [24], [25]. Then we can find the error locators Xj’s
corresponding to Ej ’s by solving a system of equations
Sl =
τ−1∑
j=0
X
[l]
j Ej , l = 0, 1, . . . , d− 2 (3)
where τ is the number of errors. Gabidulin’s algorithm [6]
in Algorithm 1 can be used to solve (3). Finally, the error
locations Lj’s are obtained from Xj’s by solving
Xj =
n−1∑
i=0
Lj,khi, j = 0, 1, . . . , τ − 1. (4)
Algorithm 1 (Gabidulin’s Algorithm [6]).
Input: S0, S1, . . . , Sd−2 and E0, E1, . . . , Eτ−1
Output: X0, X1, . . . , Xτ−1
1.1 Compute τ × τ matrices A and Q as
Ai,j =


Ej i = 0
0, i 6= 0, j < i
Ai−1,j −Ai−1,i−1(
Ai−1,j
Ai−1,i−1
)[−1] i 6= 0, j ≥ i
Qi,j =
{
Sj i = 0
Qi−1,j −Ai−1,i−1(
Qi−1,j+1
Ai−1,i−1
)[−1] otherwise.
1.2 Compute Xi’s recursively as Xτ−1 = Qτ−1,0/Aτ−1,τ−1
and Xi = (Qi,0 −
∑τ−1
j=i+1 Ai,jXj)/Ai,i, for i = τ −
2, τ − 3, . . . , 0.
In total, the decoding complexity of Gabidulin codes is
roughly O(n2(1−R)) operations over Fqm [9], where R is the
code rate, or O(dm3) operations over Fq [10]. Note that all
polynomials involved in the decoding process are linearized
polynomials.
Gabidulin codes are often viewed as the counterpart in rank
metric codes of the well-known RS codes. As shown in Table I,
an analogy between RS and Gabidulin codes can be estab-
lished in many aspects. Such an analogy helps us understand
the decoding of Gabidulin codes, and in some cases allows
us to adapt innovations proposed for RS codes to Gabidulin
codes.
4TABLE I
ANALOGY BETWEEN REED–SOLOMON AND GABIDULIN CODES
Reed–Solomon Gabidulin
Metric Hamming Rank
Ring of Polynomials Linearized Polynomials
Degree i [i] = qi
Key Operation Polynomial Multiplication Symbolic Product
Generation Matrix [gij] [g
[i]
j ]
Parity Check Matrix [hij ] [h
[i]
j
]
Key Equation Solver BMA Modified BMA
Error Locations Roots Root Space Basis
Error Value Solver Forney’s Formula Gabidulin’s Algorithm
C. KK Codes and Their Decoding
By the lifting operation [5], KK codes can be constructed
from Gabidulin codes. Lifting can also be seen as a gener-
alization of the standard approach to random linear network
coding [2], which transmits matrices in the form X = [I | x],
where X ∈ Fn×Mq , x ∈ Fn×mq , and m = M − n.
In practice, the packet length could be very long. To accom-
modate long packets based on the KK codes, very large m and
n are needed, which results in prohibitively high complexity
due to the huge field size of Fqm . A low-complexity approach
in [5] suggested that instead of using a single long Gabidulin
code, a Cartesian product of many short Gabidulin codes with
the same distance can be used to construct constant-dimension
codes for long packets via the lifting operation.
Let the received matrix be Y = [Aˆ | y], where Aˆ ∈ FN×nq
and y ∈ FN×mq . Note that we always assume the received
matrix is full-rank [5]. The row and column rank deficiencies
of Aˆ are δ = N−rank Aˆ and µ = n−rank Aˆ, respectively. In
the decoding algorithm of [5], the matrix Y is first turned into
an RRE form, and then the RRE form of Y is expanded into
Y¯ =
[
IUc 0
0 Iδ
]
RRE(Y ) =
[
In+LˆI
T
U
r
0 Eˆ
]
, where Uc denotes
the column positions of leading entries in the first n rows of
RRE(Y ). The tuple (r, Lˆ, Eˆ) is called a reduction of Y [5].
It was proved [5] that dS(〈X〉, 〈Y 〉) = 2 rank
[
Lˆ r−x
0 Eˆ
]
−µ−δ,
where µ = n−rank Lˆ and δ = N−rank Lˆ. Now the decoding
problem to minimize the subspace distance becomes a problem
to minimize the rank distance.
For a KK code C, the generalized rank decoding [5] finds
an error word eˆ = argmin
e∈r−C rank
[
Lˆ e
0 Eˆ
]
. The error word
eˆ is expanded as a summation of products of column and row
vectors [5] such that eˆ = ∑τ−1j=0 LjEj . Each term LjEj is
called either an erasure, if Lj is known, or a deviation, if Ej
is known, or an error, if neither Lj nor Ej is known. In this
general decoding problem,L has µ columns from Lˆ andE has
δ rows from Eˆ. Given a Gabidulin code of minimum distance
d, the corresponding KK code is able to correct  errors, µ
erasures, and δ deviations as long as if 2+ µ+ δ < d.
Algorithm 2 was proposed [5] for generalized decoding of
the KK codes, and its data flow is shown in Figure 2. It
requires O(dm) operations in Fqm [5].
Algorithm 2 (General Rank Decoding [5]).
Input: received tuple (r, Lˆ, Eˆ)
Output: error word eˆ
2.1 Compute S =Hr, Xˆ = LˆTh, λU (x) = minpoly(Xˆ),
σD(x) = minpoly(Eˆ), and SDU (x) = σD(x)⊗S(x)⊗
ζU (x), where ζU (x) is the q-reverse of λU (x).
2.2 Compute the error span polynomial:
a) Use the modified BMA [12] to solve the key equa-
tion σF (x) ⊗ SDU (x) ≡ ω(x) mod x[d−1] such
that degω(x) < [τ ] where τ = + µ+ δ.
b) Compute SFD(x) = σF (x)⊗ σD(x) ⊗ S(x).
c) Use Gabidulin’s algorithm [6] to find β that solves
SFD,l =
∑µ−1
j=0 X
[l]
j βj , l = d − 2, d − 3, . . . , d −
1− µ.
d) Compute σU (x) = minpoly(β) followed by σ(x) =
σU (x) ⊗ σF (x) ⊗ σD(x).
2.3 Find a basis E for the root space of σ(x).
2.4 Find the error locations:
a) Solve Sl =
∑τ−1
j=0 X
[l]
j Ej , l = 0, 1, . . . , d−2 using
Gabidulin’s algorithm [6] to find the error locators
X0, X1, . . . , Xτ−1 ∈ Fqm .
b) Compute the error locations Lj’s by solving (4).
c) Compute the error word eˆ = ∑τj=1 LjEj , where
each Ej is the row expansion of Ej .
III. COMPUTATIONAL COMPLEXITY REDUCTION
In general, RLNC is carried out over Fq , where q is any
prime power. That is, packets are treated as vectors over Fq .
Since our investigation of computational complexities is for
both software and hardware implementations of RLNC, where
data are stored and transmitted in bits, we focus on RLNC over
characteristic-2 fields in our work, i.e., q is a power of two.
In some cases, we further assume q = 2, as it leads to further
complexity reductions.
A. Finite Field Representation
Finite field elements can be represented by vectors using
different types of bases: polynomial basis, normal basis, and
dual basis [26]. In rank metric decoders, most polynomials
involved are linearized polynomials, and hence their evalu-
ations and symbolic products require computing their [i]th
powers. Suppose a field element is represented by a vector
over Fp with respect to a normal basis, computing [i]th powers
(i is a positive or negative integer) of the element is simply
cyclic shifts of the corresponding vector by i positions, which
significantly reduces computational complexities. For example,
the computational complexity of Algorithm 1 is primarily due
to the following updates in Step 1.1:
Ai,j = Ai−1,j − (
Ai−1,j
Ai−1,i−1
)[−1]Ai−1,i−1
Qi,j = Qi−1,j − (
Qi−1,j+1
Ai−1,i−1
)[−1]Ai−1,i−1
(5)
which require divisions and computing [−1]th powers. With
normal basis representation, [−1]th powers are obtained by a
single cyclic shift. When q = 2, they can be computed in
an inversionless form Ai,j = Ai−1,j −
(
Ai−1,jAi−1,i−1
)[−1]
,
Qi,j = Qi−1,j −
(
Qi−1,j+1Ai−1,i−1
)[−1]
, which also avoids
finite field divisions or inversions. Thus using normal basis
representation also reduces the complexity of Gabidulin’s al-
gorithm.
5×h MinPoly
Received RRE MinPoly SymProd BMA SymProd
Syndromes
Corrected Error Gabidulin’s Roots SymProd MinPoly Gabidulin’s
σD(x) σF (x)
λU (x)
S
SDU (x) SFD(x)
βσU (x)σ(x)
Xˆ
Lˆ
EX
Eˆ
rˆ
Fig. 2. Data flow of KK decoding
In addition to lower complexities of finite field arithmetic
operations, normal basis representation leads to reduced com-
plexities in the decoding of Gabidulin and KK codes for sev-
eral reasons. First, it was shown that using normal basis can
facilitate the computation of symbolic product [9]. Second,
it was also suggested [9] that solving (4) can be trivial us-
ing normal basis. If (h0, h1, . . . , hm−1) is a normal basis,
the matrix h, whose rows are vector representations of hi’s
with respect to the basis hi’s, becomes an identity matrix
with additional all-zero columns. Hence solving (4) requires
no computation. These two complexity reductions were also
observed in [10]. Third, if a normal basis of F2m is used as
hi’s and n = m, the parity check matrix H in (1) becomes a
cyclic matrix. Thus syndrome computation becomes part of a
cyclic convolution of (h0, h1, . . . , hm−1) and r, for which fast
algorithms are available (see, for example, [27]). Using fast
cyclic convolution algorithms are favorable when m is large.
B. Normal Basis Arithmetic Operations
We also propose finite field arithmetic operations with re-
duced complexities, when normal basis representation is used.
When represented by vectors, the addition and subtraction of
two elements are simply component-wise addition, which is
straightforward to implement. For characteristic-2 fields F2m ,
inverses can be obtained efficiently by a sequence of squaring
and multiplying, since β−1 = β2m−2 = β2β4 . . . β2m−1 for
β ∈ F2m [26]. Since the [i]-th powers require no computation,
the complexity of inversion in turn depends on that of multi-
plication. Division can be implemented by a concatenation of
inversion and multiplication: α/β = α · β−1, and hence the
complexity of division also depends on that of multiplication
in the end.
There are serial and parallel architectures for normal basis
finite field multipliers. To achieve high throughput in our de-
coder, we consider only parallel architectures. Most normal
basis multipliers are based on the Massey–Omura (MO) ar-
chitecture [26], [28]. The complexity of a serial MO normal
basis multiplier over F2m , CN , is defined as the number of
terms aibj in computing a bit of the product c = ab, where
a =
∑m−1
i=0 aihi ∈ F2m and b =
∑m−1
j=0 bjhj ∈ F2m and
(h0, h1, . . . , hm−1) is a normal basis. It has been shown [29]
that a parallel MO multiplier over F2m needs m2 AND gates
and at most m(CN +m− 2)/2 XOR gates. For instance, for
the fields F28 and F216 , their CN ’s are minimized to 21 and 85,
respectively [26]. Using a common subexpression elimination
algorithm [13], we significantly reduce the number of XOR
gates while maintaining the same critical path delays (CPDs)
of one AND plus five XOR gates and one AND plus seven
XOR gates as direct implementations, respectively. Our results
are compared to those in [26], [29] in Table II, where we also
provide the prime polynomial P (x) for each field.
TABLE II
COMPLEXITIES OF BIT-PARALLEL NORMAL BASIS MULTIPLIERS OVER
FINITE FIELDS (FOR THESE TWO FIELDS, ALL THREE IMPLEMENTATIONS
HAVE THE SAME CPD.)
P (x)
AND XORdirect [26] [29] Ours
F2m - m2 m(CN − 1) m(CN +m − 2)/2 -
F28 (
∑8
i=0 x
i)− x6 − x4 − x2 64 160 108 88
F216 (
∑16
i=0 x
i)− x14 − x9 − x6 − x4 256 1344 792 491
The reduced gate count for normal basis multiplication is
particularly important for hardware implementations of RLNC.
This improvement is transparent to the complexity of decoders,
in terms of finite field operations. When decoders for RLNC
are realized in hardware, the reduced gate count for normal
basis multiplication will be reflected in reduced area and power
consumption.
C. Inversionless BMA
The modified BMA for rank metric codes [12] is similar to
the BMA for RS codes except that polynomial multiplications
are replaced by symbolic products. The modified BMA [12]
requires finite field divisions, which are more complex than
other arithmetic operations. Following the idea of inversion-
less RS decoder [14], we propose an inversionless variant in
Algorithm 3.
Algorithm 3. iBMA
Input: Syndromes S
Output: Λ(x)
3.1 Initialize: Λ(0)(x) = B(0)(x) = x[0], Γ(0) = 1, and
L = 0.
3.2 For r = 0, 1, . . . , 2t− 1,
a) Compute the discrepancy ∆r =
∑L
j=0 Λ
(r)
j S
[j]
r−j .
b) If ∆r = 0, then go to (e).
c) Modify the connection polynomial: Λ(r+1)(x) =
(Γ(r))[1]Λ(r)(x)−∆rx
[1] ⊗B(r)(x).
d) If 2L > r, go to (e). Otherwise, L = r + 1 − L,
Γ(r+1) = ∆r, and B(r)(x) = Λ(r)(x). Go to (a).
e) Set Γ(r+1) = (Γ(r))[1] and B(r+1)(x) = x[1] ⊗
B(r)(x).
3.3 Set Λ(x) = Λ(2t)(x).
Using a similar approach as in [14], we prove that the
output Λ(x) of Algorithm 3 is the same as σ(x) produced
6by the modified BMA, except it is scaled by a constant C =∏t−1
i=0(Γ
(2i))[1]. However, this scaling is inconsequential since
the two polynomials have the same root space.
Using normal basis, the modified BMA in [12] requires at
most b(d−2)/2c inversions, (d−1)(d−2) multiplications, and
(d−1)(d−2) additions over Fqm [9]. Our inversionless version,
Algorithm 3, requires at most (3/2)d(d − 1) multiplications
and (d−1)(d−2) additions. Since a normal basis inversion is
obtained by m−1 normal basis multiplications, the complexity
of normal basis inversion is roughly m − 1 times that of
normal basis multiplication. Hence, Algorithm 3 reduces the
complexity considerably. Algorithm 3 is also more suitable for
hardware implementation, as shown in Section IV.
D. Finding the Root Space
Instead of finding roots of polynomials in RS decoding,
we need to find the root spaces of linearized polynomials
in rank metric decoding. Hence the Chien search [30] in RS
decoding will have a high complexity for two reasons. First,
it requires polynomial evaluations over the whole field, whose
complexity is very high; Second, it cannot find a set of linearly
independent roots.
A probabilistic algorithm to find the root space was pro-
posed in [25]. For Gabidulin codes, it can be further simplified
as suggested in [9]. But hardware implementations of prob-
abilistic algorithms require random number generators. Fur-
thermore, the algorithm in [25] requires symbolic long divi-
sion, which is also not suitable for hardware implementations.
According to [5], the average complexity of the probabilistic
algorithm in [25] is O(dm) operations over Fqm , while that of
Berlekamp’s deterministic method [24] is O(dm) operations
in Fqm plus O(m3) operations in Fq. Since their complexity
difference is small, we focus on the deterministic method,
which is much easier to implement.
Suppose we need to find the root space of a linearized poly-
nomial r(x), Berlekamp’s deterministic method first evaluates
the polynomial r(x) on a basis of the field (α0, α1, . . . , αm−1)
such that vi = r(αi), i = 0, 1, . . . ,m−1. Then it expands vi’s
in the base field as columns of an m×m matrix V and finds
linearly independent roots z such that V z = 0. Using the
representation based on (α0, α1, . . . , αm−1), the roots z are
also the roots of the given polynomial. Finding z is to obtain
the linear dependent combinations of the columns of V , which
can be done by Gaussian elimination.
E. n-RRE Form
Given a received subspace spanned by a set of received pack-
ets, the input of Algorithm 2 is a three-tuple, called a reduction
of the received space represented by its generator matrix Y ;
the three-tuple is obtained based on Y when it is in its RRE
form [5]. Thus, before the decoding starts, preprocessing is
performed on the received packets so as to obtain the RRE
form of Y . We show that Y needs to satisfy only a relaxed
constraint, which does not affect the decoding outcome, while
leading to two advantages. First, the relaxed constraint results
in reduced complexities in the preprocessing step. Second and
more importantly, the relaxed constraint enables parallel pro-
cessing of decoding KK codes based on Cartesian products.
We first define an n-RRE form for received matrices. Given
a matrix Y = [Aˆ | y], where Aˆ ∈ FN×nq and y ∈ FN×mq , the
matrix Y is in its n-RRE form as long as Aˆ (its leftmost n
columns) is in its RRE form. Compared with the RRE form,
the n-RRE form is more relaxed as it puts no constraints on
the right part. We note that an n-RRE form of a matrix is not
unique.
We first show that the relaxed constraint does not affect the
decoding. Similar to [31, Proposition 7], we first show that a
reduction based on n-RRE form of Y always exists. Given
Y = [Aˆ | y] and RRE(Aˆ) = RAˆ, where R represents the
reducing row operations, the product Y¯ ′ = RY = [B′ | Z ′] is
in its n-RRE form. We note that B′ ∈ FN×nq and Z ∈ FN×mq ,
where the column and row rank deficiency of B′ are given by
µ′ = n − rankB′ and δ′ = N − rankB′, respectively. We
have the following result about the reduction based on Y¯ ′.
Lemma 1. Let Y¯ ′ and µ′ and δ′ be defined as above. There
exists a tuple (r′, Lˆ′, Eˆ′) ∈ Fn×mq × Fn×µ
′
q × F
δ′×m
q and a
set U ′ satisfying |U ′| = µ′, IT
U ′
r′ = 0, IT
U ′
Lˆ′ = −Iµ′×µ′ , and
rank Eˆ′ = δ′ so that
〈[
In+Lˆ
′
I
T
U′
r
′
0 Eˆ′
]〉
= 〈Y¯ ′〉 = 〈Y 〉.
See Appendix A for the proof of Lemma 1. Lemma 1 shows
that we can find an alternative reduction based on n-RRE form
of Y , instead of an RRE form of Y . The key of our alternative
reduction of Y is that the reduction is mostly determined
by the first n columns of RRE(Y ). Also, this alternative
reduction does not come as a surprise. As shown in [31, Propo-
sition 8], row operations on Eˆ can produce alternative reduc-
tions. Next, we show that decoding based on our alternative
reduction is the same as in [31]. Similar to [31, Theorem 9],
we have the following results.
Lemma 2. Let (r′, Lˆ′, Eˆ′) be a reduction of Y determined by
its n-RRE form, we have dS(〈X〉, 〈Y 〉) = 2 rank
[
Lˆ
′
r
′
−x
0 Eˆ
′
]
−
µ′ − δ′.
See Appendix B for the proof. Lemma 2 shows that the sub-
space decoding problem is equivalent to the generalized Gabi-
dulin decoding problem with the alternative reduction (r′, Lˆ′, Eˆ′),
which is obtained from an n-RRE form of Y .
Our alternative reduction leads to two advantages. First,
it results in reduced complexity in preprocessing. Given a
matrix Y , the preprocessing needed to transform Y into its
n-RRE form is only part of the preprocessing to transform Y
into its RRE form. We can show that the maximal number
of arithmetic operations in the former preprocessing is given
by (N − 1)
∑rank Aˆ−1
i=0 (n + m − i), whereas that of the lat-
ter preprocessing is (N − 1)
∑rank(Y )−1
i=0 (n +m − i). Since
rankY ≥ rank Aˆ, the relaxed constraint leads to a lower
complexity, and the reduction depends on rankY and rank Aˆ.
Second, the reduction for n-RRE forms is completely deter-
mined by the n leftmost columns of Y instead of the whole
matrix, which greatly simplifies hardware implementations. This
advantage is particularly important for the decoding of constant-
dimension codes that are lifted from Cartesian products of
Gabidulin codes. Since the row operations to obtain an n-RRE
7form depend on Aˆ only, decoding [Aˆ | y0 | y1 | · · · | yl−1]
can be divided into parallel and smaller decoding problems
whose inputs are [Aˆ | y0], [Aˆ | y1], . . . , [Aˆ | yl−1]. Thus,
for these constant-dimension codes, we can decode in a serial
manner with only one small decoder, or in a partly parallel
fashion with more decoders, or even in a fully parallel fashion.
This flexibility allows tradeoffs between cost/area/power and
throughput. Furthermore, since the erasures Lˆ are determined
by Aˆ and are the same for all [A | yi], the computation of Xˆ
and λU (x) in Algorithm 2 can be shared among these parallel
decoding problems, thereby reducing overall complexity.
F. Finding Minimal Linearized Polynomials
Minimal linearized polynomials can be computed by solving
systems of linear equations. Given roots β0, β1, . . . , βp−1, the
minimal linearized polynomial x[p] +
∑p−1
i=0 aix
[i] satisfies

β
[0]
0 β
[1]
0 · · · β
[p−1]
0
β
[0]
1 β
[1]
1 · · · β
[p−1]
1
.
.
.
.
.
.
.
.
.
.
.
.
β
[0]
p−1 β
[1]
p−1 · · · β
[p−1]
p−1




a0
a1
.
.
.
ap−1

 =


β
[p]
0
β
[p]
1
.
.
.
β
[p]
p−1

 . (6)
Thus it can be solved by Gaussian elimination over the exten-
sion field Fqm . Gabidulin’s algorithm is not applicable because
the rows of the matrix are not the powers of the same element.
The complexity to solve (6) is very high. Instead, we refor-
mulate the method from [21, Chap. 1, Theorem 7]. The main
idea of [21, Chap. 1, Theorem 7] is to recursively construct
the minimal linearized polynomial using symbolic products
instead of polynomial multiplications in polynomial interpola-
tion. Given linearly independent roots w0, w1, . . . , wp−1, we
can construct a series of linearized polynomials as: F (0)(x) =
x[0] and F (i+1)(x) = (x[1]− (F (i)(wi))q−1x[0])⊗F (i)(x) for
i = 0, 1, · · · , p− 1.
Although the recursive method in [21, Chap. 1, Theorem 7]
is for p-polynomials, we can adapt it to linearized polynomials
readily. A serious drawback of [21, Chap. 1, Theorem 7]
is that the evaluation of F (i)(wi) has a rapidly increasing
complexity when the degree of F (i)(x) gets higher. To elimi-
nate this drawback, we reformulate the algorithm so that the
evaluation F (i)(wi) is done in a recursive way. Our reformu-
lated algorithm is based on the fact that Fi(wi+1) = (x[1] −
(Fi−1(wi))
q−1x[0]) ⊗ Fi−1(wi+1). Representing F (i)(wj) as
γ
(i)
j , we obtain Algorithm 4.
Algorithm 4 (Minimal Linearized Polynomials).
Input: Roots w0, w1, . . . , wp−1
Output: The minimal linearized polynomial F (p)(x)
4.1 Set γ(0)j = wj , for j = 0, 1, . . . , p − 1 and F (0)(x) =
x[0].
4.2 For i = 0, 1, . . . , p− 1,
a) If γ(i)i = 0, F (i+1)(x) = F (i)(x) and γ(i+1)j =
γ
(i)
j for j = i + 1, i + 2, . . . , p − 1; Otherwise,
F (i+1)(x) = (F (i)(x))[1] − (γ
(i)
i )
q−1F (i)(x) and
γ
(i+1)
j = (γ
(i)
j )
[1]− (γ
(i)
i )
q−1γ
(i)
j for j = i+1, i+
2, . . . , p− 1.
Since powers of q require only cyclic shifting, the opera-
tions in Algorithm 4 are simple. Also, Algorithm 4 does not
require the roots to be linearly independent. In Algorithm 4,
F (i)(wj) = 0 for j = 0, 1, . . . , i − 1 and γ(i+1)i = F (i)(wi).
If w0, w1, . . . , wj are linearly dependent, γ(j)j = 0 and hence
wj is ignored. So Algorithm 4 integrates detection of linearly
dependency at no extra computational cost.
Essentially, Algorithm 4 breaks down evaluations of high
q-degree polynomials into evaluations of polynomials with q-
degree of one. It gets rid of complex operations while main-
taining the same total complexity of the algorithm.
IV. ARCHITECTURE DESIGN
Aiming to reduce the storage requirement and total area as
well as to improve the regularity of our decoder architectures,
we further reformulate the steps in the decoding algorithms of
both Gabidulin and KK codes. Again, we assume the decoder
architectures are suitable for RLNC over Fq, where q is a
power of two.
A. High-Speed BMA Architecture
To increase the throughput, regular BMA architectures with
shorter CPD are necessary. Following the approaches in [15],
we develop two architectures based on Algorithm 3, which are
analogous to the riBM and RiBM algorithms in [15].
In Algorithm 3, the critical path is in step 3.2(a). Note
that ∆r is the rth coefficient of the discrepancy polynomial
∆(r)(x) = Λ(r)(x) ⊗ S(x). By using Θ(r)(x) = B(r)(x) ⊗
S(x), ∆(r+1)(x) can be computed as
∆(r+1)(x) = Λ(r+1)(x)⊗ S(x)
=
[
(Γ(r))[1]Λ(r)(x) −∆rx
[1] ⊗B(r)(x)
]
⊗ S(x)
= (Γ(r))[1]∆(r)(x) −∆rx
[1] ⊗Θ(r)(x) (7)
which has the same structure as step 3.2(c). Hence this re-
formulation is more conducive to a regular implementation.
Given the similarities between step 3.2(a) and (7), Λ(x) and
∆(x) can be combined together into one polynomial ∆˜(x).
Similarly, B(x) and Θ(x) can be combined into one poly-
nomial Θ˜(x). These changes are incorporated in our RiBMA
algorithm, shown in Algorithm 5.
Algorithm 5. RiBMA
Input: Syndromes S
Output: Λ(x)
5.1 Initialize: ∆˜(0)(x) = Θ˜(0)(x) =
∑2t−1
i=0 Six
[i]
, Γ(0) = 1,
∆˜
(0)
3t = Θ˜
(0)
3t = 1, and b = 0.
5.2 For r = 0, 1, . . . , 2t− 1,
a) Modify the combined polynomial: ∆˜(r+1)(x) =
Γ(r)∆˜(r)(x) − ∆˜
(r)
0 Θ˜
(r)(x);
b) Set b = b + 1;
c) If ∆˜(r)0 6= 0 and b > 0, set b = −b, Γ(r+1) = ∆˜(r)0 ,
and Θ˜(r)(x) = ∆˜(r)(x);
d) Set ∆˜(r+1)(x) = ∑3t−1i=0 ∆˜(r+1)i+1 x[i], Θ˜(r)(x) =∑3t−1
i=0 Θ˜
(r)
i+1x
[i];
e) Set Γ(r+1) = (Γ(r))[1] and Θ˜(r+1)(x) = x[1] ⊗
Θ˜(r)(x).
85.3 Set Λ(x) =
∑t
i=0 ∆˜
(2t)
i+t x
[i]
.
Following Algorithm 5, we propose a systolic RiBMA ar-
chitecture shown in Fig. 3, which consists of 3t+ 1 identical
processing elements (BEs), whose circuitry is shown in Fig. 4.
The central control unit BCtrl, the rightmost cell in Fig. 3,
updates b, generates the global control signals ct(r) and Γ(r),
and passes along the coefficient Λ(r)0 . The control signal ct(r)
is set to 1 only if ∆˜(r)0 6= 0 and k > 0. In each processing
element, there are two critical paths, both of which consist of
one multiplier and one adder over F2m .
BE0 · · · BEt · · · BE2t · · · BE3t BCtrl
Λ0 Λt
0
0
Fig. 3. The RiBMA architecture
D + ×
×
∆˜
(r)
i+1
Γ(r) Γ(r)
1 0
xq
D
Θ˜
(r)
i Θ˜
(r)
i+1
ct(r) ct(r)
∆˜
(r)
0 ∆˜
(r)
0
∆˜
(r)
i
∆˜
(r)
i
Fig. 4. The processing element BEi (xq is a cyclic shift, and requires no
hardware but wiring)
B. Generalized BMA
The key equation of KK decoding is essentially the same as
(2), but ω(x) has q-degree less than τ instead of b(d− 1)/2c.
Actually, in KK decoding, we do not know the exact value of
τ before solving the key equation. All we need is to determine
the maximum number of correctable errors t′ given µ erasures
and δ deviations, which is given by t′ = b(d− 1− µ− δ)/2c.
Hence we adapt our BMA in Section III-C to KK decoding, as
in Algorithm 6. To apply Algorithm 6 to Gabidulin decoding,
we can simply use θ = µ+ δ = 0.
Algorithm 6 (Generalized RiBMA).
Input: S and θ
Output: Λ(x)
6.1 Initialize as follows: t′ = b(d − 1 − θ)/2c, ∆˜(0)(x) =
Θ˜(0)(x) =
∑θ+2t′−1
i=θ Six
[i]
, ∆˜
(0)
2t′+t = Θ˜
(0)
2t′+t = 1,
Γ(0) = 1, and b = 0.
6.2 For r = 0, 1, . . . , 2t′ − 1,
a) Modify the combined polynomial: ∆˜(r+1)(x) =
Γ(r)∆˜(r)(x)− ∆˜
(r)
0 Θ˜
(r)(x);
b) Set b = b+ 1;
c) If ∆˜(r)0 6= 0 and b > 0, set b = −b, Γ(r+1) = ∆˜(r)0 ,
and Θ˜(r)(x) = ∆˜(r)(x);
d) Set ∆˜(r+1)(x) =∑2t′+t−1i=0 ∆˜(r+1)i+1 x[i], Θ˜(r)(x) =∑2t′+t−1
i=0 Θ˜
(r)
i+1x
[i];
e) Set Γ(r+1) = (Γ(r))[1] and Θ˜(r+1)(x) = x[1] ⊗
Θ˜(r)(x).
6.3 Set Λ(x) =
∑t′
i=0 ∆˜
(2t′)
i+t x
[i]
.
Compared with Algorithm 5, we replace t by t′. The vari-
able t′ makes it difficult to design regular architectures. By
carefully initializing ∆˜(0)(x) and Θ˜(0)(x), we ensure that the
desired output Λ(x) is always at a fixed position of ∆˜(2t′)(x),
regardless of µ + δ. Hence, the only irregular part is the
initialization. The initialization of Algorithm 6 can be done by
shifting in at most θ cycles. Hence the RiBMA architecture in
Fig. 3 can be adapted to the KK decoder and keep the same
worse-case latency of 2t cycles.
C. Gaussian Elimination
We need Gaussian elimination to obtain n-RRE forms as
well as to find root spaces. Furthermore, Gabidulin’s algo-
rithm in Algorithm 1 is essentially a smart way of Gaussian
elimination, which takes advantage of the properties of the
matrix. The reduction (to obtain n-RRE forms) and finding
the root space are Gaussian eliminations on matrices over Fq ,
while Gabidulin’s algorithm operates on matrices over Fqm . In
this section, we focus on Gaussian eliminations over Fq and
Gabidulin’s algorithm will be discussed in Section IV-D.
For high-throughput implementations, we adapt the pivoting
architecture in [32], which was developed for non-singular
matrices over F2. It always keeps the pivot element on the
top-left location of the matrix, by cyclically shifting the rows
and columns. Our Gaussian elimination algorithm, shown in
Algorithm 7, has three key differences from the pivoting ar-
chitecture in [32]. First, Algorithm 7 is applicable to matrices
over any field. Second and more importantly, Algorithm 7 can
be used for singular matrices. This feature is necessary since
singular matrices occur in the reduction for the RRE form and
finding the root space. Third, Algorithm 7 is also flexible about
matrix sizes, which are determined by the variable numbers
of errors, erasures, and deviations.
Algorithm 7 (Gaussian Elimination for Root Space).
Input: M ∈ Fm×mq , whose rows are evaluations of σ(x)
over the normal basis, and B = Im
Output: Linearly independent roots of σ(x)
7.1 Set i = 0.
7.2 For j = 0, 1, . . . ,m− 1
a) l = 1
b) While M0,0 = 0 and l < m− i
l = l + 1, shiftup(M , i), and shiftup(B, i).
c) If M0,0 is not zero, eliminate(M), reduce(B,M),
and i = i+ 1; Otherwise, shiftleft(M).
7.3 The first m − i rows of M are all zeros and the first
m− i rows of B are roots.
The eliminate and shiftup operations are quite similar to
those in [32, Algorithm 2]. In eliminate(M), for 0 ≤ j < m,
9Mi,j = M0,0Mi+1,(j+1) mod m − Mi+1,0M0,(j+1) mod m for
0 ≤ i < m − 1, and Mm−1,j = M0,(j+1) mod m. Note that a
cyclic row shift and a cyclic column shift are already embed-
ded in the eliminate operation. In the shiftup(M , ρ) opera-
tion, the first row is moved to the (m−1−ρ)th row while the
second to the (m−1−ρ)th rows are moved up. That is, for 0 ≤
j < m, Mi,j =M0,j if i = m−1−ρ, and Mi,j = Mi+1,j for
0 ≤ i ≤ m− 2− ρ. The operation reduce(B,M) essentially
mimics all row operations in eliminate without the column
shift: for 0 ≤ j < m, Bi,j = M0,0Bi+1,j −Mi+1,0B0,j for
0 ≤ i < m−1, and Bm−1,j = B0,j . In the shiftleft operation,
all columns are cyclicly shifted to the left. In other words, for
all 0 ≤ i < m and 0 ≤ j < m, Mi,j = Mi,(j+1) mod m. By
adding a shiftleft operation, Algorithm 7 handles both singular
and non-singular matrices while [32, Algorithm 2] works for
non-singular matrices only. Since B is always full rank, the
roots obtained are guaranteed to be linearly independent.
We can get the root space using Algorithm 7, and we can
also use it in KK decoding to reduce the received vector to
an n-RRE form. However, Algorithm 7 only produces Eˆ′. We
extend it to Algorithm 8 below so as to obtain Lˆ′ simultane-
ously.
Algorithm 8 (Gaussian Elimination for n-RRE Forms).
Input: N × n matrix Aˆ and N ×m matrix y
Output: Lˆ′, Eˆ′, r′, and µ′
8.1 Set i = 0, U ′ and Lˆ as empty.
8.2 For each column j = 0, 1, . . . , n− 1
a) l = 1
b) While Aˆ0,0 = 0 and l < n− i
l = l+1, shiftup(Aˆ, i), shiftup(y, i), shiftup(Lˆ′, i).
c) If Aˆ0,0 is not zero, eliminate(Aˆ), reduce(y, Aˆ),
shiftup(Lˆ′, 0), i = i+ 1.
d) Otherwise, shiftleft(Aˆ), append the first column
of Aˆ to Lˆ′, set the top-right element of Lˆ′ to one,
and add j to U ′.
8.3 Set µ′ = n− i. The deviations Eˆ′ are given by the first
µ′ rows of y.
8.4 For each column j ∈ U ′, shiftup(Lˆ′, j) and shiftup(y, j).
8.5 The received vector r′ is given by y.
In Algorithm 8, we incorporate the extraction of Lˆ′, Eˆ′, and
r′ into Gaussian elimination. Our architecture has the same
worst-case latency as Algorithm 7 and requires no extra cycles
to extract Lˆ out of the n-RRE form. Hence the throughput also
remains the same.
Algorithm 7 is implemented by the regular architecture shown
in Fig. 5, which is a two-dimensional array of m×2m process-
ing elements (GE’s); The leftmost m columns of processing
elements correspond to M , and the rightmost m columns B.
Algorithm 8 can be implemented with the same architecture
with N×(n+m) GE’s; The leftmost n columns of processing
elements correspond to Aˆ, and the rightmost m columns y.
The elements for Lˆ′ are omitted in the figure. The circuitry
of the processing element GE is shown in Fig. 6. The control
signal cti for row i chooses from five inputs based on the
operation: keeping the value, shiftleft, eliminate (or reduce),
and shiftup (using the first row or the next row).
D
+
×
×
M
U
X
Mi+1,j
Mi,j+1
cticti
Mi,j
M ′i+1,j+1
Mi,jM ′i,j
Mi,0
M0,0
Mi,0
M0,0
M0,j
M0,j
1 0
Fig. 6. The processing element GEi, j
D. Gabidulin’s Algorithm
In Algorithm 1, the matrix is first reduced to a triangular
form. It takes advantage of the property of the matrix so that
it requires no division in the first stage. In the first stage, we
need to perform elimination on only one row. We use a similar
pivoting scheme like Algorithm 7. When a row is reduced to
have only one non-zero element, a division is used to obtain
one coefficient of X . Then it performs a backward elimination
after getting each coefficient. Hence we introduce a backward
pivoting scheme, where the pivot element is always at the
bottom-right corner.
In Algorithm 1, there are two τ × τ matrices over Fqm ,
A and Q. In step 1.2, it requires only Qi,0’s to compute the
coefficients. To compute Qi,0 in (5), it requires only Qi−1,0
and Qi−1,1. And for Qi,j in (5), it requires only Qi−1,j and
Qi−1,j+1. Recursively, only those Qi,j’s where i+ j < τ are
necessary. Actually, given any i, entries Qi,0, Qi+1,0, . . . , Qτ−1,0
can be computed with the entries Qi−1,0, Qi−1,1, . . . , Qi−1,τ−i.
With Q0,0, Q1,0, . . . , Qi−2,0, we need to store only τ values to
keep track of Q. Hence we reduce the storage of Q from τ×τ
m-bit registers down to τ . We cannot reduce the storage of A
to τ(τ +1)/2 because we have to use the pivoting scheme for
short critical paths.
ACtrl AE0,0 · · · AE0,τ−1 QE0 QCtrl
AE1,0 · · · AE1,τ−1 QE1
.
.
.
.
.
.
.
.
.
.
.
.
AEτ−1,0 · · · AEτ−1,τ−1 QEτ−1
x
q−1
q
x−1
× X0,X1, . . . , Xτ−1
· · ·· · ·
0
0
0
Fig. 7. Our architecture of Gabidulin’s algorithm
In our decoder, Algorithm 1 is implemented by the regular
architecture shown in Fig. 7, which includes a triangular array
10
GCtrl GE0,0 GE0,1 · · · GE0,m−1 GE0,m GE0,m+1 · · · GE0,2m−1
. . . GE1,0 GE1,1 · · · GE1,m−1 GE1,m GE1,m+1 · · · GE1,2m−1
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
GEm−1,0 GEm−1,1 · · · GEm−1,m−1 GEm−1,m GEm−1,m+1 · · · GEm−1,2m−1
Fig. 5. Regular architecture for Gaussian elimination
x
1
q
× +
D
Ai−1,j
A
(q−1)/q
i−1,i−1 A
(q−1)/q
i−1,i−1
Ai,j
A′i−1,j−1
A′i,j
ctaui ctaui
ctali ctali
Ai+1,j+1
Ai,j
0
1
01
Fig. 8. The processing element AEi,j
× x
1
q
+
D
× +
qi
q′i
q′i+1
qi+1
A
(q−1)/q
i−1,i−1
Ai,j ctqi
Xj
Xj
1
0
2
qi
Fig. 9. The processing element QEi
of τ × τ AE’s and a one-dimensional array of τ QE’s. The
circuitry of the processing element AEi,j and QEi is shown
in Fig. 8 and 9. The upper MUX in AE controls the output
sending upward along the diagonal. Its control signal ctaui is
1 for the second row and 0 for other rows since we update A
one row in a cycle and we keep the pivot on the upper left
corner in Step 1.1. The control of the lower MUX in AE is
0 for working on Step 1.1, and 1 for working on Step 1.2.
Similarly the control of the MUX in QE is 0 for working on
Step 1.1, and 1 for working on Step 1.2. But in Step 1.1, only
part of QE’s need update and others should maintain their
values and their control signals ctqi’s are set to 2. Initially,
A0,i = Ei and qi = Si for i = 0, 1, . . . , τ − 1. Step 1.1 needs
τ substeps. In the first τ−1 substeps, ctali+1 = 0, ctau1 = 1,
ctq0 = ctq1 = · · · = ctqi = 2, and ctqi+1 = ctqi+2 = · · · =
ctqτ−1 = 0 for substep i. In the last substep, ctau1 = 0 and
all ctqi’s are set to 2. This substep is to put the updated A
into the original position. In Step 1.2, the pivot is in the right
lower corner, where we compute Xi’s. Step 1.2 also needs
τ substeps, in which all ctali’s and ctqi’s are set to 1. First
Xτ−1 is computed by A−1τ−1,τ−1qτ − 1 where qτ−1 = Qτ−1,0.
Note that the inversion may need m− 2 clock cycles. In each
substep, the matrix A is moving down the diagonal so the Ai,i
to be inverted is always at the bottom right corner. At the same
time, the qi’s are also moving down. Basically, in substep p,
the architecture updates qi’s to Qi−p,0 −
∑τ−1
j=τ−1−p Ai,jXj
for i > p by doing one backward elimination at each substep.
E. Low Complexity Linearized Interpolation
It would seem that three registers are needed to store F (i)(x),
wj’s, and γ(i)j ’s, respectively, in Algorithm 4. However, we
can implement Algorithm 4 with a single register of size p+
1. First, we note that wj’s are used to initialize γ(0)j ’s, and
only γ(i)j ’s are used in the updates. Second, after the i-th
iteration of step 4.2, the q-degree of F (i+1)(x) is no more
than i + 1 and we need only γ(i+1)i+1 , γ
(i+1)
i+2 , . . . , γ
(i+1)
p−1 there-
after. Thus, we can store the coefficients of F (i+1)(x) and
γ
(i+1)
i+1 , γ
(i+1)
i+2 , . . . , γ
(i+1)
p−1 in a register of size p+ 1. We refer
to this register as η and index it 0, 1, · · · , p from left to right.
Note that γ(i+1)i+1 , γ
(i+1)
i+2 , . . . , γ
(i+1)
p−1 are stored at the lower end
of the η register, and the coefficients of F (i+1)(x) are stored
at the higher end of the register. At each iteration, the content
of the η register is shifted to the left by one position, so that
γ
(i)
i is always stored at η0.
Algorithm 9 (Reformulated Algorithm for Minimal Linearized
Polynomials).
Input: Roots w0, w1, . . . , wp−1
Output: The minimal linearized polynomial F (x)
11
9.1 Initialization: η(0)j = wj for j = 0, 1, . . . , p − 1, and
η
(0)
p = 1.
9.2 For i = 0, 1, . . . , p− 1,
a) If η(i)0 6= 0,
i) For j = 0, 1, . . . , p−1−i, η(i+1)j = (η(i)j+1)[1]−
(η
(i)
0 )
q−1η
(i)
j+1;
ii) For j = p−i, p−i+1, . . . , p, η(i+1)j = (η(i)j )[1]−
(η
(i)
0 )
q−1η
(i)
j+1;
b) Otherwise, for j = 0, 1, . . . , p, η(i+1)j = η(i)j+1.
9.3 F (x) =
∑p
i=0 η
(p)
i x
[i]
.
We note that the updates involve η(i)p+1, which is always
set to zero (see Fig. 10). When an input wi is not linearly
independent with w0, w1, . . . , wi−1, η(i)0 = 0. In this case,
the algorithm simply ignores the input, and the ηi registers
are shifted to the left by one position. Hence, whether or
not the inputs w0, w1, . . . , wp−1 are linearly independent, the
minimal linearized polynomial for the inputs will be available
after p iterations. This flexibility is important for our decoder
architecture, since the number of linearly independent inputs
varies.
Algorithm 9 is implemented by the systolic architecture
shown in Fig. 10, which consists of p + 1 processing ele-
ments (ME’s). The circuitry of the processing element MEj
is shown in Fig. 11. The cr signal is 1 only when γ0 6= 0.
The ctj signal for each cell is 1 only if j < p − i. Ba-
sically, ctj controls whether the update is for F (i+1)(x) or
γ
(i+1)
i+1 , γ
(i+1)
i+2 , . . . , γ
(i+1)
p−1 as in Algorithm 4.
ME0 · · · MEp MCtrl
0
0
F0 Fp
Fig. 10. Architecture of linearized polynomial interpolation
D + × xq−1
xq xq
& D
ηj+1
ctjctj+1
η0η0
ηj
crcr
ηj
0
1
1 0
0
1
Fig. 11. The processing element MEj (xq is a cyclic shift, and requires no
hardware but wiring). For simplicity, we have omitted the superscripts of ηj
F. Decoding Failure
A complete decoder declares decoding failure when no valid
codeword is found within the decoding radius of the received
word. To the best of our knowledge, decoding failures of
Gabidulin and KK codes were not discussed in previous works.
Similar to RS decoding algorithms, a rank decoder can return
decoding failure when the roots of the error span polynomial
λ(x) are not unique. That is, the root space of λ(x) has a
dimension smaller than the q-degree of λ(x). Note that this
applies to both Gabidulin and KK decoders. For KK decoders,
another condition of decoding failure is when the total number
of erasures and deviations exceeds the decoding bound d− 1.
G. Latency and Throughput
We analyze the worst-case decoding latencies of our decoder
architectures, in terms of clock cycles, in Table III.
TABLE III
WORST-CASE DECODING LATENCY (IN TERMS OF CLOCK CYCLES).
GAUSSIAN ELIMINATION OVER Fqm (ROOT SPACE IN GABIDULIN AND KK
DECODERS) HAS THE LONGEST CRITICAL PATH OF ONE MULTIPLIER, ONE
ADDER, ONE TWO-INPUT MUX, AND ONE FIVE-INPUT MUX.
Gabidulin KK
n-RRE - n(2N − n+ 1)/2
Syndrome S n n
λU (x) - 2t
σD(x) - 2t
SDU (x) - 2(d − 1)
BMA 2t 2t
SFD(x) - d− 1
β - (m+ 2)(d − 1)
σU (x) - 2t
σ(x) - d− 1
root space basis E m(m + 1)/2 m(m + 1)/2
error locator L 2t+mt (m+ 2)(d − 1)
error word e t 2t
As in [32], the latency of Gaussian elimination for the n-
RRE form is at most n(2N − n+ 1)/2 cycles. Similarly, the
latency of finding the root space is at most m(m+ 1)/2.
For Gabidulin’s algorithm, it needs one cycle per row for for-
ward elimination and the same for backward elimination. For
each coefficient, it takes m cycles to perform a division. Hence
it needs at most 2(d− 1)+m(d− 1) and 2(d− 1)+m(d− 1)
for β and L respectively. The latencies of finding the minimal
linearized polynomials are determined by the number of regis-
ters, which is 2t to accommodate λD(x), σD(x), and σU (x),
whose degrees are µ, δ, and µ, respectively. The 2t syndromes
can be computed by 2t sets of multiply-and-accumulators in n
cycles. Note that the computations of S(x), λU (x), and σD(x)
can be done concurrently. The latency of RiBMA is 2t for 2t
iterations. The latency of a symbolic product a(x) ⊗ b(x) is
determined by the q-degree of a(x). When computing SDU (x),
we are concerned about only the terms of q-degree less than
d− 1 because only those are meaningful for the key equation.
For computing SFD(x), the result of σD(x)⊗S(x) in SDU (x)
can be reused, so it needs only one symbolic product. In total,
assuming n = m, the decoding latencies of our Gabidulin and
KK decoders are n(n+3)/2+(n+5)t and n(N+2)+4(n+5)t
cycles, respectively.
One assumption in our analysis is that the unit that computes
xq−1 in Figs. 9 and 11 is implemented with pure combinational
logic, which leads to a long CPD for large q’s. To achieve
a short CPD for large q’s, it is necessary to pipeline the
unit that computes xq−1. There are two ways to pipeline it:
xq−1 = x · x2 · · ·xq/2 that requires log2 q− 1 multiplications,
or xq−1 = xq/x that requires m multiplications for division.
To maintain a short CPD, xq−1 needs to be implemented
12
sequentially with one clock cycle for each multiplication. Let
cqm = min{log2 q−1,m} and it requires at most 2(cqm+2)t
clock cycles for getting minimal linearized polynomials λU (x),
σD(x), and σU (x). Similarly, it requires at most cqm(d − 1)
more cycles to perform forward elimination in Gabidulin’s
algorithm for the error locator, and the latency of this step
will be (m+ cqm + 2)(d− 1) cycles.
In our architectures, we use a block-level pipeline scheme
for high throughput. Data transfers between modules are buffered
into multiple stages so the throughput is determined by only
the longest latency of a single module. For brevity, we present
only the data flow of our pipelined Gabidulin decoder in Fig. 12.
The data in different pipeline stages are for different decoding
sessions. Hence these five units can work on five different
sessions currently for higher throughput. If some block fin-
ishes before others, it cannot start another session until all
are finished. So the throughput of our block-level pipeline
decoders is determined by the block with the longest latency.
For Gabidulin decoders, the block of finding root space is the
bottleneck that requires m(m+1)/2 cycles, the longest latency
in the worst case scenario. For KK decoders, the bottleneck is
the RRE block, which requires n(2N − n+ 1)/2 cycles.
rs rσ
Sσ
Received Syndromes BMA
Corrected Error Gabidulin’s Roots
EX
SE
rX rE
r S
σ(x)
EX
Fig. 12. Data flow of our pipelined Gabidulin decoder
V. IMPLEMENTATION RESULTS AND DISCUSSIONS
To evaluate the performance of our decoder architectures,
we implement our architectures for Gabidulin and KK codes
for RLNC over F2. Note that although the random linear
combinations are carried out over F2, decoding of Gabidulin
and KK codes are performed over extension fields of F2.
Due to the hardware limitations caused by the architecture
in Fig. 5, We need to restrict N . Note that we assume the
input matrix is full rank as [5]. When N ≥ n+d, the number
of deviations δ = N − n is at least d and it is uncorrectable.
Hence in our implementation of KK decoders, we assume N
is less than n+ d.
A. Implementation Results
We implement our decoder architecture in Verilog for an
(8, 4) Gabidulin code over F28 and a (16, 8) one over F216 ,
which can correct errors of rank up to two and four, respec-
tively. We also implement our decoder architecture for their
corresponding KK codes, which can correct  errors, µ era-
sures, and δ deviations as long as 2+ µ+ δ is no more than
five or nine, respectively. Our designs are synthesized using
Cadence RTL Compiler 9.1 and FreePDK 45nm standard cell
library [33]. The synthesis results are given in Table IV. In
these tables, the total area includes both cell area and esti-
mated net area, the gate counts are in equivalent numbers
of 2-input NAND gates, and the total power includes both
leakage and estimated dynamic power. All estimations are
made by the synthesis tool. The throughput is computed as
(n×m×R)/(LatencyBottleneck × CPD).
To provide a reference for comparison, the gate count of our
(8, 4) KK decoder is only 62% to that of the (255, 239) RS de-
coder over the same field F28 in [34], which is 115,500. So for
Gabidulin and KK codes over small fields, which have limited
error-correcting capabilities, their hardware implementations
are feasible. The area and power of decoder architectures in
Table IV appear affordable except for applications with very
stringent area and power requirements.
TABLE IV
SYNTHESIS RESULTS OF DECODERS FOR GABIDULIN AND KK CODES
Finite fields F28 F216
Codes Gab. KK Gab. KK
(n, k) or (n,m) (8, 4) (4, 4) (16, 8) (8, 8)
Gates 18465 71134 116413 421477
Area (mm2)
Cell 0.035 0.133 0.219 0.791
Net 0.053 0.202 0.320 1.163
Total 0.088 0.335 0.539 1.954
CPD (ns) 2.309 2.199 3.490 3.617
Estimated Leakage 0.281 1.084 1.690 6.216
Power (mW) Dynamic 14.205 54.106 97.905 313.065Total 14.486 55.190 99.595 319.281
Latency (cycles) 70 216 236 752
Bottleneck (cycles) 36 68 136 264
Throughput (Mbit/s) 385 214 270 134
B. Implementation Results of Long Codes
Although the area and power shown in Table IV are af-
fordable and high throughputs are achieved, the Gabidulin
and KK codes used have very limited block lengths 8 and
16. For practical network applications, the packet size may
be large [11]. One approach to increase the block length of
a constant-dimension code is to lift a Cartesian product of
Gabidulin codes [5]. We also consider the hardware implemen-
tations for this case. We assume a packet size of 512 bytes,
and use a KK code that is based on Cartesian product of 511
length-8 Gabidulin codes. As observed in Section III-E, the
n-RRE form allows us to either decode this long KK code in
a serial, partly parallel, or fully parallel fashion. For example,
more decoder modules can be used to decode in parallel for
higher throughput. We list the gate counts and throughput of
the serial and factor-7 parallel schemes based on the (8, 4) KK
decoder and those of the serial and factor-5 parallel schemes
based on the (16, 8) KK decoder in Table V.
In Table V, we simply use multiple KK decoders for paral-
lel implementations. Parallel KK decoders actually share the
same Aˆ, Lˆ, Xˆ , and λU (x). Hence, some hardware can be
also shared, such as the left part of Gaussian elimination for
reduction in Fig. 6 and the interpolation block for λU (x). With
the same latency, these hardware savings are roughly 7% of
one single KK decoder.
13
TABLE V
PERFORMANCE OF KK DECODERS FOR 512-BYTE PACKETS
(n,m) (4, 4) (8, 8)
Decoder Serial 7-Parallel Serial 5-Parallel
Gates 71134 497938 421477 2107385
Area (mm2) 0.335 2.345 1.954 9.770
CPD (ns) 2.199 3.617
Est. Power (mW) 55.190 386.330 319.281 1596.405
Latency (cycles) 34896 5112 67808 13952
Throughput (Mbit/s) 214 1498 134 670
C. Discussions
Our implementation results above show that the hardware
implementations of RLNC over small fields and with limited
error control are quite feasible, unless there are very stringent
area and power requirements. However, small field sizes imply
limited block length and limited error control. As shown above,
the block length of a constant-dimension code can be increased
by lifting a Cartesian product of Gabidulin codes. While this
easily provides arbitrarily long block length, it does not ad-
dress the limited error control associated with small field sizes.
For example, a Cartesian product of (8, 4) Gabidulin codes has
the same error correction capability as the (8, 4) KK decoder,
and their corresponding constant-dimension codes also have
the same error correction capability. If we want to increase
the error correction capabilities of both Gabidulin and KK
codes, longer codes are needed and in turn larger fields are
required. A larger field size implies a higher complexity for
finite field arithmetic, and longer codes with greater error
correction capability also lead to higher complexity. It remains
to be seen whether the decoder architectures continue to be
affordable for longer codes over larger fields, and this will be
the subject of our future work.
VI. CONCLUSION
This paper presents novel hardware architectures for Ga-
bidulin and KK decoders. Our work not only reduces the
computational complexity for the decoder but also devises
regular architectures suitable for hardware implementations.
Synthesis results using a standard cell library confirm that our
designs achieve high speed and high throughput.
ACKNOWLEDGMENT
The authors would like to thank Dr. D. Silva and Prof. F.
R. Kschischang for valuable discussions, and thank reviewers
for their constructive comments.
REFERENCES
[1] R. Ahlswede, N. Cai, S.-Y. R. Li, and R. W. Yeung, “Network informa-
tion flow,” IEEE Trans. Inf. Theory, vol. 46, no. 4, pp. 1204–1216, Jul.
2000.
[2] T. Ho, M. Me´dard, R. Koetter, D. R. Karger, M. Effros, J. Shi, and
B. Leong, “A random linear network coding approach to multicast,”
IEEE Trans. Inf. Theory, vol. 52, no. 10, pp. 4413–4430, Oct. 2006.
[3] R. Ko¨tter and F. R. Kschischang, “Coding for errors and erasures in
random network coding,” IEEE Trans. Inf. Theory, vol. 54, no. 8, pp.
3579–3591, Aug. 2008.
[4] N. Cai and R. W. Yeung, “Network coding and error correction,” in
Proc. IEEE Information Theory Workshop (ITW’02), Oct. 20–25, 2002,
pp. 119–122.
[5] D. Silva, F. R. Kschischang, and R. Ko¨tter, “A rank-metric approach
to error control in random network coding,” IEEE Trans. Inf. Theory,
vol. 54, no. 9, pp. 3951–3967, Sep. 2008.
[6] E. M. Gabidulin, “Theory of codes with maximum rank distance,” Probl.
Inf. Transm., vol. 21, no. 1, pp. 1–12, Jan.–Mar. 1985.
[7] R. M. Roth, “Maximum-rank array codes and their application to
crisscross error correction,” IEEE Trans. Inf. Theory, vol. 37, no. 2, pp.
328–336, Mar. 1991.
[8] D. Silva and F. R. Kschischang, “On metrics for error correction in
network coding,” IEEE Trans. Inf. Theory, vol. 55, no. 12, pp. 5479–
5490, Dec. 2009.
[9] M. Gadouleau and Z. Yan, “Complexity of decoding Gabidulin codes,”
in Proc. 42nd Ann. Conf. Information Sciences and Systems (CISS’08),
Princeton, NJ, Mar. 19–21, 2008, pp. 1081–1085.
[10] F. R. Kschischang and D. Silva, “Fast encoding and decoding of
Gabidulin codes,” in Proc. IEEE Int. Sym. Information Theory (ISIT’09),
Seoul, Korea, Jun. 28–Jul. 3, 2009, pp. 2858–2862.
[11] P. A. Chou, Y. Wu, and K. Jain, “Practical network coding,” in Proc.
41st Ann. Allerton Conf. Communications, Control, and Computing,
Moticello, IL, Oct. 2003.
[12] G. Richter and S. Plass, “Error and erasure decoding of rank-codes with
a modified Berlekamp–Massey algorithm,” in Proc. 5th Int. ITG Conf.
Source and Channel Coding (SCC’04), Erlangen, Germany, Jan. 2004,
pp. 249–256.
[13] N. Chen and Z. Yan, “Cyclotomic FFTs with reduced additive complex-
ities based on a novel common subexpression elimination algorithm,”
IEEE Trans. Signal Process., vol. 57, no. 3, pp. 1010–1020, Mar. 2009.
[14] H. Burton, “Inversionless decoding of binary BCH codes,” IEEE Trans.
Inf. Theory, vol. 17, no. 4, pp. 464–466, Jul. 1971.
[15] D. V. Sarwate and N. R. Shanbhag, “High-speed architectures for Reed–
Solomon decoders,” IEEE Trans. VLSI Syst., vol. 9, no. 5, pp. 641–655,
Oct. 2001.
[16] P. Lancaster and M. Tismenetsky, The Theory of Matrices, 2nd ed., ser.
Computer Science and Applied Mathematics. Orlando, FL: Academic
Press, 1985.
[17] M. Gadouleau and Z. Yan, “On the decoder error probability of bounded
rank-distance decoders for maximum rank distance codes,” IEEE Trans.
Inf. Theory, vol. 54, no. 7, pp. 3202–3206, Jul. 2008.
[18] ——, “Decoder error probability of bounded distance decoders for
constant-dimension codes,” in Proc. IEEE Int. Symp. Information Theory
(ISIT’09), Seoul, Korea, Jun. 28–Jul. 3, 2009, pp. 2226–2230.
[19] ——, “On the decoder error probability of bounded rank distance
decoders for rank metric codes,” in Proc. IEEE Information Theory
Workshop (ITW’09), Taormina, Sicily, Italy, Oct. 11–16, 2009, pp. 485–
489.
[20] P. Delsarte, “Bilinear forms over a finite field, with applications to coding
theory,” J. Comb. Theory, Ser. A, vol. 25, pp. 226–241, 1978.
[21] O. Ore, “On a special class of polynomials,” Trans. Amer. Math. Soc.,
vol. 35, no. 3, pp. 559–584, 1933.
[22] ——, “Contributions to the theory of finite fields,” Trans. Amer. Math.
Soc., vol. 36, no. 2, pp. 243–274, 1934.
[23] P. Loidreau, “A Welch–Berlekamp like algorithm for decoding Gabidulin
codes,” in Proc. 4th Int. Workshop Coding and Cryptography (WCC’05),
ser. Lecture Notes in Computer Science, vol. 3969, Bergen, Norway, Mar.
14–18, 2005, pp. 36–45.
[24] E. R. Berlekamp, Algebraic Coding Theory. New York, NY: McGraw-
Hill, 1968.
[25] V. Skachek and R. M. Roth, “Probabilistic algorithm for finding roots of
linearized polynomials,” Des. Codes Cryptogr., vol. 46, no. 1, pp. 17–23,
Jan. 2008.
[26] E. D. Mastrovito, “VLSI architectures for computations in Galois fields,”
Ph.D. dissertation, Linko¨ping Univ., Linko¨ping, Sweden, 1991.
[27] M. Wagh and S. Morgera, “A new structured design method for
convolutions over finite fields, part i,” IEEE Trans. Inf. Theory, vol. 29,
no. 4, pp. 583–595, Jul. 1983.
[28] J. K. Omura and J. L. Massey, “Computational method and apparatus
for finite field arithmetic,” US Patent 4 587 627, 1986.
[29] A. Reyhani-Masoleh and M. A. Hasan, “A new construction of Massey–
Omura parallel multiplier over GF(2m),” IEEE Trans. Comput., vol. 51,
no. 5, pp. 511–520, May 2002.
[30] E. R. Berlekamp, Algebraic Coding Theory, revised ed. Laguna Hills,
CA: Aegean Park Press, 1984.
[31] D. Silva, F. R. Kschischang, and R. Koetter, “A rank-metric approach
to error control in random network coding,” in Proc. IEEE Information
Theory Workshop on Information Theory for Wireless Networks, Jul. 1–6,
2007, pp. 1–5.
14
[32] A. Bogdanov, M. C. Mertens, C. Paar, J. Pelzl, and A. Rupp, “A parallel
hardware architecture for fast Gaussian elimination over GF(2),” in
Proc. 14th Ann. IEEE Symp. Field-Programmable Custom Computing
Machines (FCCM’06), Napa Valley, CA, Apr. 24–26, 2006, pp. 237–
248.
[33] J. E. Stine, I. Castellanos, M. Wood, J. Henson, F. Love, W. R. Davis,
P. D. Franzon, M. Bucher, S. Basavarajaiah, J. Oh, and R. Jenkal,
“FreePDK: An open-source variation-aware design kit,” in Proc. IEEE
Int. Conf. Microelectronic Systems Education (MSE’07), San Diego, CA,
Jun. 3–4, 2007, pp. 173–174.
[34] H. Lee, “High-speed VLSI architecture for parallel Reed–Solomon
decoder,” IEEE Trans. VLSI Syst., vol. 11, no. 2, pp. 288–294, Apr.
2003.
APPENDIX A
PROOF OF LEMMA 1
Proof: This follows the proof of [31, Proposition 7] closely.
Let the RRE and an n-RRE forms of Y be RRE(Y ) =[
W r˜
0 Eˆ
]
and Y¯ ′ =
[
W ′ r˜′
0 Eˆ′
]
. Since the RRE form of Aˆ is
unique, W =W ′. Thus, µ = µ′ and δ = δ′. In the proof of
[31, Proposition 7], U is chosen based onW . Thus, we choose
U = U ′. Since Lˆ is uniquely determined by W and Lˆ′ is by
W ′, we also have Lˆ = Lˆ′. Finally, choosing r′ = IU ′c r˜′, the
rest follows the same steps as in the proof of [31, Proposi-
tion 7].
APPENDIX B
PROOF OF LEMMA 2
Proof: This follows a similar approach as in [5, Appendix
C]. We have
rank
[
X
Y
]
= rank

 I xI + Lˆ′IT
U ′
r′
0 Eˆ′


= rank

 Lˆ′ITU ′ r′ − xIT
U ′c
(I + Lˆ′IT
U ′
) IT
U ′c
r′
0 Eˆ′

 (8)
= rank
[
Lˆ′IT
U ′
r′ − x
0 Eˆ′
]
+ rank
[
IT
U ′c
IT
U ′c
x
]
(9)
= rank
[
Lˆ′ r′ − x
0 Eˆ′
]
+ n− µ′
where (8) follows from IT
U ′
[I+Lˆ′IT
U ′
| r] = 0 and (9) follows
from IT
U ′
IU ′c = 0. Since rankX+rankY = 2n−µ′+δ′, the
subspace distance is given by dS(〈X〉, 〈Y 〉) = 2 rank
[
X
Y
]
−
rankX − rankY = 2 rank
[
Lˆ
′
r
′
−x
0 Eˆ
′
]
− µ′ − δ′.
Ning Chen (S’06-M’10) received the B.E. and M.E.
degrees from Tsinghua University, Beijing, China, in
2001 and 2004, respectively, and the Ph.D. degree
from Lehigh University, in 2010, all in electrical
engineering. Currently he is with the Enterprise Stor-
age Division, PMC-Sierra, Allentown, PA, USA.
His research interests are in the VLSI design and
implementation of digital signal processing and com-
munication systems.
Zhiyuan Yan (S’00–M’03–SM’08) received the B.E.
degree in electronic engineering from Tsinghua Uni-
versity, Beijing, China, in 1995 and the M.S. and
Ph.D. degrees, both in electrical engineering, from
the University of Illinois, Urbana, in 1999 and 2003,
respectively. During summer 2000 and 2002, He was
a Research Intern with Nokia Research Center, Irv-
ing, Texas. He joined the Electrical and Computer
Engineering Department of Lehigh University, Beth-
lehem, Pennsylvania, as an Assistant Professor in
August 2003.
His current research interests are in coding theory, cryptography, wireless
communications, and VLSI implementations of communication and signal
processing systems, and he has published over 70 technical papers in refereed
journals and conference proceedings.
Dr. Yan served as technical program committee co-chair and general co-
chair for ACM Great Lakes Symposium on VLSI in 2007 and 2008, respec-
tively. Dr. Yan has been an associate editor for IEEE Communications Letters
since 2008 and is an associate editor for Journal of Signal Processing Systems.
He is a member of the IEEE Information Theory, Communications, Signal
Processing, and Computer Societies. He is also a member of Tau Beta Pi,
Sigma Xi, and Phi Kappa Phi honor societies.
Maximilien Gadouleau (S’06–M’10) received an
equivalent of the M.S. degree in electrical and com-
puter engineering from Esigelec, Saint-Etienne du
Rouvray, France, in 2004, and the M.S. and Ph.D.
degrees in computer engineering from Lehigh Uni-
versity, Bethlehem, PA, in 2005 and 2009, respec-
tively.
From 2009 to 2010, he was a postdoctoral re-
searcher at the Universite´ de Reims Champagne-Ardenne,
Reims, France. In 2010, he joined the School of Elec-
tronic Engineering and Computer Science at Queen
Mary, University of London, London, UK as a postdoctoral research assistant.
His current research interests are coding theory, network coding, and cryptog-
raphy, and their relations to combinatorics and graph theory.
Dr. Gadouleau is a member of the IEEE Information Theory Society.
Ying Wang (S’00–M’06) received her Ph.D. degree
in electrical and computer engineering from the Uni-
versity of Illinois at Urbana-Champaign, IL, USA, in
2006. Currently, she is a Staff Engineer with the New
Jersey Research Center of QUALCOMM Corporate
Research and Development at Bridgewater, NJ. Her
research interests include wireless communications,
statistical signal processing, multimedia security and
forensics, and detection and estimation theory.
Bruce W. Suter (S’80–M’85–SM’92) received the
B.S. and M.S. degrees in electrical engineering in
1972, and the Ph.D. degree in computer science in
1988 from the University of South Florida, Tampa,
U.S.A.
In 1998, he joined the technical staff at the U.S.
Air Force Research Laboratory, Rome, New York,
where he was the founding Director of the Center
for Integrated Transmission and Exploitation (CITE).
He has held a visiting appointments at Harvard Uni-
versity and the Massachusetts Institute of Technol-
ogy. His current research interests are focused on wireless computing networks
and their applications to signal and image processing. His previous positions
include academia at the U. S. Air Force Institute of Technology and the
University of Alabama at Birmingham, together with industrial positions at
Honeywell Inc. and Litton Industries. He is a former associate editor of
IEEE Transactions on Signal Processing and the author of a widely accepted
monograph Multirate and Wavelet Signal Processing.
Dr. Suter is a member of Tau Beta Pi and Eta Kappa Nu. He has received
a number of awards for his engineering and research contributions. These
include the Air Force Research Laboratory (AFRL) Fellow, an AFRL-wide
award for his accomplishments in the theory, application, and implementation
of signal processing algorithms, the Arthur S. Flemming Award, a government-
wide award for his pioneering Hankel transform research, the General Ronald
W. Yates Award for Excellence in Technology Transfer for his patented Fourier
transform processor, and the Fred I. Diamond Award for best laboratory
research publication. He is author of over a hundred technical publications.
