Abstract-In this paper, we propose a new implementation of the Extended Min-Sum (EMS) decoder for non-binary LDPC codes. A particularity of the new algorithm is that it takes into accounts the memory problem of the non-binary LDPC decoders, together with a significant complexity reduction per decoding iteration. The key feature of our decoder is to truncate the vector messages of the decoder to a limited number of values in order to reduce the memory requirements. Using the truncated messages, we propose an efficient implementation of the EMS decoder which reduces the order of complexity to ( log 2 ). This complexity starts to be reasonable enough to compete with binary decoders. The performance of the low complexity algorithm with proper compensation is quite good with respect to the important complexity reduction, which is shown both with a simulated density evolution approach and actual simulations.
I. INTRODUCTION

I
T is now well known that binary low density parity check (LDPC) codes achieve rates close to the channel capacity for very long codeword lengths [1] , and more and more LDPC solutions have been proposed in standards (DVB, WIMAX, etc). In terms of performance, binary LDPC codes start to show their weaknesses when the code word length is small or moderate, or when higher order modulation is used for transmission. For these cases, non-binary LDPC (NB-LDPC) codes designed in high order Galois fields have shown great potential [2] - [5] .
However, the performance gain provided by LDPC codes over GF( ) comes together with a significant increase of the decoding complexity. NB-LDPC codes can be decoded efficiently with message passing algorithms as the belief propagation (BP) decoder, but the size of the messages varies in the order of the field. Therefore, a straightforward implementation of the BP decoder has complexity in ( A. Voicila, D. Declercq (contact author), F. Verdier, and M. Fossorier are with ETIS, ENSEA/univ. Cergy-Pontoise/CNRS, 6, avenue du Ponceau, F-95000, Cergy-Pontoise, France (e-mail: declercq@ensea.fr, verdier@u-cergy.fr, mfossorier2@yahoo.com).
P. Urard is with STMicroelectronics, Crolles, France (e-mail: pascal.urard@st.com).
Digital Object Identifier 10.1109/TCOMM.2010.05.070096
in the binary case, reducing the complexity to ( log ) [2] , [6] , but this implementation is only convenient for messages expressed in the probability domain. This is a problem since several authors have identified that the use of log-densityratios (LDR) representation is mandatory to avoid complicated operations like multiplications and divisions. Any LDR-based implementation of the BP requires also − 1 values per message in the graph.
In this paper, we propose a new decoding algorithm for NB-LDPC codes. Our algorithm has both low computing complexity and reduced storage requirements, and therefore becomes a good solution for hardware implementation.
In one of the algorithms presented in [7] the authors introduced the idea of using only a limited number of reliabilities in the messages at the input of the check node in order to reduce the computational burden of the check node update. The complexity at each check node was reduced to the order of ( ), and the same memory storage complexity as BP was needed. In this paper, we keep the basic idea of using only ≪ values for the computation of messages, but we extend the principle to all the messages in the Tanner graph, that is, both at the check nodes and the variable nodes input. Moreover, we propose to store only reliabilities instead of − 1 for each message. The truncation of messages from − 1 to values has to be done in an efficient way in order to reduce its impact on the performance of the decoder. The truncation technique that we propose is described in details in Section III, together with an efficient offset correction to compensate the performance loss. Using the truncated messages representation, and a recursive implementation of the check node update, we propose a new implementation of the Extended Min-Sum (EMS) decoder whose complexity is dominated by ( log ), with ≪ . This is an important complexity reduction compared to all existing methods [7] - [9] . Our new algorithm is developed in Section IV and a study of its complexity/performance trade-off is presented in Section V. Section VI is dedicated to non-binary adaptation of the shuffled scheduling for the special class of cycle codes. In Section VII the robustness of the algorithm to the effects of a finite precision representation of messages is studied. In Section VIII-A, the simulation results verify that the proposed low complexity decoder still performs very close to the BP decoder that we use as benchmark. We conclude the paper in section VIII-A by a fair comparison between 0090-6778/10$25.00 c ⃝ 2010 IEEE the proposed non-binary decoding algorithm and the binary corrected Min-Sum (MS) algorithm [10] applied to binary irregular LDPC codes, in terms of computational complexity and error performance.
II. PRELIMINARIES
An NB-LDPC code is defined by a very sparse random parity check matrix , whose components belong to a finite field GF( ). The matrix consists of rows and columns; the code rate is defined by ≤ − . Decoding algorithms of LDPC codes are iterative message passing decoders based on a factor (or Tanner) graph representation of the matrix [11] . In general, an LDPC code has a factor graph consisting of variable nodes and parity check nodes with various degrees. To simplify the notations, we will only present the decoder equations for isolated nodes with given degrees. We denote the degree of a symbol node and the degree of a check node. In order to apply the decoder to irregular LDPC codes, simply let (resp. ) vary with the symbol (resp. check) index. A single parity check equation involving variable nodes (codeword symbols) is of the form:
where each ℎ is a nonzero value of the parity matrix .
As for binary decoders, there are two possible representations for messages : probability weights vectors or LDR vectors. The use of the LDR form for messages has been advised by many authors who proposed practical LDPC decoders. The LDR values, which represent real reliability measures on the bits or the symbols are less sensitive to quantization errors due to the finite precision coding of the messages [12] . Also, LDR measures operate in the logarithm domain, which avoids complicated operations (in terms of hardware implementation) like multiplications or divisions. The following notation will be used for an LDR vector of a random variable ∈ ( ):
where
with ( = ) being the probability that the random variable takes on the values
[ ] ∈ ℝ. The log-likelihood-ratio (LLR) messages at the channel output are − 1 dimensional vectors in general denoted by
and are defined by − 1 terms of the type (2) . The values of the probability weights ( = ) depend on the transmission channel statistics. The decoding algorithm that we propose is independent of the channel, and we just assume that a demodulator provides the LLR vector L ℎ to initialize the decoder. We have applied the NB-LDPC codes to communicate over two types of channels: BI-AWGN and QAM-AWGN.
For the BI-AWGN case, each symbol of the codeword , ∈ {0, . . . , − 1} can be converted into a sequence of log 2 ( ) bits The NB-LDPC iterative decoding algorithms are characterized by three main steps corresponding to the different nodes depicted in Fig. 1 : (i) the variable node update, (ii) the permutation of the messages due to non zeros values in the matrix and (iii) the check node update which is the bottleneck of the decoder complexity, since the BP operation at the check node is a convolution of the input messages, which makes the computational complexity grow in ( 2 ) with a straightforward implementation.
We use the following notations for the messages in the graph (see Fig. 1 ). Let {V } ∈{0,..., −1} be the set of messages entering into a variable node v of degree , and {U } ∈{0,..., −1} be the output messages for this variable node. The index ' ' indicates that the message comes from a permutation node to a variable node, and ' ' is for the other direction. We define similarly the messages {U } ∈{0,..., −1} (resp. {V } ∈{0,..., −1} ) at the input (resp. output) of a degree check node. In [7] , the EMS algorithm reduces the complexity of the check node update by considering only the largest values of the messages at the input of the check node. However, the output messages of the check node are still composed of values. As a consequence, the EMS complexity of a single parity check node varies in ( . ) and all messages in the graph are stored with their full representation of real values, which implies a high memory requirements.
In this paper, we present a new implementation of the EMS algorithm, whose main originality is to store exactly ≪ values in all vector messages U , V . As a result not only the memory requirements are reduced but also the computational complexity. In the following section we present our procedure to truncate the messages from to values and discuss the impact on the error correction performance of the decoder. The compensated-truncated message B has then ( + 1) components, and the value is seen as a constant real value that replaces the − missing reliabilities. A full representation of the truncated message B would then be:
This means in particular that
. Let us first analyze a possible solution to compute the value of using normalization of probability messages. We consider P the probability domain representation of the LDR vector 
Remember that A is unsorted while B is sorted, which explains the difference in these two definitions. Because P is a probability weight vector, we have:
A clever way to fix a good value on the scalar compensation is to assume that the truncated message should represent a probability weight vector with a sum equal to one, so that
= 1 is satisfied. The probability weight associated with LDR value is
. The normalization of vector P is then
and finally
As a first remark, we note that the computation of the additional term requires the − ignored values of vector A, and the computation of a non linear function. The non linear function can be expressed in terms of the max * ( 1 , 2 ) operator, used in many papers (e.g. [9] ), and in order to simplify (4), we approximate this operator by:
Equation (4) becomes:
where [ ] is the largest value among the ( − ) ignored values of vector A. By using the approximation (6) we obtain a simple computational formula for the supplementary term , since we just need to truncate the LDR vector A with its ( + 1) largest values instead of its largest values. On the other hand, this approximation introduces a degradation of the error performance of the decoder. The approximation (5) is well known to over-estimate the values of the LDR messages [13] , and needs compensation.
In principle, the compensation of the over-estimation should be different for each message since the accuracy of approximation (5) depends on the values it is applied to. An adaptive compensation would be obviously too complicated with regards to our goal of proposing a low complexity algorithm. We have then chosen to compensate globally the over-estimation of the additional term with a single scalar offset, constant for all messages in the graph and also constant for all decoding iterations:
There are several ways of optimizing the value of a global offset correction in message passing decoders. We have chosen to follow the technique proposed in [7] , which consists of minimizing the decoding threshold of the LDPC code, computed with simulated density evolution. Because of the lack of space, we do not discuss in this paper the optimization of the global offset, and we recall that estimated density evolution is just used as a criterion to choose the correction factor and not to compute accurate thresholds.
IV. DESCRIPTION OF THE ALGORITHM
A. Decoding steps with messages of size ≤
We now present the steps of the EMS decoder that uses compensated-truncated messages of size . We assume that the LLR vectors of the received symbols are known at the variable nodes, either stored in an external memory or computed on the fly from the channel measurements.
Using the notations of Fig. 1 , the basic steps of the algorithm are: 1) Initialization: the largest values of the LLR vectors are copied in the graph on the {U } ∈{0,..., −1} messages. 2) Variable-node update: the output vector messages {U } ∈{0,..., −1} (of size ) associated to a variable node passed to a check node are computed given all the information propagated from all adjacent check nodes and the channel, except this check node itself. 3) Permutation step: this step permutes the messages according to the nonzero values of (see (1)). In our algorithm, it just modifies the indices vectors and not the message values:
where the multiplication is performed in GF( ). 4) Check-node update: for each check node, the values
.., −1} sent from check a node to a permutation node are defined as the probabilities (expressed in LDR format) that the parity-check equation is satisfied if the variable node is assumed to be equal to [ ]. 5) Inverse permutation step: this is the permutation step from check nodes to symbol nodes, so it is identical to step 3), but in the reverse order.
For steps 2) and 4), a recursive implementation combined with a forward/backward strategy is a well known efficient implementation of node update when the associated degree is larger than four. This implementation technique has been widely presented in the literature for binary LDPC codes, and also for non-binary LDPC codes in [9] . It is based on a decomposition of the node neighborhood using dummy variables and adding corresponding edges that carry intermediate messages, that are named I in this paper. This decomposition allows to express the check or variable node equations using several elementary steps. One elementary step is defined by a node update that assumes only two input messages and one output message. The decomposition of a degree = 5 check node and the associated forward/backward scheduling is depicted on figure 2. In this figure, 
Since the EMS algorithm only involves linear operations, the terms [ ], [ ] have the same LDR structure as defined in (2).
B. Variable node elementary step
Let assume that an elementary step describing the variable node update has V and I as input messages and U as output message. The vectors V, I and U of size are sorted in decreasing order. We note also by V , I and U their associated index vectors. Using the BP equations in the logdomain for the variable node update [9] , the goal of an elementary step is to compute the output vector containing the largest values among the 2 candidates (10) (stored in an internal vector message T). The processing of the elementary step in the case of a variable node update is described by:
The compensation value is used when the required symbol index is not present in an input message. Whenever the V input corresponds to the LLR channel vector of the received symbol, the equation (10) becomes:
since we do not assume that LLR vectors are truncated/compensated messages.
C. Low complexity implementation of a check node elementary step
This section describes in details the algorithm that we propose for an elementary component of the check node. This step is the bottleneck of the algorithm complexity and we discuss its implementation in details in the rest of the paper. The check node elementary step has U and I as input messages and V as output message. All these vectors are of size are sorted in decreasing order. Similar to the variable node update, we note also by U , I and V their associated index vectors. Following the EMS algorithm presented in [7] , we define 
Just as in the variable node update, when a required index is not present in the truncated vector U or I, its compensated value is used in equation (11) . Without a particular strategy, the computation complexity of an elementary step is dominated by ( 2 ). We propose a low computational strategy to skim the two sorted vectors U and I, that provide a minimum number of operations to process the sorted values of the output vector V. The main component of our algorithm is a sorter of size , which is used to fill the output message. For the clarity of presentation, we use a virtual matrix built from the vectors U and I (cf. Fig.3 
2 ]. Of course, the value of depends on the LDR vectors U and I, and a strictly valid implementation of the elementary step should take into account the possibility of the worst case. However, we have found that is most of the time quite small. As a matter of fact, the distribution of has an exponential shape and decreases very rapidly, e.g.
( ≤ + 4) = 0.9816, for a regular GF(256)-LDPC code, = 32 and a signal to noise ratio in the waterfall region of the code. Based on this observation, it seems natural to consider that the bad situations with large are sufficiently rare so that they do not really impact on the decoder performance. We have verified this claim by simulations of density evolution and found that using = 2 does not change the value of the decoding threshold for various LDPC code parameters. Note that with = 2 , sometimes the output vector V could be filled with less than values and in those cases, we fill the rest of the vector with a constant value equal to the additional term . The worst case for the complexity of an elementary step is then ( log 2 ) = (2 log 2 ), which corresponds to the number of max operations needed to insert elements into a sorted list of size . In the next section, we study in details the complexity of our new implementation of the EMS algorithm.
V. COMPLEXITY AND MEMORY EVALUATION OF THE ALGORITHM
The computational complexity per bit of a single parity node and a single variable node is indicated in table I in terms of their connexion degree (resp. ). This complexity applies both for regular and irregular non binary LDPC codes, the local value of the connexion degree following the connectivity profile of the code. This complexity assumes the use of truncated messages of size , and the implementation of the check node update presented in this paper. Note that we indicated the worst case complexity for the check node with = and that the average complexity is often less than that. The complexity associated with the update of vectors U at the variable node output is obtained with a recursive implementation of the variable node, which is used only for connexion degrees ≥ 3. As a result, the complexity of our decoding algorithm is dominated by ( log 2 ( )) for both parity and variable nodes computation. Interestingly, the complexity of a check node and of a variable node are somewhat balanced, which is a nice property that should help an efficient hardware implementation based on a generic processor model. Moreover, one can remark that the complexity of the decoder does not depend on , the order of the field in which the code is considered. Let us again stress the fact that the complexity of our decoder varies in the order of ( log 2 ( )) and with ≪ , which is a great computational reduction compared to existing solutions [7] - [9] .
Finally, for a complete characterization of the computational complexity of our non-binary LDPC decoding algorithm, we also reported in table I the associated complexity of the permutation step ( ) and the complexity of the postprocessing ( ). The memory space requirement of the decoder is composed of two independent memory components, the memory corresponding to the channel messages L ℎ and the edge memory corresponding to the extrinsic messages U, V with their associated index vectors . Storing each LDR value on bits in finite precision would therefore require a total number of * * * ( + log 2 ) bits for the edge memory. Thus, the memory storage depends linearly on , which was the initial constraint that we put on the messages.
Since
is the key parameter of our algorithm that tunes the complexity and the memory of the decoder, we now need to study for which values of the performance loss is small or negligible. In order to give a first answer to this question, we have made an asymptotic threshold analysis of the impact of on the threshold value. For a rate = 0.5 LDPC code with parameters ( = 2, = 4), Fig.4 plots the estimated threshold in ( / 0 ) of our algorithm for different values of and two different field orders GF(64) and GF(256). In this paper, we do not claim that the EMS algorithm verifies the necessary symmetry conditions that ensures the convergence of density evolution. Therefore, the validity of the threshold values is not proved. However, the estimated thresholds are a good indicator of the decoder behavior when the codeword length is large and the nonzeros values in the matrix are chosen uniformly.
The BP thresholds are equal to = 0.58 for the GF(64) code and = 0.5 for the GF(256) code [7] . As expected, the thresholds become better as increases, and can approach the threshold of BP with much less complexity. We can use the plots on Fig.4 as first indication for choosing the field order of the LDPC code that corresponds to a given complexity/performance trade-off. Note, however, that this asymptotic study has to be balanced with the girth properties of finite length codes, since it has been identified in [3] , [4] that ultra-sparse LDPC codes in high order fields and with high girth have excellent performance.
VI. SPECIAL CASE: FURTHER MEMORY REDUCTION FOR
CYCLE CODES It has been shown that for high order fields ≥ 64, the best GF( )-LDPC codes decoded with BP should be ultra sparse (cycle codes, = 2) [2], [3] . In the EMS implementation, an improved trade-off memory space/performance can be achieved for the decoding of cycle codes, by considering a modified scheduling of the decoding steps described in Section IV. We have adapted the shuffled scheduling proposed in [14] to the non-binary case, with the objective of greater storage memory reduction. Note that the adaptation of the shuffled scheduling for NB-LDPC codes has been proposed independently in [15] , but the authors did not study the memory reduction that this scheduling implies.
Using a shuffled scheduling allows to store only the messages U in the edge memory, and the intermediate messages I and the messages V can be stored locally in a processing unit. It is therefore possible to consider more than values for the I and V without increasing the storage capacity of the decoder. Let us denote by (respectively ) the number of LDR values that form the truncated versions of messages I (respectively V) inside the processing unit. By construction, the different sizes verify ≤ ≤ . The shuffled scheduling is defined as follows. For each and every check node, let { 1 , . . . , } be the set of variable nodes connected to this check node. The shuffled processing unit takes all incoming messages U that are on the edges of the check node, computes locally the V messages on the same edges with the EMS algorithm, and then updates the U messages that are on the edges of { 1 , . . . , } which are not connected to the current check node. In the case of = 2 LDPC codes, this last step is performed only with the knowledge of the channel LLRs {L } =1,..., . We can consider that the shuffled processing unit works with two types of messages: the external U vectors which determine the dimension of the edge memory and the internal V and I vectors which determine the computational complexity of decoder. Using different values for ( , , ) has then an impact on the trade-off between the overall complexity of the decoder and its performance. We now discuss this advantage of the shuffled scheduling with a comparison with the classical flooding scheduling.
Let us consider a code ( = 2, = 4) code in GF(256) of size = 848 (see section VIII-A for more details), and let us use truncated messages of size = 18 in a flooding implementation of the EMS decoder. We consider the two following cases for a shuffled scheduling, and the corresponding frame error rate simulations are plotted on figure Fig.5: • (a) The same computational complexity for the two schedules. In this case, the size of the vectors V and I is set to = = 18. The size of the vectors U is set to = 9. This choice corresponds to a memory space reduction of roughly , with a small error performance degradation compared to the flooding implementation (Fig.5, B and C curves).
• (b) The same edge memory space for the two schedules.
In this case, the size of the vectors U is kept at = 18, but the size of vectors V and I is increased to = = 36. The shuffled scheduling provides an improvement of the error performance (Fig.5, A and B curves), without increasing the memory requirement of the decoder. Of course, this also induces an increase of the algorithm complexity . As a conclusion, implementing the shuffled scheduling for non binary LDPC codes has the same advantage of reducing the average number of decoding iterations, as for the binary shuffled scheduling (see [15] for more details), but also provides additional degrees of freedom for the storage/complexity/performance trade-off of an EMS decoder.
VII. QUANTIZATION OF THE EMS ALGORITHM
Toward practical hardware implementation, quantization is an indispensable issue that needs to be resolved. The goal of this section is to find the best trade-off between the hardware complexity, messages storage space and the error performance of the EMS algorithm. We investigate only the impact of uniform quantization schemes. The choice of the uniform quantization scheme is motivated by the fact that the hardware implementation of the EMS algorithm does not require nonlinear operations and the uniform quantifier has the advantage that it is simple and fast.
Let ( , ) represent a fixed-point number with bits for the integer part (dynamic range) and bits for the fractional part. So by fixed-point representation, a real number is mapped to a binary sequence x = [ x 0 . . .
. A direct consequence of the post-processing defined by equation (9) , is that we can use an unsigned fixed-point representation (12) to quantify the LDR messages of the EMS algorithm.
This representation corresponds to a limit range of the LDR values of
with a precision of 2 − . Various schemes ( , ) are examined, in order to find the best tradeoff between the number of quantization bits ( + ) and the error performance degradation of the decoder. The most representative results are summarized in Fig.6 , which presents the simulation results of the EMS algorithm for an LDPC code over GF(64) of rate = 1/2, for two sets of parameters ( , ) = (8, 16) and ( , ) = (16, 32). We remark that a fixed point quantization scheme with = 5 bits provides error performance close to the floating implementation of the EMS algorithm, while all the quantizations having = 4 bits caused an error floor region. It turns out that the apparition of this phenomenon is due to the insufficient dynamic range of the LDR messages [16] .
With the goal of speed and low storage in mind, we advice a quantization of all messages with 5 bits, with ( = 5, = 0). This representation of messages provides a balanced tradeoff between low storage and good performance. We have conducted the same finite precision study for various rates and code lengths and have observed that ( = 5, = 0) is good in all cases. The EMS algorithm requires then only a few quantization bits, close to the fixed-point representation of the extrinsic messages in binary LDPC decoders [18] .
VIII. EXPERIMENTAL RESULTS OF THE EMS DECODER
A. Performance loss compared to the non-binary BP algorithm
In this section, we present the simulation results of our low complexity EMS algorithm, compared with the BP algorithm considered as reference. We have made the comparison with regular GF( )-LDPC codes over high order fields, of rate = 1/2 ( = 2, = 4), applied on a BPSK-AWGN channel. The BP has been implemented in floating point precision, and a quantization of ( = 5, = 0) is used for the EMS algorithm, as pointed out in the preceding section. In figure Fig.7 , we have reported the frame error rate (FER) of a short code with length = 848 equivalent bits, corresponding to a length = / log 2 ( ) non-binary LDPC code. The maximum number of iteration has been fixed to 1000, and a stopping criterion based on the syndrome check is used. Note that the average number of decoding iterations is rather low for all the simulation points below = 10
(as an example, the average number of iterations for the (2, 4) GF(64) code at = 6 * 10 −4 is equal to 3). We denote by EMS GF(q) , the EMS decoder over the field GF( ) with parameters , and = . Let us first discuss the performance of the EMS decoder with respect to the BP decoder. For the code over GF(64), the EMS GF(64) 8,16 is the less complex algorithm presented. It performs within 0.25 of the BP decoder in the waterfall region. The EMS
GF(64)
16,32 algorithm has 0.06 performance loss in the waterfall region and performs even better than the BP decoder in the error floor region. The fact that the EMS can beat the BP decoder in the error floor is not surprising and is now well known in the literature. This behavior comes from the fact that for small code lengths, an EMS algorithm corrected by an offset could be less sensitive to pseudo-codewords than the BP.
Note that with this example, the only advantage of using a GF(256) code in terms of performance/complexity trade off is that it provides an error floor region lower than the 16,32 with a gain of 0.19 . The good performance of the GF(64) code in the waterfall region is determined by the value of = 32 parameter, which is sufficiently close to the field order to provide a good threshold. At low FER, the performance gap between the two codes becomes smaller, which seems to indicate that the (256) LDPC code will perform better that the (64) LDPC code at very low FER (FER<10 −7 ), without increasing the decoder complexity. Note that this observation balances the conclusions of Section V, and stresses another advantage of considering very high order field non-binary LDPC codes. Moreover, the EMS is quite robust since the complexity reduction from = 256 to = 32 is a lot higher than from = 64 to = 32, and the performance loss stays acceptable. Note that the other approaches proposed in the literature [8] , [9] were not illustrated on high order fields and that -to our knowledge -the EMS decoder is the first decoder that proposes a good performance complexity trade-off for field orders ≥ 64.
In order to quantify the influence of the offset parameter ( ) on the decoder's performances, we have also reported in Fig.7 the simulations results of the EMS decoder in the particular case when the offset is zero (EMS without offset). We remark that the error performances of the EMS GF (256) 16,32 algorithm are greatly improved by using a proper offset, and its influence is less significant in the case EMS GF(64) 8, 16 . Generally, the influence of the offset parameter on the error performances of the decoder depends on the loss of information induced by the truncation procedure ( − ). If the difference − is non-negligible the use of a proper offset is recommended.
For lack of space reasons, we present only the results for the code/decoder parameters of figure Fig.7 , but we have con-ducted extensive simulations for various other code/decoder parameters and the same kind of behavior has been observed. As seen on the results presented in this section, the error performance of a hardware implementable version of the EMS is quite close to the performance of floating BP algorithm. Its good performance and its reduced complexity and memory space requirement make the EMS algorithm a good candidate for the hardware implementation of non binary LDPC decoders.
In order to improve the performance of the decoder without sacrificing much the complexity, it would be interesting to study more precisely if the performance degradation compared to BP comes from the truncation of the messages or from the use of a operator at the check node update. A correction strategy more elaborate than a single offset correction (dynamical offset along the iterations, nonlinear correction, etc) could be more effective on either approximations.
B. Comparison with binary decoders
The main idea of this section is to compare in terms of computational complexity and error performance the proposed EMS algorithm to its binary equivalent, the corrected MinSum (MS) algorithm [10] . The complexity of the corrected MS algorithm for a single check node of degree is equal to: 3( − 2)/ operations per bit, (2 − 1)/ operations per bit to compute the sign of the output and 2 real additions that correspond to the correction operation. Also, for a bit node of degree the complexity is equal to (2 −1)/ real additions per bit. For a fair computational complexity comparison of algorithms, we have decided to compare only the operations that are common to both algorithms. We thus compare the number of operations of the EMS algorithm (see table I) with the operations of the MS algorithm and the number of real additions necessary to two algorithms (per iteration). The specific operations of the algorithms are not taken into account in the complexity comparison (the additions over GF( ) for EMS algorithm and the sign computation for the MS).
The comparison has been made for short and moderate code lengths over BI-AWGN and QAM-AWGN channels. The choice of the code length is motivated by the fact that the non-binary LDPC codes can achieve performance very close to the Shannon limit for these lengths. The binary codes that we used are from [17] , irregular codes of size = 504 (short length) and = 1008 bits (moderate length) and of coderate = 0.5. The corresponding non-binary codes are of equivalent length = 84 symbols over GF(64) (short length) and = 126 symbols over GF(256) (moderate length). The non-binary codes are regular ( = 2, = 4) and of coderate = 0.5.
In Fig.8 , we have reported the frame error rate (FER) of binary and non-binary short length codes. We denote by EMS GF(q) the EMS decoder over the field GF( ) with parameters = = = . Let us first discuss the performance of the EMS algorithm with respect to the corrected MS algorithm. The EMS GF(64) 18 algorithm performs better than the corrected MS with a gain of 0.375 in the waterfall region. Furthermore for a smaller value of algorithm, which has a complexity equivalent to the binary decoder. The loss of performance in the waterfall region is explained by the small value of = 6 (approximately 10% of ), which is not sufficiently close to the field order to provide a good threshold.
For short code lengths, the EMS GF(64) 18
and EMS GF(64) 12 have better error performance than the MS decoder on a very good binary LDPC code (for this rate and length) and in the same time the complexity of our non binary decoder remains reasonably close to the complexity of the binary decoder.
Over QAM-AWGN channels, the non-binary LDPC codes with a field order greater or equal to the size of constellation has the advantage that the encoder/decoder works directly with symbols. All mapping choices of the codeword symbols to the constellation points are equivalent and lead to the same performance. This means that there is no loss of performance due to the demapping process at the receiver. This is a clear advantage comparing to the binary codes. In Fig.9 , we have plotted the simulation results of the EMS algorithm and the binary MS algorithm for the moderate length codes, over a 256-QAM-AWGN channel. We have used a Bit-Interleaved Coded Modulation scheme to transmit the binary code over the 256-QAM-AWGN channel and a field order equal to = 256 for the non-binary LDPC codes. Note that the non-binary LDPC codes have been optimized with the technique described in [4] .
Over the QAM256-AWGN channel the EMS GF(256) 36 algorithm performs 0.5 better than the corrected MS algorithm which is a quite important improvement. Concerning the complexity comparison, the EMS algorithm has approximately 25 times the complexity of the binary algorithm. The EMS GF(256) 6 Comparison between EMS decoding algorithm and binary MS algorithm, for an LDPC code (R=0.5, =1008 bits) over 256-QAM-AWGN channel.
and EMS GF(256) 12
algorithms have a performance loss in the waterfall region due to the small value of . The EMS GF(256) 6
has roughly the same complexity than the MS decoder. As in the BI-AWGN channel case, the EMS decoder on non-binary LDPC codes performs better than the MS algorithm on binary LDPC codes, with a reasonable increase in complexity. Our efficient decoder shows that non-binary LDPC codes could be a reliable alternative for coding schemes with short to moderate codeword lengths. Note that the EMS decoder has a quite fast convergence since the average number of decoding iterations when a syndrome stopping criterion is used is typically half the one of the binary case. For example, with ( = 64, = 18) at = 1 − 5, the average number of iterations for the EMS algorithm is equal to 3.3 and for its binary equivalent (Min-Sum) the average number of iterations is 6.8. This remark remains valid in the case of an 256-QAM-AWGN transmission, where for the EMS GF(256) 36 algorithm (Fig.9 ) the average number of iterations is equal to 5 at = 1 − 5 and for the Min-Sum algoritm the average number of iterations is approximatively 9.5.
IX. CONCLUSION
We have presented in this paper a general low complexity decoding algorithm for non binary LDPC codes, using logdensity-ratio as messages. The main originality of the proposed algorithm is to truncate the vector messages to a fixed number of values ≪ , in order to solve the complexity problem and to reduce the memory requirements of the non binary LDPC decoders. We have also shown that by using a correction method for the messages, our EMS decoding algorithm can approach the performance of the BP decoder and even in some cases beat the BP decoder. The complexity of the proposed algorithm is dominated by ( log 2 ( )). For values of providing near-BP error performance, this complexity is smaller than the complexity of the BP-FFT decoder, and by far lower than the solutions proposed in the literature. Note that the single parameter tunes both the computational complexity and memory space requirements. It also defines efficiently the trade-off performance/complexity. We have also proposed a non-binary adaptation of the shuffled scheduling in order to induce a new degree of freedom in the algorithm, which allows a reduction of the memory space requirements for the cycle codes.
We have compared the error performance of our algorithm with non-binary BP and binary corrected MS algorithms, in order to demonstrate that the proposed low complexity, low memory EMS decoding algorithm becomes a good candidate for a hardware implementation. Since its complexity and its memory space requirements has been greatly reduced and the performance degradation is small or negligible, the EMS algorithm applied on non-binary LDPC codes build in very high order fields could be an alternative to existing solutions.
Although the EMS algorithm could be applied to irregular LDPC codes as described in this paper, an interesting issue would be to study if the number of values kept in messages needs to be optimized with respect to the degree of the variable nodes. This issue is of particular importance since good irregular LDPC codes are usually more dense than regular ones, increasing thereby the memory requirements for message storage.
