Abstract-Non-binary low-density parity-check codes have superior communications performance compared to their binary counterparts. However, to be an option for future standards, efficient hardware architectures must be developed. State-of-theart decoding algorithms lead to architectures suffering from low throughput and high latency. The check node function accounts for the largest part of the decoders overall complexity. In this paper a new, hardware aware check node algorithm is proposed. It has state-of-the-art communications performance while reducing the decoding complexity. Moreover the presented algorithm allows for partially or even fully parallel processing of the check node operations which is not applicable with currently used algorithms. It is therefore an excellent candidate for future high throughput hardware implementations.
I. INTRODUCTION
Upcoming standards like 5G will increase the demands on throughput and latency of communication systems. Especially applications like the Tactile Internet [1] require a significantly reduced latency. At the same time, error free transmission has to be guaranteed, which requires forward error correction schemes. Low-Density Parity-Check (LDPC) codes where first proposed by R.G. Gallager in 1963 [2] and rediscovered by D. Mackay and others in 1996 [3] . In the following almost two decades a lot of research has been carried out in this field. Today many commercial standards (WiMAX, WiFi, DVB-C2, DVB-S2X, DVB-T2) make use of LDPC codes. Very long binary LDPC codes have been proven to perform close to the Shannon limit. However when considering short blocks of only some hundred bit length for low latency applications, they suffer under degradation in communications performance of more than 0.5 dB. The extension of binary LDPC codes to Galois Fields (GF(q)) with q > 2 is a promising approach to solve this problem. Moreover the symbols of high-order modulation schemes can be directly mapped to the decoders input symbols. Thus an additional gain in communications performance is observed for systems combining high-order modulation and Non-Binary Low-Density Parity-Check (NB-LDPC) codes [4] . The performance gain of NB-LDPC codes comes at the cost of significantly increased decoding complexity. The decoding can be performed by message passing algorithms like Belief Propagation (BP). However the complexity increases with the size of the GF(q). A straightforward implementation of the BP algorithm has a complexity of O(q 2 ) [5] . In the last years several approaches have been proposed to reduce the decoding complexity without sacrificing the communications performance. Algorithms working in the Fourier domain [6] [7] have an excellent communications performance, but are still too complex for efficient hardware architectures. Symbol flipping algorithms [8] [9] in general have low complexity but suffer from heavily degraded communications performance. Approaches based on stochastic decoding [10] [11] have been presented as an alternative decoding method but introduce very high decoding latency. An extension of the well known binary Min-Sum algorithm to the non-binary domain, called Extended Min-Sum (EMS) algorithm [12] [13] [14] gives the best compromise between hardware complexity and communications performance. Therefore in this paper we will focus on the EMS algorithm which is the most promising starting point for efficient architectures.
To achieve the required throughput of today's applications, executing the decoding algorithms in software is not sufficient. Dedicated hardware architectures become mandatory. The largest complexity in the EMS algorithm is the computation of the Check Node (CN). State-of-the-art architectures apply a so called Forward-Backward (FWBW) scheme [15] to process the CN. A serial calculation is carried out to reduce the hardware cost and to allow for reuse of intermediate results during the computation. However this scheme introduces high latency and degrades the throughput. This effect increases significantly when the size of the GF(q) or the CN degree grows.
In this paper we propose a new hardware aware algorithm to perform the CN processing within the EMS decoding, we call it Syndrome-Based (SYN) CN processing. To reduce the initially high complexity we present several algorithmic modifications. While achieving slightly better communications performance than state-of-the-art hardware aware decoding algorithms, the SYN CN processing has a lower complexity. Moreover it allows for increased parallelism of the CN computation. Thus the SYN CN processing is the first hardware aware algorithm for NB-LDPC decoding, enabling low latency and high throughput decoder architectures.
The paper is structured as follows. In Section II we review the decoding of NB-LDPC codes making use of the EMS algorithm. Section III describes the state-of-the-art FWBW hardware aware decoding approach. Our new algorithm is presented in Section IV and further optimizations are discussed in Section V. Finally Section VI presents a comparison with other decoding methods in means of complexity and communications performance. Section VII concludes the paper.
II. EMS DECODING
This section reviews the EMS algorithm to give an overview of the complete decoding process.
978-1-4799-8078-9/15/$31.00 ©2015 IEEE 22nd International Conference on Telecommunications (ICT 2015) Let us consider an (N, K) NB-LDPC code over GF(q) where N is the codeword length and K the number of information symbols. The code is defined by a sparse parity check matrix H with N columns, M = N − K rows and elements h m,n . The transmitted codeword consists of the codeword symbols c = (c 1 , c 2 , . . . , c N ), c i ∈ GF(q). The decoder receives the noisy representation of the codeword symbols, y = (y 1 , y 2 , . . . , y N ).The decoding process can be partitioned in four main parts, the initialisation, the Variable Node (VN) update, the CN update and the permutations in the Permutation Nodes (PNs). It is important to mention that in contrast to a binary LDPC decoder instead of a single Logarithmic Likelihood Ratio (LLR) a set of q LLRs is exchanged between the nodes on a single edge. This fact accounts for the significant increase in complexity compared to binary LDPC decoding.
The first step in the EMS algorithm is the calculation of the symbol LLRs for the received codeword. For each received symbol y i a set of q LLR values is calculated. Under the assumption that all GF(q) symbols are equiprobable, the initialization LLRs for VN v are calculated as follows: 
The VN function is to sum up all values for a certain Galois field element received from the connected CNs and the according channel information L v [q] . This however has to be performed for all q elements of the GF(q). To achieve the same message structure as before (LLR = 0 for the most reliable symbol, increasing LLR values for less reliable symbols), a normalisation of the U vp messages with respect to the most reliable symbol has to be applied at the output of the VN. The next step in the decoding is the permutation according to the parity check matrix H. The permutation of the VN v outputs U vp is defined as:
where h c,v is the Galois field element at row c and column v of H, U pc represents the input for CN c.
In the CN update d c edges from U pc are processed. The outputs for CN c are calculated as follows:
with dc t=1;t =p
For every Galois field element x all possible input permutations fulfilling the parity check constraint are evaluated. The parity check constraint is given by the sum of the according Galois field elements. From all valid combinations the one with the highest reliability (smallest LLR) is chosen. Again, this task has to be performed for all q elements of the Galois field. The CN computation is the most complex part of the decoding and has a complexity of O(q 2 ) when calculated straightforward with the FWBW scheme [15] .
Before one iteration is completed the outputs of CN c must be reverse permuted.
This closes the loop and another iteration starts. The processing of a block is stopped as soon as a valid codeword is detected.
To make this decision the estimated symbolsx need to be computed for each VN v:
In state-of-the-art EMS decoding one important simplification is applied. As it has been shown in [12] the sets of q messages exchanged between VNs and CNs can be truncated to carry only the n m most reliable values per edge without sacrificing the communications performance. This approach significantly reduces the implementation complexity and is used for the algorithms presented in the following sections.
III. FWBW CN PROCESSING
In this section we will first review the state-of-the-art decoding method for the CN calculation, the FWBW scheme.
The FWBW method applies a divide and conquer approach to the CN processing. Each CN processes d c edges at a time following Eq. (4). The FWBW scheme splits the processing in three layers of d c − 2 so called Elementary Check Nodes (ECNs). Each ECN processes only two edges at a time. Fig. 1 shows the resulting structure for a six input CN. Intermediate results of the ECNs are reused in the later stages and avoid recomputations. This significantly reduces the problem size and allows for efficient architectures for the ECN [16] . The processing of the ECN is based on the assumption that the input sets are sorted according to their LLR. Therefore the search space for the best elements can be reduced systematically. In [17] the authors present a low complexity scheme for the ECN processing which can be implemented efficiently. However, due to the serial processing it requires n m clock cycles to perform the ECN task. For the complete CN an additional latency penalty of 2(d c − 2) + 4 clock cycles is introduced by the structure pointed out in Fig. 1 , see [16] for a detailed timing analysis. The overall computation time in clock cycles for one CN is then calculated as follows:
This is the drawback of the FWBW approach. With increasing n m and d c the processing time for a CN rises and leads to high latency and low throughput of the complete NB-LDPC decoder. Thus the FWBW method for the CN processing is only an option for moderate sizes of the GF (n m is closely coupled to q) and low Code Rates (CRs) as d c increases significantly for high rate codes. Moreover a parallelization of the FWBW processing is hardly possible which makes low latency decoding infeasible.
IV. SYNDROME-BASED CN PROCESSING As discussed before, the state-of-the-art way of CN processing with the FWBW scheme has several drawbacks. Architectures making use of the FWBW scheme suffer from low throughput and high latency. Todays approaches to solve these issues are limited to small GF(q)s with q ≤ 16 which have only small gain in Frame Error Rate (FER) compared to their binary counterparts [14] . Only with Galois fields of high-order significantly higher communications gains can be achieved. Therefore we propose a new algorithm, the so called SYN CN processing which can also be applied to Galois field sizes of practical interest (q ≥ 64). The SYN CN cannot be only implemented in a serial, but also in a partially or even fully parallel fashion to achieve high throughput and low latency.
The basic structure of the SYN CN processing is depicted in Fig. 2 . In the first step of the algorithm the syndromes are calculated. In contrast to the classical syndrome definition, in the following we refer with the term syndrome to the sum of one GF(q), LLR tuple from each input. As for each input U , n m tuples can be chosen, there is not just one syndrome but a set of syndromes which we call S. Individual syndromes are distinguished by the elements which are chosen for the sum. 
One syndrome SYN is then defined as follows:
The syndrome set S contains all valid syndromes:
Calculating the syndromes in S as the sum of elements over all input edges (Eq. (8) and Eq. (9)), disregards one of the basic concepts of BP algorithms: In-and output of the same edge must not be correlated. Thus an additional step in the SYN CN processing is the decorrelation of in-and output:
The result is a dedicated syndrome set S i for every output i, which has no correlation with input i.
Once the S i sets are computed, they are sorted by their syndrome reliability, represented by the LLRs. This gives direct access to the n m most reliable syndromes which constitute the CN output sets V cp .
The algorithm we proposed is an alternative to the conventional FWBW processing. It is the first efficient approach for high-order Galois field decoding, allowing for massive parallel implementations and thus high throughput and low latency. However, without special treatment the calculation of the syndrome set S and the sorting of S i introduce a high complexity. It has to be reduced to make the algorithm attractive for an efficient hardware implementation. 
V. COMPLEXITY REDUCTION OF THE SYNDROME-BASED CN PROCESSING
In this section we are presenting an approach to reduce the complexity of the afore introduced SYN CN algorithm. The target is to allow the algorithm to be implemented efficiently in hardware. Therefore we discuss algorithmic modifications for simplifications of the syndrome set generation and the sorting while maintaining the communications performance.
A. Reducing the syndrome set cardinality:
The first step is to optimize is the calculation of the syndrome set S. For the output computation only the most reliable values of S are used which makes the computation of all other syndromes superfluous. Thus a smart reduction of the cardinality of S, i.e. |S|, can significantly reduce the overall complexity of the algorithm without sacrificing the communications performance.
The first step for a reduction of |S| is the separation of syndromes with high reliability from ones with low reliability. In the following, a concept similar to the configuration sets introduced in [18] [19] is applied to the computation of S. We define d c + 1 deviation sets D i with i ∈ 0 . . . d c and separate the syndrome set in sub-sets:
Each set contains only syndromes deviating in exactly i elements from the most reliable element as shown in Fig. 3 . The subset D 0 contains only one syndrome, which is the sum of the most reliable elements from all inputs. These sub-sets structure the data in a way that allows for easier access to syndromes with high reliability. Fig. 4 shows the average LLR values of the syndromes in the sorted deviation sets D i . One can observe, that the distribution of reliable LLRs depends on the Signal-to-Noise Ratio (SNR). However, syndromes with more than two deviations i.e. D 3 and D 4 have such a low reliability that they rarely contribute to the generation of the outputs. Thus we can limit the calculation of sub-sets D i to the ones with a low amount of deviations.
Another parameter for reduction of |S| is the maximum allowed rank d i of elements contributing to deviation D i . The The maximum allowed rank for a certain deviation can be set dynamically based on the LLR value of the elements or it is fixed, as a predefined parameter. For each deviation a different maximum rank can be set, see Fig. 3 , e.g. the higher the number of allowed deviations, the lower the maximum rank of the deviations, 
Combining both proposed techniques strictly reduces the cardinality of S and thus the computational complexity. The most reliable syndromes are calculated and only unreliable ones are removed. The parametrization for the number of deviations and their maximum rank is a critical step in the algorithm. 
B. Simplifying sorting:
One big drawback of the original SYN CN algorithm presented in Section IV is that every syndrome set S i must be sorted separately to output the n m most reliable syndromes. This is due to the decorrelation step applied before. To avoid the sorting of the decorrelated syndrome sets S i , a simple but effective approach can be used. Instead of decorrelating every value, only syndromes using the most reliable element (LLR = 0) from the currently handled edge are considered. All other syndromes are not used for the current edge output. By this approach the order of the syndromes is not changed and it is sufficient to sort S instead of the d c S i sets. In addition, the LLR values are not modified in the decorrelation step which saves a real valued subtraction for every output message. Finally only the most reliable input element and not the complete input sets must be stored for the decorrelation. Each syndrome is denoted with the additional information about which of the input edges contributed to the syndrome with a deviation. SRC in Fig. 5 stores the edges where deviations occurred and ADDR i represents the current edge. A simple comparison evaluates if a deviation from the current edge was involved in the syndrome calculation and thus if the syndrome is valid for the current edge or not. Only if no deviation occurred on the current edge, the decorrelated message is marked as valid and used for the output V i .
Even though the sorting has been reduced to the syndrome set S, there is more potential for simplification. Sorting S can be divided into sorting the deviation sets D i and merging them. Especially for D 1 the sorting can be further simplified. This is achieved due to the previous knowledge we have of the input data. We implicitly know that the sets U pc are sorted according to their LLRs. The sorting of D 1 thus is limited to merging d c sorted sets. Performed serially, this is a trivial task that introduces only minimal hardware overhead.
For the higher-order deviations D i for i ≥ 2, the sorting can also be simplified because of the sorted input sets. Sorted sub-sets can be generated with little effort which only have to be merged to achieve the final set. An example of the subset generation for D 2 with d 2 = 2 is given in Fig. 7 which can be extended easily to other deviations and ranks. Once the sub-sets D i are sorted, the outputs are generated by merging them iteratively as shown in Fig. 6 . In the presented case with The proposed algorithmic modifications result in a slightly different data flow, see Fig. 2 and Fig. 8 . Summarized, three algorithmic transformations were introduced in this section to reduce the complexity:
• Significant reduction of |S|.
• Simplified sorting of S instead of S i .
• No LLR subtractions and no storage for U i in the decorrelation step.
VI. COMMUNICATIONS PERFORMANCE AND COMPLEXITY COMPARISON
In the following section we discuss the quality of the proposed algorithm in means of communications performance and decoding complexity in means of required basic operations.
All results for the communications performance are obtained with a bit true C++ model using Binary Phase Shift Keying (BPSK) modulation and an Additive White Gaussian Noise (AWGN) channel. The simulated NB-LDPC code has a code word length N of 16 GF (64) based on the approach presented in [20] . Both, the FWBW and the SYN CN make use of truncated vectors of size n m = 13, and perform a maximum of 10 iterations with a two-phase scheduling. A bit true fixed-point model with 8 bits for the LLR representation is implemented. Fig. 9 shows the performance of different implementations of the EMS algorithm compared with an optimal EMS decoding (no message truncation). For the syndrome based CN we have considered up to four deviations with
A second comparison shows the difference with respect to the state-of-the-art hardware aware FWBW decoding. After ten iterations the syndrome based CN computation has a superior performance compared to the FWBW implementation, see Fig. 10 .
To compare the complexity of the proposed algorithm with state-of-the-art decoding methods Table I lists the number of required basic operations. On the one hand, the SYN CN algorithm requires additional GF(q) additions. In a hardware architecture they can be implemented with simple XOR logic and thus are very cheap in means of required area. On the other hand, the SYN approach uses only a small fraction of adders Regarding future hardware architectures, the proposed algorithm generates a whole new design space. Both, serial and partially parallel architectures can be explored, see Table II. A serial processing may achieve higher efficiency in means of throughput per area, than state-of-the-art architectures but suffers from the same drawbacks, i.e. high latency and low throughput. Another possibility is the parallelization of the syndrome generation and sorting, leading to a high throughput and low latency architecture. First hardware experiments show promising results for the area efficiency of the proposed architectures. In the future, an extensive study on architectures will be carried out. 
VII. CONCLUSION
We have presented a new hardware aware algorithm for the CN processing of NB-LDPC decoders. Our investigations have shown slightly better communications performance compared to state-of-the-art hardware aware decoding algorithms. In addition a comparison of the algorithmic complexity reveals that the proposed SYN CN processing has a lower complexity. The SYN CN algorithm is the first hardware aware CN processing, allowing for low latency and high throughput decoder architectures for high-order Galois fields. In the next step we will design hardware architectures based on the proposed algorithm to further explore its potential.
