The first task of non-binary (NB) decoders [low-density parity check (LDPC) or turbo] is to generate the log-likelihood ratio (LLR) of the received NB symbols defined over Galois fields GF(q > 2). In the extended min-sum decoding algorithm, the intrinsic information associated to a given received symbol is a sorted list of the nm most reliable NB GF symbols along with their associated reliability values. In this letter, we present a fully parallel LLR generation algorithm, an enabler for very high throughput decoding that processes one received symbol per clock cycle. We provide complexity figures for LLR architectures designed over GF(q) of sizes 64, 256, and 1024, as well as different values of nm . Compared with the related state-of-the-art architecture, field-programmable gate array (FPGA) synthesis results show that the proposed parallel architecture improves the hardware efficiency by a factor ranging from 7 up to 15.
I. INTRODUCTION
B INARY Low Density Parity Check (LDPC) [1] codes and Binary Turbo-Codes [2] have been adopted in many communication standards such as WiMAX, WiFi, DVB-C, DVB-S2X, DVB-T2, among others. Nevertheless, with short codewords, performance of binary codes start to degrade compared to theoretical achievable performance. A solution to mitigate this problem is to replace binary codes by Non-Binary (NB) codes defined over GF(q), q = 2 m [3] , [4] . NB-LDPC codes are efficient for short and moderate codeword lengths [3] . This high error correction capability is obtained thanks to higher girths which is an inherent feature of NB-LDPC codes. These characteristics position NB-LDPC codes as serious competitors of classical binary LDPC and Turbo-Codes in future wireless communication and digital video broadcasting standards. However, this competitive edge does not come for free; it entails high computational complexity making their hardware implementation a very challenging task. In [5] and [6] , the authors propose the Extended-Min Sum (EMS) algorithm to decode NB-LDPC code with a reduced complexity. The principle of the EMS is to truncate the LLR messages associated to a symbol from q (the Galois Field size) down to n m (n m q). The Bubble Check algorithm [7] reduces the CN complexity from O(q 2 ) down to O(4n m ).
The first step in the EMS algorithm is the generation of the n m most reliable candidate GF(q) symbols. Feeding the decoder with n m symbols in one clock cycle is a necessity for highly parallel processing-based decoder architectures. In this letter, we propose a parallel LLR generation architecture able to generate a predefined number, n m , of symbols at each clock cycle in the specific case where the channel generates binary LLRs (Binary Phase Shift Keying (BPSK) modulation for example). Compared to the state of the art [8] , the proposed architecture offers a hardware efficiency gain factor ranging from 7 to 15.
The rest of this letter is organized as follows. Section II reviews the basic notions and definitions related to the LLR computation. Section III discusses the proposed parallel architecture and Section IV presents the synthesis results. Finally, Section V concludes the letter.
II. DEFINITION OF THE LOG-LIKELIHOOD RATIO
In this letter, we consider the transmission of Non-Binary codeword over GF(q), q = 2 m , using a BPSK modulation through an Additive White Gaussian Noise (AWGN) Channel of noise variance σ 2 . A GF symbol X of the codeword is represented by a binary vector of size m, i.e., X = (x 0 , x 1 , . . . , x m−1 ). Each coordinate of this vector is modulated as B(x i ) = (−1) xi , i = 0, 1, . . . , m − 1, and transmitted over the AWGN channel. At the receiver side, the received message is R = (r 0 , r 1 , . . . , r m−1 ), where r i = B(x i ) + w i and w i , i = 0, 1, . . . , m − 1 are independent realizations of a Gaussian variable N (0, σ). Following [9] , and knowing the received vector R and that the GF symbols are equiprobable, we define the LLR L(X) of a GF symbol X as
whereX represents the hard decision over R, i.e.x i = 1 if r i < 0, 0 otherwise, i = 0, 1, . . . , m − 1. The expression (1) can be developed as
Since in the AWGN channel,
Thus, with Δ(a, b) = 0 if a = b, 1 otherwise. In the following, the LLR value 2 ρ 2 r i in (4) will be denoted by y i . Note that the decision valueȳ i on y i remains equal tox i .
Example: let Y = (−6, +9, −2, +12, +11, −7) be the received quantized LLR values of a symbol over GF(64). ThenȲ = (1, 0, 1, 0, 0, 1). Let X 0 = (0, 0, 0, 0, 0, 0), then L(X 0 ) = 6 + 2 + 7 = 15, since X 0 differs fromȲ in positions 1, 3 and 6. The n m = 5 best possible decisions, based on Y , will be
In this example, U 0 = ((1, 0, 1, 0, 0, 1), 0), U 1 = ((1, 0, 0, 0, 0, 1), 2), U 2 = ((0, 0, 1, 0, 0, 1), 6), U 3 = ((1, 0, 1, 0, 0, 0), 7) and finally U 4 = ((0, 0, 0, 0, 0, 1), 8).
III. PROPOSED ARCHITECTURE
This section describes the new proposed parallel architecture that generates the first n m most reliable LLR values with their associated GF symbols in parallel. The key idea is to work in 3 steps. The first step is to replace the received unstructured set of LLR Y = (y 0 , y 1 , . . . y m−1 ) by an equivalent structured set S + = (s + 0 , s + 1 , . . . , s + m−1 ) containing the absolute values of Y sorted in ascendant order. The second step is to find the setÎ of the n m smallest LLR values along with their associated GF values from the structured set S + . Finally, the last step is to transform the set of solutionsÎ of the structured problem back to the set of solutions U of the initial problem. In order to ease the reading of the letter, GF symbol and vector in the structured domain will be noted with a hat.
A. From Unstructured to Structured Channel Observations
The first step is then to sort the absolute values of the received LLR |y i |, i = 0 . . . m − 1, using the odd-even sorting algorithm 1 [10] of size m. Fig. 1 .a) shows the considered symbol to refer to a Comparator-Swap (CS) while the one shown in Fig. 1 .b) refers to a simple Comparator (C). Fig. 2 shows the fully parallel pipelined sorter that receives the absolute values |y i |, sorts them, and generates the couples
The permutation π is used later in step 3 to transform the solution of the structured problem back to the solution of the initial problem.
With the example of previous section, from the set of binary LLR Y = (−6, +9, −2, +12, +11, −7), we obtain the set S + = (2, 6, 7, 9, 11, 12) and the associated permutation π = (2, 0, 5, 1, 4, 3). 1 Other types of sorting algorithm can also be selected. 
B. Solution of the Structured Problem
The first step is to compute the LLR values associated to a predefined set of symbolsĴ ⊕ , whereĴ ⊕ is defined offline in such a way that the n m most reliable intrinsic symbols for any S + are always computed. The design ofĴ ⊕ takes advantage of the fact that the LLR values of S + are positive and sorted. Then, the LLR valuesĴ + ofĴ ⊕ , computed from S + , are sorted in increasing order and the set of solutionŝ I = (Î ⊕ ,Î + ) of the structured problem is simply extracted as the smallest n m values ofĴ + . and, by construction, s + 1 ≤ s + 2 , thusÂ ⊕ will be always selected beforeB ⊕ . The notion of dominance can be defined formally by the existence of an injective function π between the set of indices φÂ ⊕ of the positions of 1s inÂ ⊕ and the set of indices φB ⊕ of positions of 1s inB ⊕ so that, for all e ∈ φÂ ⊕ , π(e) ≥ e. The explicit construction of π is not described due to lack of space but it can be simply constructed offline using a greedy algorithm.
The relation of dominance gives only a partial order over the set of binary vectors of size m. For example,Ĉ ⊕ = (0, 0, 1) andD ⊕ = (1, 1, 0) can give either L(Ĉ ⊕ ) > L(D ⊕ ) or L(Ĉ ⊕ ) < L(D ⊕ ) (for example, with S + = (2, 3, 12) and S + = (2, 3, 4), respectively). For given values of m and n m , it is possible to generate formally the setĴ ⊕ m (n m ) of all vectors of size m that are dominated by at most n m vectors (note that a vector is dominated by itself). Thus, for any realization of S + , the setĴ ⊕ m (n m ) = {Ĵ ⊕ 0 ,Ĵ ⊕ 1 , . . . ,Ĵ ⊕ nJ −1 } of cardinality n J , is guaranteed to contain the n m m-binary vectors associated to the n m smallest LLR values. Note that the proposed notion of dominance is already presented in a different context in [11] to define the position of frozen bits in the construction process of polar codes. Table I shows all the elements of the setĴ ⊕ 6 (12) that constitute all the possible candidates needed to extract n m = 12 symbols over GF(64). In this case, the setĴ ⊕ 6 (12) is of cardinality n J = 17. Figure 3 shows the evolution of n J as a function of n m for several values of m. It is important to note that n J is greater (2, 6, 7, 9, 11, 12) than n m , but less than 2n m for m ≤ 10, i.e., it increases almost linearly with n m for practical values of n m .
2) Computation of the List of LLR ValuesĴ + : Based on S + and the setĴ ⊕ , the n J values ofĴ + are computed using a set of adders wired according to the addition of s + i s indicated in the third column of Table I . The next step is to sort the predefined potential candidatesĴ + i = {Ĵ + i } i=0,1,...,nJ −1 , to generate the list of n m sorted LLRsÎ = {(Î ⊕ k ,Î + k )} k=0,...,nm−1 .
3) Sorting of the Potential Candidates:
The sorting process is first presented for the case where m = 6 and n m = 12, then the method is generalized for any m and n m values.
The first three outputs are alwaysÎ 0 =Ĵ 0 ,Î 1 =Ĵ 1 andÎ 2 =Ĵ 2 . For the remaining 9 outputs, we propose the }. The sorter architecture is based on the odd-even algorithm [10] . It is composed of N l = 7 layers of CSs. The overall complexity is N cs = 28 CSs (CSs and Cs are not differentiated) with a critical path of T = N l ×T CS = 7×T CS , where T CS is the critical path of one CS.
Going back to the previous example, the obtained setÎ is equal to the n m = 12 couples (see Table I 9) , (Ĵ ⊕ 8 , 9), (Ĵ ⊕ 5 , 11), (Ĵ ⊕ 11 , 11), (Ĵ ⊕ 6 , 12), (Ĵ ⊕ 9 , 13), (Ĵ ⊕ 15 , 13)}. To generate the sorting architecture that extractsÎ from J in the general case, we propose a generic method. It consists in starting with an odd-even sorter of size 2 log 2 (nJ ) , i.e., the lowest power of 2 greater than or equal to n J , then pruning each CS receiving pre-ordered inputs or having unused outputs. Table II summarizes the sorter complexity for several values of n m and different Galois fields. Note that the mapping between the elements ofĴ and the inputs of the odd-even sorter impacts significantly the overall complexity. Finding the optimal mapping is still an open question. When defined, this optimal mapping may lead, for some GF and n m configurations, to lowering the complexity figures of N cs and/or N J proposed in Table II .
C. Inverse Permutation of the GF Values ofÎ
The computation of the intrinsic message U fromÎ is straightforward. The couples U k = (U ⊕ k , U + k ), k = 0, 1, . . . , n m − 1 are obtained as U ⊕ k = (Î ⊕ k (π −1 (i)) ⊕ Y (i)) i=0,1,...,m−1 and U + k =Î + k . Following the previous example, π = (2, 0, 5, 1, 4, 3) gives π −1 = (1, 3, 0, 5, 4, 2). For k = 0, U 0 = (Ȳ , 0). For k = 1,Î ⊕ 1 = (1, 0, 0, 0, 0, 0), then (Î ⊕ 1 (π −1 (i))) i=0,1,...,m−1 is equal to (0, 0, 1, 0, 0, 0, ), thus U ⊕ 1 = (0, 0, 1, 0, 0, 0)⊕(1, 0, 1, 0, 0, 1) = (1, 0, 0, 0, 0, 1). Since U + 1 =Î + 1 = 2, then U 1 = ((1, 0, 0, 0, 0, 1), 2). The same process is repeated to obtain U k , k = 2, . . . , n m − 1.
IV. COMPLEXITY ANALYSIS
This section presents the implementation results on a Xilinx Kintex7 (xc7k325t -2 fbg676) FPGA device, of the LLR generator over GF(64), n m = 4 and 12. It is easy to modify the proposed architecture to fit all the cases of n m < 12 since it is a matter of removing some CSs. Table III compares the hardware cost of the proposed and the systolic one [8] in terms of Look Up Tables (LUTs) . The complexity of the systolic architecture is constant since the number of stages is fixed and equal to m. The only thing that changes with n m is the size of the FIFOs implemented in each stage which does not impact significantly the overall complexity of the systolic architecture. In order to compare the efficiency of both architectures, we evaluated a metric called Hardware Efficiency (HE). HE is defined as the number of sorted symbols per second per LUT, i.e., HE = (n m /n c ) × F clk /C, where n c denotes the number of clock cycles required to generate the n m outputs, F clk the clock frequency (in MHz) and C the number of LUTs required by the hardware with an FPGA implementation. For the proposed parallel architecture, n c = 1, while in [8] , n c is equal to n m +1. To better illustrate the efficiency comparison, we have evaluated the HE Ratio (HER) defined as the ratio of the HE of the proposed architecture to the HE of [8] . The most right column of Table III shows that the proposed architecture outperforms the systolic architecture by an order of magnitude.
V. CONCLUSION
The letter has presented the design and implementation of a parallel low-hardware cost LLR generator. Theoretical complexity and performance analysis of the proposed architecture compared to the systolic architecture have been addressed. For any size of n m , the proposed architecture requires the lowest area and offers the highest frequency, where a hardware efficiency gain ranging from 7 up to 15 is obtained. The architecture has been developed and presented in the context of non binary codes. It could be used also for high speed Chase algorithm [12] .
