Abstract-Non-Binary LDPC codes offer higher performances than their binary counterpart but suffer from higher decoding complexity. A solution to reduce the decoding complexity is to use the Extended Min-Sum algorithm. The first step of this algorithm requires the generation of the first largest Log-Likelihood Ratio (LLR), sorted in increasing order, of each received symbol. In the case where GF( ) symbols are transmitted using a BPSK modulation, we propose a simple systolic architecture that generates the sorted list of symbols.
I. INTRODUCTION
N ON-BINARY Low Density Parity Check (NB-LDPC) codes over GF( ), with > 2, are known to outperform binary LDPC codes for short and medium length [1] . Moreover, for high spectral efficiency modulation, channel symbol can be mapped directly into code symbol, avoiding the information loss generated by the marginalisation process used in the binary LDPC case (gain of 1.6 dB, in the MIMO case, are reported in [2] ).
These advantages of such a Non-Binary coding scheme come at the expense of increased hardware complexity. However, a recent decoding algorithm called Extended MinSum (EMS) Algorithm was proposed by Declercq [3] which presents a significant reduction of the computation complexity of the decoding process while keeping a good performance. This algorithm, instead of treating the LLR values associated to the complete list of GF( ) symbols, considers the sorting of the LLR messages and only treats the greatest values and proposes an offset compensating the truncated messages.
One of the key components of the NB-LDPC decoder implementing the EMS algorithm is the LLR generation circuit and its accompanying sorter. This paper considers the design and the hardware implementation of an efficient circuit dedicated to generate the LLR values sorted in increasing/decreasing order.
The rest of this paper proceeds as follows. In section II, we review some basic notions and definitions related to the LLR computation and we define a new simplified formula allowing the generation of the LLR values in an ordered list. In Section Manuscript received February 7, 2011. The associate editor coordinating the review of this letter and approving it for publication was M. Ardakani.
The 
Let˜be the symbol of GF( ) that maximizes ln( ( | )), i.e.˜= {arg max ∈ ( ) , ( | )}. Using equation (1), is given by˜= (˜= ( )) =0.. −1 , where denotes the Hard Decision on , i.e. ( ) = 0 if < 0, ( ) = 1 otherwise.
With the hypothesis that the GF( ) symbols are equiprobable, the reliability ( ) of a symbol may be defined as the LLR of the probability of related to the probability of symbol˜:
which, using (1), can be developed as:
By definition of˜, ( ) is a negative number. In order to deal with positive numbers, the quantity ( ) = − ( ) will be considered in what follows, as in [4] . Using (3), ( ) can be written as:
1089-7798/11$25.00 c ⃝ 2011 IEEE where Δ = XOR˜, i.e. Δ = 0 if and˜have the same sign, 1 otherwise.
This approach which represents the inverse logic of the conventional LLR computation, associates the lower LLR value to the most reliable GF( ) symbol. Its main advantage is that it avoids the normalization step needed to keep the numerical stability of the extrinsic LLR during the decoding process.
The first step of the EMS algorithms [3] requires the generation of the first minimum values of ( ) which is not a trivial problem. An elegant algorithm has been proposed by Fossorier et al. in [5] . Unfortunately, this algorithm is more software oriented, since it builds the LLR list dynamically, and is not adapted to hardware implementation mainly when a parallel approach is considered. Thus, a hardware oriented algorithm has become a necessity when the hardware implementation of the NB-LDPC decoder has become possible.
III. PROPOSED ALGORITHM
Let us tackle the problem of generation of the lowest ( ) values and their associated GF( ) symbols sorted in increasing order. In this paper, we propose an iterative construction of the list by considering, at stage , only the first coordinates of ( varying from 1 to ). Let us define ( ) = 
The second step is the merging of 0 2 and
In the general case, during the expansion process, the
where & stands for the binary append operator. Since both 0 and 1 are sorted, the creation of is simple. The direct approach (computation of all LLRs and sorting process to extract the first values) leads to a complexity in log 2 ( ). With the proposed method, the overall complexity is reduced to log 2 ( ). Moreover, the proposed algorithm can be implemented in a simple systolic hardware architecture.
IV. HARDWARE ARCHITECTURE OF THE LLR CIRCUIT
In this section we describe the systolic hardware architecture serially generating the sorted LLR list. This architecture is composed of stages working in pipeline mode and each stage consists of one Processing Element (PE). Figure 1 -a illustrates the architecture for = 4. As shown in this figure, the ℎ PE receives two inputs: the channel binary observation −1 with its corresponding sign and the list −1 from the ( − 1) ℎ PE. As an output, the ℎ PE generates the list to be fed to the ( + 1) ℎ PE of the next stage.
A. Structure of the PE
The PE constitutes the core of the proposed LLR circuit. It consists of two expansion modules, two First-In-First-Out (FIFO) memories and one comparator selecting the minimum of the two FIFO's outputs as shown in Fig. 1-b where the third stage PE is illustrated. The input/ouput of this stage represent the intermediate LLR computation of the example described in previous section. The 3 PE serially receives (one new couple every clock cycle) the ordered list 2 from stage 2. The first step is the expansion of 2 into lists in the list 3 and a pull signal is set to 1 to get a new couple at the output of the considered FIFO. ℎ element of will be retrieved from FIFO 0, which freed room for the (3⌊ ( )/3⌋ + 1) ℎ element coming from −1 and so on. If ( − 1) = ( )/2 it can be shown that, in that case, 0 ( ) = ( )/4. In summary, 0 ( ) varies between 0 ( ) = ( )/4 and 0 ( ) = ( )/3. Figure 2 shows the timing diagram of the LLR computation of the example described in section III. As shown in this figure, the LLR circuit operates in a synchronous way with a clock signal. The signal indicates the start of the LLR computation and the signal indicates the availability of the data at the input. The circuit has a Latency of = 4 cycles, the start of the generation of the output is indicated by signal . After +4−1 = 13 cycles the (LLR,GF) couples are generated. As previously mentioned, the LLR circuit operates in a pipeline manner where it can starts the processing of the second symbol while the last stage of the circuit is producing the (LLR,GF) couples of the first symbol. Thus as shown in Fig. 2 , after a delay of one cycle the first (LLR,GF) couple associated to the second symbol is produced. This delay cycle is required to reinitialize the FIFOs.
B. Implementation results
A generic architecture of the proposed systolic architecture has been developed and successfully validated. The architecture accepts any Galois Field size and any value ≤ . The internal binary word size ( , ) of the (LLR, GF) couple is also parameterized. This architecture has been used in a FPGA based GF(64) NB-LDPC code implementation, with = 6, = 12, = 6 and = 6. On a Virtex 4 (XC4VLX15), it requires 275 slices and operates at a frequency of 149 MHz.
V. CONCLUSIONS
We have presented a novel and efficient hardware design of the LLR computation circuit. The proposed circuit is the first of its kind to be designed. It performs the LLR computation in a systolic way and produces an ordered list of (LLR, GF) couples. This architecture has been implemented in a design of a GF(64) NB-LDPC code. It can be also used for other codes, like the soft Reed Solomon decoder described in [6] .
