Abstract-This paper presents the architecture, performance and implementation results of a serial GF(64)-LDPC decoder based on a reduced-complexity version of the Extended Min-Sum algorithm. The main contributions of this work correspond to the variable node processing, the codeword decision and the elementary check node processing. Post-synthesis area results show that the decoder area is less than 20% of a Virtex 4 FPGA for a decoding throughput of 2.95 Mbps. The implemented decoder presents performance at less than 0.7 dB from the Belief Propagation algorithm for different code lengths and rates. Moreover, the proposed architecture can be easily adapted to decode very high Galois Field orders, such as GF(4096) or higher, by slightly modifying a marginal part of the design.
NB-LDPC codes also outperform binary LDPC codes in the presence of burst errors [7] , [8] . Further research on NB-LDPC considers their definition over finite groups , which is a more general framework than finite Galois fields [9] . This leads to hybrid [10] and split or cluster NB-LDPC codes [11] , increasing the degree of freedom in terms of code construction while keeping the same decoding complexity.
From an implementation point of view, NB-LDPC codes highly increase complexity compared to binary LDPC, especially at the reception side. The direct application of the Belief Propagation (BP) algorithm to -LDPC leads to a computational complexity dominated by and considering values of results in prohibitive complexity. Therefore, an important effort has been dedicated to design reduced-complexity decoding algorithms for NB-LDPC codes. In [12] and [13] , the authors present an FFT-based BP decoding that reduces complexity to the order of , where is the check node degree. This algorithm is also described in the logarithm domain [14] , leading to the so-called log-BP-FFT. In [15] , [16] , the authors introduce the Extended Min-Sum (EMS), which is based on a generalization of the Min-Sum algorithm used for binary LDPC codes ( [17] , [18] and [19] ). Its principle is the truncation of the vector messages from to values , introducing a performance degradation compared to the BP algorithm. However, with an appropriate estimation of the truncated values, the EMS algorithm can approach, or even in some cases slightly outperform, the BP-FFT decoder. Moreover, the complexity/performance trade-off can be adjusted with the value of the parameter, making the EMS decoder architecture easily adaptable to both implementation and performance constraints. A complexity comparison of the different iterative decoding algorithms applied to NB-LDPC is presented in [20] . Finally, the Min-Max algorithm and its selective-input version are presented in [21] . In the last years several hardware implementations of NB-LDPC decoding algorithms have been proposed. In [22] and [23] , the authors consider the implementation of the FFT-BP on an FPGA device. In [24] the authors evaluate implementation costs for various values of by the extension of the layered decoder to the NB case. An architecture for a parallel or serial implementation of the EMS decoder is proposed in [16] . Also, the implementation of the Min-Max decoder is considered in [25] , [26] and optimized in [27] for GF (32) . Finally, a recent paper 1 presents an implementation of a NB-LDPC decoder based on the Bubble-Check algorithm and a low-latency variable node processing [28] .
Even if the theoretical complexity of the EMS is in the order of , for a practical implementation, the parallel 1 Paper published during the reviewing process of our manuscript. insertion needed to reorder the vector messages at the elementary check node (ECN) increases the complexity to the order of . An algorithm to reduce the EMS ECN complexity is introduced in [29] for a complexity reduction in the order of . The complexity of this architecture was further reduced without sacrifying performance with the L-BubbleCheck algorithm [30] .
As the EMS decoder considers log-likelihood ratios (LLR) for the reliability messages, a key component in the NB decoder is the circuit that generates the a priori LLRs from the binary channel values. An LLR generator circuit is proposed in [31] , but this algorithm is software oriented rather than hardware oriented, since it builds the LLR list dynamically. In [32] , an original circuit is proposed as well as the accompanying sorter which provides the NB LLR values to the processing nodes of the EMS decoder.
In this paper, we present a design and a reduced-complexity implementation of the L-Bubble Check EMS NB-LDPC decoder focusing our attention on the following points: the Variable Node (VN) update, the Check Node (CN) processing as a systolic array of ECNs and the codeword decision-making. Table I summarizes the notation used in the paper.
The paper is organized as follows: Section II introduces ultrasparse quasi-cyclic NB-LDPC codes, which are the one considered by the decoder architecture. This section also reviews NB-LDPC decoding with particular attention to the Min-Sum and the EMS algorithms. Section III is dedicated to the global decoder architecture and its scheduling. The VN architecture is detailed in Section IV. The CN processor and the L-Bubble Check ECN architecture are presented in Section V. Section VI is dedicated to performance and complexity issues and, finally, conclusions and perspectives are discussed in Section VII.
II. NB-LDPC CODES AND EMS DECODING
This section provides a review of NB-LDPC codes and the associated decoding algorithms. In particular, the Min-Sum and the EMS algorithms are described in detail.
Definition of NB-LDPC Codes
An NB-LDPC code is a linear block code defined on a very sparse parity-check matrix whose nonzero elements belong to a finite field , where . The construction of these codes is expressed as a set of parity-check equations over , where a single parity equation involving codeword symbols is:
, where are the nonzero values of the -th row of and the elements of are . The dimension of the matrix is , where is the number of parity-check nodes (CN) and is the number of variable nodes (VN), i.e., the number of symbols in a codeword. A codeword is denoted by , where , is a symbol represented by bits as follows: . The Tanner graph of an NB-LDPC code is usually much more sparse than the one of its homologous binary counterpart for the same rate and binary code length ( [33] , [34] ). Also, best error correcting performance is obtained with the lowest possible VN degree, . These so-called ultra-sparse codes [33] reduce the effect of stopping and trapping sets, and thus, the message passing algorithms become closer to the optimal maximum likelihood decoding. For this reason, all the codes considered in this paper are ultra-sparse. To obtain both good error correcting performance and hardware friendly LDPC decoder, we consider the optimized non-binary protograph-based codes [35] , [36] with proposed by D. Declercq et al. [37] . These matrices are designed to maximize the girth of the associated bi-partite graph, and minimize the multiplicity of the cycles with minimum length [38] . This NB-LDPC matrix structure is similar to that of most binary LDPC standards (DVB-S2, DVB-T2, WiMax, ), and allows different decoder schedulings: parallel or serial node processors. 2 Finally, the nonzero values of are limited to only distinct values and each parity check uses exactly those distinct values. This limitation in the choice of the values reduces the storage requirements.
A. Min-Sum Algorithm for NB-LDPC Decoding
The EMS algorithm [15] is an extension of the Min-Sum ( [39] , [40] ) algorithm from binary to NB LDPC codes. In this section we review the principles of the Min-Sum algorithm, starting with the definition of the NB LLR values and the exchanged messages in the Tanner graph. Considering a BPSK modulation and an additive white Gaussian noise (AWGN)  channel, the received noisy codeword  consists of  binary symbols independently affected by noise: , where , , , is the realization of an AWGN of variance and represents the BPSK modulation that associates symbol "-1" to bit 0 and symbol "+1" to bit 1.
Definition of NB LLR Values:
The first step of the Min-Sum algorithm is the computation of the LLR value for each symbol of the codeword. With the hypothesis that the symbols are equiprobable, the LLR value of the symbol is given by [21] (
where is the symbol of that maximizes , i.e., . Note that and, for all , . Thus, when the LLR of a symbol increases, its reliability decreases. This LLR definition avoids the need to re-normalize the messages after each node update computation and permits to reduce the effect of quantization when considering finite precision representation of the LLR values.
As developed in [32] , can be expressed as
Using (3) Step 1: VN computation: for all
Step 2: Determination of the minimum V2C LLR value (6)
Step 3: Normalization
3) CN Equations in the Min-sum Algorithm: With the forward-backward algorithm [43] a CN of degree can be decomposed into ECNs, where an ECN has two input messages and and one output message (see Fig. 7 ). (8) where is the addition in .
4) Decision-making Equations in the Min-sum Algorithm:
The decision is expressed as
B. The EMS Algorithm
The main characteristic of the EMS is to reduce the size of the edge messages from to ( ) by considering the sorted list of the first smallest LLR values (i.e., the set of the most probable symbols) and by giving a default LLR value to the others. Let be the EMS message associated to the symbol knowing (the so-called intrinsic message). is composed of couples , where is a element and is its associated LLR:
. The LLR verifies . Moreover, . In the EMS, a default LLR value is associated to each symbol of that does not belong to the set , where is a positive offset whose value is determined to maximize the decoding performance [15] .
The structure of the V2C and the C2V messages is identical to the structure of the intrinsic message . The output message of the VN should contain only, in sorted order, the first smallest LLR values and their associated GF symbols . Similarly, the output message of the CN contains only the first smallest LLR values (sorted in increasing order), their associated GF symbols , and the default LLR value . Except for the approximation of the exchanged messages, the EMS algorithm does not differ from the Min-Sum algorithm, i.e., it corresponds to (5)-(9).
III. ARCHITECTURE AND DECODING SCHEDULING
This section presents the architecture of the decoder and its characteristics in terms of parallelism, throughput and latency.
A. Level of Parallelism
We propose a serial architecture that implements a horizontal shuffled scheduling with a single CN processor and VN processors. The choice of a serial architecture is motivated by the surface constraints as our final objective is to include the decoder in an existing wireless demonstrator platform [44] ) (see Section VI). The horizontal shuffled scheduling provides faster convergence because during one iteration a CN processor already benefits from the processing of a former CN processor. This simple serial design constitutes a first FPGA implementation to be considered as a reference for future parallel or partial-parallel enhanced architecture designs.
B. The Overall Decoder Architecture
The overall view of the decoder architecture is presented in Fig. 1 . A single CN processor is connected to VN processors and RAM V2C memory banks. The CN processor receives in parallel V2C messages and provides, after computation, C2V messages. The C2V messages are then sent to the VN processors to compute the V2C messages of their second edge.
Note that, for the sake of simplicity, we have omitted the description of the permutation nodes that implement the multiplications. The effect of this multiplication is to replace the value by where the GF multiplication requires only a few XOR operations.
1) Structure of the RAMs:
The channel information and the message associated to the variables are stored in memory banks RAM and RAM V2C respectively. 4 Each memory bank contains information related to variables. In the case of RAM , the received values associated to the variable are stored in consecutive memory addresses, each of size bits, where is the number of bits of the fixed-point representation of (i.e., the size of RAM is words of bits). Similarly, each RAM V2C is also associated to variables. The information related to is stored in consecutive memory addresses, each location containing a couple ( ), i.e., two binary words of size , where is the number of bits to encode the values. To reduce memory requirements, for each symbol , only the channel samples and the extrinsic messages are stored in the RAM blocks. The intrinsic LLR are stored after their computation but they are overwritten by the V2C messages during the first decoding iteration. Each time an intrinsic LLR is required for the VN update, it is re-computed in the VN processor by the LLR generator circuit. Such approach avoids the memorization of all the LLR of the input message ( messages) and thus, saves significant area when considering high-order Galois Fields (
). The partition of the variables in the memories is a coloring problem: the variables associated to a given CN should be stored each in a different memory bank to avoid memory access conflicts (i.e., each memory bank must have a different color). A general solution to this problem has been studied in [45] . Since the NB-LDPC matrices considered in our study are highly structured (see [37] ), the problem of partitioning is solved by the structure of the code.
2) Wormhole Layer Scheduling: The proposed architecture considers a wormhole scheduling. The decoding process starts reading the stored and V2C information sequentially and sends, in clock cycles, the whole message to the CN. After a maximum delay , the CN starts to send the messages to the VN processors, again with a value , at each clock cycle. 5 After a delay of (see Section V-B), the VNs send the new messages to the memory. The process is pipelined, i.e, every clock cycles, a new CN processing is started. The total time to process decoding iterations is (10) where is given in clock cycles. Fig. 2 illustrates the scheduling of the decoding process.
3) The Decoding Steps: The decoding process iterates times performing CN updates and VN updates at each iteration. During the last iteration a decision is taken on each symbol. The codeword decision is performed in the VN processors. This concludes the decoding process and the decoder then sequentially outputs to the next block of the communication chain. Note that the interface of the decoder is then rather simple:
1) Load and store them in RAMy ( clock cycles). 2) Compute intrinsic information from to initialize the messages.
3) Perform the decoding iterations. 4) During the second edge processing of the last iteration, use the decision process to determine . 5) Output the decoded message ( clock cycles) and wait for the new input codeword to decode.
IV. VARIABLE NODE ARCHITECTURE
Although most papers on NB-LDPC decoder architectures focus on the CN, the implementation of the VN architecture is almost as complex, if not more, than the implementation of the CN in terms of control. In the proposed decoder, the VN processor works in three different steps: 1) the intrinsic generation; 2) the VN update; and 3) the codeword decision. During the first step, prior to the decoding iterations, the Intrinsic Generation Module (IGM) circuit is active and generates the intrinsic message from the received samples. During the VN update, all the blocks of the VN processor, except the Decision block, are active. Finally, during the last decoding iteration, the Decision block is active (see Fig. 3 ).
A. The Intrinsic Generator Module (IGM)
The role of the IGM is to compute the intrinsic messages. In [32] , the authors propose an efficient systolic architecture to perform this task. The purpose is to iteratively construct the intrinsic LLR list considering, at the beginning, only the first coordinate, then the first two coordinates and so on, up to the complete computation of the intrinsic vector. The systolic architecture works as a FIFO that can be fed when needed. Once the input symbols are received, and after a delay of clock cycles ( ), the IGM generates a new output at every clock cycle. When pipelined, this module generates a new intrinsic vector every clock cycles. Each intrinsic message is stored in the corresponding V2C memory location in order to be used during the first step of the iterative decoding process.
In the present design, in order to minimize the amount of memory, the intrinsic messages are not stored but re-generated when needed, i.e., during each VN update of the iterative decoding process. This choice was dictated by the limited memory resources of the existing FPGA platform. In another context, it could be preferable to generate only once the intrinsic messages, store them in a specific memory and retrieve them when needed.
B. The VN Update
In the VN processor, the blocks involved in the VN update are the following: the elementary LLR generator (eLLR), the Sorter, the IGM, the Flag memory, and the Min block.
The task of the VN update is simple: it extracts in sorted order the smallest values, and their associated GF( ) symbols, from the set indexed by to generate the new V2C message.
The set of values can be divided into two disjoint subsets and , with the subset of defined as . In this set, , with such that . The second set, contains the symbols not in . If , then takes the default value (see section Section II-C). The generation of is done serially in 3 steps : 1) is sent to the eLLR module to compute according to (4) . The value of is also used to put a flag from 0 to 1 in the Flag memory of size to indicate that this value now belongs to . To be specific, the Flag memory is implemented as two memory blocks in parallel, working in ping-pong mode to allow the pipeline of two consecutive C2V messages without conflicts.
2)
is added to to generate . Note that is no more sorted in increasing order.
3) The Sorter reorders serially the values in in increasing order. The architecture of this Sorter is described in Section IV-C. The IGM is used to generate the second set . Each output value of the IGM is first added to . Then, if belongs to (i.e., the flag value at address in the flag memory equals "1"), the value is discarded and a new value is provided by the IGM component to the Min component.
The Min component serially selects the input with the minimum LLR value from and . Each time it retrieves a value from a set, it triggers the production of a new value of this set until all the values of are generated.
C. The Architecture of the Sorter Block in the VN
The Sorter block in the VN processor is composed of stages, where is the smallest interger greater than or equal to (see Fig. 4 ). The ( ) stage serially receives two sorted lists of size , and provides a sorted list of size . The first received list goes into and the second list goes into . Then, the min_select block compares the first values of the two FIFOs, pulls the minimum one from the corresponding FIFO and outputs it. In practice, a stage starts to output the sorted list as soon as the first element of the second list is received. The latency of a stage is then clock cycles, plus one cycle for the pipeline, i.e., clock cycles. The size of is double (i.e., ) in order to allow receiving a new input list while outputting the current sorted list.
As an example, to order a list of values, the Sorter consists of 4 stages. The first stage receives 16 sequences of size and outputs 8 sorted lists of size (i.e., the elements are ordered by couples). The second stage outputs 4 lists of size , the third stage outputs 2 lists of size 8 and, finally, the last stage outputs the whole sorted list of size . The global latency of the Sorter is then expressed as (11) Note that the sorter is able to process continuously blocks of size power of two, i.e., for , it is able to process a new block every 16 clock cycles and the latency is .
D. Decision Circuit Architecture
The architecture of the simplified codeword decision circuit is presented in Fig. 5 . The optimal decoding is given by (12) Since the decision is done during the second branch update, we can replace in (12) by (see (5) ). Thus, we can write (13) The processing of this equation is rather complex, since it requires either an exhaustive search for all values of , or a complex content addressable memory (CAM) to search for the common values in the V2C and C2V messages. At this point, any method leading to a hardware simplification without significant performance degradation can be accepted. In a very pragmatic way, we tried several methods and we propose to replace, in (13) , by in order to reduce the size of the CAM from to 3. Let be the set of the common values between the C2V and V2C messages, indexed by (14) The decided symbol is defined as (15) where refers to the associated value. Fig. 5 presents the architecture of the Decision circuit and Fig. 6 shows performance simulation of the decision circuit comparing CAM sizes 3 and 12 for 8 and 20 decoding iterations. Note that reducing the CAM size from 12 to 3 does not introduce any performance loss when considering 20 decoding iterations. 
E. The Latency of the VN
The critical path in the VN is the one containing the Sorter block, because this block waits for the arrival of the last C2V message to start its processing. The latency is then determined by the latency of the Sorter, i.e., , plus a clock cycle for the adder and another one for the Min block. 
V. THE CHECK NODE PROCESSOR
The CN processor receives messages , performs its update based on the parity test described in (8) , and generates messages to be sent to the corresponding VNs. The processing of the received messages is executed according to the Forward-Backward algorithm [43] which splits the data processing into 3 layers of ECNs, as shown in Fig. 7 . The main advantage of this architecture is that it can be easily modified to implement different values of (i.e., to support different code rates).
Each ECN receives two vector messages and , each one composed of (LLR,GF) couples, and outputs a vector message whose elements are defined by (8) [15] , [16] . This equation corresponds to extracting the minimum values of a matrix , defined as , for . In [16] , the authors propose the use of a sorter of size which gives a computational complexity and constitutes the bottleneck of the EMS algorithm. In order to reduce this computational complexity, two simplified algorithms were proposed [29] , [30] . In [29] the Bubble-Check algorithm simplifies the ECN processing by exploiting the properties of the matrix and by considering a two-dimensional solution of the problem. This results in a reduction of the size of the sorter, theoretically in the order of . It is also shown in [29] that no performance loss is introduced when considering a size of the sorter smaller than the theoretical one.
In [30] , the authors suppose that the most reliable symbols are mainly distributed in the first two rows and two columns of matrix and propose to use the so called L-Bubble Check which presents an interesting performance/complexity tradeoff for the EMS ECN processing. As depicted in Fig. 8 , the values in the sorter are initialized with the matrix values , , and only a maximum of values in are considered in the ECN processing. Simulation results provided in [30] showed that the complexity reduction introduced by the L-Bubble Check algorithm does not introduce any significant performance loss. For this reason, we adopt the L-Bubble Check algorithm for the implementation of the present NB-LDPC decoder.
A. The L-Bubble ECN Architecture
The L-Bubble ECN architecture is depicted in Fig. 9 . The input values are stored in two RAMs and to be read during the ECN processing. At each clock cycle, each RAM receives a new (LLR, GF) couple and outputs a couple from a predetermined address. The LLR values of the couples read from the RAMs are added and the associated GF symbols are Xored (added modulo 2) to generate an element that feeds the sorter. This sorter is composed of four registers (B@ind) with (from left to right), four multiplexers and one Min operator that outputs the (LLR, GF) couple having the minimum LLR value.
The values fetched from the memories are denoted by and , the values are named bubbles and feed the registers. The bubbles are tagged as follows: , , , . This addressing scheme is based on the position of the bubbles in the matrix. The complete ECN operation can be summarized as: 1) Read and from memories and . . However, redundant associated GF symbols may appear, which are deleted at the output of the ECN [16] . In order to compensate this redundancy, operations are performed in the ECN. Simulation results showed that the best performance/complexity trade-off is obtained for . The critical path of the CN processor is then imposed by the ECN computation composed of RAM access, an adder, two serial comparators, and an index update operation.
B. Multiplication and Division in
As described in Section II, the messages crossing the edges between VNs and CNs are multiplied by predetermined coefficients when entering the CN and divided by the same coefficients (i.e., multiplied by ) when leaving the CN towards the VN. In order to perform these multiplications in , we have designed two wired multipliers dedicated to perform the multiplication over . Each multiplier implemented on Virtex IV consumes 14 slices and operates at 900 MHz. The operands of the multiplier are the (respectively, the ) and the predefined coefficients stored in read only memory (ROM) called (respectively ). Each ROM contains a binary matrix, where each entry contains the six coefficients.
C. Timing Specifications
This section describes the timing and scheduling details of the CN processor in the NB-LDPC EMS decoder. We first consider the scheduling at the ECN level and then at the CN processor, which is composed of three layers of serially concatenated ECNs. Fig. 10 depicts the operations executed in the ECN at each clock cycle (CC). In this figure, WM stands for write memory, RM for read memory, Ind upd for index update, and NV for non-valid output. The input data is represented by D and corresponds to two (LLR, GF) incoming couples. Finally, E represents the output (LLR, GF) couple.
1) ECN Timing Specifications:
The Sorter is represented by a vertical rectangle where a blank case represents an empty register and a dark one a filled one. At CC0, the vectors and receive their first inputs to be stored in the RAMs at CC1. At CC2, the stored messages are read, fed to the adder, and then to the sorter. As shown in Fig. 10 , the first register is filled (dark case) with the adder output and this (LLR, GF) couple directly goes to the output (E1) as it corresponds to the minimum LLR value. 6 The latency of the ECN is 2 cycles. During the next three CCs, the ECN receives three new data couples and outputs three NV outputs. This 3-CC latency is denoted as sorter filling latency (SFL). After the SFL, at CC4, the four registers in the sorter are filled and the second valid data couple is output.
The number of cycles needed to generate valid outputs is then . However, due to the redundant symbols that may appear when adding two input messages in and , some extra cycles are allowed in order to guarantee the generation of different symbols. To be specific, we consider , as detailed in section Section III-B2.
2) CN Timing Specifications:
The Forward-Backward implementation of the CN processor consists of three layers of serially concatenated ECNs (see Fig. 7 ). Let ECNe denote the ECN of layer , where the numeration is considered from left to right and top to bottom. The execution progress for each CC is depicted in Fig. 11 . The inputs and (resp. and ) feed ECN (resp. ECN ). Note that only these two ECNs have both inputs directly connected to the RAMs. All the other ECNs have at least one input generated by an adjacent ECN. Because of the latency contraints of the ECN, ECN , and ECN provide their first output at CC2. These outputs activate ECN and ECN , that deliver their first output at CC4.
Note that each ECN is in SFL after the generation of its first output. This means that at each of the following three CCs, an State 4: Generating a valid output and the sorter is filled. At this state, all the generated outputs are valid.
The global CN execution is represented in Fig. 11 . At each CC, the state of each ECN in the Forward/Backward architecture is indicated. For example, at CC0, no ECN is active (State 1). The decoding process of the whole CN is constrained by ECN and ECN . For these ECN, the latency to output the first value is . The SFL then follows (i.e., 3 CCs) and during the next CCs, the rest of the message is output. The latency of the CN is then given by (17) VI. PERFORMANCE AND COMPLEXITY
A. Decoding Throughput
We consider a GF order of for the implementation of the NB-LDPC decoder. The following code lengths and rates are chosen for the decoder synthesis:
• symbols, , • symbols, , • symbols, , The decoding throughput of the architecture (in bits per second) is where is the number of cycles to decode a frame (see (10)) and is the clock frequency. For example, for = 192 symbols, = 2/3, and with , , , and , the latency values for the CN and VN processing are and clock cycles. The delay is clock cycles, which constitutes a maximum decoding latency of clock cycles to decode a frame and Mbps. Note that is the maximum decoding throughput assuming that there is a ping-pong input and output RAM to avoid idle times between the input loading of a new codeword and the output of a decoded one.
The serial architecture has been synthetized on a Xilinx Virtex4 XC4VLX200 FPGA. Table II presents the synthesis  results 7 for three different frame lengths and code rates considering 8 decoding iterations and 6-bit quantization for input data (intrinsic LLR) as well as for the check-to-variable and variable-to-check messages. The proposed architecture can be easily adapted for any quasi-cyclic ultra-sparse (i.e., ) -LDPC code.
B. Emulation Results
To obtain performance curves in record time we have implemented the complete digital communication chain on an FPGA device. For this, the hardware description of the different parts of the digital communication chain is required, namely the source, the encoder, the channel and the decoder. The source generates random bits that are encoded, BPSK modulated, affected by a an additive white Gaussian noise (AWGN), then demodulated and decoded. To emulate the effect of AWGN in the baseband channel, we consider the Hardware Discrete Channel Emulator as in [46] . We use the Xilinx ML507 FPGA DevKit which contains a Virtex5. The PowerPC processor is available as hardcore IP in the FPGA and can be used for software development. For practical purposes, we developed a human machine interface (HMI) for the control of the emulation chain and the generation of performance curves. This HMI consists of a web server/FTP and its main advantage is being multiplatform, i.e., all the control can be done through a web server. More details about the emulator platform can be found in [47] . Table III summarises the post-synthesis area results. LDPC-IP stands for the digital communication chain including the NB-LDPC decoder. The PowerPC is mainly implemented as hardcore IP, which explains that its cells requirement is negligible. The digital chain is a multi-cadenced system, where the LDPC-IP block is cadenced at a frequency of 50 MHz. 8 We compared emulation and software throughputs for different scenarios (i.e., different code rates and frame lengths). The speedup factor between software simulation 9 and hardware emulation was greater than 100 for all cases. The performance results obtained with the hardware emulator platform were compared to the EMS and BP simulation results. The number of iterations for the BP was fixed to 100. Fig. 12 considers a frame length of symbols and a code rate . The curves show the good agreement between simulation and emulation results. Also, a gain of about 0.5 dB can be obtained when increasing the number of iterations from 8 to 20. The emulation results show that no error floor appears (up to a FER of ). 7 These synthesis results do not include the ping-pong input and output RAM. 8 Note that the maximum frequency of the LDPC-IP block is of 65MHz. However, we select a frequency of 50 MHz because it is faster for design tools to find a place-and-route solution for a system with lower frequency constraints 9 Performed on an Intel Bi-Quad processor with 24 Go RAM and 6144 Mo Cache. Note that the performance of the implemented decoder is at less than 0.5 dB of the BP performance.
Figs. 13-14 consider with and symbols, respectively. They both confirm the good agreement between emulation and simulation, and show that the performance of the implemented decoder is at less than 0.7 dB of the BP performance. The decoder generalization for different frame lengths and code rates is also validated.
C. Comparison With Other NB-LDPC Decoder Implementations
Table IV summarizes the comparison of the synthesis results presented in [23] , [26] , [27] and our approach. Note that the GF order and the decoding algorithm is not the same for each implementation, so the comparison is quite approximative but allows us to place our work in the state-of-the-art of NB-LDPC decoder implementations. In a general way, as we consider , complexity increase and significant performance gain are expected compared to [23] , where , and [26] , [27] , where . The best speed-over-area ratio is presented by the 31-parallel ASIC implementation in [27] , where the authors propose a trellis-Min-max algorithm for the CN processing. However, a performance loss of about 0.1 dB is to be expected, compared to -EMS decoding 10 . The serial implementation in [23] considers and results in a 1-Mbps throughput and a synthesis on a Virtex2P device that consumes 4660 slices. This area is considered as a reference for the normalized area comparison in Table IV . Considering BP decoding, the GF(64) decoder would lead to an increase of complexity from to (i.e., a factor of 64). However, as we consider the EMS algorithm (with ) the area is increased by only a factor 4 for the serial GF(64) decoder and the performance is at less than 0.5 dB of the BP performance for . Note that the speed/area parameter is around 1 for [23] , [26] and 0.74 for our design. As [23] and [26] consider GF orders of 8 and 32, respectively, while our work considers , this comparison shows the interest of our work in terms of performance/ area/throughput trade-off. Moreover, the reduced area required for serial architecture suggests that more complex semi-parallel architecture can be implemented, increasing the throughput of the decoding algorithm. Also, some effort should be dedicated to increase the maximum frequency of the design, knowing that the critical path is at the ECN.
While revising our paper, the work of [28] was published. There are many similarities between this work and ours: [28] uses the Bubble Check algorithm with the forward-backward implementation and both papers use a reduced-complexity VN processor. However, there are many significant differences: 1) in [28] , the CN architecture is based on the Bubble-Check algorithm while our CN architecture is based on the more efficient and simplified algorithm called L-Bubble Check; 2) [28] proposes an interesting pre-fetching technique that permits to reduce the critical path of the Bubble Check; 3) the VN architecture in [28] is characterised by the use of the first values of the Intrinsic message ( ) for both computation of V2C messages and decision making. However, in our work, the VN architecture uses all the 64 intrinsic values for the computation of the V2C message and only the first 3 values for the decision making. In terms of complexity, similar results are obtained for a rate-1/2 NB-LDPC decoder 11 . The (960,480) NB-LDPC decoder implemented in [28] 
D. Toward Decoding of NB-LDPC of High Field Order
Table V summarizes complexity of the main components as a function of in the proposed architecture. Note that the Flag memory is the only component that has a size scaling with . As mentioned in section Section IV-B, this Flag memory allows to determine if a given intrisic message belongs to the received messages (refer to section Section IV-B). This task can also be done using an associated memory of words of size . If we do so, all the elements in the architecture scale with , i.e., , except for the GF multiplier that scales in but represents a small part of the overall decoder. In other words, doubling the size of the field order would only have a small impact on the architectural cost. Thus, the use of CAM for the Flag memories opens the way to efficient decoding of high-order NB-LDPC codes, such as GF(256) or even higher.
VII. CONCLUSION
This paper is dedicated to the architecture design of a GF(64) NB-LDPC decoder based on a simplified version of the EMS 11 The implementation of a rate-2/3 decoder is not considered in [28] 12 Note that the size of the codeword does not have any impact on the processing hardware but only on the memory size The implementation is also generalized for other code rates and lengths and, in all cases, the hardware performance is at less than 0.7 dB of the BP decoding performance. The integration of the decoder in a hardware emulator platform provided emulation results showing that no error floor appears up to a FER of . A general comparison of our synthesis results with the existing works shows the interesting performance/area/throughput trade-off of our design. Moreover, as highlighted in the previous section, replacing the Flag memory in the VN by a CAM of size , makes that the architecture complexity scales in , (with ). In other words, decoding very high-order field NB-LDPC codes, such as GF(256) or even GF(4096), is feasible with the proposed architecture.
From this work we can draw important conclusions about the implementation of EMS-like algorithms for NB-LDPC. First, the design of the VN is as complex as the design of the CN, even if most of the papers in the literature focus on the CN implementation which is considered as the bottleneck of the decoder. Note that the high complexity of the VN is due to the use of ordinate lists to represent the messages, which constitutes a high overhead cost. Second, many computations in the CN are useless: among the inputs, less than are used in the output. Thanks to this point, it should be possible to decrease the number of computations in the CN to generate an output. To conclude, efficient decoding of NB-LDPC is still an open field. Other techniques should be invented to represent messages and/or to process parity-checks and variable updates.
ACKNOWLEDGMENT
This work is supported by INFSCO-ICT-216203 DAVINCI "Design And Versatile Implementation of Non-binary wireless Communications based on Innovative LDPC Code" (www.ictdavinci-codes.eu) funded by the European Commission under the Seventh Framework Program (FP7). The work has been done using also resources of the CPER PALMYRE II, with FEDER and the Brittany region fundings. The authors would also like to thank Dr. Yvan Eustache for synthesis and emulation results.
