Non-Binary Low-Density-Parity-Check codes (NB-LDPC) have shown superior performance but its huge complexity and low throughput prevent it from practical applications. This paper presents a novel architecture to implement a kind of high-throughput low-complexity irregular quasicyclic NB-LDPC decoder over GF (16) based on Extended Min-Sum (EMS) Algorithm. Double clocks are adopted in this paper. The low frequency clock at 60 MHz serves as the system clock and the high frequency clock at 480 MHz works for check nodes and variable nodes thus they can be reused 8 times during one system clock period as a result the complexity can be largely reduced. Synthesis result shows that the throughput can achieve 68.57 Mbps at 5 iterations. FPGA testing result shows that the decoder has little performance degradation compared with its floating model and it can provide about 0.5 dB coding gain compared with its binary LDPC code of the same block length. The proposed architecture can be conveniently extended to higher Galois Filed such as GF (64) or GF (256) and it can be applied for different code length and rate by modifying a slight part of the decoder. Compared with previous works, the decoder proposed in this paper is more efficient for practical applications.
Introduction
Non-Binary low-density-parity-check codes (NB-LDPC) in high Galois Filed GF (q) have presented more excellent performance when the block length is short or moderate. Compared with its binary counterparts (B-LDPC), NB-LDPC codes can efficiently correct burst errors and are closer to Shannon limit with good error floor [1] .
As for decoding architecture, the belief-propagation (BP) algorithm is always adopted to decode NB-LDPC codes in order to perform maximum likelihood decision. Because vectors of q messages need to be processed, the complexity is much higher, incurring a long latency and a much degraded throughput. The complexity is dominated by O (q 2 ). To reduce the complexity of decoder, several reduction methods have been proposed. In [2] , FFT-BP algorithm, which is based on frequency domain computation, is presented to reduce the complexity to O (qlog 2 qÞ, but it still needs large multiplication computations that makes it not suitable for FPGA or ASIC implementation. In [2] , EMS algorithm is proposed firstly by Declercq and Fossorier, each symbol consists of n m vectors where ðn m ( qÞ. The complexity of EMS algorithm is dominated by O (n m log n m ). and it provides a new trade-off between complexity and performance for NB-LDPC decoders. In [3] , Min-Max (MM) algorithm is presented to future reduce decoder's complexity but the performance is seriously degraded so MM algorithm is not adopted in this paper.
In [4] , the forward-backward (F-B) architecture is proposed for EMS algorithm implementation, in this method, a check node message update is based on several similar elementary check units (ECUs) to reduce computation complexity. But each ECU just processes two symbols per system cycle, which means 3ðd c À 2Þ ECU computations are needed for one (check node CN to variable node VN) C2V update. In this paper, we introduce a higher frequency clock for ECU, which is 8 times of the system clock, so just one ECU is required for a C2V update. We future apply this higher frequency clock for (variable node to check node) V2C update to reduce the complexity. Based on the architecture above, a high-throughput lowcomplexity NB-LDPC decoder is presented. For demonstration, a (324,648) NB-LDPC decoder over GF (16) is implemented on Virtex5 FPGA platform. The core only consumes 4547 slice registers, 12356 slice LUTs and operates at 60 MHz system clock and 480 MHz ECU clock.
This paper is organized as follows: Section 2 introduces the FB-EMS algorithm. Section 3 describes architecture of the decoder proposed in this paper as well as its ECN operation schedule. Section 4 gives the performance of the decoder and its source consumption. Section 5 draws conclusion of this paper.
Extended MIN-SUM algorithm
In general, the complexity of NB-LDPC decoder is dominated by O (q 2 ). But EMS algorithm only considers n m elements for one symbol. The vector messages are truncated from q to n m (n m ( q) and the complexity is reduced to O (n m logn m ).
In EMS algorithm, the message in Tanner graph consists of n m vectors and each vector includes its possible symbols and its corresponding log-likelihood radio (LLR) [7] . The n m vectors are in descending order based on its LLR. In this paper, we introduce V and P to present the C2V and V2C vector messages respectively. For 1 i n m , llr0[i] and sym0[i] are to denote the n m largest LLR values from channel and their corresponding symbols. v llr (i) and v sym (i) are to denote the n m largest LLR values of C2V message and their corresponding symbols. p llr (i) and p sym (i) are to denote the n m largest LLR values of V2C message and their corresponding symbols [8, 9] .
Then the EMS algorithm can be described as follows: 1) Initialization In this step, the channel values are stored into the LLR memory and the V2C message buffers are also fed with the channel values. Then the initial V2C message of a check-node can be given by:
a i is the i-th possible symbol and its corresponding LLR value.
2) C2V update
For all the check nodes of dc degrees, calculate C2V messages based on the other dc À 1 V2C messages, we denote the other dc À 1 V2C messages as P 1 ðaÞ $ P dcÀ1 ðaÞ. The update equation can be given by:
3) V2C update For all the variable nodes of dv degrees, calculate V2C messages based on the other dv À 1 messages and the channel values, we denote the other dv À 1 C2V messages as V 1 ðaÞ $ V dvÀ1 ðaÞ. At the same time, the vectors for decision have to be calculated in this stage, here we use S to represent. The update equations are shown as follows:
Decision
We use C^¼ ½C 1 C 2 C 3 . . . C N to represent the decision code-word for current iteration. Here
Repeat 2) ∼ 4) until H T Ã C^¼ 0 or the maximum iteration number has been reached.
For check-node unit (CNU) or variable-check node unit (VNU) implementation, F-B structure is always adopted. Fig. 1 depicts the F-B based CNU with dc ¼ 6. The CNU can be implemented with 3ðdc À 2Þ ECUs.
Proposed architecture of decoder
The CN processor receives dc messages V2C, performs its update and generate dc messages C2V to be send to the corresponding dc VNs. The CN processor is mainly based on the above ECNs. In an ECN, two input messages are fed into the module and the function is to generate n m largest candidates with different symbols. In [6] , the author proposed the Bubble Check algorithm for ECN implementation. The algorithm has the advantage to simplify the complexity of ECN without introducing significant performance loss, but because it is based on the serial sorter the throughput is hard to improve, and when the Galois Filed order is rather high such as GF (256) its performance loss will be evident compared with convenient EMS algorithm. In this paper, we proposed an in-parallel-out-serial (IPOS) architecture to construct the low complexity high throughput NB-LDPC decoder. The system clock is 60 MHz and the ECN clock is optimized to achieve 480 MHz by synthesis tools and careful design. Fig. 2(a) shows the structure of proposed ECN in this paper with Nm ¼ 4. For a merge unit, its function is to select the symbol with bigger LLR when different input messages are fed in and while one of the input message symbols equals to each other, the output LLR is the summation of all elements. The comparison unit sorts the input messages in descending order. The ECN unit outputs selected Nm ¼ 4 largest LLR elements in this paper. The ECN unit itself is fully parallel structured but as a part of CN, the ECN unit works serially during CN update period, so we call it IPOS architecture. Based on this strategy, we only need one ECN to complete a CN update, so the whole complexity of the decoder can be largely reduced without throughput degradation. Fig. 2(b) depicts its working schedule of the IPOS CN structure for our decoder. The input signal "Load" works as the flag to startup CN core. ECN unit is on the control of FSM and 14 high frequency clock cycles are needed for a CN update. After that, all the results will be written into the messages memory for VN update on the corner.
As shown in Fig. 3(a) , the overall implementation of the decoder mainly consists of the follow components, Control Unit, LLR Memory, Message Memory between CNs and VNs, Transposition Unit and Inverse Transposition Unit, CN Unit, VN Unit and Decision Unit. The Control Uniti's function is to control other modules to work sequentially in good order. It reads channel values to the LLR Memory in the first stage. The LLR Memory is used to store the channel values corresponding to each bit of the code-word. Message Memory is used to store the temporary results between CNs and VNs. The Message Memory is initialized to be zero and the LLR Memory is initialized to the channel values.
After the initialization of the decoder, the messages from LLR Memory are read and fed into the Processing Array to perform the EMS decoding algorithm. The Decision Unit is to compute whether the decoded code-word is valid. When the maximum iteration number has been reached, the algorithm will stop and the current code-word is output to the ports. Note that, the message crossing the edges between VNs and CNs should be multiplied by the check matrix coefficient h j;k ¼ a ajk before entering the CNs and also should be divided by the same coefficient h j;k ¼ a qÀ1Àajk when leaving from CNs to VNs. These functions are implemented 
Implementation results and performance analysis
The implementation result of proposed NB-LDPC is analyzed in this section and compared with other works by designing a (324,648) QC-NB-LDPC code over GF (16). The code can be divided into 3 Â 6 sub-matrices of 27 Â 27 dimension. It is an irregular code to achieve better decoding performance. Based on Virtex5 FPGA platform, the proposed decoder of EMS algorithm for NB-LDPC code is implemented with n m ¼ 4. The decoder only consumes 4547 slice registers, 12356 slice LUTs and operates at 60 MHz system clock and 480 MHz ECU clock. The throughput can achieve 68.57 Mbps at 5 iterations. The bit error rate (BER) of proposed decoder based on FPGA testing is shown in Fig. 3(b) and the performance of binary LDPC code with the same length is also provided for comparison. The result shows that our decoder has little performance degradation compared with its floating model and it can provide about 0.5 dB coding gain compared with its binary LDPC code of the same block length based on Belief Propagation (BP) algorithm. And there is little difference between the MATLAB simulation and FPGA emulation. For fair comparison, we propose a normalized hardware efficiency p to represent decoder quality p ¼ ðThroughput=Total GatesÞ Ã n m log n m as Throughput has already taken code length, code rate, iteration number and q-ary into consideration and the primitive complexity is dominated by O (n m log n m ).
In Table I , the comparison of proposed decoder with other works shows the interest of our architecture in terms of performance/complexity/throughput trade- [10, 11, 12] our decoder can achieve better performance at the cost of less source consumption. And it can provide high throughput for practical applications.
Conclusion
In this paper, an IPOS structured ECN is proposed to construct the low complexity and high throughput partial-parallel QC-NB-LDPC decoder based on F-B EMS algorithm and double clocks (60 MHz for system and 480 MHz for CN/VN) are used to reduce the number of ECNs. Synthesis result shows that the throughput can achieve 68.57 Mbps at 5 iterations. FPGA testing result shows that the decoder has little performance degradation compared with its floating model and it can provide about 0.5 dB coding gain compared with its binary LDPC code of the same block length. This structure can be applied for different code length and rate by modifying a slight part of the decoder. The hardware efficiency of proposed decoder architecture is much higher than those of previous comparable decoder architectures in the open literature and it is more efficient for practical applications.
