Abstract-In this brief, a method for compressing the messages between check nodes and variable nodes is proposed. This method is named compressed nonbinary message passing (CNBMP). CNBMP reduces the number of messages exchanged between one check node and the connected variable nodes from d c × q to 5 × q, and its application has a high impact on the performance of the decoder: the storage and routing areas are reduced and the throughput is increased. Unlike other methods, CNBMP does not introduce any approximation or modification in the information and the processed operations are exactly the same as those of the original decoders; hence, no performance degradation is introduced. To demonstrate its advantages, an architecture applying this CNBMP to the Trellis Min-Max algorithm was derived showing that most of the storage resources were also reduced from d c × q to 5 × q. This architecture was implemented for a (837 726) nonbinary lowdensity parity-check code using a 90-nm CMOS technology reaching a throughput of 981 Mb/s with an area of 10.67 mm 2 , which is 3.9 more efficient than the best solution found in the literature.
I. INTRODUCTION
The two main bottlenecks of nonbinary low-density paritycheck (NB-LDPC) decoder architectures are the storage resources and the maximum throughput. In spite of their significant benefits, such as better behavior in the error floor region and a more robust correction for burst errors, NB-LDPC codes cannot compete with their binary counterparts in terms of complexity or throughput/area efficiency.
Several alternatives to the original Q-ary sum-of-product algorithm [1] were proposed during the last decade to achieve the best correction performance possible and reduce complexity. The most remarkable ones are the extended min-sum (EMS) [2] and min-max (MM) [3] algorithms, which reduced the complexity of the check node processor and the storage resources. However, a parallel implementation of these algorithms was prohibitive in terms of wiring between check node and variable node processors and arithmetic resources. For this reason, all the architectures derived from these two algorithms applied the forward-backward metrics, which consist of a serial computation of the check node information. All the decoders based on the forward-backward metrics suffer from a very large number of clock cycles per iteration, limiting the maximum throughput to a few megabits per second [4] .
To increase the degree of parallelism keeping the same error correction, a new version of the EMS algorithm named trellis-EMS (T-EMS) was proposed in [5] . This method allowed hardware designers to implement a fully parallel check node in a layered architecture [6] . This implementation did not sacrifice efficiency in terms of throughput/area compared with other serial implementations based on trellis [7] and increased throughput more than three times. Further improvements were introduced with trellis min-max (TMM) in [8] . Despite this, the decoder from [8] required 14.7 mm 2 of area with a 90-nm CMOS process and reached a throughput of 660 Mb/s, which is far from the results of modern binary LDPC decoders for the same technology (9.6 mm 2 , 45.42 Gb/s) [9] . While the binary architectures just exchange a number of messages equal to the degree of the check node (d c ) between the check node and the variable node, nonbinary decoders require q times more wires/connections; and the same happens for the memories and registers, which are about 80% of the decoder's area. In this brief, a method for reducing the number of messages exchanged in nonbinary decoders between the check node and the variable node is introduced. This method does not vary the computation of the decoding algorithm nor reduces the information transferred between nodes, so it does not introduce any performance degradation. This proposal compresses the information transmitted in the messagepassing reducing the size of the messages from d c × q to 5 × q. This has a great impact on both area and throughput especially for high-rate codes. For example, an implementation for the same code as in [7] and [8] achieves 981 Mb/s of throughput with an area of 10.6 mm 2 for a 90-nm CMOS process.
The rest of this brief has four sections. Section II includes a summary of the NB-LDPC message-passing of the decoding algorithms. Section III describes the proposal of this brief. Section IV shows the impact of the new message-passing in a hardware implementation and compares the results with other existing architectures. Section V outlines the conclusions.
II. NONBINARY LDPC MESSAGE PASSING
Let H be the M × N parity check matrix with coefficients h i, j ∈ GF(q) that defines an (N, K ) NB-LDPC code. N (m) and M(n) are described as the sets that consist of all the nonzero elements of a row m (check node) and a column n (variable node), respectively. The sizes of the sets N (m) and M(n) are the degree of check node (d c ) and the degree of variable node (d v ), respectively. The d c and d v degrees represent the number of messages that each check node and variable node receive, respectively. The set of messages from check node to variable node are denoted by R and the set of messages from variable node to check node by Q. Each of these messages consists of q elements, due to the fact of performing operations over GF(q). The method for computing each of these sets depends on the decoding algorithm applied. The algorithms that provide a better performance with lower complexity are T-EMS and T-MM, which have a different processing at the check node but share the same operations at the variable node. For a better understanding of the message-passing between the check node and the variable node, a short explanation of the basic operations performed in the check node is included next. For more details about the different decoding processes, we refer the reader to [5] and [8] .
In addition, to perform a parallel processing of the check node, we will assume delta domain [5] , [6] messages as inputs and outputs at the check node.
1063-8210 © 2014 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Let Q be the set of d c messages from the variable node in the delta domain defined as
Each element Q m,n includes the likelihood of being the symbol
The output messages of the check node in the delta domain are also of length d c
The likelihood of each symbol accomplishing the parity check equation of the check node is defined as
To compute the reliability of each of the q symbols in a single message, the check node update equations consider the combinations of the most reliable input messages. If only the two most reliable messages per symbol are considered, the update rules for the check node follow the next conditions. 1) If the input likelihood of the symbol α x for the edge {m, n} is not the most reliable for α x nor is considered to compute other
Being
2) If the input likelihood of the symbol α x for the edge {m, n} is the most reliable for α x , R m,n (α x ) takes the value of the second most reliable message
3) If the input likelihood of the symbol α x for the edge {m, n} is involved in the output reliability of α y , R m,n (α x ) takes the value of the most reliable message
To reduce the number of operations at the check node and share results, a set that includes common computation was proposed in [5] , and defined as
where each element from the set P m includes the two most reliable input values from α x Based on the set P m , an extra set is computed in [5] . This set includes the values from R m,n (α x ) in (5). The set is defined as follows:
In spite of the definition of the extra set, the output messages of the check node are R m,n , which is a set of size q × d c .
III. COMPRESSED NONBINARY MESSAGE-PASSING (CNBMP)
With the aim of reducing the size of the sets that conform the messages shared between the check node and the variable node, we propose a new ordering of the information. With these new sets, the number of messages exchanged between the check node and the variable node is reduced considerably and the set R m,n is easily derived at the variable node. We name this method CNBMP.
First, we define the set C m as follows:
Each N x (m) element contains the index n of the edge {m, n} for the symbol α x in which R m,n is not updated following (5):
Considering that the sets E m and P m are computed, message R m,n can be recovered at the variable node following the equations:
It is important to remark the following. 1) Whether CNBMP is applied or not the sets P m and E m are computed because of computational efficiency [5] , so we are not adding any extra operation.
2) It can be demonstrated that the values of messages R m,n are exactly the same applying (5)- (8) or (16)- (18), so in terms of error correction performance, we can claim that CNBMP is equivalent to the original T-EMS or T-MM algorithms as it does not include any approximation. Note that applying CNBMP, the output information of the check node is conformed by the set E m that contains q elements and the sets C m and P m that contain 2 × q elements each one. Therefore, in total the cardinality of the output information is 5×q, unlike previous proposals found in the literature.
To sum up, the check node with CNBMP does not compute (5)- (8), but (16)-(18). In addition, the message-passing consists of the sets C m , P m , and E m , not of R m,n , which is of size d c × q, as shown in Fig. 1 .
IV. HARDWARE IMPACT OF CNBMP
The first improvement in the hardware architectures of NB-LDPC decoders is the reduction of the wiring. According to the implementation reports, the maximum frequency of the decoder is not limited by the depth of the logic gates, but by the length of the wiring and the routing congestion. Therefore, if we apply CNBMP, the wires between both check node and variable node processors will be reduced, and hence, routing congestion will be mitigated. The (Fig. 2) , assuming that the messages at the check node are quantized with Q b bits and that the set C m requires log 2 (d c ) bits to represent the indexes n. As it is shown next with this reduction of the routing, there is an improvement in the maximum frequency.
The second improvement is in terms of storage resources. To perform the layered schedule, the decoder requires the storage, in registers or memories, of the information from the check node in the previous iteration, to compute the extrinsic information. Therefore, M addresses of depth equal to the size of the output messages from the check node are required. As it is previously explained, the number of the output messages without CNBMP is d c × q × Q b and the number with CNBMP is equal to 3 × q × Q b + 2 × q × log 2 (d c ) , so the reduction in storage resources is also λ (Fig. 2) . Note that application of CNBMP will be especially advantageous to high rate codes, where d c is very large. However, even with low-and medium-rate codes there will be significant improvements, as far as the only requirement to get some complexity reduction is that d c > 5. To decompress the messages at the variable node comparators and multiplexors implement the conditions from (16)-(18) to select whether E m (α x ) or P m 0 (α x ) and P m 1 (α x ) is applied to update R m,n (α x ). In Table I , we include the hardware results of the best architectures for an NB-LDPC decoding and the results of our layered T-MM decoder with CNBMP. The code under test is for all the decoders the (N = 837, K = 726) NB-LDPC code over GF(32), with d c = 27 and d v = 4 [13] . Cadence RTL Compiler was used for the synthesis and system-on-a-chip encounter for place and route of the design employing a 90-nm CMOS process of nine layers with standard cells and operating conditions of 25°C and 1.2 V. Compared with a conventional implementation of the T-MM algorithm, a CNBMP decoder improves the requirements of area due to the reduction of storage resources in the check node, in a layered schedule. On the other hand, the clock frequency is increased owing to the reduction of the wiring congestion and the core area, in general. In addition, we eliminate some pipeline stages in the decoder thanks to the reduction in the complexity of the checknode processor, and hence, the critical path is also reduced. These facts contribute to increment the overall throughput of the decoder.
If we compare this brief with the most efficient architectures found in [7] and [8] , we can see that the maximum frequency is increased by 50% and 26%, respectively, due to the reduction of the routing congestion. On the other hand, the area is about 43% larger than the decoder from [7] and three times smaller than the one in [8] . After applying the CNBMP, the area of storage resources (RAMs and registers) is reduced from 80% (2.2 × 10 6 NAND gates) of the total area in [8] to 50% (0.62 × 10 6 NAND gates). About the throughput, the CNBMP proposal is 1.48 times faster than the T-MM decoder in [8] and 14.8 times faster than the min-max from [7] . In terms of efficiency throughput/area the decoder with CNBMP is 3.9 times more efficient than those in [7] and [8] . For the gate count, we consider the equivalence that one bit of RAM equals 1.5 NAND gates and one register equals 4.5 NAND gates.
Finally, if we compare CNBMP with the binary LDPC decoder from [9] , which has a gate count of 3.4 millions of equivalent NAND gates and a throughput of 45.42 Gb/s for a code with a similar rate and half-codeword length in terms of bits [(2048, 1723) LDPC code], CNBMP has 2.72 times less gates and reaches 17.46 times less throughput. Therefore, in terms of throughput/area efficiency, our nonbinary decoder is 6.32 times less efficient than the binary one. Even though not reaching the efficiency of a binary decoder, with CNBMP we reduce the difference to less than q, which is a good step forward compared with solutions like the one in [8] that has 2 × q times lower efficiency.
V. CONCLUSION
In this brief, a new message-passing definition is proposed for NB-LDPC decoders. This method reduces the number of the messages exchanged between the check node and the variable node, simplifying the routing of the derived hardware architectures and saving a big percentage of storage resources. Moreover, the new message-passing does not modify the processing of the information at the decoder, keeping the same error correction performance as the original message-passing.
