An ultra-high throughput low-density parity check (LDPC) decoder with an unrolled full-parallel architecture is proposed, which achieves the highest decoding throughput compared to previously reported LDPC decoders in literature. The decoder benefits from a serial message-transfer approach between the decoding stages to alleviate the well-known routing congestion problem in parallel LDPC decoders. Furthermore, a finitealphabet message passing algorithm is employed to replace the variable node update rule of standard min-sum decoders with optimized look-up tables, designed in a way that maximize the mutual information between decoding messages. The proposed algorithm results in an architecture with reduced bit-width messages, leading to a significantly higher decoding throughput and to a lower area as compared to a min-sum decoder when serial message-transfer is used. The architecture is placed and routed for the standard min-sum reference decoder and for the proposed finite-alphabet decoder using a custom pseudohierarchical backend design strategy to further alleviate routing congestions and to handle the large design. Post-layout results show that the finite-alphabet decoder with the serial messagetransfer architecture achieves a throughput as large as 594 Gbps with an area of 16.2 mm 2 and dissipates an average power of 22.7 pJ per decoded bit in a 28 nm FD-SOI library. Compared to the reference min-sum decoder, this corresponds to 3.3 times smaller area and 2 times better energy efficiency.
I. INTRODUCTION
L OW DENSITY PARITY CHECK (LDPC) codes have become the coding scheme of choice in high datarate communication systems after their re-discovery in the 1990s [1] , due to their excellent error correcting performance along with the availability of efficient high-throughput hardware implementations in modern CMOS technologies. LDPC codes are commonly decoded using iterative message passing (MP) algorithms in which the initial estimations of the bits are improved by a continuous exchange of messages between decoder computation nodes. Among the various MP decoding algorithms, the min-sum (MS) decoding algorithm [2] and its variants (e.g., offset MS, scaled MS) are the most common choices for hardware implementation. LDPC decoder hardware implementations traditionally start from one of these established algorithms (e.g., MS decoding), where the R. Ghanaatian exchanged messages represent log likelihood ratios (LLRs). These LLRs are encoded as fixed point numbers in two'scomplement or sign-magnitude representation, using a small number of uniform quantization levels, in order to realize the message update rules with low-complexity conventional arithmetic operations.
Recently, there has been significant interest in the design of finite-alphabet decoders for LDPC codes [3] - [7] . The main idea behind finite-alphabet LDPC decoders is to start from one or multiple arbitrary message alphabets, which can be encoded with a bit-width that is acceptable from an implementation complexity perspective. The message update rules are then crafted as generic mapping functions to operate on this alphabet. The main advantage of such finite-alphabet decoders is that the message bit-width can be reduced significantly with respect to a conventional decoder, while maintaining the same error-correcting performance [5] , [6] . The downside of this approach is that the message update rules of finite-alphabet decoders can usually not be described using fast and areaefficient standard arithmetic circuits.
Different hardware architectures for LDPC decoders have been proposed in the literature in order to fulfill the power and throughput requirements of various standards. More specifically, varying degrees of resource sharing result in flexible decoders with low area requirements. On the one hand, partialparallel LDPC decoders [8] , [9] and block-parallel LDPC decoders [10] , [11] are designed for medium throughput, with modest silicon area. Full-parallel [12] , [13] and unrolled LDPC decoders [5] , [14] , on the other hand, achieve very high throughput (in the order of several tens or hundreds of Gbps) at the expense of large area requirements. Even though, in principle, LDPC decoders are massively parallelizable, the implementation of ultra-high speed LDPC decoders still remains a challenge, especially for long LDPC codes with large node degrees [15] , mainly due to severe routing problems.
Contributions: In this paper, we propose an unrolled fullparallel architecture based on serial transfer of the messages to enable an ultra-high throughput implementation of LDPC decoders for codes with large node degrees. Moreover, we employ a finite-alphabet LDPC decoding algorithm in order to decrease the required quantization bit-width, and thus, to increase the throughput, which is limited by the serial message transfer. We also adopt an efficient pseudo-hierarchical flow for the physical implementation of the proposed decoder. To the best of our knowledge, by combining the aforementioned techniques, we present the fastest fully placed and routed LDPC decoder in the literature.
Outline: The remainder of this paper is organized as follows: Section II gives an introduction to decoding of LDPC codes, as well as more details on existing high-throughput implementations of LDPC decoders. Section III describes our proposed ultra-high throughout decoder architecture that employs a serial message-transfer technique. In Section IV, our algorithm to design a finite-alphabet decoder with nonuniform quantization is explained and applied to the serial message-transfer decoder of Section III. Section V describes our proposed approach for the physical implementation and the timing and area optimization of serial message-transfer decoders. Finally, Section VI analyzes the implementation results, and Section VII concludes the paper.
II. BACKGROUND

A. LDPC Codes and Decoding Algorithms
A binary LDPC code is a set of codewords which are defined through an M × N binary-valued sparse parity check matrix as: LDPC codes are traditionally decoded using MP algorithms, where information is exchanged as messages between the VNs and the CNs over the course of several decoding iterations. At each iteration the message from VN n to CN m is computed using a mapping Φ v : R dv → R, which is defined as:
where N (n) denotes the neighbors of node n in the Tanner graph,μ N (n)\m→n ∈ R dv−1 is a vector that contains the incoming messages from all neighboring CNs except m, and L n ∈ R denotes the channel LLR corresponding to VN n. Similarly, the CN-to-VN messages are computed using a mapping Φ c : R dc−1 → R, which is defined as: 
Messages are exchanged until a valid codeword has been decoded or until the maximum number of iterations I has been reached. For the widely used MS algorithm, the mappings (2) and and,
where min |µ| denotes the minimum of the absolute values of the vector components and sign µ = j sign µ j . The decision mapping Φ d is defined as:
B. High Throughput LDPC Decoders
Several high throughput LDPC decoders have been developed during the past decade in order to satisfy the high data rate requirements of optical and high-speed Ethernet networks. These decoders usually rely on a full-parallel isomorphic [16] architecture and a flooding schedule, which directly maps the algorithm for one iteration to hardware. More specifically, the CN and VN update equations are directly mapped to M CN and N VN processing units and a hard-wired routing network is responsible for passing the messages between them. From an implementation perspective, while such an architecture enables a very high throughput by fully exploiting the inherent parallelism of each iteration, the complexity of the highly unstructured routing network turns out to be a severe bottleneck. In addition to this routing problem, such full-parallel decoders usually require one or two clock cycles for each iteration and in the worst case as many cycles as the maximum number of iterations for each codeword, which is another throughput limitation factor.
Several solutions have been proposed to alleviate the routing problem in full-parallel decoders on both an architectural and an algorithmic level. The authors of [17] , [18] suggest using a bit-serial architecture, which only requires a single wire for each variable-to-check and check-to-variable connection. While this approach can reduce the routing congestion, it also leads to a significant reduction in the decoding throughput. The decoder in [18] , for example, only achieves a throughput of 3.3 Gbps when implemented using a 130 nm CMOS technology. Another architectural technique is reported in [13] , where the long wires of the decoder are partitioned into several short wires with pipeline registers. As a result, the critical path is broken down into shorter paths, but the decoder throughput is also affected since more cycles are required to accomplish each iteration. Nevertheless, the decoder of [13] is still able to achieve 13.2 Gbps in 90 nm CMOS with 16 iterations.
On an algorithmic level, the authors of [19] propose an MP algorithm, called MS split-row threshold, which uses a column-wise division of the H matrix into S pn partitions. Each partition contains N/S pn VNs and M CNs, and global interconnects are minimized by only sharing the minimum signs between the CNs of each partition. This algorithmic modification was used to implement a full-parallel decoder for the challenging (2048, 1723) LDPC code in 65 nm CMOS, which achieves a throughout of 36.2 Gbps with 11 iterations. Another decoder, reported in [20] , uses a hybrid hard/soft decoding algorithm, called differential binary MP algorithm, which reduces the interconnect complexity at the cost of some error-correcting performance degradation. A fullparallel (2048, 1723) LDPC decoder using this algorithm was implemented in 65 nm CMOS technology, achieving up to 126 Gbps [21] .
To solve the problem of throughput limitations in fullparallel decoders from potentially using multiple iterations for decoding, the work of [14] presents an unrolled fullparallel LDPC decoder. In the proposed architecture, each decoding iteration is mapped to distinct hardware resources, leading to a decoder with I iterations that can decode one codeword per clock cycle, at the cost of significantly increased area requirements with respect to non-unrolled full-parallel decoders. This unrolled architecture achieves a throughput of 161 Gbps for a (672, 546) LDPC code with d v = 3 and d c = 6, when implemented in a 65 nm CMOS technology. It is notable that an unrolled decoder has 50% reduced wires between adjacent stages compared to a non-unrolled decoder since one stage of variable nodes is connected to one stage of check nodes with uni-directional data flow per decoding iteration. Even though this measure leads to a lower routing congestion, it is still challenging to fully place and route such a decoder. This routing issue becomes more and more severe when considering longer LDPC codes and especially with increasing CN and VN degrees to achieve better errorcorrecting performance, as required in wireline applications such as for the (2048, 1723) code with d v = 6 and d c = 32 used in the IEEE 802.3an standard [15] .
III. SERIAL MESSAGE-TRANSFER LDPC DECODER
Unrolled full-parallel LDPC decoders provide the ultimate throughput with smaller routing congestion than full-parallel decoders. However, they are still not trivial to implement for long LDPC codes with high CN and VN degrees, which suffer from severe routing congestion. Hence, in this section, we propose an unrolled full-parallel LDPC decoder architecture that employs a serial message-transfer technique between CNs and VNs. This architecture is similar to the bit-serial implementations of [17] , [18] in the way the messages are transferred; however, as we shall see later, it differs in the fact that it is unrolled and in the way the messages are processed in the CNs and VNs.
A. Decoder Architecture Overview
An overview of the proposed unrolled serial messagetransfer LDPC decoder architecture is shown in Fig. 2 . As with all unrolled LDPC decoders, each decoding iteration is mapped to a distinct set of N VN and M CN units, which form a processing pipeline. In essence, the unrolled LDPC decoder is a systolic array, in which a new set of N channel LLRs is read in every clock cycle and a decoded codeword is put out every clock cycle.
Even though both the CNs and VNs can compute their outgoing messages in a single clock cycle, in the proposed serial message-transfer architecture each message is transfered one bit at a time between the consecutive variable node and check node stages over the course of Q msg clock cycles, where Q msg is the number of bits used for the messages. More specifically, each CN and VN unit contains a serial-to-parallel (S/P) and parallel-to-serial (P/S) conversion unit at the input and output, respectively, which are clocked Q msg times faster than the processing clock to collect and transfer messages serially, while keeping the overall decoding throughput constant. More details on the architecture of the CN and VN units as well as our serial message-transfer mechanism are provided in the sequel.
B. Decoder Stages
The unrolled LDPC decoder, illustrated in Fig. 2 , consists of three types of processing stages, which are described in more detail below. We note that the CN/VN processors of this reference decoder are similar to those of a standard MS decoder, and our modifications for these parts are discussed in Section IV.
1) Check Node Stage: Each check node stage consists of M CN units, each of which contains three components: a CN processor, d c S/P units for the d c input messages, and d c P/S units for the d c output messages. Moreover, the complete check node stage contains a register bank that is used to store the channel LLRs, which are not directly needed by the check node stage, but nevertheless must be forwarded to the following variable node stage, and thus need to be buffered. Hence, no S/P and P/S units are required for the channel LLR buffers in the check node stages as they are still read and put out serially.
The main task of each CN processor is to identify the two smallest absolute values among all d c input messages that are used as the minima in order to efficiently calculate (6) for all outputs. Therefore, the CN processor is essentially composed of a pruned sorting tree and an output unit that selects the first or the second minimum, along with the appropriate sign for each output. The sorting unit consists of 4-input and 2-output compare-and-select (CAS) units that form a pruned sorter (tree) of depth log 2 ( dc 2 ). The VN processor implements the update rule (5) by adding up all d v incoming LLR messages and the channel LLR using an adder tree. An output unit subtracts each input from the aforementioned sum in order to efficiently compute all output messages.
3) Decision Node Stage: The last variable node stage is called a decision node stage because it is responsible for taking the final hard decisions on the decoded codeword bits. The structure of this stage is similar to a variable node stage, but a decision node (DN) is a simpler version of the VN processor in the sense that after computing the sum of all inputs, its sign is output as the decision bit and no subtraction is required. Moreover, no P/S units are required since the decision node stage is the last stage in the decoder pipeline and the output is only one bit per DN unit.
C. Message Transfer Mechanism
One of the modifications, compared to [14] , is the serial transfer of the channel and message LLRs between the stages of the decoder, which reduces the required routing resources by factor of Q msg . This modification is applied to make the placement and routing of the decoder more feasible, especially for large values of d v and d c . 1 To this end, as explained in the previous section, a S/P and a P/S shift register are added to each input and each output of the CN and VN units, as illustrated in Fig. 3 . We see that the S/P unit consists of a (Q msg −1)-bit shift register and Q msg memory registers, while the P/S unit has Q msg registers with multiplexed inputs. The serial messages are transfered with a clock, denoted by CLK F , that is Q msg times faster than the slow processing clock, denoted by CLK S . More specifically, at each CN unit and VN unit input, data is loaded serially into S/P shift register using the fast CLK F , and after the Q msg -th cycle all message bits are stored in memory registers, clocked by the slow CLK S . The CN/VN processing can then be performed in one CLK S cycle and the output messages are saved in the output P/S shift registers and transferred serially to the next stage using again CLK F . At the same time, a new set of messages is serially loaded into the input shift registers. We note that for simpler clock tree generation, all registers in Fig. 3 are clocked by CLK F , while CLK S is actually implemented as a pulsed clock, which is generated using a clock-gating cell controlled by a finite state machine.
D. Decoder Complexity Analysis
In this section, we describe some hardware complexity considerations of the proposed decoder as well as the decoding latency and the throughput, which are useful for the subsequent measures for optimization.
1) Hardware Complexity and Memory: Each CN and VN unit of the block diagram shown in Fig. 2 consists of the processing part (logic) and the register part (memory). While the logic complexity is estimated by synthesis results, which will be discussed in Section VI, the memory complexity can easily be characterized by counting the number of required registers.
From Fig. 2 , each CN unit has d c S/P units and d c P/S units and a set of registers for the channel LLRs. This adds
Other outputs Other inputs . . .
Parallel-to-Serial
F CLK S Fig. 3 : The message receive and transfer mechanism by S/P and P/S shift registers for Q msg = 5. 
Since the control unit registers required for each CN/VN unit are negligible compared to the registers in the data path, and with respect to the fact that N d v = M d c by construction of the LDPC code, the total number of registers required by the serial message-transfer decoder is
where I is number of decoding iterations. We note that if the quantization bit-width for the channel LLRs and the message LLRs are the same, i.e., Q msg = Q ch = Q, which is often the case for MS LDPC decoders, (8) can be simplified to:
From (9), one can easily see that the quantization bit-width linearly increases the memory requirement for the proposed architecture.
2) Decoding Latency: Since each stage has a delay of two CLK S cycles and there are two stages for each decoding iteration, the decoder latency is 4I CLK S cycles or, equivalently,
3) Decoding Throughput: In the proposed unrolled architecture, one decoder codeword is output in each CLK S cycle. Therefore, the decoder throughput is
where f max is the maximum frequency for CLK S . For the proposed decoder
where T CP,VN and T CP,CN are length of the critical paths of the CN unit and the VN unit, respectively, and T CP,route is the critical path of the (serial) routing between the decoding stages. Thus, the decoder throughput will be limited by the routing, if the VN/CN delay is smaller than Q max times the routing delay. Hence, on one hand, the serial messagetransfer decoder alleviates the routing problem by reducing the required number of wires, but on the other hand, the decoder throughput for large quantization bit-widths may be affected, as the serial message-transfer delay will become the limiting factor.
IV. FINITE-ALPHABET SERIAL MESSAGE-TRANSFER LDPC DECODER
Even though the serial message-transfer architecture alleviates the routing congestion of an unrolled full-parallel LDPC decoder, it has a negative impact on both throughput and hardware complexity (more specifically memory requirement which was determined by the number of message bits), as discussed in the previous section. In our previous research [5] , we have shown that finite-alphabet decoders have the potential to reduce the required number of bits while maintaining the same error rate performance. In this section, we will review the basic idea and our design method for this new type of decoders and then show how the technique can be used in order to increase the throughput and reduce the area of the serial message-transfer architecture.
A. Mutual Information Based Finite-Alphabet Decoder
In order to design a finite-alphabet LDPC decoder, we employ the approach of [5]- [7] . In this approach, custom finite-alphabet update rules that can be implemented as lookup tables (LUTs) are designed in a way that maximizes the mutual information between the VN output messages and the corresponding codeword bit.
Unfortunately, the complexity of the LUTs grows, in general, exponentially with the node degree. While [7] used monotonic LUTs for both VNs and CNs, the work of [5] , [6] solved the problem for the (generally more complex) CNs by using the standard MS update rule Φ c (L,μ = Φ MS c (L,μ for them. However, Φ v L n ,μ N (n)\m→n and Φ d (L n ,μ N (n)→n ) are designed as LUTs that preserve the ordering of the output messages in a sign-magnitude interpretation of their binary labels. The CNs can then be implemented efficiently using standard arithmetic circuits. We use this approach in the following.
Additionally, to reduce the complexity of the VN LUTs for our VN degree, we establish the proposal of [5] and [6] to use a tree of small LUTs as a more general form of the linked list of LUTs, as proposed in [7] . We note that the tree based approach requires more LUTs than the linked list approach, but it is also more parallelizable, which reduces the longest path delay. Moreover, both approaches reduce the complexity of the LUT-based VN from exponential to linear in the VN degree, with only a small negative impact on the error-correcting performance of the decoder. 
B. Error-Correcting Performance and Bit-Width Reduction
In Fig. 4 , we compare the performance of the IEEE 802.3an LDPC code under floating-point MS decoding, fixed-point MS decoding (with Q ch = Q msg ∈ {4, 5}), and LUT-based decoding (with Q ch = 4 and Q msg = 3) when performing I = 5 decoding iterations with a flooding schedule. We observe that the fixed-point decoder with Q ch = Q msg = 5 has almost the same performance as the floating-point decoder, while the fixed-point decoder with Q ch = Q msg = 4 shows a significant loss with respect to the floating-point implementation. Thus, a standard MS decoder would need to use at least Q ch = Q msg = 5 quantization bits. The LUT-based decoder, however, can match the performance of the floatingpoint decoder with only Q ch = 4 channel quantization bits and Q msg = 3 message quantization bits. 2 
C. LUT-Based Decoder Hardware Architecture
The LUT-based serial message-transfer decoder hardware architecture is very similar to the MS decoder architecture, described in Section III. However, the LUT-based decoder can take advantage of the significantly fewer message bits that need to be transferred from one decoding stage to the next. This reduction reduces the number of CLK F cycles per iteration, which in turn increases the throughput of the decoder according to (11) provided that the CN/VN logic is sufficiently fast. Moreover, the size of the buffers needed for the S/P and P/S conversions is also reduced significantly, which directly reduces the memory complexity of the decoder (see (9) ).
On the negative side, we remark that the VN units for each variable node stage (decoder iteration) of the LUTbased decoder are different, which slightly complicates the hierarchical physical implementation as we will see later. Furthermore, since Q ch > Q msg , we now need to transfer the channel LLRs with multiple (two) bits per cycle to avoid the need to artificially limit the number of CLK F cycles per iteration to Q ch rather than to the smaller Q msg . To reflect this modification, we redefine (11) as Q max = max(Q msg , Qch 2 ). While this partially parallel transfer of the channel LLRs impacts routing congestion, we note that the overhead is negligible since the number of channel LLRs is small compared to the number of messages.
V. IMPLEMENTATION
Despite the use of a serial message-transfer, the physical implementation of the decoders, proposed in the previous sections, requires special scrutiny since the number of global wires is still significant and the area is large. Therefore, in this section, we propose a pseudo-hierarchical design methodology to implement the serial message-transfer architecture.
A. Physical Design Flow
Due to the large number of identical blocks in the decoder architecture, a bottom-up flow is expected to provide the best results. The CN, VN, and DN units are first placed and routed individually to build hard macros, and their timing and physical information are extracted. These macros are then instantiated as large cells in the decoder top level. We propose to treat the macros as custom standard-cells with identical height to be able to perform the placement using the standardcell placement, rather than the less capable macro placement of the backend tool, since in our case the number of hard macro instances is extremely large and the interconnect pattern is complex and highly irregular. We also note that for the LUTbased decoder there are different macros for each variable node stage, as discussed in Section IV, while for the MS decoder an identical VN macro is used for all of the variable node stages. Fig. 5 illustrates the proposed physical floorplan for the decoders with the unrolled architecture. In this floorplan, the CN and VN macros within each stage are constrained to be placed into dedicated regions. This measure enforces the high-level structure of the pipeline, but leaves freedom to the placement tool to choose the location for each macro in the stage to minimize routing congestion between stages. Furthermore, the CN and VN macros are placed in dedicated rows while the area between these rows is left for repeaters and for the register standard-cells for the channel LLRs in the check node stages. We note that the proposed floorplan provides the flexibility to exploit the automated algorithm for both custom and conventional standard-cells to optimize their placement in order to alleviate the still significant routing congestion.
B. Timing and Area Optimization Flow
Although the synthesis results can give an approximate evaluation for timing and area of the physical implementation, several iterations with different constraints are required to reach an optimal layout. To this end, we propose the methodology illustrated in Fig. 6 to effectively implement the serial message-transfer architecture. The main idea behind this methodology is that three main factors directly contribute to the decoder throughput and also indirectly to the decoder area, as discussed in Section III and specifically summarized in (11) . Our goal is to maximize the throughput at a minimum area. We define the timing constraint applied to CLK S as T CSTR,CLKS , and the timing constrain applied to CLK F as T CSTR,CLKF . The first step is to place and route the CN/VN macros based on T CSTR,CLKS . This step is followed by the implementation of the decoder using T CSTR,CLKF . We note that the initial constraints for the backend are thereby extracted from synthesis timing results. The fully placed and routed design can give an accurate routing delay (maximum achievable CLK F ), which will be used to update T CSTR,CLKF and then T CSTR,CLKS according to (11) . The updated T CSTR,CLKS will be used to re-implement the CN and VN macros within the minimum achievable area.
Area for Registers and Repeaters
We note that for a strong LDPC code with a large area and long routing delay (such as the one of the IEEE 802.3an standard), the first implementation starts with T CSTR,CLKS < Q max T CSTR,CLKF . After obtaining a realistic value for CLK F (and hence for T CSTR,CLKF ), the T CSTR,CLKS will be updated to a larger value for the next iteration. Consequently, the CN and VN macro area and thus the decoder area will shrink in the second iteration, which results in larger achievable CLK F and hence smaller T CSTR,CLKF and T CSTR,CLKS . The feedback loop will reach the optimum point after a few iterations.
VI. RESULTS AND DISCUSSIONS
To study the impact of the serial message-transfer architecture and the finite-alphabet decoding scheme, we have implemented the proposed architecture by employing the methodology explained in Section V and analyzed the results for both MS and LUT-based decoding. We used the parity check matrix of the LDPC code defined in the IEEE 802.3an standard [15] and Q msg = Q ch = 5 for the MS decoder and Q msg = 3 and Q ch = 4 for the LUT-based decoder to achieve the same error-correction performance, as described in Sections III and IV. The decoders were synthesized from a VHDL description using Synopsys Design Compiler and placed and routed using Cadence Encounter Digital Implementation. The layouts are shown in Fig. 7 . The results are reported for a 28 nm FD-SOI library under typical operating conditions (V DD = 1 V, T = 25 • C). 
A. Delay Analysis
In the serial message-transfer architecture the critical path, and hence, the maximum decoding frequency are defined by (11) . To investigate the impact of serially transferring the messages on the decoder throughput, we consider the delay of the following register-to-register path groups for both the MS and LUT-based decoder.
1) CN Path: The CN path is the path from the S/P memory registers to the P/S shift registers within the CN unit. For both decoders, this path is essentially comprised of the logic cells for a sorter tree with a depth of 4.
2) VN Path: The VN path is the path from S/P memory registers to P/S shift registers within the VN unit. This path is dominated by an adder tree for the MS decoder and an LUT tree for the LUT-based decoder.
3) Interconnect Path: The interconnect path comprises mainly the routing wires (and buffers) that connect the CN/VN unit S/P shift registers to the VN/CN unit P/S shift registers. Table I summarizes the critical path delays of the CN/VN and the interconnect path groups. Together with (11) , the values in the table dictate the maximum achievable frequency for CLK S and CLK F , respectively, for both of the decoders with the proposed serial message-transfer architecture. We note that the critical paths are reported after the timing and area constraints for the CN/VN macros and for the decoder toplevel are jointly optimized according to the flow shown in Fig. 6 . From Table I , it can be seen that the CN processor in LUT-based decoder is less complex than the one in the MS decoder, while the the LUT-based VN processor is more complex compared to the one that is based on adders in the MS decoder. According to (11) , we observe that in both decoders the message transfer limits the throughput to a CLK S period of 5 × 1.51 ns = 7.55 ns and 3 × 1.16 ns = 3.48 ns for the MS and the LUT-based decoder, respectively, where 1.51 ns and 1.16 ns are the corresponding minimum CLK F periods. Consequently, in our flow, the VN and CN units end up as optimized for minimum area only with relaxed and easy to meet timing constraints. Fig. 8 illustrates the area distribution among their components after layout. The area utilization is approximately 67% for both decoders. While almost 62% of the layout is filled with CN/VN macros and registers, clock tree and routing buffers occupy around 5%. Furthermore, we see a 44% difference in total area between the decoders due to the fact that the absolute area for CN and VN macros is 14.12 mm 2 in the MS decoder, as opposed to only 9.56 mm 2 in the LUTbased decoder.
B. Area Analysis
To understand this fact, we list the area of each CN/VN macro in Table II 3 . According to this table, the finite-alphabet message passing algorithm leads to significantly smaller CN processors because of two important factors: first, the bit-width reduction of the messages directly affects the data-path area, and second, the quantized messages in the LUT-based decoder are processed directly in the sorter tree of the CNs without the need to compute their absolute values. However, VN processors are less area-efficient in the LUT-based decoder in comparison with the ones of the MS decoder. This is caused by the fact that the LUT-based computations are, in general, less area-efficient than the conventional arithmetic based update Throughput and energy per bit @ Imax = 8 † Throughput and energy per bit @ Imax = 11 ‡ Scaling is done by S 3 where S is relative dimension to 28 § Scaling is done by 1/U 2 where U is relative voltage to 1.0 rules. Thus, the logic area of the VN in the LUT-based decoder is larger, even though their input/output bit-width is smaller. Another contributing factor in the Table II is the register area, which is defined by the number of S/P and P/S registers. For those, the 40% reduction of bit-width in the LUT-based decoder is directly noticeable in the register area for both CN and VN units. Altogether, the CN and VN macros in the LUT-based decoder are 58% and 14% smaller, respectively, compared to those of the MS decoder.
C. Power Analysis
The energy which is consumed by each decoder is proportional to the capacitance, which in turn is related to the decoder area. Also, the number of required CLK F cycles for the serial message-transfer to decode one codeword, which is inversely proportional to the decoding throughput at a constant frequency, directly contributes to the consumed energy for each decoded bit. Therefore, we analyze both the average power and the energy efficiency of the decoders using post-layout vectorbased power analysis. 4 The results are reported in Table III . We note that the average power is calculated at a constant CLK S frequency, here f F = min(f max,MS , f max,LUT ) = 662 MHz, for both decoders. According to Table III , the total power consumption of the LUT-based decoder is 16.2% smaller than that of the MS decoder. Furthermore, by considering the fact that the LUT-based decoder has 66.7% higher throughput than the MS decoder at a similar CLK F frequency, the energy efficiency of the LUT-based decoder is almost 2 times better in comparison with the MS decoder.
D. Summary and Final Comparison
The final post-layout results for our MS and LUT-based decoders and also for some other recently implemented decoders are summerized in Table IV . Our LUT-based decoder runs at a maximum CLK F frequency of f F = 862 MHz and delivers 0.594 Tbps, while it occupies 16.2 mm 2 area and dissipates 22.7 pJ/bit. Compared to the MS decoder, the LUTbased decoder is 1.4× smaller, 2.2× faster, and thus 3.1× more area efficient. It also has 16.2% lower power dissipation and 2× better energy efficiency, when the decoding throughout is taken into account.
The work in [14] is the only other unrolled fullparallel decoder in literature, but it is designed for the IEEE 802.11ad [22] code, which has a shorter block length and smaller node degrees (d v = 3 and d c = 6 as opposed to d v = 6 and d c = 32 for the code used in the design reported in this paper). The work of [8] and [19] are for the same IEEE 802.3an code considered in this paper, but with partialparallel and full-parallel architectures, respectively. Comparing to [8] and [19] , the proposed LUT-based decoder has more than an order of magnitude higher throughput with almost the same energy efficient as the decoder in [19] . The area efficiency of the proposed unrolled full-parallel architecture, however, is inferior to the one of the decoder in [19] with full-parallel architecture due to the repeated routing overhead between the decoder stages in our design.
VII. CONCLUSION
An ultra high throughput LDPC decoder with a serial message-transfer architecture and based on non-uniform quantization of messages was proposed to achieve the highest decoding throughput in literature. The proposed decoder architecture is an unrolled full-parallel architecture with serialized messages for CN/VN units, which was enabled by employing S/P and P/S shift registers at the inputs and outputs of each unit. The proposed quantized message passing algorithm replaces conventional MS, resulting in 40% reduction in message bit-width without any performance penalty. This algorithm was implemented by using generic LUTs instead of adders for VNs while the CNs remained unchanged compared to MS decoding. Placement and routing results in 28 nm FD-SOI show that the LUT-based serial message-transfer decoder delivers 0.594 Tbps throughput and is 3.3 times more area efficient and 2 times more energy efficient in comparison with the MS decoder with serial message-transfer architecture. 
ACKNOWLEDGMENT
