Iterative decoding of low-density parity check codes (LDPC) using the message-passing algorithm have proved to be extraordinarily effective compared to conventional maximumlikelihood decoding. However, the lack of any structural regularity in these essentially random codes is a major challenge for building a practical low-power LDPC decoder. In this paper, we jointly design the code and the decoder to induce the structural regularity needed for a reduced-complexity parallel decoder architecture. This interconnect-driven code design approach eliminates the need for a complex interconnection network while still retaining the algorithmic performance promised by random codes. Moreover, we propose a new approach for computing reliability metrics based on the BCJR algorithm that reduces the message switching activity in the decoder compared to existing approaches. Simulations show that the proposed approach results in power savings of up to 85.64% over conventional implementations. However, in order to achieve desired power and throughputs for current applications (e.g., > lMbps in 3G wireless systems, > lGbps in magnetic recording systems), fully parallel and pipelined iterative decoder architectures are needed.
INTRODUCTION
Turbo codes [l] and LDPC codes [2] are the two best known codes that are capable of achieving low bit error rates (BER) at code rates approaching Shannon's capacity limit. *This work was supported with funds from NSF under grants CCR 99-79381 and CCR 00-73490.
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission andor a fee. ZSLPED'O2, August 12-14, 2002 , Monterey, California, USA. It is well known that global interconnects primarily determine the system performance in modern semiconductor processes. This suggests exploiting data locality to reduce communication overhead in a parallel VLSI implementation, and motivates a joint code-decoder design approach to construct a class of LDPC codes having the structural regularity and locality properties favorable for a practical parallel architecture. Previous LDPC decoder architectures [3]- [6] have not addressed the interconnection network problem in random LDPC codes. Issues related to routing conflicts on the network and scheduling in the function units significantly increase power consumption and degrade system performance and hence need to be accounted for explicitly.
All current implementations of the message-passing algorithm employ Gallager's original algorithm [a] for computing reliability messages. However, due to quantization effects, this algorithm suffers from performance loss and requires large number of iterations to converge, which increases the switching activity, power consumption and decoding latency. First, we propose a new interconnect-driven LDPC code design approach that eliminates the power problem in the interconnection network by inducing desired regularity characteristics in the code. Second, we propose an optimized version of the BCJR algorithm [7] to compute reliability messages, which minimizes the effect of quantization noise on algorithmic performance and improves the power consumption. Thus, a low-power VLSI architecture for the decoder is obtained by optimizing both the interconnection network and the function units.
The rest of the paper is organized as follows. Section 2 presents a joint code-decoder design approach for the design of low-power decoder architectures for LDPC codes. Section 3 proposes a new optimized version of the BCJR algorithm for reducing power in the function units. Section 4 presents some simulations results and Section 5 provides some concluding remarks.
The contributions of this paper are twofold.
LDPC codes are decoded iteratively using Gallager's message passing (MP) algorithm [2] . It is based on evaluating extrinsic reliability values associated with each bit using disjoint parity check equations that the bit participates in [2] . Each iteration is composed of a two-phase schedule in which updates of all check nodes are done in phase 1 by sending messages ( p U c -b ) to neighboring bit nodes, and then updates of all bit nodes are done in phase 2 by sending messages (pt,-=) to neighboring check nodes (see Fig. 2 ). Updates in each phase are independent and can be parallelized.
For randomly constructed LDPC codes, the non-zero entries in H are randomly distributed across the rows and columns (while still satisfying the regularity constraints), which makes the interconnection networks in Fig 1 for communicating messages from check nodes to bit nodes, and vice versa, very complex. In Section 2.1, we propose an approach for designing the codes, i.e., the matrix H, such that this interconnection complexity is minimized.
Interconnect-Driven LDPC Code Design
We propose to construct short length LDPC codes (< 2K bits) having simple graph connectivity properties while still maintaining the performance of randomly constructed codes. One easy way to achieve the desired structural properties is to partition the parity check matrix H into blocks of p x p matrices, for some appropriately chosen p , such that 
LOW-POWER LDPC DECODER ARCHI-TECTURES
After briefly introducing LDPC codes, the interconnectdriven LDPC code design approach is proposed in section 2.1. The resulting decoder architecture and its memory synchronization scheme are presented in sections 2.2 and 2.3. An LDPC code is described by a bipartite graph (see Fig. 2 The easiest way to define the p x p block matrices is by suitable permutations of the rows of the identity matrix I p x p in an analogous fashion to the construction of BCH codes. Let Bf,k be an Ipxp identity matrix located at the j t h block row and Icth block column of the parity check matrix having its rows shifted to the right i mod p positions for i E 5 = {0,1,2,. . . , p -l}. Assume there exists a q such that qc Then the operation of multiplying by q mod p divides S into cyclotomic cosets mod p. A cyclotomic coset containing the integer 1 (mod p ) . 
Interconnect-Driven Low-Power Decoder Architecture
This subsection presents a parallel decoder architecture for the LDPC codes designed via the interconnect-driven code construction method described in Section 2.1. The codes have length n = cp and their bipartite graph has cp bit nodes and r p check nodes.
The proposed decoder in Fig. 5 is composed of two main blocks, BLOCK1, the bit-node processing block and BLOCK2, the check-node processing block. Two frames are processed simultaneously by the decoder: phase 2 of the first frame is performed in the check node processing block while phase 1 of the second frame is performed in the bit-node processing block. BLOCKl contains c Bit Function Units (BFU's) and c memory banks for storing the check-to-bit messages ( p c -b ) obtained from BLOCK2 during the previous iteration. Each memory bank consists of r memory blocks and a set of r counters for address generation in one-to-one correspondence with a single column of p x p blocks in the parity check matrix. In addition, there are 2 memory blocks for storing the intrinsic reliability metrics (IRM) of the two frames obtained from the channel. The i t h memory block in the j t h bank containing the bit messages is denoted by MemBi,j, the block containing the intrinsic messages by IRM,, and the corresponding counters for memory access are denoted by RDli,j. One BFU operates on a single memory bank by 
Memory Synchronization
The operations of BLOCKl and BLOCK2 on the two frames simultaneously are synchronized dynamically through the set of counters RDl,,, and RD2,,, in the memory banks. alternate their operations between frame 1 and frame 2 in each processing phase. The function units BFU, and CFU, first read their operands from their respective memory banks according t o RD1 and RD2. The messages in these locations are not needed anymore so they can be overwritten with new results from the opposite BLOCK. For this t o be possible, the messages for both frames in BLOCKl and BLOCK2 must be consumed and produced in a coherent manner t o avoid new messages overwriting unprocessed messages. It can be shown that the misalignment of the messages in both BLOCKS can be adjusted by incrementing counters RDl,,, and RD2,,, by
for 2 I: z 5 r and 2 5 j 5 c, at the end of every iteration.
Since the first row and column of blocks of the parity check matrix can be taken as references, counters RD1,,1, RDll,,, RD2,,1, RD21,, are not updated.
As an example, the middle block in Fig. 3 shows the misalignment of data between the frames (with reference t o the vertical and horizontal lines) as a result of BLOCK2 overwriting messages from frame 1 still unprocessed by BLOCK1. It is composed of 2 blocks, the bit-node processing block and check-node processing block that perform the two phases of Gallager's message-passing algorithm. The decoder processes 2 frames simultaneously by alternating between each phase of both frames in BLOCKl and BLOCKZ. The dynamic addressing scheme eliminates the need for an interconnection network.
The decoder completes one decoding iteration for 2 frames in 2p clock cycles. For I iterations per frame and clock frequency f, the throughput is c f / I bps. The throughput can be scaled up by a factor of v by dividing the memory blocks in each memory bank in BLOCKl and BLOCK2 into Y subblocks such that every c sub-blocks in BLOCKl (respectively T sub-blocks in BLOCK2) form a sub-bank on which a BFU (CFU) operates as shown on the right side of Fig. 5 . Note however that the interconnection complexity also increases.
LOW-POWER BCJR-BASED CFUARCHI-TECTURE
In this section, we present a low-power approach based on the BCJR algorithm [7] to generate the check-to-bit messages performed by the CFU blocks in Fig Typically $(z) is implemented as a small look-up table (LUT). Equation (1) however is prone to quantization noise due to the non-linear function $(z) and its inverse. If the argument of +(z) is scaled to increase the dynamic range, the quantization levels become coarser and deviate from the ideal line. These disadvantages translate to algorithmic performance loss where for a given SNR more iterations are needed for convergence increasing the decoding latency, switching activity and hence the power consumption of the decoder. This problem has been overlooked in all current implementations which solely resort to scaling the input to mitigate quantization effects. If the argument of +(z) is scaled to increase the dynamic range, the quantization levels become coarser and deviate from the ideal line. These disadvantages translate to algo--rithmic performance loss where for a given SNR more iterations are needed for convergence increasing the decoding latency, switching activity and hence the power consumption of the decoder. This problem has been overlooked in all current implementations which solely resort to scaling the input to mitigate quantization effects. We employ the BCJR algorithm [7] , typically used in turbo codes, to compute the check-to-bit messages by tailoring it to the trellis of a single parity check equation shown in Fig 7. The BCJR algorithm computes a posteriori reliability metrics (check-to-bit messages) on the trellis using the bit-to-check messages according to eqs. (3)- (5):
a' 2 --ln (,~I+XZ + ea~+X1) ( 3 ) ph = I~(~P I + X Z + e P~+ X~) (4) where A2 -A 1 = &+c and A = ,uc+b. Equations ( 3 ) and (4) are called the forward and backward state metric recursions, respectively. Referring to Fig. 7 , the forward recursion is first performed on all trellis states from left to right and the results are stored. Next, the backward recursion is performed from right to left in parallel with the output metric computation given by (5) using the stored forward state metrics. (5) by replacing AA by Ap in (6). Note that (6) can be implemented simply as a CompareSelect (CS) block using two comparators, two 2's complementers, a multiplexer and a LUT which stores the sum [z + Q ( z ) ] b as shown in Fig. 8(a) , assuming comparators are implemented using adders. A similar argument holds for Ap and A. Figure 8( b) shows the structure of the Check Function Node (CFU) constructed from the 3 CS blocks implementing the Aa, Ap, and A equations. Note that 2 simple FIFO buffers are used to store the forward state metric differences Aa and the messages &+c.
This CFU block has comparable complexity to a CFU block implementing (1) using a tree-adder, sign correction logic and a pair of LUT's. Figure 9 shows the Bit Function Unit implementing (2) . Note that the only difference among the messages computed in (2) is the effect of the check node under consideration. Hence, the equation can be implemented using the Save-Add-Subtract (SAS) structure where the operands are first saved and then the summation is computed serially (Fig. 9(a) ) or using a treeadder (Fig. 9(b) ), and then the appropriate operands are subtracted to generate all the messages.
In Fig. 6(b) we compare the quantization effects on the computation of the check-to-bit messages using the transformed equations of the BCJR algorithm. The entire dynamic range can now be represented, and the LUT provides quantized values that closely track the desired output compared to the quantized Gallager equations in Fig. 6(a) . It is worth mentioning the absence of adders required to compute the metrics needed in (1) which are subject to quantization noise under +-' (+(.)).
FIFO Buffer
, . To compare the effect of quantization on the algorithmic performance of the message-passing algorithm using Gallager's method and the proposed method, a low-level simulator was developed for both algorithms, and the results are shown in Fig. 10 . Figures 10(a)-(b) show the percentage of valid frames decoded and the number of iterations required for convergence, respectively, using the unquantized Gallager algorithm, a 6-bit quantized Gallager algorithm, and a 6-bit quantized BCJR algorithm. Figures lO(c)-(f) show the results for 5-bit and 4-bit quantization, respectively. The results demonstrate that the optimized BCJR algorithm is superior to the original algorithm particularly for 4-bit quantization where it attains more than 1 dB of improvement in coding gain. Moreover, the BCJR algorithm quantized to 5 bits achieves even better performance than a 6-bit quantized Gallager algorithm. The average improvement in coding gain achieved at 4-bit, 5-bit, and 6-bit quantization levels is 5.2%, 11.98%, and 119.19%, respectively. Note also the reduction in switching activity due to the decrease in the number iterations.
SIMULATION RESULTS
To simulate the power savings of the decoder architecture resulting from applying the interconnect-driven design approach and the calculation of the reliability messages via the proposed BCJR approach, the architecture ( A l ) for a regular (3,5)-LDPC code of length n = 1055 and rate 0.4 constructed from cyclotomic cosets is used as an example. The results are compared with the decoder architecture (A2) for a random LDPC code of similar complexity constructed using interconnection networks like the ones shown in Fig. 1 and utilizing Gallager's original algorithm for computing reliability messages. Architecture A2 has twice the memory requirements of A1 and operates on a single frame, while A2 decodes two frames simultaneously. Note that due to the irregularity of the code matrix in A2, the two phases of computations, bit-to-check and check-to-bit, cannot be overlapped which necessitates two copies of each memory bank that alternate between read/write. The algorithmic performance of both codes over an AWGN channel is shown in Fig. 4 .
The memory blocks for A1 were designed as circular buffers, while those for A2 were built as FIFO buffers and stacks similar to [5] . The interconnection networks for A2 were built using 631-input multiplexers and 210-output demultiplexers that serve as read and write ports for the CFU blocks, and similarly for the BFU blocks using 1055-input multiplexers and 1055-output demultiplexers as read/write ports. Power estimates for the CFU and BFU blocks used in both architectures were obtained by synthesizing the blocks in Figs. 8 and 9(b) (for A l ) and their counterparts for A2 using 3.3V 0.35 p m standard CMOS technology. Similarly, power estimates for the circular and FIFO buffers, stacks and (de)multiplexers were obtained through synthesis. The switching activity for the logic was estimated through simulations for different SNR. Figure 11 shows the power consumed by A1 and A2 as a function of SNR for the same algorithmic performance. At the same SNR, A1 implemented using 5-bit datapaths achieves comparable and even better performance than A2 implemented using 6-bit datapaths mainly due to quantization effects in the CFU blocks as seen in Fig. 10 , and partly due to the better performance of the cyclotomic code itself over the random one as shown in Fig. 4 . As seen from the figure, the power savings range between 82.42% and 85.64%. The power consumed by A2 at SNR of 4 dB is almost the same as that consumed by A1 at SNR of 1.5 dB which reflects the reduction in switching activity in the decoder due to faster convergence as a function of SNR. Figure 12 shows a breakdown of the power distribution as a function of SNR in A1 and A2 in terms of message computations, communication, and memory accesses. The figure demonstrates the effectiveness of the interconnect-driven code design approach in eliminating the effect of the interconnection network, and the efficiency of the dynamic memory addressing scheme in reducing the power consumption in the memory units by approximately 80%. The power savings due to computations demonstrates that, contrary to common perception, the optimized BCJR when tailored to the trellis of a single parity check equation is a viable alternative to the Gallager algorithm.
CONCLUSIONS
The LDPC decoding algorithm achieves significant coding gain but at the expense of large power consumption in the decoder due to the lack of structural regularity and the inefficiency of the algorithm employed for computing reliabilities. We have shown that through an interconnect-driven code design approach, coupled with a dynamic addressing scheme and an optimized version of the BCJR algorithm for computing reliabilities, power savings of up to 85.64% can be achieved. 
