Abstract-We propose a bit-serial LDPC decoding scheme to reduce A. LDPC codes and min-sum decoding interconnect complexity in fully-parallel low-density parity-check de-A binary (N N -M) LDPC code, C, is the null space of a sparse coders. Bit-serial decoding also facilitates efficient implementation of 
I. INTRODUCTION
M(n) = {m: Hmn = 1}.
The following paragraphs describe min-sum (MS) decoding [2] Low-density parity-check (LDPC) codes [1] have recently been which can be considered as an approximation to the commonly-used adopted for several data communication applications due to their iterative sum-product (SP) algorithm [3] . Although the performance superior coding performance and parallelizable decoder architecture. of MS is generally a few tenths of a dB lower than that of SP LDPC codes allow a fine-level parallel message-passing decoding in decoding, it is more robust to quantization errors when implemented . . with fixed-point operations [4] . Moreover it requires much simpler which all the check and variable nodes are updated concurrently. This p parallelism can potentially be used to build a decoder with Multi-hardware for the check node functions compared to SP decoding.
Gbit/sec throughput. The major obstacle for efficient implementation
In the MS decoding, similar to SP algorithm, the extrinsic messages of fully-parallel LDPC decoders is interconnect complexity which is are passed between check and variable nodes in the form of loglikelihood ratios (LLRs). Let zmn) represent the LLR value for bit n, the result of random location of l's in the code's parity-check matrix. lu sent from variable node vn to check node cm in the ith iteration In this paper, we propose a bit-serial scheme for fully-parallel . . the node-to-node message transfers without the need for extra routing In message-passing LDPC decoding, a large number of messages channels. Programmability of the decoder wordlength allows one need to be updated and transfered between check and variable nodes to efficiently trade-off complexity for error correction performance. in each iteration. Previous works have proposed several approaches This in turn allows efficient implementation of gear-shift decoding for representing and updating these messages. In [5] , analog signals [10] . Gear-shift decoding is based on the idea of changing the are used to represent the extrinsic messages. In analog decoders decoding update rule used in different iterations to simultaneously the exponential voltage-current relationship of a transistor is used optimize hardware complexity and error correction performance. For to realize the message-passing update functions. Although analog instance, gear shift decoding often suggests applying a complex decoders have the advantage of low power consumption, they become powerfull update rule in the first few iterations followed by simpler impractical for decoding long LDPC codes due to the noise and update functions in later iterations. Bit-serial computation allows process mismatch.
efficient shifting between update rules by changing the computations More conventional LDPC decoders often use multi-bit digital wordlength. signals to represent the messages. In partially-parallel decoders [6] , Bit-serial decoding, however, imposes some challenges. The im- [7] , the messages are transferred between the nodes through memory. mediate effect is that it reduces the decoder throughput compared This architecture reduces the decoder area by sharing the processing with fully-parallel implementations, as multiple clock cycles are units, but this comes at the cost of reduced throughput. To achieve required for transmitting a single message. Also some common check higher throughput, in the fully-parallel decoder presented in [8] , all and variable update functions can not be efficiently implemented check and variable nodes are directly instantiated in hardware. Using bit-serially. Although bit-serial fully-parallel LDPC decoders have this architecture, a throughput of 1 Gbps with 64 iterations per frame a lower throughput compared with bit-parallel fully-parallel LDPC is reported. The major challenge in the implementation of fully-decoders we will show in this paper that their throughput can still be parallel LDPC decoders is the complex and random interconnection higher than hardware-sharing decoder schemes. between the variable and check nodes. This problem is worsened 1. SIMPLIFIED CHECK UPDATE FUNCTION when multi-bit buses are used to realize the edges in the code Tanner graph.
The MS decoding algorithm, as described in Section I, is cumbersome if implemented in a bit-serial hardware decoder. In this section, C. Bit-serial computation we introduce an approximation to the MS algorithm that reduces To reduce the complexity of the interconnect in fully-parallel the hardware complexity of check nodes while causing minimal LDPC decoders, in this paper we investigate a bit-serial approach degradation in code performance. In fact, this approximation is also for both communicating and computing extrinsic messages. Fig. 1 applicable to bit-parallel hardware decoders. shows the difference between the conventional bit-parallel scheme The first step is to replace the check update rule of (1) with and a bit-serial scheme for a simple case of transferring an n-bit
number, b12 b2bl. In Fig. I (a) all then bits are sent over n parallel nm/C rN(m) mn/l H gxmni(3 lines in one clock cycle. In contrast, in a bit-serial scheme as in Fig. n'EN(m)\n l(b), the message is sent over a single line in n clock cycles.
In other words, the sign of the check node outputs are calculated exactly the same as before but now the output magnitude is the minimum of magnitudes of all input messages. Fig. 2 (3) significantly reduces the hardware complexity. The reason is that once the minimum among (a) (b) all input magnitudes is found it is sent out as the magnitude of all the outgoing messages, cm$n, for all n C N(m).
Fig. 1. Two alternatives for synchronous transmission of an n-bit number
We have observed that although the above modification to MS (a) bit-parallel: n bits sent in one clock cycle over n wires. (b) bit-serial: n results in almost no performance loss under full-precision operations, bits sent in n clock cycles over one wire. it introduces a considerable loss when performed in finite-precision. Fig. 3 shows the effect of the MS approximation when applied to Stochastic computation [9] is similar to bit-serial computation in quantized messages. In the following paragraphs we introduce a that it communicates extrinsic messages over single wires. It has further change to the modified MS decoding algorithm that reduces a very simple check and variable node architecture but needs a the performance gap shown above. significant amount of hardware overhead in oreder to translate the The sign of the output messages in the new check update rule is stochastic messages at the decoder inputs and outputs. In addition, the same as in (3). The magnitude of the output message is calculated the stochastic computation uses a redundant number representation as follows. First, for check node cm, in the ith iteration, we define which limits the decoder throughput. M)= min/(m) Z(i) We also define 1 < TI) < K as the In addition to simplifying the node-to-node interconnection, the bit-number of inputs z(i to check node cm that satisfy lZi |-M) serial approach has several other advantages for fully-parallel LDPC The magnitude of check node outputs are calculated as decoders. In a bit serial scheme, the wordlength of computations can be increased simply by increasing the number of clock cycles( allocated for transmitting the messages. Using this property, the ( LDPC (992, 833) processes the m g iu eof tec ck node in uswhereas te sg -2 -0-~~~~modified min-sum, LDPC (992, 833) bit is generated separately using an XOR tree. Fig. 4 and Fig. 6 . Since the updated messages are carried bit-serially for iterative decoders in magnetic recording channels," IEEE Transactions on Magnetics, vol. 37, pp. 748-755, March 2001. Over single wires, the complexity of node-to-node interconnections 1S
[7] T. Zhang and K. K. Parhi, "A 54 MBPS (3, 6)-regular FPGA LDPC less than that of conventional bit-parallel fully-parallel decoders [8] .
decoder," in IEEE Workshop on Signal Processing Systems, San Diego, Fig. 7 shows the measured BER performance from decoder hardware CA, 2002. as well as the bit-true simulation. Table II summarizes 
