Applying a joint code and decoder design methodology, we develop a high-speed (3, k)-regular LDPC code partly parallel decoder architecture, based on which a 9216-bit, rate-1/2 (3,6)-regular LDPC code decoder is implemented on Xilinx FPGA device. When performing maximum 18 iterations for each code block decoding, this partly parallel decoder supports a maximum symbol throughput of 54 Mbps and achieves BER at 2dB over AWGN channel.
INTRODUCTION
Thanks to its excellent performance, Low-Density ParityCheck (LDPC) code [1] [2] has been widely considered as a next-generation error-correcting code for telecommunication and magnetic storage. Defined as the null space of a very sparse M x N parity check matrix H, an LDPC code is typically represented by a bipartite graph, called Tanner graph, in which one set of N variable nodes corresponds to the set of codeword, another set of M check nodes corresponds to the set of parity check constraints and each edge corresponds to a non-zero entry in the parity check matrix H. An LDPC code is known as (j, k)-regular LDPC code if each column and each row in its parity check matrix have j and k non-zero entries, respectively. The construction of LDPC code is typically random. As illustrated in Fig. 1 , LDPC code is decoded by the iterative belief-propagation (BP) algorithm [2] that directly matches its Tanner graph. A fully parallel decoder is realized by directly instantiating the BP decoding algorithm to hardware. Such fully parallel decoder could achieve extremely high decoding speed, e.g., a 1024-bit, rate-l/2 LDPC code fully parallel decoder [4] with the maximum symbol throughput of 1 Gbids has been implemented using ASIC technology. However, the primary disadvantage of fully parallel design is that with the increase of code length the hardware complexity will become more and more prohibitive for many practical purposes, e.g., the ASIC LDPC decoder [4] with only 1K-bit code length consumes 1.7M gates. Moreover, as pointed out in [4] , the routing overhead is quite formidable due to the large code length and randomness of the Tanner graph.
A joint code and decoder design methodology [5] was recently proposed for (3, k)-regular LDPC code and partly parallel decoder design to achieve appropriate trade-offs between hardware complexity and decoding throughput. In this paper, applying the proposed joint design methodology, we develop an elaborate (3,k)-regular LDPC code highspeed partly parallel decoder architecture based on which we implement a 9216-bit, rate-1/2 (3,6)-regular LDPC code decoder using Xilinx Virtex FPGA device. We significantly modify the original decoder structure [5] to improve the decoding throughput and simplify the control logic design. We propose a novel concatenated scheme to realize the random connectivity by using two concatenated routing networks, where the random hardwire routings are localized to significantly reduce the routing overhead. Based on the post-routing static timing analysis, with the maximum 18 decoding iterations, this decoder supports a maximum symbol throughput of 54 Mbps and achieves BER at 2dB over AWGN channel.
JOINT CODE AND DECODER DESIGN
In this section we briefly describe the joint (3, k)-regular LDPC code and decoder design methodology according to [5] . The essential objective of this joint design approach is to construct an LDPC code that not only fits to a highspeed partly parallel decoder but also has large average cycle length in its 4-cycle free Tanner graph. This joint de-sign process is outlined as follows and the corresponding schematic flow diagram is shown in Fig. 2 This decoder completes each decoding iteration in 2L clock cycles. During the lSt and 2nd L clock cycles, it works in check node processing (CNP) mode and variable node processing (VNP) mode, respectively. In the CNP mode, the decoder performs both the computations of all the check nodes and the decoding information exchange between neighboring nodes. In the VNP mode, the decoder only performs the computations of all the variable nodes.
All the intrinsic, check-to-variable and variable-to-check information are quantized to 5 
Check node processing
In the CNP mode, decoder performs the computations of all the check nodes and decoding information exchange between neighboring nodes. At the beginning, in each PE,,, the memory location with address d-1 in EXTRAMi contains 6-bit hybrid data consisting of 1-bit hard decision and 5-bit variable-to-check information associated with the variable node ~2 '~) .
Each clock cycle this decoder performs the read-shuffle-modjfjwnshuffle-write operations to convertone variable-to-check information in each EXTRAMi to its check-to-variable counterpart. We outline the datapath loop in CNP mode as follows: 1. Read One 6-bit hybrid data h?!, is read from each EXTRAMi; 2. ShuHe:
Each h$)g goes through the shuffle network rz and arrives CN&; 3. Modi@ Each CNU,,j performs the parity check on the 6 input hard decision bits and generates the 6 output 5-bit check-to-variable information a?L; 4. Unshufle:
Send each at), back to the PE Block via the same path as its variable-to-check counterpart; 5. Write: Write check-tovariable information a$)y to the same memory location in EXTRAWI-z as its variable-to-check counterpart.
We implement each bi-directional U 0 connection in the 3 shuffle networks by two distinct sets of wires with opposite directions so that the hybrid data from PE Blocks to CNU's and the check-to-variable information from CNU's to PE Blocks are carried on distinct set of wires. Compared with sharing one set of wires in time-multiplexed fashion, this approach has higher wire routing overhead but eliminates the logic gate overhead due to the realization of timemultiplex and, more importantly, make it feasible to directly pipeline the datapath loop for higher decoding throughput. Each EXTRAM-z associates with one address generator AGkk that provides the read address in each clock cycle. The write address for writting the check-to-variable information is obtained via delaying the read address by the pipelining stages of the datapath loop. The connectivity among all the variable nodes and check nodes realized by this decoder is jointly specified by all the address generators and the 3 shuffle networks. Moreover, for i = 1,2,3, submatrix H, or the connectivity among all the variable nodes and the check nodes in CG, is completely determined by all AG~'!,'s and r, .
Implementations of AGiL and .rr, for i = 1,2: Recall that node wy") corresponds to the column h!",') as illustrated in Fig. 3 and the decoding information associated with node To simplify the design process, we separately conceive AG~!~'s and 7r3 so that the design of AGi:L's and 7r3 accornplish the above l S t and 2nd requirement, respectively.
Implementations of AGL:;:
We implement each AGch as a [log, L] -bit binary counter that counts up to the value L -1 and loads a constant value t,,, at the beginning of CNP mode. Each t,,, is generated in random under the following two constraints:
1. Givenz,wehavet,,,, #t,,,,,Vyl,y2 E { l , . . .
,k};
2. Giveny, we have tzl,,-tzz,, f ((zl-z,).y) mod L,
We can prove that the above constrains are sufficient to make H always correspond to a 4-cycle free Tanner graph no matter how we implement 7r3.
correspond to a 4-cycle freeTanner graph; random-like.
VZl,S, E {l,... ,k}.
Implementation of 7r3:
We develop a novel concatenated configurable random shuffle network implementation scheme for 7r3 as described in the following. so that 7r3 can unshuffle the k2 data backward from CNU,,j to PE,,, along the same route as the forward path on distinct sets of wire.
To make the connectivity realized by 7r3 be random-like and change each clock cycle, we randomly generate the control word sk' and st) for each clock cycle and each R, and Cy. Since most modern FPGA devices have multiple metal layers, the implementations of the two shuffle arrays can be overlapped from the bird's-eye view. Therefore, such concatenated implementation scheme confines all the routing wires to small area (in one row or one column), which will significantly reduce the routing overhead. 
Load

Intrinsic
Address Data
The operations performed in the variable node processing is quite simple since the decoder only performs all the variable node computations. At the beginning of variable node
Decoding
Outpu1 trated in Fig. 5 , in each clockbycle. this decoder performs the readmodifywrite operations\y convert the 3 check-tovariable information associated w i t h h e same variable node to 3 hybrid data consisting of variable-to-check information and hard decision. This decoder works simultaneously on 3 consecutive code frames in two-stage pipelining mode: while one frame is being iteratively decoded, the next frame is loaded into the decoder and the hard decisions of the previous frame are read out from the decoder. Thus each INTRAM contains two RAM blocks to store the intrinsic information of both current and next frames. Similarly, each DECRAM contains two RAM blocks to store the hard decisions of both current and previous frames. The intrinsic information input and hard decision output schemes are heavily dependent on the floor planning of the k2 PE Blocks. To minimize the routing overhead, we develop a square-shaped floor planning as illustrated in Fig. 7 and the data I/O scheme is described in the following:
Intrinsic Data Input: The intrinsic information of next frame is loaded 1 symbol per clock cycle. As shown in Fig. 7 , the memory location of each input intrinsic data is determined by the input load address that has the width of ([log, L] + [log, k21) bits in which [logz k21 bits specify which PE Block is being accessed and the other [log, L1 bits represent the memory location in the INTRAM. The primary intrinsic data and load address input directly connect to the k PE Blocks PEl,, for 1 < y < k, and from each PE,,, the intrinsic data and load address are delivered to the adjacent PE Block PE,+l,, in pipelined fashion.
Decoded Data Output:
As shown in Fig. 7 , the primary [log, L]-bit read address input directly connects to the k PE Blocks PE,,1 for 1 5 2 2 k, and from each PE,,, the read address are delivered to the adjacent block PE,,,+l in pipelined fashion. Each PE Block outputs 1-bit hard decision per clock cycle. Therefore, as illustrated in Fig. 7 , the width of pipelined decoded data bus increases by 1 after going through one PE Block, and at the rightmost side, we obtain k k-bit decoded output that are combined together as the k2-bit primary data output.
FPGA IMPLEMENTATION
Based on the above architecture, we implemented a (3,6)-regular LDPC code partly parallel decoder for L = 256 using Xilinx Virtex-E XCV2600E device. The LDPC code length is N = L . k2 = 256 . 62 = 9216 and code rate is 1/2. The target XCV2600E FPGA device contains 184 onchip block RAMS, each one is a dual-port 4K-bit RAM. We configure each dual-port 4K-bit RAM as two independent single-port 256 x 8-bit RAM blocks so that each E X T R A M i can be realized by one single-port 256 x %bit RAM block.
Since each INTRAM contains two RAM blocks for storing the intrinsic information of both current and next code frame, we use two single-port 256 x 8-bit RAM blocks to implement one INTRAM. The DECRAM is realized by distributed RAM that provides shallow RAM structures implemented in CLBs. Because all the RAM blocks have fixed locations, the placement of the decoder is primarily carried out based on the RAM block locations and we manually configured the placement of each PE Block according to the floor planning scheme as shown in Fig. 7 . Notice that such placement scheme exactly matches the structure of the configurable shuffle network x3. This decoder is described in VHDL and SYNOPSYS FPGA Express was used to synthesize the VHDL implementation. The Xilinx Development System tool suite was used to place and route the synthesized implementation for the target XCV2600E device with the speed option -7. The resource utilization statistics are listed in 
