A novel, flexible and scalable parallel LDPC decoding approach for the WiMAX wireless broadband standard (IEEE 802.16e) in the multicore Cell broadband engine architecture is proposed. A multicodeword LDPC decoder performing the simultaneous decoding of 96 codewords is presented. The coded data rate achieved a range of 72 -80 Mbit/s, which compares well with VLSI-based decoders and is superior to the maximum coded data rate required by the WiMAX standard performing in worst case conditions. The 8-bit precision arithmetic adopted shows additional advantages over traditional 6-bit precision dedicated VLSI-based solutions, allowing better error floors and BER performance. From a computational perspective, the most demanding efforts lie in the error correcting code system, which is based on low-density paritycheck (LDPC) codes, particularly on the decoder side. LDPCs are (n, k) linear block codes with length n and k ¼ rate Â n, defined by sparse parity-check binary H matrices. They usually take the form of bipartite Tanner graphs [2] , which represent connections between two kinds of nodes: bit nodes (BN) and check nodes (CN). The belief in probabilistic information received from the channel is propagated through neighbouring nodes as defined by the Tanner graph [2] , in order to reach a codeword that respects all parity-check equations.
Introduction: The WiMAX IEEE 802.16e standard [1] was developed to establish non-line-of-sight (NLoS) connectivity between a base station and a subscriber. Mobile communications supported by WiMAX technology can deliver very demanding coded data rates. The theoretical maximum is approximately 75 Mbit/s per channel, but real-world performance is considerably lower, typically achieving a maximum of 40 Mbit/s per channel.
From a computational perspective, the most demanding efforts lie in the error correcting code system, which is based on low-density paritycheck (LDPC) codes, particularly on the decoder side. LDPCs are (n, k) linear block codes with length n and k ¼ rate Â n, defined by sparse parity-check binary H matrices. They usually take the form of bipartite Tanner graphs [2] , which represent connections between two kinds of nodes: bit nodes (BN) and check nodes (CN). The belief in probabilistic information received from the channel is propagated through neighbouring nodes as defined by the Tanner graph [2] , in order to reach a codeword that respects all parity-check equations.
Since LDPC decoding demands very intensive computation, major solutions are VLSI-based non-scalable dedicated, typically using 5 to 6-bit precision arithmetic, which imposes error floor limitations in the bit error rate (BER) performance [3] . The new approach proposed in this Letter is based on the low-cost heterogeneous Cell multicore architecture [4] and uses 8-bit precision arithmetic allowing significant coding gains in terms of BER and error floor, as shown in Fig. 1 , compared with a common 6-bit precision solution [6] . Furthermore, it represents a programmable solution based on a powerful data transfer mechanism that both supports the parallel processing of data on each of the independent synergistic processor elements (SPE) of the Cell broadband engine (Cell/BE) architecture [4] , and allows exploitation of the concept of data locality. The adoption of specific parallelisation techniques produces coded data rates that approach VLSI solutions [5, 6] , and at the same time guarantees enough bandwidth for theWiMAX standard to work in worst case conditions. Decoding algorithm: The min-sum (MS) is one of the most efficient algorithms [5, 6] used for LDPC decoding. It is a simplification of the well-known sum-product algorithm (SPA) in the logarithmic domain. Even so, the MS still requires intensive processing. For each node pair (BN n , CN m ) we initialise Lq nm with the a priori log-likelihood ratio (LLR) information received from the channel LP n . Then we proceed to the iterative body of the algorithm by performing steps 1 and 2, respectively, where Lr mn denotes the message sent from CN m to BN n , and Lq nm denotes the message sent from BN n to CN m :
1. Horizontal processing: Calculating the messages sent from CN m to BN n :
2. Vertical processing: Calculating the messages sent from BN n to CN m :
3. Finally, we compute the a posteriori pseudo-probabilities and perform hard decoding:
Parallelisation strategy: The multicore Cell/BE architecture has one PowerPC processor element (PPE) and eight SPEs that support multithreading and single instruction multiple data (SIMD) instructions associated with an efficient low-latency data transfer mechanism based on DMA [4] . The decoding is performed simultaneously, in parallel for multiple codewords, on the several processor elements of the architecture. Each processor carries out a vectorised 128-bit wide instruction based processing, computing 16 8-bit data elements simultaneously. Lr mn and Lq nm 8-bit messages are updated independently for each word.
The parallel approach adopted consists of exploiting the PPE to manage the I/O, preprocess and distribute data to be computed in the next step, which is performed on each of the available SPEs. According to Fig. 2 , in a Cell/BE architecture with K SPEs, the Pth SPE is responsible for decoding codewords c 16P up to c 16(Pþ1) 2 1 , with P [ N and 0 P K 2 1. The obtained coded data rates are given in Table 1 for different rates and number of iterations. The six SPEs can process 96 (6 Â 16) codewords in parallel, producing coded data rates of 72 -80 Mbit/s, for ten iterations, which compare well with hardware-dedicated VLSI solutions [5, 6] . For a number of P SPEs, the global coded data rate T can be obtained as:
where n denotes the codeword length and f op denotes the SPE frequency of operation. The number of operations per decoding iteration is represented by N op/iter and N iter is the number of iterations. To analyse the results obtained, we propose the following mathematical model described in (5), which assumes a hypothetical upper bounded coded data rate T ma exclusively dependent on local SPE memory accesses, given by:
where W m represents the SPE memory bandwidth assuming that the algorithm has only memory accesses and no arithmetic operations (this value was experimentally measured, W m ' 3.01 Gword/s, that is giga 128-bit words per second), and Edges is the total number of the Tanner graph edges of a code. We denote a as the ratio between the global coded data rate obtained and the virtual coded date rate T ma , given by T ¼ a Â T ma . The ratio between the two coded data rates shows a , 1 and is approximately constant (with negligible variation) for all codes and rates (a ' 0.2134). This relation proves that the global coded data rates in Table 1 are mainly limited by the SPE processing time rather than memory accesses, and that the T ma model fits the experimental results with adequate precision. Furthermore, considerable coding gains were achieved using 8-bit arithmetic, as opposed to the 6-bit arithmetic typically used in hardware solutions. Fig. 1 shows the BER results obtained by simulating data transmission over an additive white Gaussian noise (AWGN) channel, for WiMAX codes (576, 288) and (1248, 624). For E b /N 0 equal to 3 dB in code (576, 288), it can be seen that, by moving from a 6-bit to an 8-bit arithmetic based solution, a coding gain of 0.5 dB is achieved.
Conclusions:
The advent of inexpensive multicore architectures allows the development of a novel programmable LDPC decoding solution for the WiMAX standard, with excellent coded data rates, on the Cell/BE architecture. The LDPC decoder presented exploits parallelism and data locality and is scalable to future generations of the Cell/BE architecture that are expected to have more SPEs, and should therefore improve the performance even further, processing more channels/subcarriers per second. The proposed LDPC decoder compares well with non-scalable and hardware-dedicated VLSI LDPC decoding solutions, reporting superior BER performances and coded data rates above 72 Mbit/s. 
