A new strategy is proposed for implementing computationally intensive high-throughput decoders based on the long length irregular LDPC codes adopted in the DVB-S2 standard. It is supported on manycore graphics processing unit (GPU) architectures, for performing parallel multi-threaded decoding of multiple codewords with reduced accesses to global memory. This novel approach is flexible and scalable, and achieves throughputs superior to the 90 Mbit/s required by the DVB-S2 standard, while at the same time it improves error-correcting performances such as BER and error floors regarding conventional VLSI-based decoders.
Introduction:
The second generation of the digital video broadcasting standard for satellite communications (DVB-S2) [1] uses binary and irregular low-density parity-check (LDPC) codes with frame lengths up to n ¼ 64 800-bit long. The length and irregularity of these codes make it harder to achieve real-time LDPC decoding with throughputs above 90 Mbit/s, as required by the DVB-S2 standard [1] . LDPCs are (n, k) linear block codes with length n and k information bits and they can be defined by sparse binary parity-check n × (n 2 k) H matrices. They are also represented as bipartite or Tanner graphs [2] having two types of node processors: check nodes (CNs) and bit nodes (BNs), connected by bidirectional edges.
The information received from the channel is propagated through neighbouring nodes of the graph (belief propagation), iteration after iteration, in order to infer a valid codewordĉ that verifies all paritycheck equations [2] . The intensive nature of such computation is imposed by thousands of messages being processed and communicated between adjacent nodes of the graph (several messages per node and iteration).
Although major solutions typically have to be developed by using VLSI-based dedicated architectures [3, 4] , with 5 to 6-bit fixed-point precision arithmetic and corresponding BER and error floor limitations, recently more flexible and programmable approaches have been proposed [5 -7] . However, they target only regular or short-length LDPC codes. The novel approach proposed here is based on ubiquitous, lowcost and homogeneous manycore GPU architectures [8] that nowadays exist in conventional personal computers. The GPU device communicates with the host CPU by using the PCIe bus and processes data by exploiting a high level of parallelism based on hundreds of cores and thousands of threads launched simultaneously [8] . The proposed solution decodes irregular and very long length (n ¼ 64 800) LDPC codes. Moreover, it is programmable, which introduces flexibility and allows using higher precision to represent data, therefore increasing coding gains as Fig. 1 shows. These properties represent significant advantages when compared with typical VLSI decoders [3, 4] . Min-sum decoding algorithm: Although the min-sum algorithm (MSA) consists of a simplification of the sum-product algorithm (SPA) [2] based on the processing of log-likelihood ratios (LLR), it still requires intensive processing. As step 1 in Algorithm 1 illustrates, before the first iteration executes, all input Lp n data received from the channel (a priori LLRs) are used to initialise Lq nm elements. Then, the MSA processes kernels 1 and 2 (steps 3 to 6) on an iterative basis until the stop conditions occur [2] .
Lr m ′ n //kernel 2 -vertical processing
Lr m ′ n //aosteriori LLRs calculation 6.ĉ n = (LQ n . 0 ? 0 : 1) //hard decoding 7. end while
Exploiting parallelism on CUDA-based GPUs: The GPUs from NVIDIA, based on the compute unified device architecture (CUDA), consist of several multiprocessors, each composed of stream processors that compute thousands of threads simultaneously following a singleinstruction multiple-thread (SIMT) approach. The host CPU transfers data and controls the kernels' execution on the device. During the execution of each iteration of Algorithm 1, the host launches two main kernels on the device for processing the horizontal and vertical steps, respectively 3 and 4 (as Fig. 2 illustrates) . This parallel approach adopts a thread-per-node perspective, where each thread processes a complete row (horizontal processing) or column (vertical processing) of H. At the end, data is returned from device back to host. The GPU DVB-S2 LDPC decoder processes 16 codewords in parallel. The fact that data is represented with 8-bit precision allows each 128-bit memory access to the GPU's global memory to read/write 16 elements of data, which favours parallelism. The Lr mn and Lq nm 8-bit data elements are processed independently for each of the 16 codewords. Furthermore, concerning the addresses of the Tanner graph edges, we specifically developed for this GPU-based approach a data memory organisation similar to the one used by Kienle et al. in ASICs [3] . Consequently, it is possible to calculate these addresses on-the-fly using data from the GPU's fast constant memory. This avoids the need of accessing the GPU's slow global memory to read the edge addresses, which saves considerable computation time. Therefore, significantly higher throughputs are obtained, compared with other previously reported approaches [5] [6] [7] .
Experimental results: The experimental results were achieved by decoding a subset of DVB-S2 [1] LDPC codes using a C2050 Fermi GPU programmed with the C + + language compiled with GCC-4.4 and CUDA 3.2 [8] . The host platform uses a GNU/Linux kernel 2.6.31-22 x86_64.
The selected GPU has 14 multiprocessors with 32 stream processors each. It was programmed by launching 128 threads per block and a number of blocks that depends on the DVB-S2 code and kernel type (1 or 2).
Throughputs reported in Table 1 show that more than 90 Mbit/s are achieved for all rates under test running 20 iterations for DVB-S2 codes B2 (64 800, 21 600) and B4 (64 800, 32 400), which have rates 1/3 and 1/2, respectively. They compare well with state-of-the-art dedicated VLSI approaches [3, 4] . Fig. 1 shows that the use of 8-bit arithmetic leads to superior BER performance as opposed to the use of 6-bit, typical in VLSI solutions [3, 4] . Comparing fixed-point data representations, respectively with Q5.1 (6-bit, with 5-bit dedicated to the integer part, and 1-bit for decimal representation) and Q6.2 (8-bit), the Figure clearly shows that coding gains exist in the waterfall region (data transmission is simulated over an additive white Gaussian noise (AWGN) channel using QPSK modulation). Furthermore, the GPU accelerates processing allowing the detection of error floors for normal frame length DVB-S2 codes within hours, instead of weeks of computation [9] (we tested 10 8 codewords per each E s /N 0 value shown in Fig. 1 , for a maximum number of iterations I ¼ 50). Fig. 1 shows that the use of a Q6.2 (8-bit) representation produces error floors with performance more than three orders of magnitude superior compared with Q4.2 (6-bit) arithmetic for both B2 and B4 DVB-S2 codes. 
Conclusions:
We propose a novel GPU-based LDPC decoding solution for the DVBS2 standard, adopted in satellite communications. This programmable parallel decoder exploits massive data-parallelismwell suited for the GPU and uses a reduced number of accesses to the device's slow global memory, owing to the computation on-the-fly of the edge addresses, which allows the acceleration of processing and high throughputs to be obtained. This approach is scalable to future generations of GPUs that are expected to have more cores, which should improve performance, either in throughput or in BER, namely by increasing the level of multicodeword parallelism or by using higher precision to represent data, respectively. It compares fairly well with non-scalable and non-reusable VLSI DVB-S2 decoders and presents throughputs above the required 90 Mbit/s. 
