Abstract-This article provides a scalable parallel approach of an iterative LDPC decoder, presented in a tutorial-based style. The proposed approach can be implemented in applications supporting massive parallel computing. The proposed mapping is suitable for decoding any irregular LDPC code without the limitation of the maximum node degree. The implementation of the LDPC decoder with the use the OpenCL and CUDA frameworks is discussed and the performance evaluation is given at the end of this contribution.
parallel approach for additional optimizations in order to make further accelerations. The parallel decoding approach is suitable for fast decoders implemented in GPUs. It is also highly applicable for accelerating bit error rate simmulations used in designing new LDPC codes.
Several contributions published so far deal with fitting the LDPC decoder on GPU platform [6] [7] [8] [9] [10] [11] [12] [13] [14] . However, the decoders are mostly limited for applications with some families of LDPC codes or bounded with the maximum node degree in the associated Tanner graph [5] . The proposed parallel approach is suitable for decoding any irregular LDPC code without the bound in terms of the maximum node degree. September 7, 2016 II. LDPC A. Introduction LDPC codes [2] represent the coding technique with the best known error correcting capabilities. LDPC codes surpassed other codes, including turbo codes and Reed Solomon codes, at the correcting performance, and they are becoming increasingly difficult to ignore in novel signal processing systems. Although the number of applications with LDPC codes has grown significantly with the increasing speed of computing resources, decoding is still a computionally intensive task, which limits the deployability of non-approximated decoding algorithms for medium and long block length codes. However, the decoding can be accelerated significantly with the use of parallel multicore computing architectures. Our work related to LDPC codes include [16] [17] [18] [19] . 
B. Basic defitions
In this section, we provide basic mathematical definitions related to channel coding and their associations to LDPC codes and the presented parallel decoder. Let C = (n, k) be a linear block code, where the number of code bits is denoted as n and the number of information bits is denoted as k. The information vector of k bits is denoted as m and the (kn) generator matrix is denoted as G. The codeword c is given by c = mG, which is encoding. The parity-check matrix associated with the code C is denoted as H. Any vector v is a codeword if and only if vH T = 0. The product of the multiplication vH T is called the syndrome s. If the parity-check matrix H of code C is sparse, the code C is said to be the Low-Density Parity-Check (LDPC) code.
The Tanner graph is a bipartite graph of sets of variable nodes and check nodes defined by the parity-check matrix H. If the element H i,j = 1 (i corresponds to the row, while j corresponds to the column of the matrix H), an edge occurs between the check node c i and the variable node v j . The Tanner graph is used for LDPC decoding, which is briefly described in the following section.
The vector of check nodes connected with j-th variable node is denoted as M j be and the vector of variable nodes connected with the i-th check node is denoted as N i . Then
(1)
C. Decoding
Decoding is a method for correcting errors in a corrupted codeword and the device performing decoding is called the decoder. The output of the decoder is usually called the estimation c, as illustrated in Fig. 2 . Two main principles, listed below, can be considered for LDPC decoding. The principles are:
• Hard-decision, e. g. Bit-Flipping • Soft-decision, working with probabilities during decoding process Soft-decision decoding, including the Sum-Product (SP) algorithm [4] and its derivations, is supposed for the implementation of the LDPC decoder and related benchmarks in this article. LDPC decoding is an iterative process of passing values as messages in the Tanner graph through its edges. An estimation of the codeword is calculated after finishing each iteration and if the estimation is a codeword of the LDPC code, decoding is stopped. If a codeword is not found after a certain number of iterations (typically 5-100), decoding is terminated as unsuccessful.
All messages passed in the Tanner graph represent probabilities, which are used for calculating the estimation after finishing every iteration. Because the algorithm convergence is affected significantly by the parameters of the Tanner graph (especially the number of short cycles), there is no reason for performing relatively high number of iterations. Therefore, the maximum number of iterations is limited.
Messages outgoing from one set of nodes are calculated with the use of the incoming values from the opposite set of nodes. Edges are used as interfaces for passing messages between the set of variable nodes and the set of check nodes, while each message outgoing from a node is passed through an edge. Each message outgoing from a node in the Tanner graph depends on the incoming messages from the connected nodes excluding the value received from the node which is the destination node, as Algorithm 1 describes in more detail. The process is ilustrated in the following example. As can be seen in Fig. 4 , the variable node v 0 is connected with check nodes c 0 , c 2 , c 3 , c 5 . Considering the calculation of the value beeing passed from v 0 to c 0 , the value depends on the incoming values from the nodes c 2 , c 3 and c 5 . In the second half of an iteration, the value beeing passed from c 3 to v 0 depends on the incoming values from v 4 , v 11 , v 12 . The data flow is shown in Fig. 5 In recent years, there has been an increasing interest in implementing LDPC decoders in a wide variety of hardware architectures, including GPU. Several contributions deal with fitting the decoder on parallel architectures with the use OpenCL or CUDA frameworks and discuss the benchmarks [6] [7] [8] [9] [10] [11] [12] [13] [14] . However, work reviewed so far deal mostly with some families of LDPC codes and the application of parallel decoders is limited. In this article, we propose a general parallel approach for the decoder of any irregular LDPC code. The proposed approach divides calculations into a scalable number of threads. Each thread performs the calculation of the value outgoing through the edge, which is associated with the thread itself (edge-level parallelization). The approach was chosen because of its suitability for any irregular LDPC matrices, scalability for any code block lengths and deployablity on many hardware architectures. It is also convenient for derivated algorithms for LDPC decoding, such as Min-Sum (MS) or adaptive MS. In the previous work dealing with the parallel LDPC decoding, the calculations are mostly divided on the level of rows and columns of the parity-check matrices.
B. Our approach
In this section, we describe the approach of the edge-level parallelization used for the LDPC decoder. The principle is also shown in the illustrated example supported by consistent figures associated with the same LDPC (14,7) code. Considering the code given by the parity-check matrix ( Fig. 6 ) and associated Tanner graph (Fig. 4) , we define the following arrays used as address iterators for the parallel message passing algorithm (described in Algorithms 3 and 4):
• a sorted tuple of variable nodes v = (v j ) starting with the lowest index and associated tuple of check nodes c = (c i ), such i, j :
unequivocally defines an edge in the Tanner graph; n is the number of variable nodes and n − k is the number of check nodes • a tuple of edges e = (e k ) = (0, 1, 2, ..., |c|)
• a tuple of connected edges t = (t k ) with a variable node
order to calculate the value passed through the edge e k ;
with the connected node v k ; u k = k −|(v q ) : q < k, v q = v k | The arrays defined above are used as address iterators for calculations of messages outgoing from variable nodes to check nodes (the first half of the iteration). We also show the arrays in the illustrative example. Supposing the code (14, 7) given by the parity-check matrix in Fig .6 , the arrays derived by the principle described above are shown in Table I . The first half of the iteration of the LDPC decoding process calculates for all j ∈ [0, |M|) do 3:
for all i ∈ M j \ i do 7: for all j ∈ [0, |M|) do 15: for all i ∈ [0, |N |) do end for 25: end procedure the values passed from the variable nodes to the check nodes. With the use of the array iterators we can perform such calculations without any complicated operations with array indices. The pseudo code is shown in Algorithm 3. The local index of the thread (according to the OpenCL terminology) is denoted as lid and the number of synchronized threads working in parallel is denoted as lgsize. Because all threads performing the calculations have to be synchronized after they finish writing in the memory and the number of synchronizable threads is strictly limited (e. g. 1024), the calculations are divided in several steps (pages) if necessary. This is when the number of edges is greater than the lgsize variable. An illustrative example for 12 synchronizable threads is shown in Fig. 6 .
The arrays for the second half on the iteration can be derived similarly. Keeping of the unique edge identifier (c i , v j ) and associated edge index e k , the arrays c, v, e are sorted starting with the lowest check node index and other arrays are derived considering the messages outgoing from the check nodes. Such arrays are then denoted as e, c, v, t, s, u) in the following for it ∈ (0, ITERATIONS) do for all y j ∈ y do end for 37: end procedure descriptions. As a demonstrative example, the arrays for the second half of the iteration are shown in Table II .
The algorithm performing the second half of the iteration processes the arrays described above. Its pseudo code is shown in Algorithm 3. After finishing the second half of the iteration we can continue with the next iteration. The whole decoding principle remains the same, as described in Algorithm 1.
For example, the address iterators for LDPC (14,7) code are listed in Table I and Table II . Both tables are particularly useful for understanding the principle and checking the correctness of the implementation. To keep the consistency and for tutorial purposes, both tables are associated with the LDPC (14,7) code given by the parity-check matrix from Fig. 6 . end for 22: end procedure
IV. BIT ERROR RATE SIMULATOR
Apart from the implementation of the LDPC decoder, we also considered a Bit Error Rate simulator based on the Additive White Gaussian Noise (AWGN). The simulator is a highly useful tool for benchmarks and code evaluation purposes. The code evaluation requires up to billions of operations to be performed and it is the most time-consuming part of algorithms designing new and innovative LDPC codes. Therefore, its parallelization leads to a significant acceleration of a code design process and more precise simulations become possible. Fast simulations are also needed for evaluating for (p = 0; p < totaledges; p+ = lgsize) do q index = value 15: end for 16: end procedure candidate solutions when applying algorithms for performing LDPC code optimizations.
For BER calculation, codewords are modulated and transmitted through the AWGN channel given by the parameter σ (often recalculated to the E b /N 0 ratio), as can be seen in Fig. 3 . The decoder then receives noised vectors, which are decoded, and Hamming distances between decoded vectors and original codewords are calculated. Due to the linearity of LDPC codes, it is enough to transmit only zero codewords and count the number of 1's at the output of the decoder (Fig. 3) .
where k is the length of the information message, n is the length of the codeword, and E b is the energy per bit.
V. OPENCL AND CUDA IMPLEMENTATION
In current signal and data processing systems, there is an unambiguous trend to use parallel architectures to increase the processing speed, which plays a crucial role in real time applications and determines a deployability of computationally complex algorithms in hardware. Hardware devices supporting massively parallel processing algorithms generally include Graphics Processing Units (GPUs), which are considered in this tutorial article.
In this work, the CUDA and the OpenCL frameworks are used for GPU computations. The OpenCL is an open standard for parallel programming using the different computational devices, such as CPU, GPU, or FPGA. It provides a programming language based on the C99 standard. Unlike OpenCL, CUDA is only for NVIDIA devices starting from G80 series (so called CUDA-enabled GPUs). CUDA gives a possibility 4 2 2 2 2 3 3 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 10 13 2 3 5 9 11 0 6 9 0 4 11 12 1 3 4 6 8 0 2 7 8 13 3 7 10 12 c 0 0 0 0 0 1 1 1 1 1 2 2 2 3 3 3 3 4 4 4 4 4 5 5 5 5 5 6 6 6 6 t 5 5 5 5 5 5 5 5 5 5 3 3 3 4 4 4 4 5 5 5 5 5 5 5 5 5 5 4 4 4 to write programs based on the C/C++ and Fortran languages. OpenCL and CUDA programming models are illustrated in Fig. 7 .
A. Necessary considerations
When implementing an algorithm on GPU platform using OpenCL or CUDA frameworks, two main issues have to be considered:
• size of the local memory (OpenCL) or shared memory (CUDA), • size of the working group (OpenCL) or block size (CUDA). GPU devices offer several types of the allocable memory, which differ in their speed and their size. The memory type used to store variables is specified in the source code by the prefix according to the OpenCL or CUDA syntax rules. Generally, the largest allocable size, typically in gigabytes for current devices, is located in the global memory. However, the global memory is also the slowest one. A higher speed is provided by the local memory, but the size is typically only in kilobytes. Exceeding the limited size of the local memory usually leads to incorrect results without any warnings in the compilation report.
Another crucial issue related to an algorithm implementation in GPU devices is the working group size. Although the GPU can run thousands of threads in parallel, these threads are not synchronized among each other in terms of writing in the memory. The threads are split into working groups and they can be synchronized only among other threads at the same working group. The size of the working groups is strictly limited (typically 1024).
B. Coding
Both frameworks processes two types of code
• host (runtime), running serially on CPU • kernel (device), running parallely on GPU Listing 1: Types t y p e d e f s t r u c t Edge{ i n t i n d e x ; / / e a r r a y i n t vn ; / / v a r r a y i n t cn ; / / c a r r a y i n t e d g e s C o n n e c t e d T o N o d e ; / / t a r r a y i n t a b s o l u t e S t a r t I n d e x ; / / s a r r a y i n t r e l a t i v e I n d e x F r o m N o d e ; / / u a r r a y The kernel is executed by the host. In CUDA, the kernel execution is more straightforward compared to OpenCL, as can be seen in the consistent examples in Listing 2. Both codes execute the kernel berSimulate in 100 working groups (blocks) with 512 threads per one working group. After finishing the kernel, the results are copied in the berOut array and processed by the host. Because the kernel function has to be considered as a function running in parallel, each thread has its own unique identifier -the combination of global ID and local ID in OpenCL or the combination of thread ID and block ID in CUDA, which can be recalculated vice versa. The parallel implementation of the function decodeAW GN , defined in Algorithm 2, is shown in Listing 3. Types used for code definition and passing messages are pointed in Listing 1.
Some main differencies between the OpenCL and CUDA syntax rules are shown in Table III , which can be used when moving the source code from one framework to another one.
VI. RESULTS

A. Experimental evaluation
Developed algorithms for LDPC decoding were run on NVIDIA Tesla K40 (Atlas) and Intel Xeon E5-2695v2 platforms [21] , [22] . The NVIDIA device contains 2880 CUDA cores and runs at 745 MHz. The peak performance for double precision computations with floating point is 1.43 Tflops. The clock frequency of the Intel Xeon CPU is 2.4 GHz. All measurements include the time required for random generation, realised by the Xorshift+ algorithm and the Box-Muller transform.
Benchmarks were performed through the calculation of the Bit Error Rate at E b /N 0 = 2dB for a code given by the NASA CCSDS standard [20] and its protographically expanded derivations [3] . Based on the results obtained from NVIDIA Tesla K80, we got slightly better performance with the use of the CUDA framework, as shown in Fig. 9 . Compared to the CPU implementation run on Intel Xeon, the acceleration grows with the size of working groups and the number of decoders running in parallel to the limit of the device, as illustrated in Fig. 8 . GPU become very effective for longer block length codes, as also shown in Table IV . The ratio between CPU (C++ compiler with O3 optimization) and GPU was 25 for code of 262144 bits.
B. Further acceleration
To keep the generality, no simplifications in the decoding algorithm were applied and the experimental evaluation was performed with the use of the global memory. For further acceleration, several tasks can be considered, i. e. usage of the local memory, variables with a lower precision, look-up tables, or modifications of the algorithm for certain families of LDPC codes. For example, by moving the part of variables in the local (shared) memory, the decoder works approximately 40% faster in our experience. However, it is not possible to decode longer codewords because of the size limitations (240 kB of the local memory per working group). Another possibility for greater optimization could be the parallelization of less computationally intensive functions. After applying parrallel algorithms for passing messages, calculating the syndrome and the estimation, the most serial time-consuming operation is checking syndrome for all zero equality (approximately 34% of the decoding function in our experience).
VII. CONCLUSIONS
The development of multicore architectures supporting parallel data processing has led to a paradigm shift. Signal processing algorithms has to be considered working asynchronously in separated threads, while the threads are synchronized only when writing in registers (memory). Therefore, there is a need for novel approaches and frameworks allowing an algorithm deployabality in modern signal and data processing systems.
In this article, we touched with recent frameworks for Graphics Processing Units and probably the best known error correction coding technique, LDPC. In a tutorial-based style, we have provided a general parallel approach for decoding any irregular LDPC code and presented a demonstrative application in consistent examples associated with LDPC (14,7) code. The presented approaches are based on the edge-level parallelization, where each thread performs the calculation of a particular value passed through the associated edge (one thread for one edge). The potential acceleration achieved by the parallelization of the calculations grows with the number of edges in the graph. This can lead to interesting applications for long block length codes providing excellent error correcting capabilities.
Hardware devices supporting massively parallel processing algorithms generally include GPUs, which were considered in this tutorial article. Differencies and similarities, in terms of the terminology and source codes, between the OpenCL and CUDA frameworks used for GPU programming were shown in the paper. Benchmarks for the OpenCL and CUDA approaches were performed on the NASA CCSDS (256,128) standard and its protographically expanded derivations [3] , and the results were compared against the C++ implementation.
Results shown the acceleration which is up to 22 times compared against C++ with O3 optimization, and up to 58 times compared against C++ compilation without optimization.
Because the OpenCL framework has found utilization in programming FPGA-based systems [15] , the proposed algorithms and their potential modifications can be easily used in a wide variety of fast signal processing systems. (a) Acceleration dependence on the block (working group) for 100 decoders running in parallel. We would like to thank to Vladimir Vasilievich Korenkov and Ivan Stekl for arranging the cooperation, to Jan Busa for technical support and especially for professional LaTeX consultations, and to Gheorge Adam for his professional comments and interest in this work.
