ABSTRACT Low-density parity-check (LDPC) block codes are popular forward error correction schemes due to their capacity-approaching characteristics. However, the realization of LDPC decoders that meet both low latency and high throughput is not a trivial challenge. Usually, this has been solved with the ASIC and FPGA technology that enables meeting the decoder design constraints. But the rise of parallel architectures, such as graphics processing units, and the scaling of CPU streaming extensions has shown that multicore and many-core technology can provide a flexible alternative to the development of dedicated LDPC decoders for the compute-intensive prototyping phase of the design of new codes. Under this light, this paper surveys the most relevant publications made in the past decade to programmable LDPC decoders. It looks at the advantages and disadvantages of parallel architectures and data-parallel programming models, and assesses how the design space exploration is pursued regarding key characteristics of the underlying code and decoding algorithm features. This paper concludes with a set of open problems in the field of communication systems on parallel programmable and reconfigurable architectures.
I. INTRODUCTION
Known for more than fifty years [1] , only recently lowdensity parity-check (LDPC) codes have been exploited under real-life error-correcting code (ECC) scenarios. They were left unused for more than thirty years, mainly due to the lack of computational power that was required to practically demonstrate their capacity-approaching characteristics [2] . Since the 1990s, they have become one of the most widely adopted coding schemes, along with Turbo codes, for operation in forward error correcting (FEC) systems in multiple communication standards: IEEE 802.3an, 802.11n, 802.15, 802.16, ETSI 2 nd Gen. DVB, 3GPP LTE (4G) and ITU-T G.9960 and G.709 to name a few.
The vast majority of LDPC decoders found in the literature target dedicated very large scale integration (VLSI) decoders [3] , either using application-specific integrated circuit (ASIC) technology or designing for reconfigurable computing devices (field-programmable gate arrays (FPGAs)) [4] , [5] . The development of dedicated decoders for targeting an ASIC at certain technology node or an (FPGA) device incurs in high non-recurring engineering (NRE) costs and is usually an error-prone and protracted endeavor performed at the register-transfer level (RTL). In its turn, this entails that simulation and prototyping of the implemented solution be executed and verified over an alternative computing platform. Often, the simulation platform is not an obvious choice, since it performs only a simulation role in the whole development project. However, with the computing paradigm shift towards multi-core technology, and generally speaking, with the necessity to exploit parallelism in the computation to make the most out of the computational resources at hand, most processors include wide registers for vectorized operations, through streaming instructions, multithreading capabilities, and in the graphics processing unit (GPU) case, a massively multithreading environment with the vast majority of logic devoted to arithmetic units that is capable of executing at the TFLOPs range [6] .
In the meantime, the ability to use GPUs as generalpurpose processors has lead to an active field of research [7] - [35] , [35] - [75] , designated as general-purpose GPU (GPGPU) [76] - [78] , which has seen several works concerning the use of GPU devices programmed to operate as LDPC decoders. While we can only assume that central processing units (CPUs) are the first choice for code study and bit error rate (BER) performance analysis through Monte Carlo simulation, the majority of LDPC decoders found in the literature are not based on CPU architectures. With the odd exception, CPUs are mostly the underlying platform conveying proof of LDPC theory and concepts, but the decoder implementation is not the study focus, notwithstanding the fact that the advent of cross-platform parallel programming models and the growth in the register width of single instruction multiple data (SIMD)-vector units, has given CPUs a high level of computational power [75] .
Several references found in the literature deal with LDPC decoders based on streaming architectures, namely, GPUs and other accelerators such as ARM mobile system on a chip (SoCs) [79] , Intel single-chip cloud computer (SCC), the Cell B.E. [35] or experimental stream processors-the latter examples are less prevalent than GPU-based LDPC decoders. One of the reasons for GPUs popularity is the compromise between effort put into the development of a parallel algorithm that conveniently exploits the GPU single instruction, multiple thread (SIMT)-architecture, and the corresponding attained performance. First, the development time between testing and prototyping, and the final optimized decoder ready for deployment is not as high and does not incur in too high NRE costs, as hardware dedicated solutions do. Secondly, this flexibility usually means diminished returns in the performance of the decoding solution, due to the fixed instruction set, memory hierarchy and underlying architecture that are not custom-tailored to the developed decoder. Furthermore, the introduction of data parallel programming models such as Compute Unified Device Architecture (CUDA) [6] and open computing language (OpenCL) [80] , together with the unification of the graphics pipeline into a single programmable processor, meant that high-level productivity scientific languages such as C/C++, Fortran, Python and Ruby could be utilized, instead of protracted graphics languages. The drawback is that only with sufficient knowledge of the underlying GPU architecture will the developed LDPC decoders perform with high decoding throughputs.
But while GPUs, due to their raw computational power (peaking in the TFLOPs range) and performance-to-wattratios orders of magnitude above CPUs, began to dip into the high-performance computing (HPC) market [81] , and to some extent on the datacenter market too, fieldprogrammable gate arrays (FPGAs) were evolving too. Starting from a primitive ''glue logic'' status [82] , they have given rise to the very active field of reconfigurable computing [83] - [86] . They bring more throughput per silicon area [85] and less energy is consumed in the process than using conventional processors [84] . Furthermore, due to the chip area of FPGAs, they usually accompany Moore's law technology nodes, while improvements on dedicated solutions, more often than not, fail to upgrade to faster, more efficient and smaller nodes in the same time-frame. Thus, the utilization of FPGAs as custom accelerators, usually designated as reconfigurable computing, addresses some of the issues surrounding the development of ASIC technology, but also set opened a whole new level of challenges purported by the availability of gate-level optimizations. Of particular interest to the work developed in this field, is the use of high-level synthesis (HLS) models, which extend C/C++ and other programming languages [87] - [89] , in a somewhat similar way that compute unified device architecture (CUDA) and OpenCL did for GPUs, thereby avoiding the specific knowledge for developing VHDL and Verilog RTL-descriptions to generate circuits.
II. THE PROBLEM
ECC in FEC systems were deprived of the capacityapproaching capabilities of LDPC codes for over thirty years before sufficient computational power was available to enable their utilization. However, typical design approaches involve using VLSI technology, such as ASIC and FPGA development, which, thus far, require RTL-based development to reach within high decoding throughputs and low latencies.
A. MOTIVATION
The continuing trend in semiconductor manufacturing, dictated by Moore's Law, has placed tremendous challenges to processor manufacturers as the end of Dennard's scaling [90] meant performance scaling could no longer be guaranteed with increasing of the clock frequency of operation. Hence, as technology progressed into the multicore and manycore realm, with massively multithreading processing enabling computational powers in the range of the GFLOPs and TFLOPs, LDPC decoders have been shown to achieve dozens to hundreds of Mbit/s. However, due to the fixed instruction set of the underlying CPU and GPU architectures, the considerations taken by the LDPC designers can be substantially different in nature than those considered when developing custom-made hardware for LDPC decoders. In particular, designers are faced with the hard challenge of mapping the LDPC decoding algorithms onto a limited set of arithmetic operations, offered by a fixed instruction set architecture (ISA), also constrained by the nature of the memory hierarchy of the processor and the computing system as a whole. Moreover, native support for certain type of arithmetic types may not be supported, as well as thoughtful considerations are due to how data is moved in the system as a whole, since distributed to shared memory addressing regions may exist, lying on-or off-chip. Only the correct manipulation and definition of suitable arithmetic-tomemory ratios and patterns allows for the maximization of the delivered bandwidth-in other words, the resources offered by the programmable processors should be explored wisely so as to fully exploit their computational capabilities. VOLUME 4, 2016
B. NOTATION
The notation utilized throughout the paper is listed below, and is employed to systematize the description of the decoding solutions presented in the surveyed works in the following Sections of this paper. The basic understanding of the decoding problem can be perceived from the illustration of a binary LDPC code in matrix and graph representation in Fig. 1 . FIGURE 1. Parity-check matrix and Tanner graph example for the binary case. The parity-check matrix H defines the LDPC code and is the adjacency matrix to the associated bipartite graph designated as Tanner graph. The majority of the decoding algorithms are message-passing: an a-priori likelihood is given to each VN; VNs send q nm messages to their adjacent CNs; CN update r mn messages and send them back to their adjacent VNs; in their turn VNs produce new estimates on the q nm messages and also a new a-posteriori estimage on the VN state Q n .
• H is the parity-check matrix of the LDPC code;
• check node (CN) corresponds to a type of node in the Tanner graph and to a parity-check equation;
• variable node (VN) corresponds to the other type of node in the Tanner graph and to a symbol in the codeword;
• an edge exists in the Tanner graph connecting CN m to VN n whenever there is a non-null element h mn in the parity-check matrix H;
• p n a-priori likelihood estimate for VN n ;
• q nm message sent from VN n to CN m ;
• r mn message sent from CN m to VN n ;
• Q n a-posteriori likelihood estimate for VN n .
C. EVALUATION
The characteristics and complexity of the LDPC decoding algorithms are surveyed for codes defined over binary and non-binary fields. The O(·) numerical complexity presented does not capture the totality of transposing the LDPC decoding algorithm into an efficient LDPC decoder operating on either programmable or reconfigurable architectures. We can enumerate a list of the most important challenges to overcome in the design of efficient and high-performance LDPC decoding solutions as follows. i) the need to transform node connections imposed by the Tanner graph into a suitable memory layout and efficient addressing problem, considering that in most cases, irregular access patterns will be imposed by the Tanner graph structure; ii) the weighing of suitable ratios of arithmetic-tomemory-instructions which maximize the efficiency of the LDPC decoding for the underlying computer architectures; iii) the selected scheduling variants of the algorithmtwo-phased message passing (TPMP) or Turbo decoding message passing (TDMP)-influence on the aforementioned items; iv) parallelism has to be explored at different levels of complexity depending on the architecture being programmed; v) the complex exploitation of the memory hierarchy of multi-core systems, or the complex problem of defining a suitable memory hierarchy for reconfigurable LDPC decoders.
D. FIGURES OF MERIT
The different solutions must be evaluated according to certain key figures of merit. Due to limited information that can be collected from the surveying of the literature, namely lack of profiling results, we limit our assessment mostly to decoding throughput and latency at a fixed number of decoding iterations-usually under the canonical level of 10, when applying the TPMP and 5 when applying TDMP decoding schedules. Another metric which aims at normalizing the results across LDPC decoders programmed in different GPU architectures is the throughput of decoding normalized per core per frequency (TDNC).
III. DECODING ON PROGRAMMABLE ARCHITECTURES
In this Section, the survey is focused on programmable architectures for LDPC decoder solutions. The most prevalent LDPC decoders found are GPU-based, then CPU-based, and finally, those based on streaming accelerators. We focus on key characteristics of the LDPC decoders, namely 1) taskand data-parallelism, 2) data representation, 3) LDPC code type and dimensions, 4) the indexing method of the messages circulating in the Tanner graph; and figures of merit of the LDPC decoder performance with respect to computational power, i.e., 5) decoding latency, 6) decoding throughput. Finally, the programmable platform is also identified. The following subsections are devoted to discussing the methodologies developed for defining efficient programmable LDPC decoders in light of their characteristics and design space constraints. In addition, the reader is referred to Table 1 in Appendix A, which contains a tabulation of the surveyed LDPC decoders, and that can be found as a supplementary file to this manuscript.
A. PROGRAMMABLE LDPC DECODER MAPPING
Due to the programmable nature of the underlying computer architectures, a prototype isomorphic architecture [91] that is a direct mapping of the Tanner graph to CN units, VN units and the Tanner graph interconnection network is not truly possible [92] . Instead a programming description is required considering that the fixed underlying architecture and instruction set will demand the sharing of computational resources.
Only clever usage of the instruction set functionality and exploitation of the different regions within the memory hierarchy is guaranteed to optimize the LDPC decoders for performance and efficiency of computation [34] . While the term occupancy usually refers to GPU computing, the concept also extends to CPU architectures. Considering the limited but fixed number of logic arithmetic and memory resources available in programmable architectures, only if occupancy of the resources is high, will the performance of LDPC decoders peak. However, occupancy is a double-edged sword in the sense that it does not take into account over-utilization of resources leading to bottlenecks or deadlocks, that nevertheless keep occupancy high. Moreover, it is difficult to assess how efficiently a hardware resource is being utilized, based solely on the information provided by what authors have made available in their LDPC decoders on programmable architectures' publications. Thus, we are left with figures of merit contextualized for the LDPC decoding problem, throughput and latency.
B. TANNER GRAPH INDEXING SCHEMES
The LDPC decoder structure plays an important role on how efficiently the Tanner graph connections between nodes can be mapped. While regular codes might seem at first simpler than irregular ones, in practice they are not. Since the majority of LDPC codes also keeps simplicity of encoding and simplicity of Tanner graph indexing in mind, standardized codes make use of systematic coding schemes, exploring repeataccumulate (RA) parity sub-matrices, mainly for encoding purposes, and structured irregularity in the remaining portion of the parity-check matrix concerning the connectivity of information nodes (INs) [93] . As discussed in the literature [1] , each edge defines two messages, traversing in opposite directions, as seen in Fig. 1 . When mapping the Tanner graph connections to a programmable processor, we must take into account that the messages have to be laid out in memory and, thus, their location index must be known with respect to both the CN and VN that define the edge. The memory index is usually not the same for messages traversing the same edge in different directions.
Since the LDPC code parity-check matrix is also the adjacency matrix of the Tanner graph, any given LDPC code can be stored through matrix storage methods. Due to its sparsity, sparse matrix storage methods have lower memory footprints and can be employed for any type of code, and in fact they are. Regular codes, typically those constructed through progressive edge growth (PG) methods and made available in the Encyclopedia of Sparse Graph Codes [94] (Mackay codes), are stored most of the times in compressed row storage (CRS)-and compressed column storage (CCS)-like formats. This method of indexing the Tanner graph is shown in Fig. 2 . It is readily seen that out of the four memory accesses to q nm and r mn messages, two can effectively be contiguous, a feature most of the surveyed decoders implementthis exposes high bandwidth due to coalesced data accesses on GPUs and high cache hit rates on CPUs. The scenario depicted in Fig. 2 defines reading accesses to be contiguous and writing accesses as non-contiguous [32] . Thus CNs require indexes relative to their adjacent VNs, and read a memory location offset from a lookup-table (LUT) (CN idx ), and VNs, likewise, read a memory offset from the other LUT (VN idx ). Essentially, VN idx corresponds to the messages positions in memory for a reshaped column-wise paritycheck matrix, and the CN idx to its row-wise reshaped. As a consequence, the number of elements required to store the connections of a binary LDPC code Tanner graph is
This indexing method can also be employed for standardized codes [51] , though the memory footprint of the Tanner graph mapping can be reduced by orders of magnitude [40] , [95] . In the particular case of LDPC irregular Repeat-Accumulate (LDPC-IRA) codes, such as those employed in the 2 nd generation DVB (DVB 2) standards, shown in Fig. 3 , the LDPC code Tanner graph is systematically constructed in a way that permits indexing using a barrel shifter approach [95] . For instance, the number of elements required to index the DVB 2 Tanner graph codes is
with r f a code construction regularity parameter [95] , [96] . This allows an on-the-fly computation of the memory locations to where each message reads and writes. In particular, q nm messages will read and write to a contiguously increasing base offset, but writes will be shifted, and r mn messages are read from an indexed base offset and written shifted while maintaining the offset. Hence, an address and a shift LUT, with a much lower size than the CCS and CRS sparse matrix storage (2) can be employed. Likewise, quasi-cyclic LDPC (QC-LDPC) codes can also be indexed by small-sized lookup tables (LUTs), as shown VOLUME 4, 2016 in Fig. 4 , by performing an on-the-fly computation of the memory location to where a message is sent after computation. This method in particular [40] , defines contiguous reading of messages and indexed writing to memory, with a footprint that is independent of the LDPC code dimensions. In a way, the code protograph is sparsely indexed using the first method, though extra computation of indexes is required. The memory footprint of this method is
with M f and N f being the dimensions of the protograph matrix that generates the QC-LDPC code. The great advantage to this indexing scheme is that, regardless of the final codeword length which depends on the expansion factor z f , the indexing LUTs size remain the same [40] .
The memory footprint is not the most pressing issue in programmable architectures, as the memory addressing space size of modern CPU and GPU systems can be larger than the Tanner graph indexing memory footprint (1) (2) (3). However, the indexing method becomes a source for memory contention if for every computed message a memory index location requires loading, reducing the overall bandwidth to memory-it contributes to poorer cache hit ratios on CPUs [75] and adds further pressure to GPU memory engines [30] . The best performing LDPC decoders are those exploring structure sparse storage that exploit the Tanner graph structure, as opposed to a generic sparse matrix storage method. For instance, LDPC decoders implementing the former methodology achieve much higher throughputs than those exploring the latter [30] .
Several parameters, other than the Tanner graph indexing, influence the decoding throughput attained, however, a clear trend is observed in this case. For all thread-parallelism techniques employed, only the thread-per-codeword (TpC) strategy overcomes the overhead in the Tanner graph indexing scheme [50] , since the imposed overhead is negligible with regards to the amount of data moved in the GPU architecture. The remaining decoders see throughputs of a few to a dozen Mbit/s. Only structured indexing schemes consistently see decoding throughputs in the hundreds of Mbit/s.
C. PROGRAMMING MODELS
A prevalence of C/C++-based families can be observed throughout the surveyed LDPC decoders, adding to the popularity language under an HPC challenge such as the one concerning the development of LDPC decoders churning out very high decoding throughputs.
In particular, parallelism in CPU-based decoders has been exploited using the open multi-processing (OPenMP) programming model in a number of decoders [15] , [16] , [47] , [54] , [55] . Therein, the strategy to extract parallelism is based on the automatic parallelization of computation loops wherein the CN and the VN processing are defined. This is achieved with appropriate OPenMP directives. A minority of LDPC decoders replace the functionality provided by the acOPenMP loop parallelization directives with explicit thread-partitioning using POSIX threads [37] . Despite the usage of a lower-level application programming interface (API) to perform multithreading, the opportunity to improve on the decoding throughput is not fully captured by this approach. However, POSIX threads are the basis of the Cell B.E. software development kit (SDK), which is a C-extended programming model [8] , [35] . Other authors choose to put upon the OpenCL cross-platform capabilities to use on CPU technology [33] , [34] , [36] making minor adjustments from the GPU-optimized decoder onto a computing substrate with lower parallel capabilities. Similar to the aforementioned OPenMP and POSIX threads strategies, the delivered throughput is limited by a number of factors, the most important of which is the OpenCL compiler ability to pack data elements within wide registers over which single instruction, multiple data (SIMD) computation is performed [34] .
Under this light, explicit utilization of SIMD instructions is pursued on both x86 and ARM processors. The first have since evolved from their assembly-accessible 128-bit MMX registers [25] , [32] , [61] to the richer extensions provided by streaming SIMD extensions (SSE) and advanced vector extensions (AVX) at 128-, 256-and 512-bit register widths, though LDPC decoders in the literature exploit only 256-bit AVX registers [75] . As the instructionset functionality of the SIMD extensions became richer, so did the abstraction concerning its use. While MMX required explicit assembly coding, streaming simd extension (SSE) and AVX support high-level C/C++ intrinsics. On the ARM processors, the NEON extensions provide 64-and 128-bit registers, exploited through a set of C/C++ intrinsics [37] . SIMD computation also faces another challenge. The indexing of the Tanner graph connections renders data element packing and unpacking unavoidable [75] . This means that under MMX register, a non-negligible overhead of data management housekeeping tasks offsets the performance gains enjoyed from SIMD computation. On the other hand, only the increased functionality and width of the SIMD units can guarantee higher performances, on a par with GPUs, due to the increase levels of data-parallelism purported by the data packing for SIMD execution [75] .
On the GPU LDPC decoders side, apart from the seminal approaches utilizing streaming models [97] based on graphics programming languages [23] - [25] and early stream-ing computing models [97] , the majority of LDPC decoders makes use of CUDA and a minority of OpenCL. While in certain aspects both programming models are alike, with many language traits and constructs referring to the same hardware features of the GPU processor, only with different designations and handles [80] , [98] , CUDA popularity overwhelms that of OpenCL. On Nvidia platforms the reason is clear, as OpenCL is mapped on top of CUDA, with the performance of the former only reaching that of the latter at best [99] . Also, despite the similar ways to explore arithmetic instructions, data types and memory addressing spaces of both models, OpenCL is a cross-platform programming model, rendering superfluous management instructions and verbose coding requirements not really necessary when crossplatform is not truly intended. Nevertheless, the surveyed decoders see no clear performance gap between the use of CUDA or OpenCL LDPC. CUDA-based LDPC decoders range from the inception of CUDA in 2007 to the latest stable version. However, majority of the LDPC decoders based on it explore only a limited subset of its features. In particular, more advanced features such as kernel self-calling and reconfiguring the GPU execution grid without the host CPU intervention (dynamic parallelism) [98] and advanced memory synchronization and fencing operations available to the whole thread execution grid are not explored in the surveyed works, though they address limitations identified in some of the works [17] - [19] , [59] , [74] . With this regard, the OpenCL specification suffers from a lower evolution pace, with features sensitive to GPU programming having to be ratified for inclusion in a cross-platform model also supporting CPUs and FPGAs. Thus, OpenCL-based decoders are somewhat more limited to some features that CUDA addresses [33] , [37] , [49] , [64] . Notwithstanding, as such features are not explored, there is no evident loss in choosing one instead of the other.
With more or less control of the underlying hardware, all of utilized programming models allow the development of parallel LDPC decoding algorithms. Parallelism is naturally exposed in these [100] , [101] , but other parallel features pertain only to a certain algorithm expression, in its turn tightly coupled to an underlying architecture. This concerns parallelism at the task-level and at the data-level that are discussed next.
D. THREAD-LEVEL PARALLELISM
Several parallelism strategies have been proposed on multicore CPU and many-core GPU architectures that divide LDPC decoding tasks between concurrent execution threads. These strategies entail constraints to other important design features of the LDPC decoders, especially with regards to data-parallelism and also to the decoding schedule of nodes, which are on the following Subsection.
1) TAXONOMY
Due to the rich set of parameters explored by researchers in the development of programmable LDPC decoders, an appropriate taxonomy for LDPC decoding on programmable architectures will be introduced herein. First regarding parallelism, whereupon the nodes functionality is translated onto task-and data-parallelism at certain granularity levels among physical or logical cores or between execution threads. To keep a low simplicity of taxonomy, we define it in terms of thread-parallelism, which is also in accord with the majority of the surveyed programmable LDPC decoders that take in thread-based programs, a convenient feature as it will not require a set of terms for multi-core architectures and another for many-core ones.
2) PpE DECODING pixel-per-edge (PpE) is the oldest parallelism strategy dating from the time where GPGPU was at its inception, with graphics languages being the only way available to perform nongraphical computation on GPU processors. Back then, data elements had to be mapped onto ''graphical data elements'' in order to be processed by the pixel shaders, thereof stemming the designation of this thread-parallelism strategy.
While results seemed highly promising at a time when the only prospective way to achieve high-throughput would be to develop a dedicated hardware accelerator, they were still lacking the performance seen for the LDPC decoders under CUDA and OpenCL, once the graphics pipeline had been unified onto a single processor [6] . Approaches such as those proposed by Falcao et al. allowed decoding throughputs of dozens to hundreds of Mbit/s [23] - [25] , combining a graphics language approach with Caravela streams [97] , as illustrated in Fig. 5 . 3) TpE DECODING thread-per-edge (TpE) corresponds to the strategy with the finest granularity, which brings upon the LDPC decoder designer a granularity tradeoff. For the one, if consecutive threads deal with the update of messages belonging to consecutive nodes in the Tanner graph, then there is a high exposure of both spatial and temporal data locality, as Fig. 6 illustrates. For the other, most GPU-based LDPC decoders that exploit this granularity have been developed for pre-Fermi or Fermi architectures that do not have a caching mechanism available to threads for computation [6] -it exists only for off-chip memory transaction. For instance, locality is automatically explored by the x86 cache system in heavily SIMD-based LDPC decoders [75] , leading to over 90% cache hit rates that maximize the LDPC decoder bandwidth. Alas, this is not the case in the many-core-based decoders utilizing this strategy. Under the methodology proposed by Chang et al. throughputs peak lower than ∼2 Mbit/s for moderate length codes (816, 4000 and 8000 bits) [17] - [19] . This approach, requires the spawning of the highest number of threads, compared to the remaining approaches. For a regular LDPC code, the VN processing sees N ×d v threads spawned, and likewise the CN processing spawns M ×d c threads, implying the thread-parallelism granularity level putting the most pressure through the generation of thousands of threads, even though computation within each thread is not as heavy as coarser thread-parallelism strategies, which are discussed next.
4) TpN DECODING
thread-per-node (TPN) is the most prevalent strategy, with a thread being spawned per node in the LDPC code Tanner graph. While this strategy quickly depletes the number of concurrent threads that can be active simultaneously on many-core GPUs for moderate to long block lengths, this pressure is not as high as in the TpE strategy case for short to moderate codes. One of the reason behind this strategy being by far the most popular strategy is related to its elegance. Each node in the Tanner graph can be assigned to an execution thread in the absence of a de facto isomorphic transformation [3] into a functional unit (FU).
The first LDPC decoder observed to utilize this threadgranularity (see Fig. 7 ) was proposed by Falcao et al. for short to moderate length Mackay codes [94] (rate 1/2 1024, 4000 and 4896 bits) reaching up to ∼1.63 Mbit/s. The decoder forcibly defined a 2-D texture mapping of the loglikelihood ratios (LLRs) messages that contributes to the poor performance yielded [31] . Under a more general-purpose computing memory mapping, the authors were able to elevate the decoding throughputs to ∼87 Mbit/s for the normal frame DVB-S2 codes [30] , and to ∼40 Mbit/s for rate 1/2 Mackay codes (1024 to 20000 bits) [32] . The difference in the attained performance shows how data-parallelism design decisions and Tanner graph indexing methods are pivotal to elevating the decoding throughputs attained by GPU-based LDPC decoders. This is also verified by the work by Grönroos et [65] . However, other TpN approaches achieve only limited decoding throughputs [41] , [54] , [55] , without a clear pattern to what lead to such low levels of throughput performanceespecially in light of the employed highly efficient Tanner graph indexing methods for cyclic and quasi-cyclic codes [39] , [40] .
While LDPC codes working as the ECC basis of FEC in communication systems imply an application agnostic operation, i.e., all bits being equally protected, they also perform well under more applied applications such as video coding [59] . On their application to video coding benchmarks (Hall Monitor, Foreman, Coastguard and Soccer), excellent frame reconstruction is obtained, though for offline coding, i.e. it is not applied to real-time decoding [52] , [53] , [59] . Other applications of LDPC decoding worth mentioning is their use for quantum-key distribution (QKD) reconciliation [102] . Mink et al. were able to reach high decoding throughputs, though for this purpose a lower number of iterations is required [74] .
Finally, non-binary LDPC decoders that implement this strategy have also been proposed [13] , [64] . Beerman et al. define a 3-dimensional grid of computation to properly exploit parallelism exposed at the binary extension field (GF(2 m ) dimension and by the scheduling of operations within the processing of the Tanner graph nodes. Under the proposed strategy, equivalent decoding throughput (∼2 Mbit/s) is obtained for GF(2 m ) spanning the binary field (GF(2)) to GF(2 8 ).
5) BpC DECODING
block-per-codeword (BpC) is a strategy available to GPU execution as it is based on the concept of a thread block [6] , a CUDA concept that finds its equivalent as workgroup in OpenCL [80] . The rationale stems from the execution grid composed of threads and divided onto blocks that permit a suitable exploitation of the memory hierarchy in many-core GPU architectures, as depicted in Fig. 8 . Threads within the same block are allowed synchronization and fencing mechanisms for tighter cooperative computation as they can access the shared memory space, an addressing block physically unavailable to inter-block communication. A strategy based on this granularity fails short of utilizing all the GPU stream multiprocessors (SMs), as the number of blocks required is independent of the code length. Hence, this strategy is usually accompanied by data-parallelism levels that spawn more blocks to decode more codewords in parallel 1 throughout the remaining SMs of the architecture [7] , [42] , [69] . Notwithstanding, there is a constraint on how many threads can compose a block, that is influenced not only by the capabilities of the underlying hardware-high-end GPUs can execute blocks with a higher thread count than low-end ones-but also by how the developed decoding algorithm consumes registers and shared memory [6] . For QC-LDPC codes, it might make sense to define z f threads per block as it matches the protograph expansion factor. However, this might be too low of a value, leading to poor resource utilization, or too high, preventing this strategy to be accommodated onto lower-end GPUs [69] . Equivalent tradeoffs can be seen for LDPC irregular repeat-accumulate (LDPC-IRA), random LDPC codes and non-binary regular ones [103] . This thread-granularity level is also the de facto strategy able to efficiently implement turbo decoding message passing (TDMP) and TPMP decoding schedules, as opposed to the remaining strategies which are usually limited to the TPMP, as explained next.
First proposed by Abburi [7] , this strategy has been applied to worldwide interoperability for microwave access (WiMAX) (IEEE 802.16e) codes, and also to Wi-Fi (Wi-Fi) codes (IEEE 802.11n) [69] , for moderate to high decoding throughputs achieved (24.50∼160 Mbit/s), at relatively low latencies under 12 ms. This method has also been explored for the quick evaluation of a QC-LDPC construction method [42] . 1 Note that we adopt the designation codeword in a broader sense. Codeword can be the set of codewords that can fit onto the same data type, thus, it can be a single element or several ones packed into a vector data type [34] .
6) BpN DECODING
block-per-node (BpN) is, in a sense, a particular case of the BpC approach. Non-binary LDPC codes define another dimension exposing parallelism, the GF(2 m ) dimension. This strategy is adopted for the particular case of non-binary LDPC decoding, but instead of defining a block of threads depending on the LDPC Tanner graph regular features, it depends on the binary extension field (GF(2 m )) dimension [57] . This approach is tested for a GF(2 8 ) code, yielding throughputs in the Mbit/s range (<6) under different levels of dataparallelism for a pure sequential decoding schedule. Wang et al. also use this approach for a non-binary LDPC decoder for both OpenCL-based execution on CPU and GPU architectures. Though their approach has considerably low latencies (<5 ms) it achieves only 1.26 Mbit/s at best for GF(2 m ) dimensions of 2 2 , 2 3 and 2 4 .
7) TpC DECODING
TpC is in all similar to the former approach, except that in the LDPC program description there is not the concept of a thread executing a codeword. For instance core-percodeword (CpC) in an x86 CPU implies that one thread, corresponding to a logical core will execute the LDPC decoder in some of the physical cores. However, the program can be explicitly defined in terms of an executing thread, which decodes a codeword (see Fig. 9 ), hence, the distinction between two approaches would otherwise be blurred. Also, typically multi-threading is explored to elevate the parallelism levels and improve the decoding throughput performances by exploiting a higher occupancy of the hardware. Namely, Abburi et al. propose a Cell-based LDPC decoder where a thread per synergistic processing element (SPE) is assigned with the execution of the longest length rate 1/2 802.16e codewords, peaking at 270 Mbit/s [8] . Furthermore, other authors propose this approach for the many-core GPU architecture [46] and compare the performance of their approach to their previously presented LDPC decoder [70] , reducing by one order of magnitude the time required to perform BER Monte Carlo simulation for a Mackay code [94] . Also, Lin et al. were able to achieve decoding throughputs in the range 212∼550 Mbit/s, though for high latencies (53∼421 ms) using short to long length codes (204∼20000 bits) [50] . Finally, Wang et al., in order VOLUME 4, 2016 to assess the performance of the construction of QC-LDPC convolutional code developed a TpC decoder peaking at 15 Mbit/s (using 768 to 1536 rate 1/2 and 2/3 codes) [71] .
8) CpC DECODING
CpC is a thread-parallelism granularity that has no equivalent method in GPU computing, it is only available to CPU architectures. Herein, we consider the logical core definition of ''hyperthreaded'' processors, which defines a core as equivalent to an execution thread. Thus, a logical core will be responsible for executing a codeword or batch of codewords. However, the scenario under consideration is somewhat vaster here, as several approaches can be taken to implement this task-parallelism strategy.
For the upcoming exascale computing platforms [104] , Diavastos et al. studied the scalability of LDPC decoders under 1) distributed and 2) shared memory cooperative execution model, and 3) shared memory non-cooperative model [22] . Regarding scalability, 1) saw a reduction of the throughput to less than 1% of the single core baseline reference when all cores were committed to the computation, mainly due to high communication overheads caused by absence of caching mechanisms, 2) reported a sub-linear scalability of up to 11× when 48 cores were committed to the computation, while 3), presented a 41× speedup when compared to the baseline [22] . Other approaches with regards to distributed computing involve the use of streaming accelerators applied to Mackay codes [94] and achieve moderate throughputs (<79 Mbit/s) for low latencies between 0.69∼1.53 ms [28] , [32] . 802.16e standard codes can be decoded at throughputs of 72∼80 Mbit/s under this methodology on the Cell B.E. processor [35] . Furthermore, this approach is also explored on a mobile SOC platform (a Quad ARM system on a chip (SOC) Exynos 4412), whereupon short and normal frame DVB-T2 codes have been tested, reaching high latencies peaking in the 500∼2592 ms range at throughputs of ∼3 Mbit/s [36] .
While some of the aforementioned LDPC decoders do not make use of vector processing [22] , SIMD processing is a widely employed technique to improve high performance and efficiency in CPU architectures. Namely, the LDPC decoders based on the Cell B.E. make use of extensive SIMD-instructions by increasing the data parallelism within each core [28] , [32] . The work proposed by Falcao et al. for their x86-based decoder is a particular type of CpC strategy. The OPenMP model was used to parallelize the computation inside the CN and the VN processing that were encapsulated by loops. Therefore, their true approach was Processor-perCodeword, which in a sense is a special case of the CpC strategy [32] . Also, Intel CPU-based LDPC decoders are able to explore SSE and AVX SIMD-extensions to improve the data throughput while keeping latency at low levels. Le Gal et al. proposed a CpC approach where several multiple codewords are decoded simultaneously by all logical cores in the processor, using the 128-SSE and the 256-bit AVX registers of the CPU to set the decoding throughput within 250∼560 Mbit/s, for CMMB, 802.11n, 802.16e and digital video broadcasting -satellite 2 (DVB-S2) codes [75] . Furthermore, their approach is able to minimize the decoding latency, which is kept at under 10 ms in the majority of the cases, with 802.11n and 802.16e codes in the hundreds of µs range.
E. DATA-LEVEL PARALLELISM
Data-parallelism expresses how the same operations can be applied to different data elements at the same time. Generally speaking, we can define it, with regards to LDPC decoding, as the number of codewords that are decoded simultaneously. The motivation for exploring data-parallelism is clear since short to moderate length codes cannot utilize all the resources that multi-core and many-core processor architectures possess. Thus, to avoid wasting logic resources that would otherwise be sitting idle, several codewords are loaded and decoded simultaneously to elevate the decoding throughput. However, herein lies a tradeoff. Not only does the decoding throughput sees diminishing returns as the hardware occupancy is elevated, but decoding latency, a figure of merit of the computational performance that should be kept low, also increases. Therefore, only a handful of data-parallelism strategies elevate the decoding throughput to the desired high levels without sacrificing latency beyond admissible levels for real-time operation [64] , [75] .
1) TAXONOMY
Similar to the thread-parallelism case, a proper taxonomy is due for data-parallelism within LDPC decoders on programmable hardware. Moreover, regarding data-parallelism, the differences between methods that solely concern one type of processor but not the other do not exist. Each of the presented methods is exploited on both CPUs and GPUs alike. Furthermore, because data representation is tightly coupled to the design decisions regarding data-parallelism, it is herein discussed as well.
2) CODEWORD BATCH
CPU and GPU memory engines are optimized for certain alignments. Memory transactions bandwidth can be increased by moving increasingly larger data types until the memory engine saturates at the maximum permitted alignment. Instead of storing an LLR using a float data type, a float4 type can be utilized to store 4 LLR contiguously. In fact, data-parallelism strategies go even further and apply bit slicing operations, usually not natively supported by C/C++ languages, unless by SIMD intrinsics, to pack more data elements into a vector type. Considering the negligible BER performance loss when data is no longer represented using floating-point, but instead low bitwidths fixed-point types are used (typically between 4 and 8 bits [105] ), an int data type can be employed to store 4 data elements and an int4 128 bit vector type [99] can store 16 codewords [34] . Single codeword and codeword batch storage is depicted in Fig. 10 .
When data-parallelism levels cannot be raised by increasing the number of elements in a data type, reducing each element bitwidth would hurt the BER performance, and going as further as defining a custom data structure will fail short of improving the bandwidth once the maximum alignment permitted is surpassed [34] . At this point, increasing the number of codeword batches must see the replication of the memory layout of the methodology pursued for a single data type, in one of two approaches possible.
3) PADDED CODEWORD BATCHES
Under this approach the basic level of data-parallelism within a batch is replicated as whole with a memory stride equal to the number of data type elements in memory. Thus, codeword batches become padded in memory, as shown in Fig. 11 , with the Tanner graph indexing method applied D times to D different base offsets to memory. This method does not impose any relevant constraint to the BpC, TpC or CpC approaches. In fact, throughput performance can only reach acceptable levels once data-parallelism levels are raised [36] , [46] . Using a TpN approach this means the spawning of more threads to deal with the extra batches of codewords in the TpN approach [33] , [47] - [49] , [63] . One of the disadvantages behind this method is the inability to address unbalanced memory transactions that may occur when thread-parallelism does not account for threads having different memory access patterns. More data is accessed for higher CN and VN degrees. Thus, for irregular codes, there can be certain threads computing and moving a higher load of data than others. This problem is addressed by Kang et al. by evening out the accesses among threads in the same thread block [43] . 
4) INTERLEAVED CODEWORD BATCHES
This method, illustrated in Fig. 12 , defines D codeword batches as the data-parallelism level and interleaves data elements from different batches at basic data type granularity in memory. The advantage drawn here is that accesses are evened out to large blocks of data moved to and from contiguous locations, regardless of the Tanner graph indexing method. This method is highly suited for SIMD computation in x86 CPUs, with cache hit rations for short to moderate length codes reaching 90% [75] . Furthermore, this method is also highly efficient for GPU architectures, enabling real-time decoding throughputs and simultaneously real-time decoding latencies [64] .
F. DECODING ALGORITHMS
Among the countless decoding algorithms, by far, the most popular ones are soft-decoding message-passing ones. In particular, LLR-based algorithms are adopted in the majority of decoders on programmable hardware. Not only does the support of floating-point data types makes them an appealing choice due to easiness of implementation, but also because of their BER performance, even when a step further is taken and fixed-point data types are emulated-as they are not natively supported on CPU and GPU hardware. An exception to this observation is the decoder based on the impl.-efficient reliability ratio-based weighted bit-flipping alg. (IRRWBF) [61] . However, the required arithmetic operations in fixed-point and hard-decision operations are emulated using integer types. The most favored choice is for decoding algorithms whose underlying architecture instruction set architecture (ISA) can provide the required datatypes without emulation or with little overhead. In fact, the most popular choice is the MSA in its uncorrected version, offset-corrected OMSA or normalized-corrected NMSA variations. SPA decoders in the probability domain, in the pmf Fourier domain (FFT-SPA), in the LLR domain (LSPA), and in the signedlog Fourier domain (signed-log FFT-SPA) can be found, and also the odd Min-Max and parity likelihood ratio algorithm (PLRA) decoder.
The decoding algorithm choice can be tightly coupled to the data type representation chosen for a particular decoding design. Probability domain decoders use floating-point types, as they extensively rely on multiplication, with multiplication not supported natively on programmable hardware in fixedpoint types. As a consequence, opportunities to improve the decoding throughput by increasing data-parallelism will be limited by this design decision. GPU hardware, usually aligned for 128 bit data types can pack only 4 floating-point VOLUME 4, 2016 words, while they can pack 16 fixed-point words with a bitwidth of 8 bits [32] , and a similar trade-off is expected on CPUs, though they usually implement more sophisticated integer arithmetic than GPUs. As a consequence, all the sumproduct algorithm (SPA) decoders explore single-precision floating-point (32-bit), and though some MSA-based decoders also do so, the majority of them rely on 6∼8 bit fixed-point data representations. This way, parallelism can be raised by increasing the number of words inside a data type defined by the programming model and language.
G. DECODING SCHEDULES
The decoding of LDPC codes can be scheduled mainly in two approaches. First approach is the so-called flooding or TPMP schedule. In this type of scheduling, the exposed parallelism lies at the complete dimension of the LDPC code, since all nodes can be schedule for processing one type of node at a time. Thus, all CNs can be updated at the same time, and all VNs can also be updated at the same time, provided that the CN and the VN processing is not concurrent. As a consequence, when developing a parallel programmable decoder, a memory fencing mechanism which prevents the scheduling for execution of nodes that are consuming messages from nodes which have not still updated their produced messages is required. Otherwise write-after-reads (WARs) hazards unfold. Notwithstanding, this is not particularly challenging to guarantee on either CPU, GPU or other accelerator devices, so as long as CN processing and VN process is defined by different functions or kernels. This way, the function or kernel call implicitly sets a synchronization routine preventing any WAR hazard. LDPC decoders using this decoding schedule are among those reaching the highest decoding throughputs, since the TPMP schedule is usually accompanied by a heavily multi-thread approach, usually TpN.
The TDMP schedule seen in the LDPC decoders that implement it are CN-based, i.e., CNs are scheduled for execution sequentially and after each CN is updated, their adjacent VNs (N (m)) are updated on-the-fly as well [106] . As this decoding schedule is applied to LDPC codes designed for the TDMP, such as QC-LDPC codes, this allows the execution of z f CNs and their adjacent VNs simultaneously as it does not unfold any WAR hazard. The potential for high throughputs for this decoding schedule as been shown for both CUDAenabled GPUs [7] , [48] , [49] , the Cell B.E. accelerator [8] and a conventional Intel x86 CPU [75] -decoding throughputs range from 140 to 900 Mbit/s. Other approaches [42] , [44] , [45] fail short of such high throughputs, but are still in the same range as those obtained with the TPMP schedule. An interesting result is presented for a non-binary LDPC code case, defined over GF (2 4 ). The authors [57] study both a sequential and a TPMP schedule, based on the BpN approach. For equivalent BER levels achieved, the authors report lower throughputs (3∼8.5 Mbit/s) for the TPMP than for the sequential approach (5∼12.5) Mbit/s.
The TPMP, or flooding schedule, is the most widely implemented decoding schedule. However, a certain misconception may lie in the heart of this design preference. This type of decoding algorithm is the one permitting highest level of simultaneous scheduling of operations. All CNs can be scheduled for execution at the same time, and the same holds for the VNs, provided the execution of CNs and VNs does not overlap in time. On the contrary the TDMP, despited converging faster and reducing the number of required decoding iterations to reach the same BER by roughly half, can only schedule a limited number of operations. If the LDPC code design has not been constructed with this scheduling in mind, there can be as little as no opportunity to schedule more than a single node at a time, though in practice this does not happen as the widely standardized quasi-cyclic codes are designed with this in mind. However, the TPMP schedule implies a data consumption/production pattern for each individual node where each message is accessed once per decoding iteration and per processing phase-for instance, a q nm message is produced by VN n and is consumed by CN m . This type of access pattern benefits little from a cache system. On the other hand, under the TDMP schedule, where data locality can be exploited temporally for short to moderate code lengths [75] . For earlier GPU generations this advantage meant little, as there was no caching system available. On newer models, L1 caches can exploit this feature of the TDMP schedule. In fact, among the surveyed LDPC decoders, the highest decoding throughputs found for CPU and GPU architecture uses the TDMP [48] , [75] .
IV. FUTURE DIRECTIONS: RECONFIGURABLE ARCHITECTURES USING HIGH-LEVEL SYNTHESIS
Efforts to survey the LDPC decoders developed for reconfigurable computing [83] would span out of the intended scope for this manuscript. In particular, we refrain from dwelling into reconfigurable LDPC decoders that are not developed using HLS models, with, by far and large, the great majority of decoders found in the literature for reconfigurable computing developed using traditional RTL approaches, and as a consequence, a limited set of decoders fits in this requirement [107] - [110] .
A. PROGRAMMING MODELS
OpenCL has recently become supported by the major FPGA manufacturers [87] , [111] , the OpenCL programming model used for the development of an LDPC decoder [33] is the Silicon-to-OpenCL academic tool [112] . The tool takes in OpenCL kernel C descriptions, though not fully compliant to the OpenCL specification [80] , and generates a custom wide-pipeline accelerator.
Moreover, the Vivado HLS [89] defines a comprehensive support for the C/C++ programming languages that get mapped onto circuits on the FPGA board based on a number of HLS directives that instruct how the tool should perform optimizations to different traits of the language (see Fig. 13 ). It supports optimizations to 1) memory blocks, 2) arithmetic functions, 3) dataflow directives for loops and functions, through pipeline or unrolling and 4) instantiation of certain IP cores in the C/C++ language for I/O interaction with other logic blocks [111] .
B. PARALLELISM
Notwithstanding, the fact that the OpenCL programming model defines the concept of work-items, a similar concept to execution threads, in the reconfigurable fabric, the generated accelerator defines no such physical nor logical element that is an execution thread. In fact, computation will be defined by the circuits configuration, thus while data-parallelism concepts remain perfectly valid, there is not thread-parallelism equivalent taxonomy to the reconfigurable LDPC decoders case.
Nevertheless, we are able to define the OpenCL LDPC decoder on FPGA, in its inception a TPN decoder, as a wide-pipeline decoder [33] , and the Vivado HLS decoder as a wid-epipeline accelerator as well, though, this approach defined the TPMP node processing phases in computation loops [58] . As a consequence, we prefer the designation of loop-annotated decoder since it is through the optimization directives written as annotations (directives) to loops where computation occurs that the hardware generation is guided. Both approaches see modest throughputs of dozens of Mbit/s achieved for short to moderate length codes. The greatest advantage with this approach is the low latency, ranging <3 ms in the OpenCL decoder case and <500 µs in the Vivado HLS case.
V. SUMMARY
LDPC decoders on programmable hardware can mostly be applied to simulation purposes, due to the methodology pursued in most of the literature be prone to increasing the decoding latency. Notable exceptions to this tradeoff, are the works of Le Gal and Jego [75] and Wang et al. [66] , which effectively keep latencies at low and real-time compliant levels. Notwithstanding, surveying the decoders in the literature, we observe that the better suited strategies for LDPC decoding are based on LLR-based decoding algorithms, mostly defining fixed-point data representation. This allows for the packing of multiple messages with small bitwidths, usually in the 8 range, to be packed onto wider words. Furthermore, data-parallelism levels are usually pushed beyond the wide word, or vector datatype, granularity, often at the expense of spawning more threads in the decoding underlying architecture. Task-parallelism employed in the literature is explored at all conceived levels, from coarse (CpC) to finegranularity (TpE), although the strategies attaining the highest performance are mostly fine-grained ones. In particular, the TPN task-parallelism granularity has scored the most prevalent method to expose parallelism for computation.
Regarding LDPC decoders in reconfigurable hardware, the surveyed LDPC decoders on HLS programming models show that this field provides interesting prospects, but remains a larger untapped field. In particular, it remains unclear how to best direct an HLS compiler to generate efficient hardware [113] . The incipient maturity of the tools used in the LDPC decoders [33] , [58] already attain competitive decoding throughput and latency, as observed during the inception of LDPC decoding on programmable many-core and multicore architectures. Furthermore, other programming models such as the Altera OpenCL, more recent versions of the Vivado infrastructure and the Maxeler dataflow decoders [88] promise much lower NRE efforts to target LDPC decoders with high throughputs and higher energy efficiency than programmable computer architectures [85] .
The reader is referred to Appendix A, in particular to the exhaustive set of surveyed LDPC decoders (Table 1) , which can be found as a supplementary file to this manuscript. 
