Abstract-Channel coding may be viewed as the best-informed and most potent component of cellular communication systems, which is used for correcting the transmission errors inflicted by noise, interference, and fading. The powerful turbo code was selected to provide channel coding for mobile broad band data in the 3G UMTS and 4G long term evolution cellular systems. However, the 3GPP standardization group has recently debated whether it should be replaced by low density parity check (LDPC) or polar codes in 5G new radio, ultimately reaching the decision to adopt the LDPC code family for enhanced mobile broad band (eMBB) data and polar codes for eMBB control. This paper summarizes the factors that influenced this debate, with a particular focus on the application specific integrated circuit (ASIC) implementation of the decoders of these three codes. We show that the overall implementation complexity of turbo, LDPC, and polar decoders depends on numerous other factors beyond their computational complexity. More specifically, we compare the throughput, error correction capability, flexibility, area efficiency, and energy efficiency of ASIC implementations drawn from 110 papers and use the results for characterizing the advantages and disadvantages of these three codes as well as for avoiding pitfalls and for providing design guidelines.
I. INTRODUCTION

I
N CELLULAR communication systems such as 3G UMTS [1] and 4G LTE [2] , wireless transmission is used to convey data between user equipment and basestations, where the latter act as gatekeepers to the Internet and telephone networks. However, the received data typically differs from the transmitted data, owing to transmission errors caused by noise, interference and fading. In order to correct these transmission errors, cellular communication systems use forward error correction channel codes. More specifically, a channel encoder is used in the transmitter (be it the user equipment or the basestation) to convert each so-called information block comprising K data bits into a longer encoded block comprising N > K encoded bits, which are transmitted. In the receiver, the additional (N − K) encoded bits provide the channel decoder with redundancy that allows it to detect and correct transmission errors within the original K information bits.
If the noise, interference or fading is particularly severe, then a low coding rate R = K/N will be required for the channel decoder to successfully detect and correct all transmission errors. However, a low coding rate implies the transmission of a high number N of encoded bits, which consume precious transmission time, energy and bandwidth resources. Therefore, desirable channel codes allow the successful detection and correction of transmission errors at coding rates R that approach the theoretical channel capacity [3] .
In the past couple of decades, several near-capacity channel codes have emerged, including the classic turbo codes [4] adopted by the 3G UMTS [1] and 4G LTE [2] mobile broadband standards, the LDPC codes [5] ratified by the WiFi [6] , WiMax [7] , WiGig [8] , DVB-S2 [9] and 10GBase-T [10] standards as well as the more recent polar codes [11] . Both the turbo and LDPC codes employ an iterative decoding process, in which each successive attempt at decoding the information block informs the next, until the process converges to a legitimate codeword. By contrast, polar codes select the recovered information block from a list of candidates obtained from the associated parallel successive cancellation decoding processes, in which the decoding of each successive information bit informs the decoding of the next. For all three codes, the channel decoder has a much higher complexity than the corresponding encoder, since it uses iterative or parallel decoding processes, which rely on probabilistic representations of the encoded bits to overcome the uncertainty introduced by noise, interference and fading. Owing to this, it is the error correction performance and implementation characteristics of the channel decoder that are typically the main concerns when designing a channel code. These implementation characteristics include reconfiguration flexibility, processing throughput, processing latency, energy efficiency and hardware efficiency. At the time of writing, the 3GPP standardization group is deliberating on the 5G specifications under the terminology of New Radio (NR), where the turbo code of the operational 3G UMTS and 4G LTE systems has been replaced by an LDPC code for its enhanced Mobile Broad Band (eMBB) data mode, supported by a polar code in the eMBB control mode. To elaborate a little further, the turbo code has been replaced, because it is considered to be incapable of efficiently achieving the multi-Gbps processing throughput required for eMBB. In addition to having an increased throughput, the eMBB mode aims for attaining an improved coverage, which will enable faster file-downloads and more reliable video streaming, for example. Furthermore, 5G is also targeting both Ultra-Reliable Low Latency Communication (URLLC) and massive Machine Type Communication (mMTC) [12] applications. Explicitly, URLLC will offer significantly improved error correction capability and latency, in order to support mission-critical applications, such as autonomous vehicles [13] . By contrast, mMTC will offer significantly improved energy efficiency for the Internet of Things (IoT).
There are several survey papers [14] - [22] , which discussed the implementation of channel decoders. For example, Roth et al. reviewed the trade-offs of area, throughput, Table II. Against the background described above, this treatise provides an overview and comparison of turbo, LDPC and polar codes, with particular focus on their capability to meet the different requirements associated with the eMBB, URLLC and mMTC applications of 5G, as detailed in Section II. The operation and characteristics of turbo, LDPC and polar codes are detailed in Section III. The processing throughput, error correction capability and flexibility of 110 published ASIC based turbo, LDPC and polar implementations will be characterised in Section IV as a function of their area-and energy-efficiency. Finally, Section V presents recommendations for channel decoder designers as well as conclusions and opportunities for future work. The structure of this paper has been illustrated in the Fig. 1 for better understanding.
II. THE REQUIREMENTS FOR 5G
Like 3G UMTS and 4G LTE, the aim of the 3GPP 5G NR is to continue the trend of offering substantially improved user experience and more diverse applications for cellular communications. However, this will impose stricter requirements upon all system components, including the channel code [23] , as summarized in the following discussions. Note that the different eMBB, and mMTC applications for 5G impose different requirements, which may be impossible to meet simultaneously using a single channel code. Therefore, it may be expected that different channel codes are adopted for different applications.
A. Processing Throughput
The strictest throughput requirements are imposed by the eMBB applications of 3GPP 5G NR. A peak transmission throughput of 20 Gbps is targeted for these eMBB applications, which is much higher than the 1 Gbps achieved by 4G LTE. During video streaming, this significantly improved throughput will enable opportunistic forward buffering, when the channel conditions are favorable, for example. This will substantially increase the reliability of streamed video, which currently suffers from 'stutter' when the channel conditions become unfavorable.
Since all received information has to pass through the channel decoder, it must offer an information throughput of T = 20 Gbps. Achieving this information throughput will require a high degree of parallel processing. If we (perhaps optimistically) assume that I = 10 decoding steps (namely iterations or successive cancellation steps) are required and that the channel decoder processors can operate at a clock frequency of F = 1 GHz, then at least P = I · T/F = 200 parallel processors will be required. This parallelism may be implemented internally, by using an array of P processors that collaborate during the processing of each block. Alternatively, a high information throughput can be achieved using external parallelism, where separate channel decoders are used for processing multiple blocks at the same time, or where multiple blocks are 'unrolled' and pipelined through the same decoder at the same time [24] , [56] . However, this external parallelism approach does not achieve the same low latency benefit of internal parallelism, as it will be discussed in the following Section II-B.
B. Processing Latency
The strictest latency requirements are imposed by the URLLC applications of 3GPP 5G NR. An end-to-end latency of 0.5 ms is targeted for these URLLC applications, which is much lower than the 10 ms achieved in 4G LTE [25] . This significantly improved latency will allow user inputs made on a user equipment to be delivered to the cloud, processed on a cloud computer and then returned to update the display of the user equipment, without the user perceiving an objectionable delay, for example. This will enable new applications in user-specific 3D video rendering, augmented reality, remote control and mobile gaming, among others. Furthermore, since machines are more sensitive to latency than humans, these ultra-low latencies will support new mMTC applications, such as swarm robotics, factory automation, as well as vehicular control and safety applications.
However, an end-to-end latency of 0.5 ms implies a physical layer latency of 8 to 67 μs [26] , which allows data to be transmitted immediately after receiving control information, within the guard interval duration of 8 to 67 μs in the same time slot. Furthermore, the channel decoder must share this latency budget with many other physical layer components, such as channel estimation and demodulation. Owing to this, the channel decoder should target processing latencies as low as 0.665 μs [26] . This is achieved by channel decoders employing internal parallelism, which comprise different processors that collaborate on the processing of each block. Using this approach, an information throughput of T = 20 Gbps for data blocks comprising K = 10,000 bits implies a processing latency of L = K/T = 0.5 μs, which meets the requirement described above. Note that while a high throughput can be achieved by using separate channel decoders to process multiple blocks at the same time, this approach does not improve the processing latency beyond that of each individual channel decoder. In other words, this relationship only holds under the assumption that no parallel processing of multiple frames is carried out at the same time, no pipelined operation of successive decoding steps is employed. 
C. Error Correction Capability
The strictest error correction capability requirements are imposed in the URLLC applications of 3GPP 5G NR. The target for these URLLC applications is for only 1 block in every 100,000 to suffer from excessive transmission errors that cannot be corrected by the channel decoder, when the transmission throughput approaches the capacity of the communication link between the user equipment and basestation. More specifically, the URLLC channel code must not exhibit an error floor above a Block Error Rate (BLER) of 10 −5 [25] . This represents an order of magnitude improvement upon that of the 4G LTE turbo code, which has to guarantee a BLER below 10 −4 [27] . This additional improved error correction requirement further aggravates the improved latency requirement of Section II-B, since it makes it ten times less likely that the receiver will have to activate Hybrid Automatic Repeat reQuest (HARQ) [17] , [28] to request a retransmission of erroneously decoded information, which imposes a significant additional latency. Despite this however, HARQ will remain a vital component of 5G, in order to enable ultra-reliable communication.
D. Flexibility
The summary of requirements for 5G NR channel decoder has been illustrated in Fig. 2 . The design of channel decoder has higher priority for variety characteristics based on different scenarios. Unlike other specifications, flexibility is necessary among three main dimensions of 5G NR, owing to the wide range of use cases targeted by each of eMBB, URLLC and mMTC. Owing to this, the channel code must support a wide variety of information block lengths K, as well as a wide variety of coding rates R = K/N. For example, short blocks comprising as few as tens or hundreds of information bits may be expected to be typical in URLLC, mMTC and control applications, while long blocks comprising as many as thousands of bits are typical in eMBB applications.
Likewise low coding rates will be required in rural areas, where basestations are deployed sparsely for covering large cells, while high coding rates may be used for ultra-dense urban deployments having strong Line-of-Sight (LOS). If the channel code does not support a wide variety of information block lengths K, then each information block may have to be padded with a high number of wasteful dummy bits for ensuring that its length becomes one of those legitimate ones supported by the channel code. Likewise, if the channel code does not support a wide variety of coding rates R, then it may be necessary to select a rate that is lower than it is actually required by the current level of noise, interference and fading. Again, that implies the transmission of a high number of wasteful encoded bits. In both cases, the wasteful bits translate into wasted bandwidth. More specifically, the waste results in each transmission having a higher bandwidth, duration or energy than it is actually required, preventing other users from communicating at the same frequency, time or location, without suffering from increased interference. This will therefore degrade the throughput, latency and error correction capability that can be offered by the 3GPP 5G NR. For this reason, the 5G flexibility requirement is key to unconditionally fulfilling the other challenging 5G requirements of Sections II-A-II-C.
E. Implementation Complexity
The implementation complexity of a channel code determines its hardware resource requirement and energy consumption. The strictest implementation complexity requirements are imposed by the mMTC applications of 3GPP 5G NR. In these mMTC applications, low-cost IoT devices have to operate continuously without recharging, while requiring very low chip area and energy consumption. In the uplink, these requirements are imposed upon the channel encoder, while they are imposed on the decoder in the downlink. In eMBB and URLLC applications, the hardware and energy efficiencies of the channel encoder and decoder have to be at least as good as those of 4G LTE. Here, the hardware (or area) efficiency quantifies a channel encoder's or decoder's information throughput as a ratio to its ASIC area, which is measured in Mbps/mm 2 . Meanwhile the energy efficiency quantifies the number of information bits that may be decoded per nJ of energy consumed, which is equivalent to Mbps/mW.
III. BACKGROUND ON THE CANDIDATE CHANNEL CODES
This section provides background discussions of the three mainstream channel codes considered in this paper, namely the turbo, LDPC and polar codes. In particular, we will detail the decoders of those codes, where the decoders have much higher complexity than the corresponding encoders. The following subsections discuss each code in turn, with reference to the structures illustrated in Figure 4 .
A. Turbo Codes
Turbo codes may be considered to be very mature in mobile broadband applications, since they were selected to provide flexible channel coding in the 3G UMTS [1] and 4G LTE [2] standards, which have been widely adopted world wide. Furthermore, turbo codes have been selected to provide channel coding in the 4G Narrow Band Internet of Things (NB-IoT) standard [29] for machine type communications.
A turbo encoder [30] comprises a parallel concatenation of two convolutional encoders [31] , which are separated by an interleaver. The interleaver creates a replica of the K-bit information block, but rearranges the order of its bits according to a predetermined interleaving pattern. Following this, the two convolutional encoders encode the pair of differently ordered information block and each generates a K-bit parity block, as shown in Figure 3 . These are typically concatenated together with a third replica of the K-bit information block, which is referred to as the systematic block. Puncturing and repetition [32] may also be used to remove or introduce additional bits for obtaining the resultant N-bit encoded block. The N-bit encoded block is then modulated and transmitted over the wireless channel, where it is exposed to noise interference and fading. Owing to the uncertainty introduced by these effects, the demodulator will typically be unable to express absolute confidence in the values of the N bits in the encoded block. Instead, the demodulator may express its confidence in the value of each bit using the corresponding Logarithmic Likelihood Ratio (LLR) [33] . Each LLR provides the logarithm of the ratio of the probabilities of the corresponding bit having the value 0 and 1. This approach enables soft turbo decoding, which iteratively exploits all received information, despite the uncertainty introduced by the wireless channel.
More specifically, the N LLRs of the encoded block may be decomposed into a systematic block and two parity blocks, each comprising K LLRs. These may be entered into a pair of convolutional decoders, which operate on the basis of the Maximum A-Posteriori (MAP) algorithm, also known as the Logarithmic Bahl-Cocke-Jelinek-Raviv (Log-BCJR) algorithm [34] . This employs a trellis diagram [31] for describing the relationships between the encoded bits, which may be recursed in forward and backward directions [35] to obtain a block of K extrinsic LLRs pertaining to the information block. These K extrinsic LLRs are obtained by combining the received LLRs using both addition and Jacobian logarithm [34] operations, which are also known as the max * operation. The max * operation used for combining two LLRs x and y is given by
which is simplified as max * (x, y)≈max(x, y) in the so-called max-log-MAP algorithm [34] . The order of these K extrinsic LLRs may be rearranged by the interleaver and entered into the other convolutional decoder as a block of K a priori LLRs. This supports the operation of the other convolutional decoder, enabling it to generate its own block of K extrinsic LLRs pertaining to the information block. In return, these extrinsic LLRs can be reordered by the interleaver and forwarded as a priori LLRs to the first convolutional decoder. These additional a priori LLRs may be exploited during a second decoding attempt of the first convolutional decoder, in order to improve its operation. This process may continue, with the two convolutional decoders iteratively exchanging successively ever-higher-quality extrinsic LLRs, until the process converges to a legitimate codeword. The LLRs pertaining to the information block may then be converted into the most likely bit values and output [20] . As shown in Figure 4 (a), turbo codes may be considered to employ a regular structure, comprising two rows of K trellis stages corresponding to the two convolutional decoders, connected by the interleaver. In Figure 4 (a), the vertical exchange of information between the trellis stages via the interleaver corresponds to the iterative exchange of extrinsic LLRs, while the horizontal exchange of information along the two rows of trellis stages corresponds to the forward and backward recursions of the Log-BCJR algorithm. Since every trellis stage is identical, the turbo code equally protects the K bits in each information block. The complexity of the turbo code structure scales with K, since the interconnections of the interleaver pattern may be described by a 1 × K vector. Owing to the regularity of this structure, turbo codes can be readily designed to reuse hardware to flexibly support a wide range of information block lengths K. However, the forward and backward recursions within the Log-BCJR algorithm impose data dependencies, which many can limit the degree of parallel processing in practical implementations. This impediment has motivated substantial research efforts, which have mitigated the data dependencies and increased the degree of parallel processing using the techniques such as Non-Sliding Window (NSW) [36] , radix-4 [36] and Fully-Parallel Turbo Decoder (FPTD) [37] . The impact of the turbo decoder architecture on the error correction capability, throughput, area efficiency and energy efficiency will be characterised in the scatter plots of Section IV. Furthermore, it is typically necessary to define a different interleaver pattern In order to contrast the two extreme architectures of fullyserial and fully-parallel turbo decoding, their corresponding decoding schedules are illustrated in Tables III and IV. As shown in Table III , the fully-serial turbo decoder uses a single processor operating over K clock cycles to complete each of the forward and backward recursions on each of the upper set of trellis stages and lower set of trellis stages. In total, each decoding iteration of the fully serial turbo decoder requires 4 × K clock cycles, owing to the data dependencies imposed by the Log-BCJR algorithm. However, these data dependencies are ignored in the fully-parallel turbo decoder of Table IV , where a separate processor is used for each individual algorithmic block, enabling the forward and backward recursions of both upper and lower sets of trellis stages to be completed at the same time. Therefore, one iteration of fully-parallel turbo decoding can be finished in a single clock cycle, using a total of 2 × K processors.
The examples of Tables III and IV illustrate the contrast between serial and parallel turbo decoders. The serial schedule efficiently propagates information along the forward and backward recursions, enabling a particular error correction capability to be achieved with fewer iterations and lower complexity than the parallel schedule. Despite this, the parallel schedule requires significantly fewer clock cycles overall, leading to superior throughput and latency. The Area and energy requirement of ASIC implementations of turbo decoders tends to be dominated by the capacity of the memory, which is the same for both the serial and parallel schedules. Owing to this, the high throughput of the parallel schedule typically leads to superior area and energy efficiency. However, the reconfiguration flexibility typically remains limited when implementing a parallel schedule due to the requirement for a fully laid out interconnection network that supports all interleaver designs, as will be detailed in Section IV.
Since the turbo code structure scales with the information block length K, the turbo decoder can be said to decode the information bits directly, rather than recovering the encoded bits and then extracting the information bits. Owing to this, the decoding complexity depends mainly on the information block length K and hence it does not vary significantly when the coding rate R is varied using puncturing and repetition, as shown in Figure 6 . More specifically, the computational complexity C in terms of Add Compare Select (ACS) operations can be seen to be linearly proportional to the information block length K in Figure 7 . In other words, the complexity of a turbo decoder is on the order of C = O(K). In this way, the information throughput, and hence the decoding latency, hardware efficiency and energy efficiency do not vary significantly with the coding rate R, as shown in Figure 5a . This is illustrated by the analogy of Figure 5b , which shows that the information throughput of a turbo decoder remains constant, when the coding rate R is adjusted by varying the puncturing or repetition.
B. LDPC Codes
LDPC codes belong to an even more mature channel coding family than turbo codes, since they were proposed by Gallager in his PhD thesis half-a-century ago! They have also been selected for a variety of standards conceived for diverse applications. Perhaps most famously, LDPC coding is employed in WiFi [6] , which is designed for wireless local area networking offering support for 12 combinations of medium information block lengths K and high coding rates R. The WiMAX standard [39] designed for mobile broadband communication employs an LDPC code that supports a similar range of information block lengths K and coding rates R, but with 144 options within that range. In the WiGig standard [8] ratified for Millimeter Wave (mmWave) communication, an LDPC code supporting 4 combinations of medium information block lengths K and high coding rates R is employed. The DVB-S2 standard [9] for satellite communication employs an LDPC code, which supports 29 combinations for high information block lengths K and medium to high coding rates R. In the Ethernet 10GBASE-T standard [40] , an LDPC code supporting only a single fixed medium information block length K and a high coding rate R is employed for facilitating a high processing throughput. The impact of these different standards [8] , [9] , [39] , [40] upon the error correction capability, throughput, area efficiency and energy efficiency, will be characterized in the scatter plots of Section IV. The relationship between information throughput and decoding latency, hardware efficiency and energy efficiency, as well as an analogy using pumps, valves and pipes, to illustrate how the coding rate R of (b) turbo and (c) LDPC or polar decoders affects their encoded and information throughputs.
An LDPC encoder obtains an N-bit encoded block U by multiplying a block of K information bits x by a (K × N) generator matrix G in GF(2), according to U = G · x. In contrast to a turbo decoder, an LDPC decoder employs an irregular factor graph structure [21] , which applies unequal protection to the K bits in each information block. More specifically, N LLRs pertaining to the encoded block are forwarded to a corresponding set of N Variable Nodes (VNs), which are connected to (N − K) Check Nodes (CNs). In detail, the V th VN is connected to d v number of CNs, where d v is referred to as the degree of the VN. Similarly, the C th CN is connected to d c number of VNs, where d c is referred to as the degree of the CN. In a similar manner to a turbo decoder, an iterative sum product decoding process [21] is used for exchanging extrinsic LLRs between the connected VNs and CNs. More specifically, the extrinsic LLR provided by a VN for each of its connected CNs is obtained as the sum of the LLRs provided by all other connected CNs plus the LLR provided by the channel. Meanwhile, the extrinsic LLR provided by a CN for each of its connected VNs is obtained by the so-called box-plus summation of the LLRs provided by all other connected VNs. To elaborate a little further, the box-plus sum of two LLRs x and y is given by
which is simplified to
in the so-called min-sum algorithm [41] . The complexity of this factor graph structure scales with the encoded block length N, since the interconnections between the VNs and CNs in the LDPC decoder may be described by a (N − K) × N matrix, which is referred to as the Parity Check Matrix (PCM) [21] . Note that the dimensions of the PCM grow as the coding rate R is reduced.
The computations performed within each LDPC VN and CN typically impose a significantly lower complexity than the computations performed for each trellis stage of a turbo code, hence typically leading to a lower LDPC decoder complexity, particularly at high coding rates. However, while a turbo decoder has K random interconnections in its interleaver, an LDPC factor graph has
is the average number of CNs that each VN is connected to. Depending on the design of the LDPC code, an LDPC decoder may have several times more random interconnections than an equivalent turbo decoder, particularly at lower coding rates. For example, in the case of the WiFi LDPC codes, where we have mean[d v ] = 3.5 [42] , the R = 1/2 interconnection complexity is 7 times higher than that of a turbo decoder having the same information block length K. Since different components of an LDPC decoder have different irregular numbers of connections, it can be a significant challenge to implement flexible LDPC decoders that support various block lengths K and coding rates R, at a high throughput, as it will be demonstrated in Section IV-C. This flexibility is typically achieved by defining a different PCM for each supported combination of information block length K and coding rate R, when requiring the decoder to support a wide variety of PCMs.
However, in contrast to turbo decoders, LDPC decoders only have connectivity through the interleaver. More specifically, LDPC decoders do not suffer from the data dependencies that are imposed by the serial nature of the forward and backward recursions used by the Log-BCJR algorithm during turbo decoding. Owing to this, the designers of LDPC decoders are free to implement parallel processing to a wide variety of degrees and using a wide variety of techniques, including block parallel [43] , row parallel [44] and fully parallel [45] arrangements. In analogy to the fully-serial and fully-parallel turbo decoding of Tables III and IV, LDPC decoding can be completed using the more-serial layered belief propagation schedule [46] or the extreme of the fully-parallel flooding schedule [45] . As in turbo decoding, a higher degree of parallelism leads to higher throughput, area and energy efficiency, but degraded error correction performance and flexibility, as it will be discussed in Section IV.
Furthermore, as shown in Figure 6 , the complexity, throughput, latency, hardware efficiency and energy efficiency of LDPC decoders are degraded at low coding rates R, owing to two fundamental reasons. Firstly, the number of rows in an LDPC PCM grows as the coding rate R is reduced, which implies having a higher number of CNs requiring computation. Secondly, the number of columns in the PCM and the number of VNs in the factor graph is given by the encoded block length N, which dictates the input and output interface to the LDPC decoder, as shown in Figure 4 (b). Since LDPC decoders must recover 1/R encoded bits in order to decode each information bit, their information throughputs scale down proportionately with the coding rate R, as illustrated by the analogy of Figure 5a . This analogy shows that the information throughput of an LDPC decoder varies when the coding rate is adjusted, since this controls the specific fraction of the N recovered bits that correspond to information bits. As the coding rate of an LDPC is increased, the dimensions of its PCM are reduced and its complexity C is decreased, as shown in Figure 6 . However, when the coding rate R is fixed, increasing the information block length K will increase the dimensions of the PCM, hence increasing the computational complexity of the LDPC decoder, as shown in Figure 7 . In general terms, it may be said that the complexity of an LDPC decoder is on the order of C = O(K/R).
C. Polar Codes
While turbo and LDPC codes have been researched and developed over the last 20 years, polar codes [11] were not proposed until 2009. Owing to this, polar codes were not adopted in any standards or consumer devices before the control channel of 3GPP NR eMBB. This limited their scope to proof of concept demonstrators and academic publications.
During polar encoding, the K information bits are interleaved with (M − K) frozen bits, which have a fixed value of zero. The frozen bits are positioned according to a prescribed bit pattern, which should be optimised for each supported combination of K and R. Here, the number of frozen bits (M − K) should be chosen for ensuring that the number of bits in the resultant bit sequence is a power of two, yielding M = 2 log 2 (N ) , where · is the ceiling function. These bits are combined using XOR operations according to a Kronecker matrix [11] , in analogy to Figure 4 
Following this, (M − N) of the resultant bits are punctured or shortened, in order to reduce the length of the resultant encoded block to N bits. Alternatively, repetition may be used for increasing N above M in cases, where N is slightly higher than a power of two. A polar decoder comprises a structured graph of VNs and CNs, all having a degree of no more than 3, as shown in Figure 4(c) . At the start of the decoding process, the N LLRs provided by the demodulator are depunctured and forwarded to the inputs on the right of Figure 4 (c). During the decoding process, the LLRs are propagated through the check nodes from right to the left of Figure 4 (c). Here, each check node combines a pair of LLRs using the box-plus operation of Equation (2) or its min-sum approximation as shown in Equation (3) . After the propagation of LLRs, hard decisions are made for each of the information bits on the left of Figure 4 (c) in the order commencing from top to bottom. More specifically, these hard decisions are propagated from left to right, to the VNs, allowing them to pass LLRs to the CNs to propagate from right to left. Here, each VN either adds or subtracts an LLR from another, depending on the value of the bit propagated to it. In this way, each successive hard decision enables further hard decisions for subsequent bits. In order to guard against erroneous hard decisions affecting all subsequent decisions, a list of the L best hard decision based sequences can be maintained during the so-called Successive Cancellation List (SCL) based decoding process [47] where the best candidate is then output at the end of the process. Following this, the (M − K) frozen bits can be removed and the K recovered information bits can be output. Note that in the special case of L = 1, the SCL algorithm becomes the Successive Cancellation (SC) algorithm. In analogy to the fully-serial and fully-parallel turbo decoding schedules of Tables III and IV, polar decoding can be completed using the more-serial SC algorithm and its variants [48] , or the fullyparallel belief propagation algorithm [49] . As in turbo and LDPC decoding, a higher degree of parallelism leads to a higher throughput, area and energy efficiency, at the cost of a degraded error correction performance and flexibility, as it will be discussed in Section IV.
Owing to the low degrees of its VNs and CNs, the computational complexity of a polar decoder is relatively low compared to that of an LDPC decoder, as determined by the list size L, which scales the complexity linearly, as shown in Figure 6 . Like LDPC codes, the structure of polar codes scales with the encoded block length N, rather than with the information block length K as in turbo codes. Owing to this, the complexity and encoded throughput of a polar decoder typically remains constant, when the encoded block length N is kept constant, but the information block length K is changed. However, since 1/R encoded bits must be decoded in order to recover each information bit, the information throughput typically scales proportionately with the coding rate R, as illustrated by the stylized analogy of Figure 5c . Owing to this, as shown in Figure 6 , the complexity, hardware efficiency and latency of polar decoders are degraded for lower coding rates and improved for higher coding rates, following a similar trend to that of LDPC decoders. On the same note, similarly to LDPC codes, the computational complexity of a polar code grows with the encoded block length N, as shown in Figure 6 and 7. More specifically, the dimensions of the polar code graph and hence the complexity grows according to the order of
however that a simplified successive-cancellation list based algorithm [50] may be used for reducing the computational complexity of polar decoders having low coding rates, for example. More specifically, this technique reduces the complexity of processing the frozen bits, which become more prevalent at low coding rates R. Note however that fast simplified successive-cancellation list based algorithm [50] may be used for reducing the computational complexity of polar decoders having low coding rates, for example. More specifically, this technique reduces the complexity of processing the frozen bits, which become more prevalent at low coding rates R.
IV. COMPARISON OF TURBO, LDPC AND POLAR DECODERS
This section provides a comprehensive comparison of 100+ state-of-the-art ASIC implementations of turbo, LDPC and polar decoders. The key performance characteristics of a channel decoder are its processing latency, information throughput, error correction capability, flexibility, area efficiency and energy efficiency, as illustrated in Figure 8 . When employed in a particular application, certain minimum performance requirements are imposed upon the information throughput (and on the resultant processing latency), error correction capability and flexibility of the channel decoder ASIC implementations. Within the constraints imposed by meeting these requirements, the design of a channel decoder ASIC should focus on optimising the area-and energy-efficiency, since they directly determine the implementation and running cost or the battery lifetime of basestations and user equipment. Motivated by this, the following subsections consider the information throughput, error correction capability and flexibility of the ASIC implementations in turn, each as functions of the areaand energy-efficiency. Each subsection includes plots, which are derived from the data presented in Table VI of the Appendix. Figure 9 characterises the relationship between the information throughput and area-efficiency of 100+ ASIC implementations of channel decoders, where the shape of the data points distinguishes the turbo, LDPC and polar decoders. Where a channel decoder supports multiple coding rates R and/or block sizes K, the corresponding information throughputs have been averaged in Figure 9 . Furthermore, in order to present a fair comparison amongst ASICs implemented at different technology scales, the results of Figure 9 have all been scaled to 65 nm. Explicitly, 65 nm was chosen, since it is the standard technology scale that is closest to the average of the ASICs considered in this paper. Here, scaling is applied to the performance characteristics of an ASIC implementation using a scaling factor, which is calculated as the ratio between the ASIC's technology scale and the target technology scale of 65 nm. The scaled information throughput, area and power consumption are obtained by multiplying the ASICs information throughput by the scaling factor [54] , the inverse square of the scaling factor and the inverse of the scaling factor, respectively. Owing to this, the area-efficiency and energy-efficiency are proportional to the cube and the square of the scaling factor, respectively. Figure 9 illustrates the difference between the information throughputs achieved by the turbo, LDPC and polar decoders. The considered turbo decoder implementations appear to have the largest portion of decoders associated with low information throughputs, which may be attributed to their low degrees of parallelism imposed by the data dependencies of the Log-BCJR algorithm, as discussed in Section III-A. Likewise, the parallelism of polar decoder implementations is typically limited, owing to the strict data dependencies of the SC and SCL decoding algorithms, as we discussed in Section III-C. Finally, the LDPC decoder implementations have the highest portion of decoders associated with high information throughputs, owing to the high degree of parallelism achieved by the min-sum algorithm, as discussed in Section III-B.
A. The Information Throughput vs. Area-and Energy-Efficiency
As mentioned in Section II-A, attaining a peak information throughput of 20 Gbps is a key requirement for 5G communication. As shown in Figure 9 , the only turbo decoder that achieves the 20 Gbps information throughput is that of [37] , which uses a fully-parallel architecture for the turbo decoding. Likewise, the only LDPC decoders that have reached information throughputs in excess of 20 Gbps have adopted fully parallel architectures. In particular, a decoded throughput of T = 52 Gbps and a latency of 0.045 μs have been demonstrated in [45] for the 10GBASE-T LDPC code, which has a block length of K = 1723 bits and a coding rate of R = 0.84. Likewise, the state-of-the-art LDPC ASIC implementation of [55] achieves not only a high information throughput, but also an outstanding energy efficiency of 987 bit/nJ. However, the LDPC decoders of [45] and [55] only support this single combination of block length K and coding rate R, which reflects the relatively inflexible nature of LDPC decoders in general and the rigidity of fully-parallel LDPC decoders in particular. The only polar decoders that achieve a decoded throughput in excess of 20 Gbps have managed this by unrolling hundreds of data blocks and pipelining their successive cancellation decoding through the same hardware [56] , as indicated in Table VI . Using this approach, an extremely high throughput of 208 Gbps has been demonstrated for a coding rate of R = 1/2, although the latency experienced by each K = 1024-bit data block is 3.21 μs. However, this pipelining approach severely limits the flexibility of the polar decoder. Furthermore, the SC technique for polar decoding results in a degraded error correction capability, compared to the more complex SCL decoding technique, as it will be discussed in Section IV-B.
However, a high degree of parallel processing implies having a large chip area, and a high cost. For this reason, the key consideration in the selection of the 5G channel code is the area-efficiency, which quantifies the ratio of information throughput to chip area, as seen in Figure 9 . At first sight, it may be expected that the area efficiency should remain constant upon varying the degree of parallel processing, since this scales both the information throughput and the area.
However, Figure 9 shows that decoders having higher information throughputs and degrees of parallel processing tend to have higher area efficiencies. This may be explained by the observation that while the datapath area of a decoder scales linearly with the parallelism, the memory and controller parts tend to increase slower than linearly. Owing to this, decoders having a high information throughput tend to have high areaefficiencies, that are dominated by the particular parts that scale with the grade of parallelism, such as the datapath. By contrast, decoders having a low information throughput tend to have low area-efficiencies that are dominated by the specific parts that do not scale with the grade of parallelism, such as the memory and controller. For similar reasons, the same trend is shown in Figure 9 in terms of power-efficiency, with higher information throughputs leading to higher power-efficiency.
However, there are some special cases that do not follow the tendency observed above, such as the fully parallel turbo decoder implementation of [37] . Here, the number of clock cycles required per decoding iteration indeed scales down linearly with the grade of parallelism. However, the number of iterations required by a fully parallel turbo decoder is higher than that of a conventional turbo decoder, when aiming for achieving the same error correction capability. Owing to this, the achievable information throughput does not scale up linearly with the grade of parallelism, leading to reduced area and power efficiency.
B. Error Correction Capability vs. Area-and Energy-Efficiency
As described in Section II-C, there is a 5G requirement to facilitate reliable error correction at channel Signal to Noise Ratios (SNRs) per bit (E b /N 0 ) that are as close as possible to the channel capacity bound, where a BLER of 10 −2 is targeted as a complement to HARQ in eMBB applications and 10 −5 is targeted for URLLC applications. Figure 10 characterises the discrepancy between the channel capacity E b /N 0 bound and the specific E b /N 0 value, where a BER of 10 −4 is obtained for various channel decoder implementations, when using Binary Phase Shift Keying (BPSK) modulation for communication over an Additive White Gaussian Noise (AWGN) channel. This approach is adopted, since nearly all of the papers considered present BER plots, rather than BLER plots, and consider BPSK and AWGN. Figure 10 also presents the scaled area-efficiency, while the shape and color of the data points indicate the type of decoder and the scaled energyefficiency. Note that the legend of Figure 10 is inherited from Figure 9 , which is provided below this figure again for convenience. Note that fewer data points are presented in Figure 10 than were presented in Figure 9 , since many papers do not characterise the error correction capability of their proposed decoder.
Rather than comparing the minimum E b /N 0 value required for each channel decoder implementation to achieve the target BER, Figure 10 compares the discrepancy between this E b /N 0 value and the channel capacity E b /N 0 bound, which gives a fairer comparison of the error correction capability of various channel decoder implementations having different coding rates R. In order to illustrate this, Figure 13 plots the Discrete-input Continuous-output Memoryless Channel (DCMC) capacity associated with using BPSK for communication over an AWGN channel as a function of E b /N 0 . This is compared with points corresponding to each channel decoder implementation, which plot the spectral efficiency of each decoder versus the minimum E b /N 0 value that it requires to achieve a BER of 10 −4 , when using BPSK to communicate over an AWGN channel. If ideal Nyquist pulse shaping having zero excess-bandwidth is employed, the spectral efficiency of a decoder is numerically equal to its effective throughput of, η = R × log 2 (M ) [57] , where R is the coding rate of the decoder and M = 2 is the number of constellation points employed by BPSK, resulting in η = R. Note that while many of the papers considered present flexible channel decoders that support different coding rates and hence different spectral efficiencies, they typically only provide hardware performance results for a single coding rate and spectral efficiency, as plotted in Figure 13 . Figure 10 shows that turbo decoders tend to have the best error correction capability, with most of them exhibiting less than 1.5 dB discrepancies from the channel capacity. This may be explained by the long block lengths K, comprising thousands of information bits that are supported in the Long Term Evolution (LTE) turbo code. This enables stronger error correction than is possible when using blocks comprising only hundreds of information bits, as is typical for the LDPC and polar decoders considered. Owing to this, most LDPC and polar decoders exhibit more than 1.5 dB discrepancy from channel capacity, which may even exceed 2.5 dB. However, there are some specific LDPC and polar decoder implementations that offer similar error correction capability to turbo decoders. The polar decoder of [48] has a 1.5 dB discrepancy from the channel capacity, as a benefit of using the SCL decoding algorithm with a list size of L = 4, rather than the SC algorithm. Outside of the ASIC implementation literature, it has been shown [87] that polar decoders employing large list sizes can achieve superior error correction capability to LDPC and turbo decoders employing large numbers of iterations, particularly at short block lengths K. Furthermore, the LDPC decoder of [63] and several other similar LDPC decoder implementations also exhibit reliable error correction capability as well as extremely high areaand energy-efficiency. These LDPC decoders adopt fully parallel architectures, in order to achieve the high information throughputs supported by the 10GBASE-T Ethernet standard. This allows these decoders to perform a high number of decoding iterations, hence enabling reliable error correction. Note that state-of-the-art LDPC and polar code constructions have been adopted by the 3GPP 5G New Radio standard. It is expected that the error correction performance loss compared to turbo codes will be soon eliminated when implementations of the 5G LDPC and polar codes [42] become available for comparison. Alternatively, the non-binary LDPC ASIC implementation of [66] has largely reduced the error correction performance gap from turbo decoders. Figure 10 suggests that the error correction capability is not directly correlated with the area-and energy-efficiency, as inferred from the various decoders considered here. However, for a particular decoder, it may be expected that performing more iterations or using a larger SCL list size would improve the error correction capability, albeit at the cost of degrading the information throughput and hence both the area-and energy-efficiency.
C. Reconfiguration Flexibility vs. Area-and Energy-Efficiency
As described in Section II-D, 5G NR requires the channel decoder to support a wide variety of information block lengths K, as well as coding rates R = K/N. In this section, we quantify the flexibility of each channel decoder by the number of supported combinations of block length K and coding rate R. Figure 11 characterises the flexibility, scaled average area-efficiency and energy-efficiency that is achieved by each turbo, LDPC and polar decoder ASIC considered. In the case of the turbo decoders, the number of information block lengths K supported is determined by the number of interleavers supported, which is 188 for the LTE turbo code. As we mentioned in Section III-A, turbo codes achieve single bit granularly for the encoded block length N using puncturing or repetition, and hence they support a large number of coding rates R. Rather than quantifying the very large number of combinations of K and R supported by the turbo decoders, Figure 11 quantifies only the number of supported information block lengths K instead. Hence, the flexibility of the turbo decoders considered is even better than that quantified in Figure 11 . Note that the 5G LDPC and polar codes [42] flexibly support singlebit granularity for both the information-and encoded-block length, although published implementations are not currently available in the open literature for consideration in this paper.
By contrast, LDPC codes used in the existing standards require a different PCM for each supported combination of block length K and coding rate R, hence resulting in the relatively low flexibilities observed in Figure 11 . In principle, polar codes can have single bit granularly of the block length K and coding rate R, using only a single frozen bit selection sequence. However, this high grade of flexibility has not yet been demonstrated in any existing ASIC implementations. Indeed, the majority of polar decoder ASICs in the literature support only a single combination of block length K and coding rate R, as shown in Figure 11 . This may be because many of the existing polar decoder ASIC implementations in the literature have focused their attention on the implementation of the SC or SCL algorithm, without any consideration of the circuits required for frozen bit insertion or rate matching. This is because these circuits can operate relatively independently of the core decoder, which is reminiscent of the rate matching circuit of a turbo decoder.
As shown in Figure 11 , the turbo decoder implementations considered have the highest flexibility among the three types of channel decoders, as discussed above. The LDPC decoder ASICs that achieve the highest area-efficiencies [45] , [58] , [63] , [67] - [72] typically support only a single combination of block length K and coding rate R. This is because these LDPC decoders adopt fully-parallel architectures, which only support a single PCM and a high coding rate R. More specifically, these ASICs are physically laid out in a manner that resembles the factor graph structure [88] described by the PCM, using hard-wired connections between registers and dedicated computational hardware for each VN and CN. While this approach prevents flexibility, it allows the LDPC decoding process to be completed using a minimal number of clock cycles and without the requirement for additional memory, switchable interconnections or a complex controller. Owing to this lack of flexibility, it is the LDPC decoders of [45] , [58] , [63] , and [67] - [72] that offer the best throughputs, latencies, hardware-efficiencies and energy-efficiencies among all of the channel decoder ASICs, as described above. By contrast, the flexible LDPC decoder ASICs of Figure 11 support more than one PCM by employing partially-parallel architectures, such as the row parallel [44] or block parallel [43] architecture. More specifically, these ASICs employ a bank of computational hardware, which can be flexibly reused at different times to perform the processing associated with different VNs and CNs of different PCMs. However, this approach requires the employment of additional memory, switchable interconnections and a complex controller. As exemplified in Figure 13 , these components typically occupy around 75% of the chip area in the case of LDPC decoders that support around 100 PCMs, as may be required for meeting the flexibility requirement of 5G. Owing to this, it is these additional hardware components that dominate the throughput, latency, hardware-efficiency and energy-efficiency of the resultant ASICs, particularly as the number of supported PCMs [86] , in which the non-computational components (labeled 'MEMORIES' and 'ICNWs') occupy 30% of the area. (b) ASIC layout of the partially-parallel LDPC decoder of [52] , in which the non-computational components (labeled 'ROM', 'CN Memory FIFOs', 'CV Memory', 'VNU FIFO-buffer', 'Address generator' and 'CN Memory FIFOs') occupy 75% of the area. is increased, as shown in Figure 11 . This may be attributed to the irregular structures and high interconnection complexity, as described in Section III. By contrast, turbo decoders have regular structures associated with significantly reduced interconnection complexities. As a result, the corresponding hardware typically occupies only 30% of the chip area, as exemplified in Figure 13 . Owing to this, partially-parallel turbo decoders typically offer superior flexibilities, information throughputs, latencies, hardware efficiencies and energy efficiencies than partially-parallel LDPC decoders, as shown in Figure 11 .
D. Summary
Having compared numerous turbo, LDPC and polar decoder implementations in terms of their information throughput, error correction performance, flexibility, area efficiency and energy efficiency, some relative advantages and disadvantages can be observed. The ASIC implementations of turbo decoders offer the best error correction performance owing to their natural support for low coding rates and long block lengths, as exemplified in [64] . However, limited by the data dependencies of the Log-BCJR algorithm, most turbo decoder ASIC implementations have low area-and power-efficiencies compared to LDPC and polar decoders having the same level of information throughput. ASIC implementations of LDPC decoders have been designed for conformance with various standards. For example, the LDPC 10G-BaseT decoders of [45] and [58] , both high information throughput, as well as area-and energyefficiencies, but no flexibility. Polar decoder ASIC implementations offer outstanding error correction performance for short block lengths, offering about 1dB or more coding gain relative to LDPC decoders having short block lengths. However, polar decoders have relatively limited capability to achieve high information throughput, especially at high area-and energy-efficiencies, unless low list sizes resulting in a degraded error correction capability are adopted, as exemplified in [59] . Additionally, polar codes potentially explicit a high flexibility, although this has not been demonstrated in the open literature as yet. In summary, these types of channel decoder implementations have relative advantages and disadvantages, which must be considered when selecting a channel code for a practical application. 
V. DESIGN RECOMMENDATIONS AND CONCLUSION
In closing we present an overview of our design recommendations for future ASIC channel decoder implementations.
We commenced by discussing the channel coding requirements of 5G, which include high throughput, low latency, strong error correction capability and low implementation complexity. We have also highlighted the challenging 5G requirements imposed on the channel code to flexibly support a wide variety of block lengths K and coding rates R, in order to ensure the efficient exploitation of the bandwidth available and to address the challenging applications of 5G. We have provided comprehensive discussions on the extent to which the 5G requirements can be met by the turbo, LDPC and polar codes. In order to support our discussions, we also presented several plots to characterise a diverse set of as many as 110 ASIC implementations of turbo, LDPC and polar decoders. We have characterised the fundamental trade-offs between the various performance characteristics of channel decoder ASICs. Furthermore, we have demonstrated that the overall implementation complexity of a channel code depends not only on its computational complexity, but also on its interconnection complexity and its inherent flexibility.
Our findings are summarised by the comparison of the three types of channel decoder implementations presented in Table V . We observed that most of existing turbo decoders offer limited information throughput owing to the serial nature of the Log-BCJR algorithm. This leads to lower area-and energyefficiencies at most of the practical coding rates R, although they maintain these efficiencies at low coding rates, in contrast to the family of LDPC and polar decoders. However, among the decoder ASICs considered in this paper, the turbo decoders have the best error correction performance and offer the highest flexibility in terms of supporting various combinations of block lengths and coding rates. By contrast, LDPC and polar decoder implementations have both a high information throughput as well as high area-and energy-efficiencies for most practical coding rates. However, the state-of-the-art in LDPC and polar coding has recently been substantially advanced during the design of 3GPP 5G New Radio. More specifically, the New Radio implementations of both codes enable flexibility both in terms of the block length and coding rate with single bit granularity. Furthermore, the error correction performance of both codes has been improved relative to previous implementations. While polar codes offer the best error correction performance at short block lengths, turbo, LDPC and polar codes offer similar performance at block lengths above 1000 bits. By contrast, most of the LDPC and polar decoder implementations considered in this paper support only shorter block lengths of up to 1024 bits, where their error correction ability is weaker. While it may be expected that flexible LDPC and polar decoder implementations of 3GPP 5G New Radio codes will emerge in the future, many of the ASIC implementations that were available at the time of writing this paper do not meet the error correction performance and flexibility requirements of 5G NR. However, the ASIC implementations considered in this paper do not meet the error correction performance and flexibility requirements of 5G NR. In the case of the LDPC decoders, this may be explained by the limited error correction performance and flexibility of the LDPC codes used in the existing standards. Furthermore, despite their great potential, the implementation of polar decoders having full flexibility has not been demonstrated in the previous ASIC implementations. Likewise, at the time of writing only low list sizes have been considered for the ASIC implementation of SCL polar decoders, while an improved error correction performance can only be expected at higher list sizes.
Owing to the lack of some important details in some publications, the comparison of the many ASIC channel decoder implementations considered in this paper has been challenging. Therefore, we recommend valued Colleagues that future ASIC channel decoder designers include the following important details in their implementation-oriented papers:
• Provide values for every parameter of the algorithm and implementation, especially the most important parameters considered in Table VI. • State both the average and the worst-case processing latency of the decoder for a variety of block lengths and coding rates, or provide an approximate formulaic relationship between the information throughput and processing latency.
• Provide BLER plots and tabulate the E b /N 0 value that is required to achieve a BLER of 10 −2 by the decoder, for a variety of block lengths and coding rates. Identify the type of simulated channel and modulation scheme, which should preferably be BPSK modulation for transmission over an AWGN channel, since this allows direct comparison with the majority of previous publications.
• Provide the power-or energy-consumption for a variety of block lengths and coding rates.
• Quantify the flexibility of the decoder in terms of the number of supported block length and coding rate combinations. There is still a significant amount of further work that can be completed on the implementation of channel decoders. In particular, there is a need for further research on flexible, highperformance, high-efficiency implementations of turbo, LDPC and polar decoders. For example, the hardware implementation of irregular turbo decoding [89] has not yet been demonstrated, which would improve the attainable error correction capability. Likewise, the performance of LDPC decoders can be significantly enhanced by invoking informed dynamic scheduling [90] . However, this benefit comes at the cost of increased implementation complexity owing to its adaptive decoding schedules. Polar decoding can be enhanced by implementing Soft-Input Soft-Output (SISO) decoding [91] , which would facilitate ASIC-based turbo detection for the first time for polar codes. Likewise, the ASIC implementation of HARQ [92] for polar decoding is another challenge that remains unsolved at the time of writing. It is not possible to say that any one type of channel code is superior to any other, since all codes have different advantages and disadvantages. APPENDIX See Table VI. 
