ABSTRACT In wireless communication schemes, turbo codes facilitate near-capacity transmission throughputs by achieving reliable forward error correction. However, owing to the serial data dependencies imposed by the underlying logarithmic Bahl-Cocke-Jelinek-Raviv (Log-BCJR) algorithm, the limited processing throughputs of conventional turbo decoder implementations impose a severe bottleneck upon the overall throughputs of real-time wireless communication schemes. Motivated by this, we recently proposed a fully parallel turbo decoder (FPTD) algorithm, which eliminates these serial data dependencies, allowing parallel processing and hence offering a significantly higher processing throughput. In this paper, we propose a novel resource-efficient version of the FPTD algorithm, which reduces its computational resource requirement by 50%, which enhancing its suitability for field-programmable gate array (FPGA) implementations. We propose a model FPGA implementation. When using a Stratix IV FPGA, the proposed FPTD FPGA implementation achieves an average throughput of 1.53 Gb/s and an average latency of 0.56 µs, when decoding frames comprising N = 720 b. These are, respectively, 13.2 times and 11.1 times superior to those of the state-of-the-art FPGA implementation of the Log-BCJR long-term evolution (LTE) turbo decoder, when decoding frames of the same frame length at the same error correction capability. Furthermore, our proposed FPTD FPGA implementation achieves a normalized resource usage of 0.42 (kALUTs/Mb/s), which is 5.2 times superior to that of the benchmarker decoder. Furthermore, when decoding the shortest N = 40-b LTE frames, the proposed FPTD FPGA implementation achieves an average throughput of 442 Mb/s and an average latency of 0.18 µs, which are, respectively, 21.1 times and 10.6 times superior to those of the benchmarker decoder. In this case, the normalized resource usage of 0.08 (kALUTs/Mb/s) is 146.4 times superior to that of the benchmarker decoder.
FIGURE 2.
The novel contributions of this work, compared with our previous work of [37] .
on the Log-BCJR algorithm, our FPTD algorithm does not have data dependencies within each half of each turbo decoding iteration [38] . This facilitates fully-parallel processing, allowing each half-iteration to use only a single clock cycle, although this is achieved at the cost of the FPTD typically requiring seven times as many iterations for achieving the same error correction capability as the state-of-the-art turbo decoding algorithm. Despite this, our previous contribution of [37] shows that a fixed-point ASIC implementation of this FPTD algorithm is capable of achieving a processing throughput as high as 21.9 Gbit/s, which is 17.1 times superior to the state-of-the-art Log-BCJR based turbo decoder of [19] , when implemented using the same TSMC 65 nm technology and decoding the longest N = 6144-bit LTE frames.
Against this background, this paper proposes a novel fixedpoint FPTD architecture, which implements the fixed-point FPTD algorithm using 50% less hardware resources per message bit compared to our previous fixed-point FPTD architecture [37] , as shown in Figure 2 . We also propose a novel FPGA implementation for the proposed FPTD architecture, which is suitable for MCMTC applications. The main experimental results of this work are listed as follows.
• The proposed Stratix IV FPGA implementation of the proposed FPTD architecture achieves an average throughput of 1.5 Gbit/s and an average latency of 0.56 µs, when decoding frames comprising N = 720 bits, as may be found in MCMTC applications [8] , [9] . These are respectively 13.2 times and 11.1 times superior to those of the state-of-the-art FPGA implementation of the LTE turbo decoder [14] based on Log-BCJR algorithm, when decoding the same frame length of N = 720.
• The proposed FPGA implementation of the 720-bit FPTD has a normalized resource usage of 0.42 kALUTs Mbit/s , where the Adaptive Look-Up Tables (ALUTs) are the fundamental programmable hardware resources adopted by the FPGA. This is 5.2 times superior to the 22 kALUTs Mbit/s that is obtained for the benchmarker decoder of [14] . Likewise, it is 1.3 times superior to the 0.55 kALUTs Mbit/s recorded for a specific version of the benchmarker optimized for frame lengths N in the range spanning from 512 to 1024 bits.
• When decoding the shortest N = 40-bit LTE frame, the proposed FPTD achieves an average throughput of 442 Mbit/s and an average latency of 0.18 µs, which are respectively 21.1 times and 10.6 times superior to the benchmarker decoder of [14] . In this case, the normalized resource usage is 146.4 times lower and 19 times lower than that of the benchmarker decoder of [14] and a specific version optimized for frame lengths N in the range of 40 to 512 bits, respectively. The rest of the paper of Figure 3 is organized as follows. In Section II, we discuss the motivation for using turbo codes in MCMTC applications. In Section III, we offer background discussions on our previously proposed fixed-point FPTD algorithm [37] , which was designed for VLSI applications. In Section IV, we propose a novel fixed-point FPTD architecture which benefits from a 50% lower hardware resource requirement than our previous architecture [37] . In Section V, we detail the FPGA implementation of our novel resource-efficient FPTD architecture, including its top-level schematic as well as its input/output memory and Cyclic Redundancy Check (CRC) circuit. In Section VI, we characterize the proposed FPGA implementation of our resource-efficient fixed-point FPTD, in terms of its processing throughput, processing latency, hardware resource usage and energy consumption, where the first three items are compared to those of the state-of-the-art Log-BCJR turbo decoder FPGA implementations. Finally, we offer our conclusions in Section VII. [12] and [39] , for the case of BPSK transmission over an AWGN channel.
II. TURBO CODES FOR MCMTC APPLICATIONS
As described in Section I, turbo codes are attractive in applications such as MCMTC, since they offer a strong error correction capability even for short message frames. More specifically, Figure 4 shows the E b /N 0 values, where a Frame Error Ratio (FER) of 10 −3 is achieved by the LTE turbo code for different message frame lengths N in the range of 40 to 6144 bits supported by LTE. Note that these results correspond to the case of using an LTE turbo coding rate of R = 1/3 combined with Binary Phase Shift Keying (BPSK) modulation for transmission over an AWGN channel, and using a sufficiently high number of turbo decoding iterations for achieving iterative decoding convergence. For comparison with the LTE turbo code, Figure 4 also shows the E b /N 0 values, where an FER of 10 −3 is achieved by an R = 1/3-rate Accumulate-Repeat-Accumulate (ARA)-LDPC and an R = 1/2-rate Progressive-Edge-Growth (PEG)-LDPC, as considered using similar analysis in [12] and [39] , respectively. It may be seen that the LTE turbo code outperforms the ARA-LDPC and PEG-LDPC codes for all frame lengths, where the maximum gap of approximately 1 dB is achieved, when the frame length is N = 40 bits. Note that according to [12] , the performance of the LDPC code used in WiMAX is very similar to that of the PEG-LDPC code shown in Figure 4 , but the frame lengths N supported by the WiMAX LDPC are constrained to the range of 288 to 1152 bits for the code rate of R = 1/2 [40] .
Additionally, Figure 4 shows a selection of capacity bounds, which provide a wider context for the performance achieved by the turbo and LDPC codes considered. More specifically, the Continuous-Input Continuous-Output Memoryless Channel (CCMC) Shannon capacity [41] and the modulation-specific Discrete-input Continuous-output Memoryless Channel (DCMC) capacity bound [42] for the combination of R = 1/3 channel coding, BPSK modulation and AWGN channel are represented by the pair of horizontal lines at E b /N 0 = −0.55 dB and E b /N 0 = −0.49 dB, respectively. However, the Shannon capacity and DCMC capacity provide bounds that only apply for infinitely long frame lengths, which therefore do not offer an accurate prediction of the achievable error correction capability for practical channel codes, having short message frame lengths of the order of dozens or hundreds of bits. Motivated by this, the converse bound [43] of Figure 4 represents a lower bound on the achievable error correction capability of practical channel codes as a function of the message frame length N , offering a better estimation for short message frames than the DCMC and Shannon capacity bounds. Furthermore, the Kappa-beta and Gallager bounds [43] - [45] of Figure 4 offer further refinements of the estimated error correction capability that is achievable by practical channel codes. Compared to these refined bounds, the LTE turbo code can be seen in Figure 4 to offer near-optimal error correction capability for both short message frames and long message frames, which motivates the employment of turbo codes in MCMTC applications.
III. FIXED-POINT FULLY-PARALLEL TURBO DECODER
The floating-point FPTD algorithm was originally proposed in [38] . Following this, in [37] we proposed a fixed-point version of the FPTD algorithm, which was optimized for the LTE turbo code and was implemented as an ASIC. In this section, we briefly summarize our previously proposed fixed-point LTE FPTD algorithm as follows. In Section III-A, we discuss the top-level operation of the FPTD algorithm, using the schematics of Figures 5 and 6 . In Sections III-B and III-C respectively, we summarize the fixed-point algorithmic blocks of Figure 8 and the termination unit of Figure 11 , which may be employed for implementing the FPTD algorithm of Figure 5 .
A. SCHEMATIC Figure 5 shows the schematic of the FPTD algorithm proposed in [38] . When decoding N -bit message frames, the FPTD algorithm comprises two rows of N identical algorithmic blocks, where the blocks of the upper and lower rows are labeled as {u 1 , u 2 . . . , u N } and {l 1 , l 2 . . . , l N }, respectively. The upper row is analogous to the upper decoder of the conventional Log-BCJR turbo decoder, while the lower row corresponds to the lower decoder, which are connected by an LTE interleaver. A termination unit comprising unshaded VOLUME 4, 2016 FIGURE 5. Schematic of the FPTD algorithm of [38] .
algorithmic blocks is appended to the tail of each row, in order to comply with the LTE termination mechanism [7] . As in the Log-BCJR algorithm, the FPTD algorithm operates on the basis of Logarithmic Likelihood Ratios (LLRs) [46] , where each LLR ofb
conveys soft information pertaining to the corresponding bit b within the turbo encoder. Note that in the rest of this paper, the superscripts 'u' and 'l' seen in the notation of Figure 5 are used only when necessary for explicitly distinguishing the upper and lower components of the turbo code, but they are omitted in discussions that apply equally to both. When decoding message frames comprising N bits, the upper and lower decoders each accept a set of (N + 3) a priori par- [40, 6144] in the LTE turbo code. These a priori LLRs are provided by the demodulator and are stored in the corresponding registers of Figure 5 
k=N +1 , comprising a total of (3N + 12) LLRs, in accordance with the LTE standard and as in the conventional Log-BCJR turbo decoder.
Like the conventional Log-BCJR turbo decoder, the FPTD algorithm relies on iterative operation. However, rather than requiring 64 to 192 clock cycles per iteration, each iteration of the FPTD algorithm comprises only two clock cycles, which are referred to as half-iterations. More specifically, Figure 6 (a) shows that the first half-iteration of the FPTD algorithm corresponds to the simultaneous operation of the lightly-shaded algorithmic blocks shown in Figure 5 within a single clock cycle. These lightly-shaded blocks comprise the algorithmic blocks in the upper row having odd indices {u 1 , u 3 , u 5 , . . . } and the even-indexed algorithmic blocks in the lower row {l 2 , l 4 , l 6 , . . . }. By contrast, Figure 6 (b) shows that the second half-iteration corresponds to the simultaneous operation of the remaining algorithmic blocks within a single clock cycle, which are darkly-shaded in Figure 5 . During the t th clock cycles of the decoding process, the k th ∈ [1, N ] algorithmic block processes the a priori LLRsb
was generated in the (t − 1) st clock cycle by interleaving the appropriate extrinsic message LLR provided by an algorithmic block in the other row, whereb
. In addition to the a priori LLRsb a,t−1 1,k ,b a 2,k andb a 3,k , the algorithmic block also consumes a set of M forwardoriented state metricsᾱ is generated in the previous (t − 1) st clock cycle by the following (k + 1) st algorithmic block in the same row. As shown in Figure 5 , registers are required for storing [b
k=1 between the consecutive clock cycles, since they are generated by connected algorithmic blocks in the clock cycle before they are used. 
B. ALGORITHMIC BLOCK
Within each of the clock cycles during which the k th algorithmic block in either row of Figure 5 is activated, it accepts inputs and generates outputs according to (2) , (3), (4) and (5), as shown at the bottom of this page. Here, (2) is used to obtain a metricγ t k (S k−1 , S k ) for each possible transition between a pair of states S k−1 and S k , as shown in the LTE state transition diagram of Figure 7 . Note that each transition implies a particular binary value for the corresponding message bit b 1 
, where the systematic bits are defined as having values that are identical to the corresponding message bits, giving (3) and (4) Figure 7 . Furthermore, the Jacobian logarithm [18] , [47] is defined as
However, its approximated version of
may be employed, for reducing the computational complexity of the FPTD algorithm, in analogy with the Max-Log BCJR algorithm [18] , [47] . Finally, (5) is employed for obtaining the extrinsic LLRb e,t 1,k , where the associative property of the max* operator may be involved for extending (6) and (7) to more than two operands.
The processing element of Figure 8 is designed for performing all operations of an algorithm block, within a single clock cycle, as required by the FPTD algorithm. During this single clock cycle, the signals propagate through six datapath stages, which impose similar propagation delays. More explicitly, these datapath stages perform addition, subtraction and maximum calculations, which can all be efficiently implemented at similar complexities using two's complement arithmetic. In particular, the variables of (2) to (5) are represented using two's complement fixed point numbers, having the bit-widths of (w 1 , w 2 ), where the bit-widths of (w 1 , w 2 ) = (4, 6) offer an attractive trade off between strong BER performance and low computational complexity, as shown in Figure 9 . More specifically, the bit-width of w 1 = 4 is employed for the a priori parity LLRb a 2,k and systematic LLRb a 3,k , as recommended in [37] . As shown at the top of Figure 8 , the a priori parity LLRb a 2,k and the systematic LLR b a 3,k are provided by the demodulator, where it is assumed that a quantizer is employed for converting the real-valued LLRs to fixed-point LLRs. In order to prevent a significant BER performance degradation owing to quantization distortion, it is assumed that the demodulator applies noise-dependent scaling [48] to both the a priori LLRsb a 2,k andb a 3,k . More specifically, the linear scaling factor of
where the E b /N 0 is expressed in dB, v = 2 w 1 −1 is the range corresponding to the resolution of the quantizer, . BER performance of the fixed-point FPTD using the approximate max* operation of (7), message LLR scaling (f 2 = 0.75), state-zero state metric normalization and various bit-widths (w 1 , w 2 ). The BER performance is compared to that of the floating-point FPTD using the approximate max* operation of (7), both with and without message LLR scaling (f 2 = 0.7). The BER was simulated for the case of transmitting N = 6144-bit frames over an AWGN channel, when performing I = 39 decoding iterations. y = {0.39, 0.3, 0.27, 0.25}, as discussed in [48] . In contrast to the channel LLRs, the bit-width of w 2 = w 1 + 2 = 6 is employed for the a priori and extrinsic message LLRsb Figure 8 .
In addition to the noise-dependent scaling applied tob a 2,k andb a 3,k by the demodulator, the BER performance of the FPTD algorithm can be improved by scaling the a priori message LLRb a,t−1 1,k , in order to counteract the degradation imposed by the approximate max* operation of (7) [37] . While a scaling factor of f 2 = 0.7 is beneficial for the floating-point FPTD, our results of Figure 9 show that the fixed-point FPTD benefits from applying a scaling factor of f 2 = 0.75, which also facilitates a low-complexity hardware implementation. More specifically, by exploiting the two's complement multiplication arithmetic illustrated in Figure 10 , the message LLR scaling factor of 0.75 may be applied to the a priori LLRb a,t−1 1,k using two steps. In the first step, a 2-bit sign-extended version ofb a,t−1 1,k is added to a replica of itself that has been shifted to the left by one bit position, according tob Then in a second step, a floor truncation [49] is applied to the two least significant bits of the result, which maintains the same bit-width of w 1 as that employed before the message LLR scaling. Here, the sign extension, bit shifting and floor operations can be carried out by hard-wiring, since the scaling factor of f 2 = 0.75 is fixed throughout the iterative decoding process. Therefore, the only hardware required for message LLR scaling is an adder, which occupies only the first datapath stage of Figure 8 .
As the iterative decoding process proceeds, the values of the extrinsic state metricsᾱ t k andβ t k−1 can grow without upper bound [50] . In order to prevent any potential BER error floors that may be caused by saturation or overflow, state metric normalization may be employed for reducing the magnitudes ofᾱ t k andβ t k−1 , in order to ensure that they remain within the range that is supported by their bit-width of w 2 . As shown in Figure 8 , state-zero normalization [37] is performed in the sixth datapath stage within each processing element. This is achieved by subtractingᾱ t
, respectively [50] , [51] . Note that this subtraction does not change the information conveyed by the extrinsic state metrics, since this is carried by their differences, rather than by their absolute values. After state-zero normalization, zero-values are guaranteed for the first extrinsic state metricsᾱ t k (0) = 0 andβ t k−1 (0) = 0. In our fixed-point FPTD algorithm, this allows the registers and additions involvingᾱ t−1 k−1 (0) and β t−1 k (0) to be simply removed, saving two w 2 -bit registers and seven additions per processing element, as shown by the dotted lines in Figure 8 . Furthermore, this approach guarantees a constant value of zero for one of the operands input to three of the max* operations, simplifying them to using the sign bit of the other non-zero operand for selecting which specific operand is output.
C. TERMINATION UNIT
Each row of algorithmic blocks shown in Figure 5 is appended with a termination unit, comprising three termination blocks having indices of (N + 1), (N + 2) and (N + 3). These termination blocks employ only (2) without the term of
3,k and (4), operating in a backward-oriented recursion fashion for successively calculatingβ N +2 ,β N +1
andβ N . Here, we employβ N +3 = [0, −∞, −∞, . . . , −∞], since the LTE termination technique guarantees S N +3 = 0. As described in Section III-A, here −∞ is replaced by a negative constant having a suitably high magnitude in the fixedpoint FPTD algorithm. Note that the termination units can be operated before and independently of the iterative decoding process, since the required a priori
k=N +1 are provided only by the demodulator, with no data dependencies on the other N algorithmic blocks in the row. Owing to this, the resultantβ N value can be used throughout the iterative decoding process, with no need to operate the termination unit again, as described in Section III-A. In contrast to the processing element of Figure 8 , the termination unit of Figure 11 requires eight datapath stages for implementing the three consecutive algorithmic blocks, in order to convert the termination LLRsb a 1,N +1 ,
into the extrinsic backward-oriented state metricsβ N . As shown in Figure 11 , the first datapath stage is used for calculating (2) for all three termination blocks. Then the following six datapath stages are used for calculating (4) for the three termination blocks in a backward recursive manner, where calculating (4) for each termination block requires two datapath stages. The final datapath stage is occupied by the above-mentioned state-zero normalization. Note that although the termination delay of the unit's eight datapath stages is longer than that of the six stages used by the processing element of Figure 8 , the termination unit does not dictate the critical path length of the fixed-point FPTD algorithm, which remains six datapath stages. This is because the termination units only have to be operated once before the iterative decoding process commences. Intuitively, this would imply that the termination units would impose a delay of two clock cycles before the iterative decoding process would be begun. However, in the fixed-point FPTD algorithm, we prefer to start the operation of the termination units at the same time as the iterative decoding process. In this way, the termination units do not impose a delay of two clock cycles before the iterative decoding process can begin, but the correct backward state metricsβ N cannot be guaranteed during the first decoding iteration, which is performed during the first two clock cycles. However, our experimental results demonstrate that this does not impose any BER degradation.
IV. RESOURCE-EFFICIENT FPTD ARCHITECTURE
In this section, we propose a novel resource-efficient architecture for implementing the fixed-point FPTD algorithm of Section III. In contrast to the FPTD architecture of [37] , the proposed design requires only N processing elements instead of 2N for decoding N -bit frames, therefore achieving 50% reduction in hardware resource usage. This is achieved by exploiting the odd-even operation of the FPTD algorithm, which results in only half of the algorithmic blocks being operated simultaneously, as described in Section III. The proposed area-efficient FPTD architecture employs the schematic of Figure 12 , which uses the same N processing elements for alternately operating the algorithmic blocks of the first half-iteration of Figure 6 Figure 5 . Apart from this however, multiplexers are not required for the forward-oriented state metricsᾱ k and the backward-oriented state metricsβ k , since the state metrics generated by a particular processing element in a particular clock cycle will be directly processed by the neighboring processing elements in the next clock cycle, as a natural consequence of the odd-even operation of the FPTD algorithm. Furthermore, each processing element is associated with two separate routings through the interleaver. More specifically, the interleaver connects the k th processing element with both the π (k) th and π −1 (k) th processing elements, as required when interleaving LLRs from the upper row to the lower row, as well as when deinterleaving from the lower row to the upper row, respectively. Note that the extrinsic message LLRb e 1,k is routed through both the interleaving and the deinterleaving paths in every clock cycle, which implies that each processing element receives two a priori message LLRsb a 1,k at a time. However, only the desired one is selected by the corresponding multiplexer, in accordance with the odd and even clock cycle scheduling. At the end of each clock cycle, the unshaded registers shown in Figure 5 are employed to cacheb a,u 1,k ,ᾱ k andβ k , ready for use in the next clock cycle. Note that the interleaver may be implemented using VOLUME 4, 2016 hard wires, implying that only a single fixed frame length N is supported at run time, although this particular frame length may be selected during synthesis. Our future work will consider the replacement at the hard-wired interleaver with a Beneš network, which will enable the run-time support of different frame lengths, having different interleaver designs.
In addition to the unshaded registers shown in Figure 12 ,
, where π −1 (k) is guaranteed to be even owing to the oddeven nature of the LTE interleaver. The iterative decoding process continues until a fixed clock cycle limit is reached or until the a posteriori LLRs satisfy a CRC check, which is computed within a single clock cycle, as it will be described in Section V.
V. FPGA IMPLEMENTATION
In this section, we propose an FPGA implementation which integrates the FPTD architecture of Section IV with the I/O mechanisms and a fully-parallel CRC, which can operate in a single clock cycle. More specifically, Figure 13 provides the top-level schematic of the proposed FPGA implementation of the resource-efficient fixed-point FPTD architecture. This schematic comprises several functional components, including the FPTD core of Section IV, the input/output memory, CRC circuit, Phase-Locked Loops (PLLs) [52] and clock cycle counter. Note that two PLLs are employed, since the input and output RAMs are operated using a different higher k=N +1 . A bit-width of w 1 = 4 is employed for each of these a priori LLRs, corresponding to a total of 12N +48 bits. However, it is not possible to feed the FPTD core using a oneto-one mapping of the FPGA's general I/O pins, owing to the limited number of pins, especially when the frame length N is large. Owing to this, the input memory shown in Figure 13 is employed to provide a serial-to-parallel conversion and the storage of these bits. More specifically, the input memory comprises a number of the FPGA's M9K Static Random Access Memory (SRAM) blocks, which are all configured in the simple dual-port mode [53] . In this mode, one port is configured as an input port with a fixed bit-width of 2, while the other port is configured as an output port with a fixed bit-width of 32. Note that 32 bits is the maximum bit-width that an M9K block can support, in addition to its four parity Figure 13 . Accordingly, the input memory requires 16 consecutive memory clock cycles to load the a priori LLRs, but necessitates only a single clock cycle to feed them to the FPTD core. This serial-to-parallel conversion ratio of 1:16 is motivated by the trade-off between the I/O pin usage and the number of clock cycles required to load a frame, since a larger ratio requires fewer clock cycles to transfer the data, but occupies more I/O pins. More specifically, our experimental results show that the largest FPTD that can be accommodated on our testbench EP4SE820F43C3 FPGA has a frame length of N = 720 bits, which is limited by the computational resource capacity. In this case, the input memory occupies [54] .
Furthermore, each M9K memory block has a capacity of 8192 bits besides the parity bits, which is sufficient to store upto 256 frames, since each frame occupies only 32 bits per M9K memory block. However, we employ only 64 bits per M9K memory block in this work for the sake of simplicity, which allows two frames to be stored at the same time.
As shown in Figure 13 , the input memory accordingly accepts a 5-bit address addr_wr_1 for the 2-bit wide write port and a 1-bit address addr_rd_1 for the 32-bit wide read port, in order to allow switching between the two independent frames. This allows the iterative decoding of one frame to be pipelined with the loading of the next frame, as it will be described in Section V-C. Note that the channel LLRs provided by the input memory read port are fed directly to the FPTD core, without registers in between as a cache.
The mapping between the M9K memory blocks and the processing elements is illustrated in Figure 14 . As shown in Figure 14 for eight neighboring processing elements, in the case where each a priori channel LLR uses w 1 = 4 bits. Therefore, three memory blocks are required for each set of eight processing elements, necessitating a total of 3N /8 memory blocks for the input memory, when excluding the LLRs pertaining to the termination bits. These termination LLRs correspond to a total of 48 bits, which can be provided using the first 24 bits from each of the two above-mentioned 1:16 M9K memory blocks, as shown in Figure 14(b) .
In addition to the input memory, an output memory is employed for storing and outputting the N hard-decision bits obtained by the FPTD core of Section IV, when the iterative decoding process is completed. Similarly to the input memory, the output memory is also implemented using dedicated M9K SRAM blocks, configured in simple two-port mode. However, the output memory is used for supporting a parallelto-serial conversion, in contrast to the serial-to-parallel conversion required for the input memory. More specifically, the input port and the output port are configured to have a 32-bit and 4-bit width, respectively, giving a parallel-to-serial ratio of 8:1. In analogy to the implementation of the input memory, N /32 M9K memory blocks are required for the output memory, allowing the N hard decision bits [b p 1,k ] N k=1 to be cached into the output memory in a single clock cycle, which are then output using N /8 output pins during eight consecutive clock cycles. For the sake of simplicity, we employ only 32 of the 8192 bits that can be stored in each M9K memory block, allowing the storage of a single decoded frame at a time. As shown in Figure 13 , the output memory accordingly accepts only a 3-bit address addr_rd_2 for its read port, whereas the address for the 32-bit wide write port can be fixed to zero internally. Note that outputting these hard decision bits can also be pipelined with the iterative decoding and loading of the next frames, as it will be detailed in Section V-C.
B. CRC
As described in Section IV, the FPTD core performs iterative decoding in accordance with the FPTD algorithm of Figure 5 , obtaining N hard-decision bits [b p 1,k ] N k=1 in every clock cycle while it is active. In each clock cycle, these hard-decision bits are provided to an LTE CRC circuit, which detects if and when the FPTD core successfully decodes a frame before a predetermined clock cycle limit has been reached, hence enabling early stopping. However, this requires completing the computation of the CRC in a single clock cycle, which is not feasible, when adopting the conventional LFSR approach to implement the CRC [55] . Motivated by this, our FPGA implementation employs a fully-parallel CRC circuit, which is capable of computing the CRC in a single clock cycle and without an excessive critical path length, according to the design guideline of [56] . More specifically, the LTE turbo code employs a 24-bit CRC, having the generator polynomial of g(D) = [D 24 + D 23
. In our fully-parallel CRC circuit of Figure 15 , the 24 CRC bits are computed simultaneously using separate predefined XOR trees within a single clock cycle. Each XOR tree is constructed by unfolding the corresponding LFSR operations of a conventional CRC circuit, accepting a particular selection of N hard-decision bits [b p 1,k ] N k=1 as its input. Note that many of the XOR operations can be shared between XOR trees that consider overlapping sets of bits. Each of the XOR trees takes no more than N bits as input, therefore requiring no more than (N − 1) 2-input XOR gates, which may be structured as a tree comprising no more than log 2 (N ) layers. Hence, log 2 (N ) represents an upper bound on the number of gates in the critical path length of the parallel CRC. Following the XOR trees, the CRC result is obtained by performing the OR logic combination of all 24 CRC bits, using a tree comprising 23 2-input OR gates, arranged in log 2 (24) = 5 layers.
As shown in Figure 13 , the resultant checksum is output by the FPGA to indicate the correctness of the decoded results, where a zero value indicates successful decoding. Note that in the proposed FPGA implementation of Figure 13 , the CRC is operated simultaneously with the FPTD core, but operates on the hard-decision bits that is provided in the previous clock cycle. Therefore, the CRC circuit does not affect the 6-stage critical path length of the FPTD core, as discussed in Section III-B.
C. OPERATION AND CONTROL
As shown in Figure 13 , the input and output memory is clocked by an external clock clk_mem having a clock frequency f mem = 333 MHz, while the FPTD core and CRC circuit are clocked by another external clock clk_core having a different clock frequency f core , which depends on the frame length N . Both clocks are compensated using different ones of the FPGA's dedicated PLL circuits. Figure 16 illustrates a time diagram for an example operation of the proposed FPTD, including all three loading, processing and output stages. The loading stage comprises the operation of the input memory, which requires 16 memory clock cycles, as discussed in Section V-A. This loading process is controlled by the control signal llrs_load, as shown in Figure 16 . Following these 16 memory clock cycles, there is a delay of approximately two memory clock cycles before the iterative decoding process begins. This delay is required by the memory read and for overcoming the phase difference between the FPTD core clock clk_core and the memory clock clk_mem. Here, the memory read is triggered at the rising edge of the start signal, which therefore feeds the FPTD core with the a priori channel LLRs, in accordance with the selected memory read address adr_rd_1. Considering these 18 memory clock cycles, the overall time used for loading is 18/f mem = 54.1 ns. The pulse of the start control signal is also used for resetting the FPTD core synchronously with clk_core, as well as to activate its iterative decoding process. The in_process output signal of Figure 13 is asserted throughout this iterative decoding processing, which continues until the checksum becomes zero or until a maximum number I max of decoding iterations has been reached. As shown in Figure 13 , the in_process signal is implemented using a single AND gate, accepting the inputs of checksum from the CRC of Section V-B and the condition check of I counter < I max . Here, I counter may be accumulated using a generic ripple counter, which is reset to zero when the start control signal is pulsed. The average duration of the iterative decoding process is given by τ 1 = 2·I av +OCyc f core , where I av is the average number of iterations performed, OCyc is the clock cycle overhead and f core is the clock frequency for the FPTD core. The worst-
is incurred for frames, where the iterative decoding process is terminated, when the counter I counter reaches the maximum iteration limit I max . In the proposed FPTD implementation, the overhead is OCyc = 2 clock cycles, which comprises one clock cycle for resetting the FPTD core at the beginning of the iterative decoding process and one clock cycle delay for the CRC calculation of Figure 15 at the end.
As shown in Figure 16 , the write operation for the output memory is triggered by a failing edge of the in_process signal. This causes the N hard decision bits [b p 1,k ] N k=1 obtained for the present frame to be stored in the output memory within a single memory clock cycle. Following this, the control signal result_output is asserted for eight memory clock cycles, in order to signal that the hard-decision bits are being output, as discussed in Section V-A. In addition to this, there is an output delay of approximately three memory clock cycles. Considering all of these 12 memory clock cycles, the time required for outputting the results is 12/f mem = 36 ns.
As illustrated in Figure 17 , the loading, processing and outputting operations may be pipelined, in order to maximize the processing throughput when decoding several successive frames. More specifically, the proposed FPTD decoder is implemented for such that as soon as the processing operation for the present frame is started, the loading of the next frame FIGURE 17. Timeline for pipelining the turbo decoding of three consecutive frames, where τ 2 is the time required for completely decoding a frame, in which the iterative turbo decoding process occupies τ 1 time. In addition, τ 3 is the time delay which may be incurred between completing the loading of a frame and beginning its iterative decoding.
can begin. This allows the FPTD core to start the iterative decoding processing of the next frame immediately after completing the decoding process for the present frame. In this way, the average decoding throughput can be improved
is the average delay incurred by the FPTD core, while τ 2 = τ 1 + 30/f mem is the overall delay for completely decoding a frame, including loading, processing and outputting. Note that this throughput improvement can only be achieved for the specific case, where the next frame becomes available before the iterative decoding process of the current frame has been completed. Furthermore, this pipelining technique does not improve the latency of the proposed FPTD implementation, which has the value τ 2 = τ 1 + 30/f mem on average and τ max 2 = τ max 1 + 30/f mem in the worst case. Note that Figure 17 illustrates a time delay τ 3 , which may be incurred between completing the loading of a frame and beginning its iterative decoding. However, our characterization of the proposed FPTD implementation's latency does not include τ 3 , since it may vary from frame to frame and since its value depends on the timing of the delivery of frames by the demodulator, which is outside the scope of this work. VOLUME 4, 2016
VI. RESULTS
In this section, we characterize the proposed FPGA implementation of our fixed-point LTE FPTD, using the Altera FPGA EP4SE820F43C3, which comprises 813k Logic Elements (LEs), 650k registers, 1.6k M9K memory blocks and 12 PLLs. These results are compared to the state-of-the-art FPGA implementation of the LTE turbo decoder of [14] , which applies the conventional Log-BCJR algorithm to the same FPGA. More specifically, we compare our proposed FPGA implementation to the benchmarker of [14] in terms of its BER, throughput, latency and resource usage, in Sections VI-A, VI-B, VI-C and VI-D, respectively. Following this, we characterize the energy consumption of our proposed FPGA implementation in Section VI-E, although we are unable to compare this to that of the benchmarker. This is because the energy consumption of the benchmarker implementation is not discussed in [14] nor is it discussed for other existing FPGA implementations of the Log-BCJR based LTE turbo decoder. Finally, we perform an overall comparison between the proposed LTE FPTD FPGA implementation and other state-of-the-art FPGA implementations of the Log-BCJR turbo decoder and various LDPC decoders in Section VI-F. [14] . The FPTD employs the approximate max* operation of (7) and performs I ≤ 28 iterations for frame lengths of N ∈ {256, 512, 1024}. The benchmarker decoder of [14] employs the exact max* operation of (6) and performs I = 5 iterations for frame lengths of N ∈ {256, 512, 1024}.
A. BER
The BER performance of the proposed FPGA implementation of the FPTD is compared to that of the benchmarker FPGA implementation of the Log-BCJR turbo decoder of [14] in Figure 18 . Here, the benchmarker employs the exact max* operation of (6), while performing I = 5 iterations, for the case where N = {512, 1024}. By contrast, the same BER performance can be achieved for our proposed fixed-point FPTD, when employing the approximate max* operation of (7) and performing I ≤ 28 iterations. More specifically, the iterative decoding is curtailed, once a sufficient number of iterations have been performed to achieve successful decoding or when a limit of I maxt = 28 iterations is reached. Furthermore, Figure 19 characterizes the BER performance of the proposed fixed-point FPTD performing I ≤ 28 iterations for decoding frames having different lengths of N ∈ {40, 64, 128, 256, 512, 720}, which can all be accommodated within the hardware of the target FPGA. Note that the BER performance is not provided in [14] for the frame lengths of N ∈ {40, 64, 128, 720} for the benchmarker, hence we are unable to compare it to our proposed FPTD. Note that Figure 19 shows the average number I av of iterations used by the proposed implementation for each frame length at the specific E b /N 0 value, where a BER of 10 −5 is reached. Figure 20 characterizes the critical path delay of the proposed FPTD core for the frame lengths of N ∈ {40, 64, 128, 256, 512, 720}, as well as their resultant maximum clock frequency, as reported by the Quartus design tool. Here, the critical path delay comprises two parts, namely the cell delay and interconnect delay. The cell delay is the sum of the time occupied by all the combinational components residing on the critical path. By contrast, the interconnect delay is the sum of the time occupied the interconnections between those combinational components. As shown in Figure 20 , when implementing the FPTD for the shortest LTE frame length of N = 40 bits, the critical path delay is 10.6 ns, in which the cell delay and the interconnect delay are evenly distributed, achieving a maximum clock frequency of f core = 93 MHz. Note that this maximum clock frequency depends also on the delay associated with the clock tree, although this is negligible compared to the cell delay and interconnect delay shown in Figure 20 . When the frame length N is increased from N = 40 bits to N = 720 bits, the critical path delay increases gradually to 15.3 ns and the maximum clock frequency of f core decreases accordingly to f core = 65 MHz. As shown in Figure 20 , this increased critical path delay versus N is increased and it is mainly imposed by the interconnects, while the cell delay is reduced slightly. This may be because a greater fraction of the FPGA's logic elements are employed for implementing the FPTD, when the frame length N is increased, which increases the difficulty of optimizing the placing and routing. In particular, the interleaver may be required to route information between processing elements that are implemented near the opposite corners of the FPGA. By contrast, the reduced cell delay may be attributed to the deeper optimization performed by the Quartus design tool, when the resources become limited.
B. THROUGHPUT
As described in Section V-C, the average throughput of the proposed FPTD is given by
, where τ 1 = 2·I av +OCyc f core .
FIGURE 22.
Comparison of average and maximum latency for the proposed fixed-point FPTD FPGA implementation and for the benchmarker FPGA decoder of [14] , where I ≤ 28 iterations are compared for the proposed FPTD, while the benchmarker decode employs I = 5 iterations.
Note that the throughput is a function of the frame length N , as well as the clock frequency f core and the average number of iterations performed I av , which also both depend on N , as characterized in Figures 19 and 20 . Owing to this, Figure 21 compares the throughput of the proposed FPTD with that of the benchmarker FPGA implementation of [14] , as a function of N . More specifically, the resultant throughput of the proposed FPTD ranges from 442 Mbit/s for N = 40 to 1.53 Gbit/s for N = 720. By contrast, the throughput of the benchmarker of [14] is given by
, where the clock frequency is f clk = 102 MHz, the number of iterations performed is I = 5 and the overhead is OCyc = 14 clock cycles per half-iteration. Furthermore, the benchmarker decoder of [14] comprises 64 sub-decoders, each of which processes one or zero partitions of the frame, depending on the frame length N . More specifically, a frame having the length N is decomposed into P partitions, where Considering these configurations, the resultant throughput of the benchmarker FPGA decoder of [14] is in the range from 21 Mbit/s for N = 40 to 524 Mbit/s for N = 6144, as shown in Figure 21 . Note that these throughputs are 21 and 13.2 times lower than those of the proposed FPTD decoder for the cases of N = 40 and N = 720, respectively. Furthermore, the maximum throughput gain is achieved by the proposed FPTD decoder for N = 512, where it has a throughput of 1.4 Gbit/s, which is 22.6 times higher than the 62 Mbit/s achieved by the benchmarker decoder of [14] . [14] . The percentages shown in brackets indicate the corresponding fraction of the capacity of the EP4SE820F43C3 FPGA.
C. LATENCY
As described in Section V, the average latency imposed by loading, processing and outputting a frame is given by
, while the worst-case latency is given by
, which is incurred when decoding is unsuccessful. By contrast, the latency of the benchmarker decoder is not quantified in [14] but may be optimistically estimated as latency = N /throughput, which ignores the latency for loading and outputting the data. As shown in Figure 22 , the latency of the benchmarker decoder ranges between 1.9 µs when N = 40 and 6.2 µs when N = 720. By contrast, our proposed FPTD FPGA implementation achieves a worstcase latency of 0.72 µs when N = 40 and 0.98 µs when N = 720, which are 2.6 times and 6.3 times less than those of the benchmarker decoder. Meanwhile the average latency of our proposed FPTD FPGA implementation reduces to 0.18 µs when N = 40 and 0.56 µs when N = 720, which are 10.6 times and 11.1 times less than those of the benchmarker decoder. Here, the maximum latency improvement is obtained when N = 512, where the latency of 0.46 µs for the proposed FPTD is 18 times less than the 8.3 µs, obtained by the benchmarker of [14] .
D. RESOURCE USAGE
The resource usage of the proposed N = 720 FPTD FPGA implementation is compared in Table 1 , in terms of combinational Adaptive Look-Up-Tables (ALUTs), as well as memory ALUTs, dedicated logic registers and total block memory bits. Note that the EP4SE820F43C3 FPGA has a capacity of 650,440 ALUTs, half of which can be configured to implement combinational logic, while the other half can be configured as either combinational logic or as memory. Here, we compare three versions for the benchmarker decoder of [14] , namely P = 8, P = 16 and P = 64 versions. The P = 64 version is the original implementation presented in [14] , which comprises 64 sub-decoders and is capable of supporting all LTE frame lengths at run time. However, our proposed FPTD implementation supports only a single frame length of up to N = 720 bits at run time. In order to facilitate fairer resource usage comparisons with our FPTD, the P = 8 and P = 16 versions of the benchmarker decoder comprise only 8 and 16 sub-decoders, respectively.
As described in Section VI-B, this is motivated, since the benchmarker of [14] only uses P = 8 and P = 16 subdecoders for frame lengths N in the ranges 40 to 512 and 528 to 1024, respectively. Owing to this, the P = 8 and P = 16 versions offer the same throughputs as the P = 64 version of the benchmarker for frame lengths in the ranges 40 to 512 and 528 to 1024 respectively, but at the cost of lower hardware usage. Note that the resource usage of the P = 8 and P = 16 versions reported in Table 1 was estimated by linearly scaling those of the P = 64 version. As shown in Table 1 , the N = 720 FPTD occupies 99% of the EP4SE820F43C3 FPGA's ALUTs as combinational logic, while it uses 11% of the FPGA's dedicated registers and 0.04% of its memory bits. Furthermore, the N = 720 FPTD employs 650 general I/O pins, in accordance with Figure 13 . By contrast, the P = 16 version of the benchmarker decoder occupies 8.8% of the ALUTs as combinational logic and 1% of the ALUTs as memory, while it uses 8% of the dedicated registers and 0.2% of the total memory bits. The resource usage may be normalized as ALUTs throughput , since both the proposed FPTD implementation and the benchmarker are limited by ALUT resources, rather than by memory. Figure 23 depicts the normalized resource usage of the proposed FPTD FPGA implementation as a function of the frame length N , compared with those of the P = 8, P = 16 and P = 64 versions of the benchmarker decoder. Note that for frmae lengths N above 512 and 1024 bits respectively, the throughputs of the P = 8 and P = 16 versions of the benchmarker are estimated by linearly scaling those of the P = 64 version, as shown in Figure 21 . As shown in Figure 23 
E. ENERGY
The power consumption of the proposed FPTD FPGA implementation was estimated using the power analysis tool in the design tool kit of Qaurtus II [57] , based on the Value Change Dump (VCD) results obtained from a post-fit dynamic simulation of 100 frames. These frames were recorded during transmission over an AWGN channel using BPSK at the specific Figure 13 . The core dynamic power consumption is dominated by the switching activity of all in-use hardware resources, which increases gradually with the frame length N , in correspondence with the associated increase of hardware resource usage. By contrast, the core static power consumption is relatively consistent for all frame lengths, since this depends more upon the FPGA's technology and size, rather than its application. Furthermore, when implementing the N = 40-bit FPTD, the static power consumption comprises approximately 50% the total consumption. By contrast, the static power consumption represents only 8.4% of the total power consumption, when implementing the N = 720-bit FPTD, which occupies all of the FPGA's computational resources. Compared to the core dynamic and core static power consumption, the power consumption of the I/O pins is negligible, as shown at the bottom of each bar in Figure 24 .
The average energy consumption per bit may be obtained as
, where τ 2 =
is the average latency for decoding a frame, as described in Section VI-C. As shown in Figure 24 , the average energy consumption per bit ranges from 9.9 nJ to 14.1 nJ, when the frame length is increased from N = 40 to N = 720. This increased energy consumption per bit is dominated by the increased energy consumption associated with routing, which is incurred by the more complex interconnections and clock trees that are associated with longer frames. Note that the energy consumption is not characterized for the benchmarker decoder in [14] , or for any other state-of-the-art FPGA implementations of the Log-BCJR turbo decoder, hence preventing a comparison with our proposed FPTD implementation of the FPTD. Table 2 compares the overall characteristics of the proposed LTE FPTD FPGA implementation with several state-of-theart LTE turbo decoder FPGA implementations based on the Log-BCJR algorithm. In order to facilitate fair comparisons with the other implementations, their characteristics have been scaled to become equivalent to using a 40 nm FPGA technology and using I = 5 decoding iterations, as shown in the brackets of Table 2. Note that [27] and [35] characterizes the throughput and resource usage for only a single MAP decoder, without considering the overhead of implementing the interleaver and CRC circuit. Note also that the FPGAs from different vendors have widely differing architectures, which prevents a precise comparison in terms of resource usage. Nonetheless, we adopt the concept of Equivalent Logic Blocks (ELBs) defined in [13] to offer a fair comparison between the resource usage of implementations using FPGA manufactured by different vendors. More specifically, an ELB corresponds to a pair of 4-input Look-Up Tables (LUTs) and a register, where an ALUT in Altera FPGAs is equivalent to a single ELB, while a 6-input LUT in Xilinx FPGAs is approximately equivalent to two ELBs. As shown in Table 2 , the proposed LTE FPTD FPGA implementation achieves the highest peak processing throughput, compared to all other implementations considered. Note that the peak throughput of the proposed FPTD is achieved for a frame length of N = 720 bits, while the peak throughput of the other LTE turbo decoder implementations is achieved for the case of N = 6144-bit frames. Similarly, the normalized resource usage of the proposed LTE FPTD FPGA implementation is better than those of the other implementations, as shown in Table 2 .
F. OVERALL COMPARISON
Furthermore, Figure 25 compares the turbo decoder implementations of Table 2 with many FPGA implementations of LDPC decoders which were characterized in [13] . Here, the comparison considers processing throughput, hardware resource usage, BER performance and flexibility to support different frame lengths and coding rates at run-time. More specifically, Figure 25 plots the resource usage and the throughput of the various FPGA implementations on its x-axis and y-axis, respectively. Here, the resource usage is quantified using the above-mentioned ELB metric, which facilitates a fairer comparison between implementations that employ different FPGAs. Furthermore, in order to be consistent with the comparisons of [13] , the throughputs presented in Figure 25 are the unscaled ones of Table 2 . In addition to throughput and resource usage, the flexibility of each implementation is identified by the shape of the data points, while the BER performance is indicated by their color. Here, the BER performance is characterized by the minimal E b /N 0 value where a BER of 10 −4 is achieved, which is related to the code design, coding rate, number of iterations and frame length. As shown in Figure 25 , the flexible turbo decoders offer similar normalized resource usage ( kELBs Mbit/s ) to the flexible LDPC decoders, despite the LDPC decoders typically having lower computational complexity. This may be attributed to the significantly further interconnection complexity of LDPC decoders, as well as to the significant challenges associated with implementing high-throughput flexible LDPC decoders. More specifically, all turbo decoder algorithmic blocks are identical and of them is only connected to its neighbors and a single algorithmic block through the interleaver. By contrast, the variable and check nodes of an LDPC decoder have various degrees, often much greater than one, which quantifies the number of connected nodes through the interleaver. Owing to these complications, the flexible WiFi LDPC decoders support only 12 combinations of frame length and coding rate, while the LTE turbo decoders support 643 million combinations. Furthermore, the flexible turbo decoders of Figure 25 can be seen to offer superior BER performance to the flexible LDPC decoders. 
VII. CONCLUSIONS
In this paper, we have proposed a novel area-efficient fixedpoint LTE FPTD, which achieves 50% hardware resource reduction compared with the FPTD architecture of [37] . We have also proposed a holistic FPGA implementation of this resource-efficient FPTD, which includes schemes for loading each frame, processing it and outputting the results. The proposed FPGA implementation offers a processing throughput gain up to 22.6 times and a processing latency gain of up to 18 times, compared to those of the stateof-the-art FPGA implementation based on the conventional Log-BCJR LTE turbo decoder. The peak processing throughput of 1.53 Gbit/s and the worst case latency of 0.98 µs for the proposed FPTD implementation meet the throughput and latency requirements for state-of-the-art telephony communication standards, such as LTE cat.12 [64] . In particular, its processing latency represents only an insignificant fraction of the 1 ms end-to-end transmission latency budget for MCMTC applications [9] . Furthermore, the normalized resource usage of the proposed FPTD FPGA implementation is up to 19 times better than that of the P = 8 version of the benchmarker. Our future work will be motivated by the improvements desired for 5G communications, such as increased flexibility to a wider range of frame lengths, as well as improved hardware efficiency and energy efficiency. More specifically, we will consider techniques that can further reduce the resource usage and facilitate support for all LTE turbo code frame lengths, as well as any that are defined for 5G. We will also consider the employment of a Beneš network [65] in order to implement the LTE interleaver and to support different frame lengths at run time.
