ABSTRACT Turbo codes comprising a parallel concatenation of upper and lower convolutional codes are widely employed in the state-of-the-art wireless communication standards, since they facilitate transmission throughputs that closely approach the channel capacity. However, this necessitates high processing throughputs in order for the turbo code to support real-time communications. In the state-of-the-art turbo code implementations, the processing throughput is typically limited by the data dependences that occur within the forward and backward recursions of the Log-BCJR algorithm, which is employed during turbo decoding. In contrast to the highly serial Log-BCJR turbo decoder, we have recently proposed a novel fully parallel turbo decoder (FPTD) algorithm, which can eliminate the data dependences and perform fully parallel processing. In this paper, we propose an optimized FPTD algorithm, which reformulates the operation of the FPTD algorithm so that the upper and lower decoders have identical operation, in order to support single instruction multiple data operation. This allows us to develop a novel general purpose graphics processing unit (GPGPU) implementation of the FPTD, which has application in software-defined radios and virtualized cloud-radio access networks. As a benefit of its higher degree of parallelism, we show that our FPTD improves the higher processing throughput of the Log-BCJR turbo decoder by between 2.3 and 9.2 times, when employing a high-specification GPGPU. However, this is achieved at the cost of a moderate increase of the overall complexity by between 1.7 and 3.3 times.
I. INTRODUCTION
Channel coding has become an essential component in wireless communications, since it is capable of correcting the transmission errors that occur when communicating over noisy channels. In particular, turbo coding [1] - [3] is a channel coding technique that facilitates near-theoreticallimit transmission throughputs, which approach the capacity of a wireless channel. Owing to this, turbo codes comprising a concatenation of upper and lower convolutional codes are widely employed in state-of-the-art mobile telephony standards, such as WiMAX [4] and LTE [5] . However, the processing throughput of the turbo decoding can impose a bottleneck on the transmission throughput in real-time or very throughput-demanding applications, such as flawless, high-quality video conferencing. In dedicated receiver hardware, a state-of-the-art turbo decoder Application-Specific Integrated Circuits (ASICs) may be used for eliminating the bottleneck of the turbo decoding. However, this bottleneck is a particular problem in the flexible receiver architectures of Software-Defined Radio (SDR) [6] and virtualized Cloud-Radio Access Network (C-RAN) [7] , [8] systems that employ only programmable devices, such as Central Processing Unit (CPU) or Field-Programmable Gate Array (FPGA), which typically exhibit a limited processing performance capability or a highcost. Although CPUs are capable of carrying out most of the LTE and WiMAX baseband operations, they are not wellsuited to the most processor-intensive aspect, namely turbo decoding [9] , [10] . Likewise, while high-performance and large-size FPGAs are well-suited to the parallel process- ing demands of state-of-the-art turbo decoding algorithms, they are relatively expensive. By contrast, General-Purpose Graphics Processing Units (GPGPUs) offer the advantages of high performance parallel processing at a low cost. Owing to this, GPGPUs have been favoured over CPUs and FPGAs as the basis of SDRs, where a high processing throughput at a low cost is required [11] , [12] . This motivates the implementation of the turbo decoding algorithm on GPGPU, as was first demonstrated in [13] and [14] .
However, turbo decoder implementations typically operate on the basis of the serially-oriented Logarithmic Bahl-CockeJelinek-Raviv (Log-BCJR) algorithm [15] . More specifically, this algorithm processes the bits of the message frame using both forward and backward recursions [16] , which impose strict data dependencies and hence require processing, which is spread over numerous consecutive clock cycles. In order to mitigate the inherent bottleneck that the serial nature of the Log-BCJR algorithm imposes on the achievable processing throughput, the above-mentioned GPGPU implementation of [13] invoke a variety of methods for increasing the parallelism of the algorithm. For example, the windowing technique of [17] , [18] decomposes each frame of N bits into P equal-length windows, giving a window length of W = N P , as shown in Figure 1 . The processing throughput may be increased by a factor of P upon processing the windows concurrently, each using separate forwards and backwards recursions. Here, the Previous Iteration Value Initialization (PIVI) technique of [17] and [18] may be employed for allowing the adjacent windows to assist each others' operation. However, the error correction capability of the PIVI Log-BCJR turbo decoder is degraded as the number P of windows is increased. For this reason, the maximum number of windows employed in previous GPGPU implementations of the LTE turbo decoder associated with N = 6144 was P = 192 [13] , [17] , [18] , which avoids any significant error correction performance degradation and facilitates a 192-fold increase in the grade of parallelism [19] . Furthermore, the concept of trellis state-level parallelism may be employed [12] . More specifically, the forward and backward recursions of the Log-BCJR algorithm operates on the basis of trellises comprising M states per bit [15] . Since there are no data dependencies amongst the calculations performed for each of the M states, these can be performed concurrently. Since the LTE turbo decoder relies on M = 8 states, the combination of the trellis state-level parallelism and windowing facilitates a degree of parallelism up to P × M = 1536, occupying 1536 concurrent threads on a GPGPU. However GPGPUs are typically capable of exploiting much higher degrees of parallelism than this [20] , implying that the existing GPGPU based turbo decoder implementations do not exploit the full potential for achieving a high processing throughput. Although a higher degree of parallelism may be achieved by processing several frames in parallel [21] , this would only be useful when several frames were available for simultaneous decoding. Furthermore, the act of processing frames in parallel does not improve the processing latency of the turbo decoder, which hence exceeds the tolerable transmission latency of many applications.
Motivated by these issues, we previously proposed a Fully-Parallel Turbo Decoder (FPTD) algorithm [22] , which dispenses with the serial data dependencies of the conventional Log-BCJR turbo decoder algorithm. This enables every bit in a frame to be processed concurrently, hence achieving a much higher degree of parallelism than the previously demonstrated in the literature. Thus, the FPTD is well suited for multi-core processors [23] , potentially facilitating a significant processing throughput gain, relative to the stateof-the-art. However, our previous work of [22] considered the FPTD at a purely algorithmic level, without addressing its hardware implementation. Against this background, the contribution of this paper is follows.
1) We propose a beneficial enhancement of the FPTD algorithm of [22] so that it supports Single Instruction Multiple Data (SIMD) operation and therefore it becoming better suited for implementation on a GPGPU. More specifically, we reformulate the FPTD algorithm so that the operations performed for the upper decoder are identical to those carried out by the lower decoder, despite the differences between the treatment of the systematic bits in the upper and lower encoders. The proposed SIMD FPTD algorithm also requires less high-speed memory and has a lower computational complexity compared to the FPTD algorithm of [22] , which are desirable characteristics for GPGPU implementations. 2) We propose a beneficial GPGPU implementation of our SIMD FPTD for the LTE turbo code, achieving a throughput of up to 18.7 Mbps. Furthermore, our design overcomes a range of significant challenges related to topological mapping, data rearrangement and memory allocation. 3) We implement a PIVI Log-BCJR LTE turbo decoder on a GPGPU as a benchmarker, achieving a throughput of up to 8.2 Mbps, while facilitating the same BER as our SIMD FPTD having a window size of N /P = 32, which is comparable to the throughputs of 6.8 Mbps and 4 Mbps, demonstrated in the pair of state-of-theart benchmarkers of [13] and [17] , respectively. 4) We show that when used for implementing the LTE turbo decoder, the proposed SIMD FPTD achieves a degree of parallelism that is between 4 and 24 times higher, representing a processing throughput improvement between 2.3 to 9.2 times as well as a latency reduction between 2 to 8.2 times. However, this is achieved at the cost of increasing the overall complexity by a factor between 1.7 and 3.3.
The rest of the paper is organized as follows. Section II provides an overview of GPGPU computing and its employment for the Log-BCJR turbo decoder. Section III introduces our SIMD FPTD algorithm proposed for the implementation of the LTE turbo decoder. Section IV discusses the implementation of the proposed SIMD FPTD using a GPGPU, considering topological mapping, data rearrangement and memory allocation. Section V presents our simulation results, including error correction performance, degree of parallelism, processing latency, processing throughput and complexity. Finally, Section VI offers our conclusions.
II. GPU COMPUTING AND IMPLEMENTATIONS
GPUs offer a flexible throughput-oriented processing architecture, which was originally designed for facilitating massively parallel numerical computations, such as 3D image graphics [9] and physics simulations [24] . Additionally, the GPGPU technology provides an opportunity to utilize the GPU's capability to perform several trillion Floating-point Operations Per Second (FLOPS) for general-purpose applications, such as used for the computations performed in an SDR platform. In particular, the Compute Unified Device Architecture (CUDA) [20] platform offers a software programming model, which enables programmers to efficiently exploit a GPGPU's computational units to be exploited for generalpurpose computations. As shown in Figure 2 , a programmer may specify GPGPU instructions using CUDA kernels, which are software subroutines that may be called by the host CPU and then executed on the GPU's computational units. CUDA manages these computational units at three levels, corresponding to the grids, thread blocks and threads. Each call of a kernel invokes a grid, which typically comprises of many thread blocks, each of which typically comprises many threads, as shown in Figure 2 . However, during the kernel's execution, all threads are grouped into warps, each of which comprises 32 threads. Each warp is operated in a Single Instruction Multiple-Data (SIMD) [25] fashion, with all of the 32 constituent threads executing identical instructions at the same time, but on different data elements.
There are several different types of memory in a GPU, including global memory, shared memory and registers, as shown in Figure 2 . Each different type of memory has different properties, which may be best exploited in different circumstances in order to optimize the performance of the application. More specifically, global memory is an off-chip memory that typically has a large capacity accessible from the host CPU, as well as from any thread on the GPU. Global memory is typically used for exchanging data between the CPU and the GPU, although it has the highest access latency and most limited bandwidth, compared to the other types of GPU memory. By contrast, shared memory is user-controlled on-chip cache, which has very high bandwidth (bytes/second) and extremely low access latency. However, shared memory has a limited capacity and an access scope that is limited to a single thread block. Owing to this, a thread in a particular thread block cannot access any shared memory allocated to any other thread block. Furthermore, all data stored in shared memory will be released automatically once the execution of the corresponding thread block is completed. In comparison to global and shared memory, registers have the largest bandwidth and the smallest access latency. However, registers have very limited capacity and their access scope is limited to a single thread.
Considering these features, many previous research projects have explored the employment of GPGPUs in SDR applications, as shown in Figure 3 . Note that the GPGPU-based virtualized C-RAN implementation has not been exploited, although a C-RAN system has been implemented on the Amazon Elastic Compute Cloud (Amazon EC2) [48] using only CPUs. More specifically, [37] compared several different SDR implementation approaches in terms of programmability, flexibility, energy consumption and computing power. In particular, [37] recommended the employment of GPGPU as a co-processor to complement an ASIC, FPGA or Digital Signal Processor (DSP). Additionally, [33] characterized the performance of GPGPUs, when employed for three different operations, namely Fast Fourier Transform (FFT), Quadrature Phase Shift Keying (QPSK) demapper and Infinite Impulse Response (IIR) filter. Similarly, [41] compared the processing throughput and energy efficiency of a particular FPGA and a particular GPGPU, when implementing both the FFT and a Finite Impulse Response (FIR) filter.
As shown in Figure 3 , [12] , [26] , [31] , [32] , [42] implemented an entire transmitter, receiver or transceiver for the LTE or WiMAX standard on a SDR platform that employs GPGPUs. Additionally, [11] and [35] implemented a soft-output Multiple-Input Multiple-Output (MIMO) detector, while [34] implemented the Digital Video Broadcasting (DVB) physical layer on a GPGPU. All of these previous research efforts demonstrated that GPGPUs offer an improved processing throughput, compared to the family of implementations using only a CPU. Furthermore, [12] showed that an LTE base station supporting a peak data rate of 150 Mbps can be implemented using four NVIDIA GTX 680 GPUs, achieving a similar energy efficiency to a particular dedicated LTE baseband hardware. However, [12] and [42] demonstrated that turbo decoding is the most processor-intensive operation of basestation processing, requiring at least 64% of the processing resources used for receiving a message frame, where the remaining 36% includes the FFT, demapping, demodulation and other operations. Motivated by this, a number of previous research efforts [13] , [14] , [17] , [18] , [21] , [38] , [39] , [43] - [45] have proposed GPGPU implementations dedicated to turbo decoding, as shown in Figure 3 . Additionally, the authors of [27] - [30] , [36] , [40] , [46] , and [47] have proposed GPGPU implementations of LDPC decoders. 
III. SINGLE-INSTRUCTION-MULTIPLE-DATA FULLY-PARALLEL TURBO DECODER ALGORITHM
In this section, the operation of the proposed SIMD FPTD algorithm is detailed in Section III-A and it is compared with the FPTD algorithm of [22] in Section III-B.
A. OPERATION OF THE PROPOSED SIMD FPTD ALGORITHM
In this section, we detail our proposed SIMD FPTD algorithm for the LTE turbo decoder, using the schematic of Figure 4 (a). The corresponding turbo encoder is not illustrated in this paper, since it is identical to the conventional LTE turbo encoder [5] . As in the PIVI Log-BCJR turbo decoder, the proposed SIMD FPTD employs an upper decoder and a lower decoder, which are separated by an interleaver. Accordingly, Figure 4 Throughout the remainder of this paper, the superscripts 'u' and 'l' are used only when necessary to explicitly distinguish between the upper and lower components of the turbo code and are omitted when the discussion applies equally to both.
As in the PIVI Log-BCJR turbo decoder, the proposed SIMD FPTD algorithm employs two half-iterations per decoder iteration. However, the two half-iterations do not correspond to the separate operation of the upper and lower decoders, like in the PIVI Log-BCJR turbo decoder. Furthermore, during each half-iteration, the proposed SIMD FPTD algorithm does not operate the algorithmic blocks of Figure 4 (a) in a serial manner, using forward and backward recursions. Instead, the first half-iteration performs the fullyparallel operation of the lightly-shaded algorithmic blocks of Figure 4 (a) concurrently, namely the odd-indexed blocks of the upper decoder and the even-indexed blocks of the lower decoder. Furthermore, the second half-iteration performs the concurrent operation of the remaining darkly-shaded algorithmic blocks of Figure 4 (a), in a fully-parallel manner. This decomposition of the algorithmic blocks into odd-even algorithmic blocks is motivated by the odd-even nature of the Quadratic Permutation Polynomial (QPP) interleaver [19] used by the LTE turbo code and the Almost Regular Permutation (ARP) interleaver used by the WiMAX turbo code [4] . More explicitly, QPP and ARP interleavers only connect algorithmic blocks in the upper decoder that have an odd index k to specific blocks that also have an odd index in the lower decoder. Similarly, even-indexed blocks in the upper decoder are only connected to even-indexed blocks in the lower decoder. It is this fully-parallel operation of algorithmic blocks that yields a significantly higher degree of parallelism than the PIVI Log-BCJR turbo decoder algorithm, as well as a significantly higher decoding throughput. More specifically, rather than requiring 10s or 100s of consecutive time periods to complete the forward and backward recursions in each window of the PIVI Log-BCJR turbo decoder, the proposed SIMD FPTD algorithm completes each halfiteration using only a single time period, during which all algorithmic blocks in the corresponding set are operated concurrently. Note also that this odd-even concurrent operation of algorithmic blocks in the upper and lower decoder represents a significant difference between the FPTD algorithm and a PIVI Log-BCJR decoder employing a window length of W = 1, as considered in [38] . More specifically, a PIVI Log-BCJR decoder having a window length of W = 1 may require as many as I = 65 iterations to maintain a similar BER performance as a PIVI Log-BCJR decoder having a window length of W = 32 and I = 7 iterations [38] . By contrast, taking advantage of the odd-even feature our FPTD algorithm requires only I = 36 iterations to achieve this, as it will be detailed in Section V-A.
In the t th time period of proposed SIMD FPTD, each algorithmic block of the corresponding odd or even shading having an index k ∈ {1, 2, 3, . . . , N } accepts five inputs and generates three outputs, as shown in Figure 4 (a). In addition to the LLRsb a,t−1 1,k ,b a 2,k andb a 3,k , the k th algorithmic block requires the vectorsᾱ t-1 The k th algorithmic block combines these inputs using four steps, which correspond to Equations (1), (2), (3) and (4), as shown at the top of this page, respectively. As in the VOLUME 4, 2016 conventional Log-BCJR turbo decoder, (1) obtains an a priori metricγ t k (S k−1 , S k ) for the transition between a particular pair of states S k−1 and S k . As shown in Figure 5 , for the case of the LTE turbo code, this transition implies a particular binary value for the corresponding message bit b 1,k , parity bit
2,k +b a 3,k ) and zero. All four of these possible values can be calculated using as few as two additions, as shown in Figure 6 , which provides an optimized datapath for the k th algorithmic block of the proposed SIMD FPTD. Following this, (2) and (3) may be employed to obtain the state metricsᾱ t k andβ t k−1 , respectively. Here, c(S k−1 , S k ) adopts a binary value of 1, if there is a transition between the states S k−1 and S k in the state transition diagram of Figure 5 , while
is the Jacobian logarithm [16] , as is employed by the Log-BCJR decoder. Note that the Jacobian logarithm may be approximated as
in order to reduce the computational complexity, as in the Max-Log-BCJR. Note that for those transitions having a metricγ t k (S k−1 , S k ) of zero, the corresponding terms in (2) and (3) can be ignored, hence reducing the number of additions required. This is shown in the optimized datapath of Figure 6 . Finally, (4) may be employed for obtaining the extrinsic LLRb e,t 1,k , as shown in Figure 6 . This LLR may then be output by the algorithmic block, as shown in Figure 4(a) .
When operating the k th algorithmic block in the first half-iteration of the iterative decoding process, the a priori message LLR provided by the other row is unavailable, hence it is initialized asb a,t−1 1,k = 0, accordingly. Likewise, the forward state metrics from the neighboring algorithmic blocks are unavailable, hence these are initialized asᾱ
However, in the case of the k = 1 st algorithmic block, we employᾱ
in all decoding iterations, since the LTE trellis is guaranteed to start from an initial state of S 0 = 0. Similarly, before operating the k th algorithmic block in the first half-iteration, we employβ
, since the LTE turbo coding employs three termination bits to guarantee S N +3 = 0. Note that (1), (2) , (3) and (4) reveal thatβ N is independent ofᾱ N . Therefore, the algorithmic blocks with indices k ∈ [N + 1, N + 3], shown as unshaded blocks in Figure 4 (a), can be processed before and independently of the iterative decoding process. This may be achieved by employing only equations (1) and (3), where the term b 3 
is omitted from (1). More specifically, these equations are employed in a backward recursion, in order to successively calculateβ N +2 ,β N +1 andβ N , the latter of which is employed throughout the iterative decoding process by the N th algorithmic block.
B. COMPARISON WITH THE FPTD ALGORITHM OF [22] In this section, we compare the proposed SIMD FPTD algorithm with the original FPTD algorithm of [22] . In particular, we compare the operation, temporary storage requirements and computational complexity of these decoders. Note that in analogy to (1), the FPTD algorithm of [22] employs a summation of three a priori LLRs, when operating the algorithmic blocks of the upper row having an index k ∈ {1, 2, 3, . . . , N }. However, a summation of just two a priori LLRs is employed for the corresponding blocks in the lower row of the FPTD algorithm of [22] , since in this case the term b 3 (S k−1 , S k )·b a 3,k is omitted from the equivalent of (1). By contrast, the proposed SIMD FPTD algorithm employs (1) in all algorithmic blocks, ensuring that all of them operate in an identical manner, hence facilitating SIMD operation, which is desirable for GPGPU implementations. This is achieved by including the a priori systematic LLRb a 3,k in the calculation of (1), regardless of whether the algorithmic block appears in the upper or the lower row. Furthermore, in contrast to the FPTD algorithm of [22] ,b a 3,k is omitted from the calculation of (4), regardless of which row the algorithmic blocks appears in.
A further difference between the proposed SIMD FPTD algorithm and the original FPTD algorithm of [22] , is motivated by reductions in memory usage and computational complexity. More specifically, the algorithmic blocks of the proposed SIMD FPTD algorithm are redesigned to use fewer intermediate variables and computations. In particular, the transition metricγ k (S k−1 , S k ) of (1) can only adopt three non-zero values, as described above. By contrast, the original FPTD algorithm of [22] needs to calculate and store a different transition metricδ k (S k−1 , S k ) for each of the sixteen transitions. The proposed approach allows a greater proportion of the intermediate variables to be stored in the GPGPU's limited number of low-latency registers, with less reliance on its high-latency memory. Since the GPGPU's low-latency registers are shared among all N algorithmic blocks, the benefit of reducing the reliance of each block on intermediate variables is magnified by N times. Owing to this, a slight reduction in the memory usage of each algorithmic block results in a huge reduction in the total memory usage, especially when N is large.
Furthermore, while the proposed SIMD FPTD algorithm, the original FPTD algorithm of [22] and the PIVI Log-BCJR decoder all require the same number of max* operations per algorithmic blocks, the proposed SIMD FPTD algorithm requires the fewest additions and subtractions. More specifically, as shown in the optimized datapath for the LTE turbo code of Figure 6 , the proposed SIMD FPTD algorithm requires only 45 additions and subtractions per algorithmic block. This is approximately 5% lower than the 47.5 additions and subtractions required by the original FPTD algorithm of [22] , as well as approximately 19% lower than the 55.5 required by the PIVI Log-BCJR decoder. Note that this computational complexity reduction is achieved by exploiting the relationship max * (A + C, B + C) = max * (A, B) + C [50] . This relationship holds for both the exact max* of (5) and approximate max* of (6). More specifically, (4) requires sixteen additions for obtainingᾱ k−1 +β k for the sixteen transitions in the LTE trellis, eight of which also require an extra addition for obtainingᾱ k−1 +β k +b a 2,k , before the max* operation. By grouping the transitions carefully, the additions ofb a 2,k can be moved to after the max* operation. Owing to this, only two additions are required, rather than eight, as shown in Figure 6 . Note that the datapath of Figure 6 has been specifically optimized for the LTE turbo code. By contrast, the FPTD algorithm of [22] is optimized for general turbo code applicability, yielding a more desirable design in the case of the duo-binary WiMAX turbo code [4] , for example.
IV. IMPLEMENTATION OF THE SIMD FPTD ALGORITHM ON A GPGPU
This section describes the implementation of the proposed SIMD FPTD algorithm using an NVIDIA GPGPU platform, adopting the Compute Unified Device Architecture (CUDA) [20] . The mapping of the SIMD FPTD algorithm onto the GPGPU and its memory allocation are discussed in Sections IV-A and IV-B, respectively. The pseudo code of the proposed GPGPU kernel designed for implementing the SIMD FPTD algorithm is described in Section IV-C.
A. MAPPING THE SIMD FPTD ALGORITHM ONTO A GPGPU
The proposed SIMD FPTD algorithm of Figure 4 (a) may be mapped onto a CUDA GPGPU using a single kernel. Here, two approaches are compared. In the first approach, each execution of the kernel performs one half iteration of the proposed algorithm, requiring 2I kernel repetitions in order to complete I number of decoding iterations. For this approach, the GPU kernel repetitions are scheduled serially by the CPU, achieving synchronization between each pair of consecutive half iterations by the CPU. This synchronisation ensures that all parts of a particular half iteration are completed, before any parts of the next half iteration begin. However, this synchronization occupies an average of 31.3% of the total processing time, which is due to the communication overhead between the CPU and the GPU, according to our experimental results. Owing to this, our second approach performs all 2I half iterations within a single GPU kernel run, eliminating the requirement for any communication between the CPU and the GPU during the iterative decoding process. However, the inter-block synchronization has to be carried out by the GPU in order to maintain the odd-even nature of the operation. Since CUDA GPGPUs do not have any native support for inter-block synchronization, here we include the lock-free inter-block synchronization technique of [51] . We perform this synchronization at end of every half iteration, which reduces the time dedicated to the synchronization from 31.3% to 15.5%, according to our experimental results. Owing to this superior performance compared to CPU synchronization, inter-block synchronization on the GPU is used for our proposed FPTD implementation and its performance is characterized in Section V.
Our kernel employs N number of threads, with one for each of the N algorithmic blocks that are operated within each half iteration of Figure 4(a) . Here, the k th thread processes the k th algorithmic block in the upper or lower row according to the odd-even arrangement of Figure 4(a) , where k ∈ [1, N ]. Note that it would be possible to achieve further parallelism by employing eight threads per algorithmic block, rather than just one. This would facilitate state-level parallelism as described in Section I for the conventional GPGPU implementation of the PIVI Log-BCJR turbo decoder. However, our experiments reveal that state-level parallelism offers no advantage for the proposed SIMD FPTD algorithm. More specifically, according to the Nsight profiler of [52] , the processing throughput of the proposed FPTD implementation is bounded by the memory bandwidth rather than memory access latency, which implies that the parallelism of N is already large enough to make the most of the GPGPUs computing resource. Furthermore, employing state-level parallelism would result in a requirement for more accesses of the global memory, in order to load the a priori LLRsb a 1,k ,b a 2,k andb a 3,k , which would actually degrade the throughput. The algorithmic blocks of the proposed SIMD FPTD algorithm are arranged in groups of 32, in order for the corresponding threads to form warps, which are particularly suited to SIMD operation. In order to maximize the computation throughput, special care must be taken to avoid thread divergence. This arises when 'if' and 'else' statements cause the different threads of a warp to operate differently, resulting in the serial processing of each possible outcome. However, the schematic of Figure 4 (a) is prone to thread divergence, since each half iteration comprises the operation of algorithmic blocks in both the upper and the lower row, as indicated using light and dark shading. More specifically, 'if' and 'else' statements are required to determine whether each algorithmic block resides in the top or bottom row of Figure 4 (a), when deciding which inputs and outputs to consider. This motivates the alternative design of Figure 4(b) , in which all algorithmic blocks within the same half iteration have been relocated to the same row in order to avoid these 'if' and 'else' statements. More specifically, the algorithmic blocks that have an even index in the upper row have been swapped with those from the lower row. As a result, the upper row comprises the lightly-shaded blocks labeled u k|k is odd and l k|k is even , whilst the lower row comprises the darklyshaded blocks labeled u k|k is even and l k|k is odd . Consequently, the operation of alternate half iterations corresponds to the alternate operation of the upper and lower rows of Figure 4 (b). Note that this rearrangement of algorithmic blocks requires a corresponding rearrangement of inputs, outputs and memory, as will be discussed in Section IV-B.
As described in Section III, the consideration of the termination bits by the three algorithmic blocks at the end of the upper and lower rows can be isolated from the operation of the iterative processes. Therefore, we recommend the processing of all termination bits using the CPU, before beginning the iterative decoding process on the GPGPU. This aids the mapping of algorithmic blocks to warps and also avoids thread divergence, since the processing of the termination bits is not identical to that of the other bits, as shown in Figure 4(b) .
B. DATA ARRANGEMENT AND MEMORY ALLOCATION
Note that because the proposed SIMD FPTD employs the rearranged schematic of Figure 4 (b) rather than that of Figure 4 (a), the corresponding datasets must also be rearranged, using swaps and mergers. More specifically, for the a priori parity LLRsb a 2 and the systematic LLRsb a 3 the rearrangement can be achieved by swapping the corresponding elements in the upper and lower datasets, following the same rule that was applied to the algorithmic blocks of Figure 4(b) . For the forward and backwards metricsᾱ andβ as well as for the a priori message LLRsb a 1 the rearrangement can be achieved by merging the two separate datasets for the upper and lower rows together. Furthermore, there is no need to store both the a priori and the extrinsic LLRs, since interleaving can be achieved by writing the latter into the memory used for storing the former, but in an interleaved order.
Note that this arrangement also offers the benefit of minimizing memory usage, which is achieved without causing any overwriting, as shown in Figure 7 . More explicitly, the k th memory slot M k of Figure 4 (b) may be used for passing the k th forward state metricsᾱ u/l k between the algorithmic blocks u k /l k and u k+1 /l k+1 , for example. During the first half iteration, the upper algorithmic block u k is operated to obtainᾱ u k , which is stored in M k . Then during the second half iteration, this data stored in M k will be provided to the algorithmic block u k+1 , before it is overwritten by the new dataᾱ l k , which is provided by the algorithmic block l k . As illustrated in Figure 4(b) , there are a total of seven datasets that must be stored throughout the decoding process,
, requiring an overall memory resource of 21N floating-point numbers. As shown in Figure 4 (b), these datasets are stored in the global memory, since it has a large capacity and is accessible from the host CPU, as well as from any thread in the GPGPU device. However the global memory has a relatively high access latency and a limited bandwidth. In order to minimize the impact of this, each algorithmic block employs local low-latency registers to store all intermediate variables that are required multiple times within a half iteration. More specifically, the k th algorithmic block uses registers to storeb a 2,k , (b a 1,k +b a 3,k ), (b a 1,k +b a 2,k +b a 3,k ), α k−1 andβ k , comprising a total of 19 floating-point numbers, as shown in Figure 6 .
C. PSEUDO CODE
Algorithm 1 describes the operation of the k th thread dedicated to the computation of the k th algorithmic block, in analogy to the datapath of Figure 6 . Note that the labels of Register (R) and Global memory (G) shown in Algorithm 1 indicate the type of the memory used for storing the corresponding data. Each thread is grouped into four steps as follows. The first step caches the a priori LLRb a 2,k and the a priori state metricsᾱ k−1 andβ k from the global memory to the local registers. Furthermore, the first step computes b a 13 =b
+b a 2,k +b a 3,k , before storing the results in the local registers. Following this, the second and third steps compute the extrinsic forward state metricsᾱ t k and the extrinsic backward state metricsβ t k−1 , in analogy to the datapath of Figure 6 . The results of these computations are written directly into the corresponding memory slot M k in the global memory, as shown in Figure 4(b) . In the fourth step, the extrinsic LLRb e,t 1,k is computed and stored in the global memory. Here, interleaving or deinterleaving is achieved by storing the extrinsic LLRs into particular global memory slots selected according to the design of the LTE interleaver. Note that the intermediate values ofδ 0 andδ 1 require the storage of two floating-point numbers in registers, as shown in Algorithm 1. However, instead of using two new registers, they can be stored respectively in the registers that were previously used for storing the values ofb a 13 andb a 123 , since these are not required in the calculations of the
Algorithm 1 A Kernel for Computing a Half-Iteration of the Proposed SIMD FPTD Algorithm
Step 1: Loading data
Step 2:Computing forward state metrics
Step 3:Computing backward state metrics
Step 4:
1,π(k) ←δ 1 −δ 0 fourth step. As a result, a total of 19 registers are required per thread, as discussed above.
V. RESULTS
In the following sub-sections, we compare the performance of the proposed GPGPU implementation of our SIMD FPTD algorithm with that of the state-of-the-art GPGPU turbo decoder implementation in terms of error VOLUME 4, 2016 correction performance, degree of parallelism, processing throughput and complexity. Both turbo decoders were implemented using single-precision floating-point arithmetic and both were characterized using the Windows 8 64-bit operating system, an Intel I7-2600@3.4GHz CPU, 16GB RAM and an NVIDIA GTX680 GPGPU. This GPGPU has eight Multiprocessors (MPs) and 192 CUDA cores per MP, with a GPU clock rate of 1.06 GHz and a memory clock rate of 3 GHz. The state-of-the-art benchmarker employs the Log-BCJR algorithm, with PIVI windowing, state-level parallelism and Radix-2 operation [17] , [18] . This specific combination was selected, since it offers a high throughput and a low complexity, at a negligible cost in terms of BER degradation. This algorithm was mapped onto the GPGPU according to the approach described in [13] . Furthermore, as recommended in [13] and [18] , the longest LTE frames comprising N = 6144 bits were decomposed into P ∈ {192, 128, 96, 64, 32} number of partitions. This is equivalent to having PIVI window lengths of W = N /P ∈ {32, 48, 64, 96, 192}, respectively. Figure 8 compares the BER performance of both the PIVI Log-BCJR turbo decoder and the proposed SIMD FPTD algorithm, when employing the approximate max* operation of (6). Here, BPSK modulation was employed for transmission over an AWGN channel. For both decoders, the BER performance is provided for a relatively short frame length of N = 768 bits, as well as for the longest frame length that is supported by the LTE standard, namely N = 6144 bits. We have not included the BER performance of the two decoders when employing the exact max* operation of (5), but we found that they obey the same trends as Figure 8 . Figure 8 characterizes the BER performance of the PIVI Log-BCJR turbo decoder when employing I = 7 iterations and the window lengths of W ∈ {32, 48, 64, 96, 192}. In addition to this, Figure 8 provides the BER performance of the SIMD FPTD algorithm when performing I ∈ {36, 39, 42, 46, 49} iterations. As may be expected, the BER performance of the PIVI Log-BCJR turbo decoder improves when employing longer window lengths W . Therefore, more iterations I of the SIMD FPTD algorithm are required in order to achieve the same BER performance as the PIVI Log-BCJR turbo decoder, when W is increased. More specifically, Figure 8 shows that when employing N = 6144-bit frames, the SIMD FPTD algorithm requires I ∈ {36, 39, 42, 46, 49} decoding iterations in order to achieve the same BER performance as the PIVI Log-BCJR turbo decoder performing I = 7 iterations with the window lengths of W ∈ {32, 48, 64, 96, 192}, respectively. Note that in all cases, the proposed SIMD FPTD algorithm is capable of achieving the same BER performance as the PIVI Log-BCJR turbo decoder, albeit at the cost of requiring a greater number of decoding iterations I . Note that the necessity for the FPTD to perform several times more iterations than the Log-BCJR turbo decoder was discussed extensively in [22] .
A. BER PERFORMANCE

B. DEGREE OF PARALLELISM
The degree of parallelism for the PIVI Log-BCJR turbo decoder may be considered to be given by D
where N is the frame length, W is the window length and M = 8 is the number of states in the LTE turbo code trellis. Here, M = 8 threads can be employed for achieving state parallelism, while decoding each of the N /W windows simultaneously. By contrast, the degree of parallelism for the FPTD can be simply defined as D FPTD p = N , since all algorithmic blocks can be processed in parallel threads and because we do not exploit state parallelism in this case. Table 1 compares the parallelism D p of the proposed SIMD FPTD with that of the PIVI Log-BCJR turbo decoder, when decomposing N = 6144-bit frames into windows comprising various numbers of bits W . Depending on the window length W chosen for the PIVI Log-BCJR turbo decoder, the degree of parallelism achieved by the proposed SIMD FPTD can be seen to be between 4 and 24 times higher. Figure 9 compares the processing latency of both the proposed SIMD FPTD and of the PIVI Log-BCJR decoder, when decoding frames comprising N = 6144 bits using both the approximate max* operation of (6) and the exact max* operation of (5) . Note that different numbers of iterations I ∈ {36, 39, 42, 46, 49} are used for the SIMD FPTD, while I = 7 iterations and different window lengths W ∈ {32, 48, 64, 96, 192} are employed for the PIVI Log-BCJR turbo decoder, as discussed in Section V-A. Here, the overall latency includes two parts, namely the time used for memory transfer between the CPU and the GPU, as well as the time Comparison between the PIVI Log-BCJR turbo decoder and the proposed SIMD FPTD in terms of degree of parallelism, overall latency, pipelined throughput and complexity (IPBPHI and IPB), where N = 6144 for both decoders, I = 7 and W ∈ {32, 48, 64, 96, 192} for the PIVI Log-BCJR turbo decoder, whilst I ∈ {36, 39, 42, 46, 49} for the FPTD. Results are presented using the format x/y , where x corresponds to the case where the approximate max* operation of (6) is employed, while y corresponds to the exact max* operation of (5).
C. PROCESSING LATENCY
FIGURE 9.
Latency for the proposed SIMD FPTD with I ∈ {36, 39, 42, 46, 49}, as compared with that of the PIVI Log-BCJR turbo decoder with I = 7 and W ∈ {32, 48, 64, 96, 192}. used for the iterative decoding process. More specifically, the memory transfer includes transferring the channel LLRs from the CPU to the GPU at the beginning of the iterative decoding process and transferring the decoded results from the GPU to the CPU at the end of that process. Therefore, the time used for memory transfer depends only on the frame length N and it is almost independent of the type of decoder and the values of I and W , as shown in Figure 9 . Note that the latency was quantified by averaging over the decoding of 5000 frames for each configuration. By contrast, the time used for the iterative decoding process differs significantly between the proposed SIMD FPTD and the Log-BCJR turbo decoder. More specifically, Table 1 shows that the overall latency of the SIMD FPTD is in the range from 402.5 µs to 513.4 µs, when the number of iterations is increased from I = 36 to I = 49, provided that the approximate max* operation of (6) is employed, hence meeting the sub 1ms requirement of the LTE physical layer [53] . By contrast, the overall latency of the PIVI Log-BCJR decoder ranges from 816.9 µs to 3694.6 µs, when the window length increases from W = 32 to W = 192, and when I = 7 iterations are performed, assuming that the approximate max* operation of (6) is employed. These extremities of the range are 2 times and 7.2 times worse than those of the proposed SIMD FPTD, respectively. Additionally, when the exact max* operation of (5) is employed, the overall latency of the SIMD FPTD increases by 12.3% and 14.8% for the case of I = 36 and I = 49, compared to those obtained when employing the approximate max* operation of (6) . By contrast, the overall latency increases in this case by 27.5% and 31.1% for the PIVI Log-BCJR decoder associated with W = 32 and W = 192, hence further widening the gap to the latency of the proposed SIMD FPTD. Table 1 presents the processing throughputs that were measured on the GPGPU, when employing the proposed SIMD FPTD and the PIVI Log-BCJR turbo decoder to decode frames comprising N = 6144 bits. Here, throughputs are presented for the case where the approximate max* operation of (6) is employed, as well as for the case of employing the exact max* operation of (5) . Note that when the iterative decoding of a particular frame has been completed, its memory transfer from the GPU to CPU can be pipelined with the iterative decoding of the next frame and with the memory transfer from the CPU to GPU of the frame after that. Since Figure 9 shows that the iterative decoding is the slowest of these three processes, it imposes a bottleneck on the overall processing throughput. Owing to this, the throughput presented in Table 1 was obtained by considering only the iterative decoding process, based on the assumption that throughput = N /latency of iterative decoding.
D. PROCESSING THROUGHPUT
As shown in Table I , the proposed GPGPU implementation of the SIMD FPTD achieves throughputs of up to 18.7 Mbps. This exceeds the average throughput of 7.6 Mbps, which is typical in 100 MHz LTE uplink channels [54] , demonstrating the suitability of the proposed implementation for C-RAN applications. Furthermore, higher throughputs can be achieved either by using a more advanced GPU or by using multiple GPUs in parallel.
Recall from Figure 8 that the proposed SIMD FPTD performing I = 36 iterations achieves the same BER performance as the PIVI Log-BCJR turbo decoder performing I = 7 iterations and having the window length of W = 32. Here, W = 32 corresponds to the maximum degree of parallelism of P = 192 that can be achieved for the PIVI Log-BCJR turbo decoder, without imposing a significant BER performance degradation [19] . In the case of this comparison, Table 1 reveals that the processing throughput of the proposed SIMD FPTD is 2.3 times and 2.5 times higher than that of the PIVI Log-BCJR turbo decoder, when the approximate max* operation and the exact max* operation are employed, respectively. An even higher processing throughput improvement is offered by the proposed SIMD FPTD, when the parallelism of the PIVI Log-BCJR turbo decoder is reduced, for the sake of improving its BER performance. For example, the proposed SIMD FPTD performing I = 49 iterations has a processing throughput that is 8.2 times (approximate max*) and 9.2 times (exact max*) higher than the PIVI Log-BCJR turbo decoder having a window length of W = 192, while offering the same BER performance. Note that owing to its lower computational complexity, the approximate max* operation of (6) facilities a higher processing throughput than the exact max* operation of (5), in the case of both decoders.
Furthermore, Table 2 compares the processing throughput of the proposed SIMD FPTD GPGPU implementation to that of other GPGPU implementations of the LTE turbo decoder found in the literature [13] , [17] , [18] , [38] , [43] . Here, the throughputs of all implementations are quantified for the case of decoding N = 6144-bit frames, when using the approximate max* operation of (6) . Note that the throughputs shown in Table 2 for the benchmarkers employing the PIVI Log-BCJR algorithm have been linearly scaled to the case of performing I = 7 iterations, in order to perform a fair comparison. However, different GPUs are used for the different implementations, which complicates their precise performance comparison. In order to make fairer comparisons, we consider two different methods for normalizing the throughputs of the different implementations, namely throughput×10 6 clock freq.×mem BW and throughput×10 6 clock freq.×mem freq. . More specifically, the authors of [38] proposed a loosely synchronized parallel turbo decoding algorithm, in which the iterative operation of the partitions is not guaranteed to operate synchronously. In their contribution, the normalized throughput was obtained as throughput×10 6 clock freq.×mem BW , since the GPGPU's global memory bandwidth impose the main bottleneck upon the corresponding implementation. Similarly, we suggest using the same normalization method for our proposed FPTD, since its performance is also bounded by the global memory bandwidth, according to the experimental results from the Nsight profiler, as discussed in Section IV-A. As shown in Table 2 , the benchmarker of [38] achieves a normalized throughput of 100.6, when performing I = 12 iterations for decoding N = 6144-bit frames, divided into P = 768 partitions. However, this approach results in an E b /N 0 degradation of 0.2 dB, compared to that of the conventional Log-BCJR turbo decoding algorithm employing P = 64 partitions and performing I = 6 iterations. When tolerating this 0.2 dB degradation, our proposed SIMD FPTD algorithm requires only I = 27 iterations, rather than I = 36, as shown in Table 2 . In this case, the normalized throughput of our proposed SIMD FPTD algorithm is 128.8, which is 28% higher than that of the loosely synchronized parallel turbo decoding algorithm of [38] . Furthermore, our approach has the advantage of being able to maintain a constant BER performance, while the loosely synchronized parallel turbo decoding algorithm of [38] suffers from a BER performance that varies from frame to frame, owing to its asynchronous decoding process.
By contrast, using the normalization of throughput×10 6 clock freq.×mem freq.
is more appropriate for the other implementations listed in Table 2 , since according to our experimental results, the computational latency and the global memory access latency constitute the main bottlenecks of these implementations of the Log-BCJR algorithm. This may be attributed to the low degree of parallelism of the decoder compared to the capability of the GPGPU, particularly when only a single frame is decoded at a time. Note that the normalized throughputs obtained using the different normalization methods are not comparable to each other. As shown in Table 2 , the normalized throughput of 5.4 achieved by our PIVI Log-BCJR benchmarker is significantly better than those of [17] , [18] , and [43] . Although the benchmarker of [13] achieves a better normalized throughput of 6.6, this is achieved by decoding a batch of 100 frames at a time, which can readily achieve a higher degree of parallelism than decoding only a single frame at a time, like all of the other schemes, as discussed in [18] . Owing to this, the computing latency and memory latency maybe no longer a limiter for the throughput performance, implying that the normalized throughput for [13] may be inappropriate. Additionally, this throughput can only be achieved, when there are 100 frames available for simultaneous decoding, which may not occur frequently in practice, hence resulting in an unfair comparison with the other benchmarkers. Furthermore, while decoding several frames in parallel improves the overall processing throughput, it does not improve the processing latency of each frame.
E. COMPLEXITY
The complexity of the proposed GPGPU implementation of our SIMD FPTD algorithm may be compared with that of the PIVI Log-BCJR turbo decoder by considering the number of GPGPU instructions that are issued per bit of the message frame. This is motivated, since all GPGPU thread operations are commanded by instructions. More specifically, while performing one half iteration and one interleaving operation for each turbo decoder, the average number Instructions Per Warp (IPW) was measured using the NVIDIA analysis tool, Nsight [52] . Using this, the average number of Instructions Per Bit (IPB) may be obtained as IPB = 2I · IPBPHI = 2I · IPW · D p 32N , where IPBPHI is the average number of Instructions Per Bit Per Half Iteration, N is the frame length and D p is the corresponding degree of parallelism. Here, D p 32 represents the total number of warps, since each warp includes 32 of the D p threads employed by the decoder. Table 1 quantifies IPBPHI and IPB for both the proposed SIMD FPTD and the PIVI Log-BCJR turbo decoder, when employing both the approximate and exact max* operations of (6) and (5), respectively. As shown in Table 1 , the IPBPHI of the proposed SIMD FPTD is around one third that of the PIVI Log-BCJR turbo decoder, when employing the approximate max* operation, although this ratio grows to one half, when employing the exact max* operation. Note however that the proposed SIMD FPTD algorithm requires more decoding iterations than the PIVI Log-BCJR turbo decoder for achieving a particular BER performance, as quantified in Section V-A. Therefore, the overall IPB complexity of the proposed SIMD FPTD is 1.7 to 3.3 times higher than that of the PIVI Log-BCJR turbo decoder, depending on the number of iterations I , window length W and type of max* operation performed, as shown in Table 1 . Note that this trend broadly agrees with that of our previous work [22] , which showed that the FPTD algorithm has a complexity that is 2.9 times higher than that of the state-of-the-art LTE turbo decoder employing the Log-BCJR algorithm, which was obtained by comparing the number of additions/subtractions and max* operations employed by the different algorithms. Note that the increased complexity of the FPTD represents the price that must be paid for increasing the decoding throughput by a factor up to 9.1. VOLUME 4, 2016
VI. CONCLUSIONS
In this paper, we have proposed a SIMD FPTD algorithm and demonstrated its implementation on a GPGPU. We have also characterized its performance in terms of BER performance, degree of parallelism, GPGPU processing throughput and complexity. Furthermore, these characteristics have been compared with those of the state-of-the-art PIVI Log-BCJR turbo decoder. This comparison shows that owing to its increased degree of parallelism, the proposed SIMD FPTD offers a processing throughput that is between 2.3 and 9.2 times higher and a processing latency that is between 2 and 8.2 times better than that of the benchmarker. However, this is achieved at the cost of requiring a greater number of iterations than the benchmarker in order to achieve a particular BER performance, which may result in a 1.7 to 3.3 times increase in overall complexity. In our future work we will conceive techniques for disabling particular algorithmic blocks in the FPTD, once they have confidently decoded their corresponding bits. With this approach, we expect to significantly reduce the complexity of the FPTD, such that it approaches that of the Log-BCJR turbo decoder, without compromising the BER performance.
