ABSTRACT Iterative turbo equalization is capable of achieving impressive performance gains over the conventional non-iterative equalization having the same complexity, when communicating over channels that suffer from intersymbol interference (ISI). The state-of-the-art turbo equalizers employ the logarithmic Bahl-Cocke-Jelinek-Raviv (Log-BCJR) algorithm. However, due to the specific nature of serial data processing, the Log-BCJR algorithm introduces significant processing delays at the receiver. Therefore, in low-latency applications having a high throughput, the turbo equalizer might be deemed less attractive than its conventional counterparts. In order to circumvent this problem, in this paper, we conceived a novel fully parallel turbo equalization algorithm, which is capable of significantly reducing the data processing delay and, hence, improving both the processing latency and the attainable throughput at the receiver. The fully parallel equalizer is then combined with the fully parallel turbo decoder for improving the system performance achieved in terms of the bit error ratio. Furthermore, we propose a novel odd-even interleaver design for employment between the fully parallel equalizer and the fully parallel turbo decoder in order to reduce complexity by 50% in fully parallel turbo equalization arrangements, while retaining a comparable performance. Finally, we compare the computational complexity, latency, throughput, hardware resource requirements, and the bit error ratio of the proposed fully parallel scheme to those of a Log-BCJR-based turbo equalizer benchmarker.
ARP
The multiplexed bit vector at the transmitter b 1 The message bit vector at the transmitter b 2 The parity bit vector at the transmitter The a posteriori systematic LLR vector at the turbo decoder c
The interleaved bit vector at the transmitter Berrou and his team [1] proposed the first turbo equalisation scheme, where the equalizer and the channel decoder exchange their soft-decision based information by performing iterative detection in order to gradually eliminate the channel-induced Inter-Symbol Interference (ISI). Inspired by this contribution, this problem was further investigated by a large number of researches [2] , [3] . As shown in [4] and [5] , the turbo equalizers offer a substantially improved performance over the family of non-iterative linear equalizers [6] , [7] . The closely-related family of turbo codes [8] , [9] has been adopted for providing error correction in a number of advanced communication systems, such as the 3 rd -Generation Wideband Code Division Multiple Access (3G WCDMA) [10] , [11] and the 4 th -Generation Long Term Evolution (4G LTE) systems [12] . A turbo detection scheme [13] , [14] may comprise a serial concatenation of an equalizer with a turbo decoder, which comprises a parallel concatenation of two component convolutional decoders.
By iteratively exchanging soft information in the form of Logarithmic Likelihood Ratios (LLRs) [8] between the equalizer and the pair of constitute convolutional decoders of the turbo code, the resultant turbo detection scheme is capable of facilitating reliable communications at transmission throughputs that approach the channel capacity [3] , [15] . Classic turbo detection schemes typically employ the Logarithmic Bahl-Cocke-Jelinek-Raviv (Log-BCJR) algorithm [16] . This is successively applied to the equalizer and to the two convolutional decoders, until an error-free decoded frame is obtained or until the maximum number of decoding iterations is reached. However, the Log-BCJR algorithm has an inherently serial processing nature, owing to the data dependencies within its forward and backward recursions as detailed in [3] . This limits both the achievable processing throughput and the latency of conventional turbo detection schemes, which imposes a bottleneck both on the transmission throughput and on the end-to-end latency in real-time communication systems. A number of techniques have been proposed for increasing the grade of parallelism and hence for improving both the processing throughput and latency of Log-BCJR turbo decoders although these techniques have only found limited application to turbo equalizers. These solutions include shuffled iterative decoding [17] , sub-block parallelism [18] , [19] , the Radix-4 transform [20] and the Non-Sliding Window (NSW) technique [20] . These techniques allow both recursions of both convolutional decoders to be performed simultaneously, as well as allowing the recursions to consider several turboencoded bits per time period. However, in each case, the data dependencies of the forward and backward recursions require the turbo encoded bits of each convolutional decoder to be processed serially, spread over numerous consecutive time periods. As a result, each turbo decoding iteration requires hundreds or even thousands of processing time periods, hence limiting the attainable processing throughput of the state-ofart turbo decoder [20] to 2.15 Gbit/s, which is far below the 10 Gbit/s target of the emerging 5G systems [21] .
Against this background, we previously proposed the Fully-Parallel Turbo Decoder (FPTD) algorithm [22] , where all turbo-encoded bits in the frame may be decoded in parallel, allowing each turbo decoder iteration to be completed using just one or two time periods. This offers a more than six-fold processing throughput and latency improvement over the state-of-the-art Log-BCJR turbo decoder, when employed for the LTE turbo code [22] . As a result, the FPTD facilitates both processing throughputs exceeding 10 Gbit/s and ultralow processing latencies, hence satisfying the challenging requirements of 5G for the first time. The milestones of the development of the iterative turbo decoding and turbo equalization are shown in Table 1 .
Against this background, in this paper we propose a novel Fully-Parallel Turbo Detection Scheme (FPTDS) for highthroughput and low-latency applications. Our novel contributions are detailed as follows:
1) We propose a novel Fully-Parallel Equalizer (FPE), as well as FPTDS, where the FPE is operated in parallel with the FPTD conceived in [22] and [26] . 2) We propose a novel odd-even interleaver for the proposed FPTDS in order to reduce the complexity of the system by 50%, while maintaining a comparable Bit Error Ratio (BER). 3) We quantified the computational complexity, latency, throughput, hardware resource requirements as well as BER of the proposed FPTDS and compared them to those of the conventional Log-BCJR turbo detection benchmarkers. The outline of the paper is as follows. Section II describes our novel FPTDS, where the novel FPE and the FPTD are operated in parallel. Our novel odd-even interleaver Conceived for reducing the computational complexity of the FPTDS is proposed in Section III. Section IV investigates the computational complexity, throughput, hardware resource requirements of the FPTDS and compare them to those of the Log-BCJR benchmarkers. The BER performance of the proposed FPTDS is quantified and compared to the benchmarkers in Section V. Finally, our concluding remarks are offered in Section VI.
II. SYSTEM ARCHITECTURE
The architecture of the proposed FPTDS is shown in Fig.1 . In the turbo encoder [8] 
k=1 comprising 3N bits, which is then interleaved by the block E of Fig. 1 
In the following sections, we will describe the conventional Log-BCJR turbo detection scheme and the novel FPTDS. In each section, we will detail the equalizer, the turbo decoder and the iterative turbo equalization and decoding operations exchanging soft-information between them.
A. CONVENTIONAL LOG-BCJR TURBO DETECTION
The conventional equalization and decoding may rely on I I equalizer-to-turbo-decoder and I O turbo-decoder iterations. The equalizer-to-turbo-decoder iterations are carried out between the equalizer and the turbo decoder, while the turbo-decoder iterations are performed between the two component decoders of the turbo decoder. The equalizer-toturbo-decoder iterations between the equalizer and the turbo decoder are continued, until no more errors are detected or until reaching the maximum affordable number of equalizerto-turbo-decoder iterations. The presence of errors may be detected using classic error detection codes, such as Cyclic Redundancy Check (CRC) codes.
1) LOG-BCJR EQUALIZER
The received signal vector of 3N symbolc c is first equalized by the equalizer, where the Log-BCJR equalization algorithm [4] is employed. As seen in Fig. 1 Fig. 1 to obtainb
Equations (3) and (4) employ the Jacobian logarithm, which is defined for two operands as [8] max
and may be extended to more operands by exploiting its associative property. Alternatively, the exact max * of (7) may be approximated by [27] max
Thereafter, an a posteriori transition metricsδ E k (S k−1 , S k ) is produced by (12) for each transition between the state S k−1 and S k in the trellis. Finally, (13) is employed to generate the vector of 3N extrinsic LLRsc e = [c e k ] 3N k=1 , which will be forwarded to the channel decoder.
2) LOG-BCJR TURBO DECODER
Again, the classic turbo decoder includes a pair of convolutional component decoders, where both rely on the Log-BCJR decoding algorithm [8] , [22] . As illustrated in Fig. 1 for the equalizer. For convenience, the superscripts u and l VOLUME 3, 2015 are omitted in this section and thereafter, wherever our discussions are equivalent for the upper and lower convolutional decoders. Similar to the equalizer, the decoding operations of the component decoders employ the Log-BCJR algorithm based on (9)- (14) . More specifically, each component uses (9) to combine the a priori LLRsb a 1,k ,b a 2,k andb a 3,k to produce an a priori transition metricγ k (S k−1 , S k ) for each pair of transition states S k1 and S k , for which it is possible for the convolutional encoder to traverse between, as indicated using the notation c(S k1 , S k ) = 1. Here, b j (S k1 , S k ) is the value that is implied for the bit b j,k by the transition between the state S k1 and S k , according to the state transition diagram [22] . These vectors of transition metrics are then combined according to (10)- (11), in order to produce the vector of N extrinsic forward state metric vectorsᾱ
, and the vector of N extrinsic backward state metric vectors
, respectively. Like the Log-BCJR equalizer, the forward and backward recursions in the Log-BCJR turbo decoder are also spread over N periods, hence resulting in a slow serial processing.
Thereafter, an a posteriori transition metric δ(S k−1 , S k ) is computed by (12) for each transition between the states S k−1 and S k in the trellis, which is then substituted into (13) and (14) , respectively. Again, these equations rely on the Jacobian logarithm of (7) . Following the final turbo-decoder iteration between the two component decoders, an a posteriori LLR pertaining to the k th message bit b u 1,k may be obtained as 
B. FULLY-PARALLEL TURBO DETECTION
In contrast to the conventional turbo detection scheme discussed in Section II-A, all of the symbols in the received symbol vectorc c may be simultaneously equalized by a FPE and all of the corresponding LLRs may be simultaneously decoded by a FPTD [22] in the FPTDS of Fig. 2 , eliminating the requirement for equalizer-to-turbo-decoder and turbodecoder iterations in the system. Instead, in each iteration of the proposed FPTDS, the extrinsic LLR vector is passed from the FPE to the FPTD through the deinterleaver −1 E and the demultiplexer of Fig. 1 , while that of the FPTD is forwarded to the FPE through the multiplexer and the interleaver E of Fig. 1 and Fig. 2a .
1) FPE
The FPE comprises 3N algorithmic decoding blocks, as detailed in Fig. 2b . Observe in Fig. 2b that 
In each time period, some or all of the 3N algorithmic blocks of the FPE will compute the outputs based on (15)- (18) at the same time. More specifically, the algorithmic block having the index k uses (15) to combine the received symbol c c k and the a priori LLRc a k gleaned from the channel and the FPTD, respectively, as well as the a priori state metric vectorsᾱ
S k =0 in order to produce an a posteriori state metricδ E (S k−1 , S k ) for each transition in the state transition diagram [22] , namely for each pair of states S k−1 and S k for which it is possible for the convolutional encoder to transition between, as indicated using the notation c(
is the value that is implied for the bits c k−l by the transition
These a posteriori transition metrics are then combined with the aid of (16)- (18), in order to produce the extrinsic forward state metric vectorᾱ
and the extrinsic LLRc e k , respectively. Again, (16) and (17) employ the Jacobian logarithm of (7).
In contrast to the classic Log-BCJR equalizer, the forward and backward state metricsᾱ E,e k andβ E,e k−1 at a given period only depend on the forward and backward state metrics fed back from the previous time period. Therefore, the data dependencies of the forward and backward recursions are broken, allowing fully-parallel operation. Hence, this speeds up the processing by a factor, of which is up to 3N .
2) FPTD
The FPTD is described and analysed in great detail in [22] and [26] . Briefly, a FPTD includes two convolutional component decoders, each of which has N algorithmic blocks. As illustrated in Fig. 2c, 
and M is the number of states in the corresponding state transition diagram [22] . 
. Again for convenience, the superscripts u and l are omitted in this section and thereafter, wherever our discussions are equivalent for the upper and lower convolutional decoders.
In contrast to the FPTD of [22] and [26] , the FPTD here also outputs the extrinsic encoded LLR vector b
Simultaneously with the FPE, some or possibly all of the N algorithmic blocks in each component decoder of the FPTD are operated in parallel. Each of these block performs the operation of (19)- (23). More specifically, the algorithmic block having the index k uses (19) 
, respectively. Similar to the FPE, the forward and backward state metricsᾱ e k andβ e k−1 at a given period only depend on the forward and backward state metrics fed back from the previous time period. Therefore, the data dependencies of the forward and backward recursions are broken, therefore allowing fully-parallel operation. Hence, the processing is sped up by a factor of up to 2N , compared to the classic serial Log-BCJR turbo decoder.
Furthermore, the a posteriori transition metrics are also employed in (13) for computing the uncoded extrinsic LLRb e 1,k while the encoded extrinsic LLRb e 2,k is achieved using (23) . Again, these equations employ the Jacobian logarithm of (7) . 
III. INTERLEAVER DESIGN FOR THE FPTDS
By employing the odd-even interleaver [28] like that of the LTE turbo code, an odd-even operation of the algorithmic blocks may be employed in the FPTD of [22] , hence reducing its complexity by 50%. More explicitly, an odd-even interleaver only connects algorithmic blocks from the upper row having an odd index to blocks from the lower row that also have an odd index. Similarly, blocks from the upper row of the FPTD having an even index are only connected to those from the lower row also having an even index. This arrangement allows the 2N decoding blocks of the FPTD to be grouped into two sets. The first set includes the oddindexed blocks in the upper row and the even-indexed blocks in the lower row, which are indicated by the light grey shading in Fig. 2c . Meanwhile, the second set comprises the evenindexed blocks in the upper row and the odd-indexed blocks in the lower row, which are highlighted by the dark grey shading in Fig. 2c . Given this arrangement, the FPTD may operate only the first set in odd indexed time periods and only the second set in even indexed time periods. This reduces the computational complexity of the FPTD by 50% without increasing the number of time periods required for completing the decoding process [22] . This is because in the odd-even arrangement, operating both sets in all time periods leads to redundancy, which can be eliminated without impairing the attainable performance. Inspired by this idea, in this section we propose a novel odd-even design of the multiplexer and interleaver E of Fig. 1 between the equalizer and the channel decoder. The design is illustrated in Fig. 2a Thereafter, an odd-even interleaver is employed for connecting the vectorb e of the multiplexer with the vectorc a of the equalizer in the same manner as between the upper and lower decoder of the FPTD [22] . The odd-even connections may employ either random or structured designs, such as the Almost Regular Permutation (ARP) and Quadratic Polynomial Permutation (QPP) interleavers [28] .
By contrast, the vectorc e of the equalizer is connected to the vectorb a of the multiplexer using the same order of the odd-even interleaver. The vectorb a is further demultiplexed into three LLR vectorsb As illustrated in Fig. 2a , the odd blocks of the vectorc marked by the light grey colour are connected to the odd blocks of the vectorb in the dark grey zone, which is further connected to the dark grey blocks of the vectorb u 3 ,b u 2 andb l 2 . Meanwhile, the even blocks of the vectorc in the dark grey zones are connected to the even blocks of the vectorb in the light grey zones, which is further connected to the light grey zones of the vectorb u 3 ,b u 2 andb l 2 . Consequently, the FPTDS are divided into the pair of sets: the dark gray set and the light grey set. In this way, the iterative exchange of the extrinsic information within the FPTDS can be instead thought of as an iterative exchange of extrinsic information between the two sets. When fully parallel equalization and decoding is employed, the operation of FPTDS relying on the odd-even interleaver corresponds to two independent processes, which have no influence on each other. Therefore, one of the two iterative processes is redundant. This can be achieved by activating the algorithmic blocks of only one set in each time period, with two consecutive time periods alternating between the two sets. By doing this, each detection is spread into T = 2 time periods. However, in order to achieve the same BER performance, the number of iterations required can be halved. Therefore, compared to the FPTDS where all blocks are activated in T = 1 time period, the FPTDS associated with the odd-even interleaver is capable of reducing the complexity by 50%, while retaining the same processing throughput.
IV. SYSTEM CHARACTERISTICS
In [22] , the characteristics of the FPTD, of the Log-BCJR turbo decoder as well as of the state-of-the-art turbo decoder [20] were compared in the context of the LTE and WiMAX turbo codes. However, the NSW, radix-4 and pipelining techniques of the state-of-the-art turbo decoder have not been proposed and investigated for the equalizer. Therefore, in this section, we will compare the characteristics of the FPTDS and of the classic Log-BCJR detection scheme described in Section II. These characteristics include the computational complexity, the throughput and latency, as well as the hardware resource requirements of the iterative equalization and decoding operation.
In the FPTDS employing an odd-even interleaver, each iteration requires two time periods as described in Section III. However, it is not straightforward to define the iterations of the Log-BCJR turbo detection scheme, since it contains I I equalizer-to-turbo-decoder iterations and each equalizerto-turbo-decoder iteration has further I O turbo-decoder iterations. For convenience, it is assumed that the classic Log-BCJR detection system has the number of iterations as the number of equalizer-to-turbo-decoder iterations I I of the FPTDS. More specifically, each iteration of the Log-BCJR system contains one equalization and I O turbo decoding operations.
The characteristics of both the FPTDS and of the Log-BCJR system are summarized in Table 3 . Note that the FPTDS of Table 2 is assumed to employ the odd-even interleaver of Section III.
A. COMPUTATIONAL COMPLEXITY
The computational complexity of each trellis stage of the conventional Log-BCJR and each algorithmic block (which process one trellis stage) of the FPTDS is quantified in Table 2 . The computational complexity is quantified in terms of the number of addition, subtraction and max * evaluation operations. The number of operations of the Log-BCJR equalizer is based on evaluating (2) - (6) while that of the FPE equalizer is based on (15) -(18).
In the Log-BCJR equalizer, (2) (15) is identical with (6), they have the same complexity.
Likewise, the number of operations of the Log-BCJR turbo decoder are based on (9) - (14), while that of the FPTD is based on (19) - (23) . Note that (9) and (19) require one additional addition for adding a systematic LLRb a 3,k , while (13) and (22) require one additional subtraction for removing a systematic LLRb a 3,k . As shown in [29] , the complexity of the approximate max * operation of (7) equals to that of an addition. Therefore, in Table 2 the overall complexity of the classic Log-BCJR equalizer and of the FPE are denoted by C BE and C FE , respectively. Observe that in Table 2 , the overall complexity of the Log-BCJR turbo decoder and of the FPTD are denoted by C BD and C FD .
Furthermore, the complexity of each iteration of the FPTDS C F is equal to the summation of the complexity of both the FPTD and the FPE (C F = C FE + C FD ). By contrast, the complexity of each iteration of the classic Log-BCJR scheme equals to those of the Log-BCJR equalizer and I O times the complexity of the Log-BCJR turbo decoder (C B = C BE + I O · C BD ). Clearly, the complexity of the Log-BCJR turbo decoder in each iteration depends on the number of turbo-decoder iterations I O set up. Therefore, a careful considered configuration of the turbo detection is required in order to have a fair comparison between the classic Log-BCJR turbo detection and the FPTDS, which will be detailed in Section V.
B. TIME PERIODS PER DECODING
As described in Section III, a FPTDS using an odd-even interleaver requires two time periods for the dark grey and light grey groups to complete one iteration. By contrast, each component of the Log-BCJR turbo decoder requires N time periods for the computation of the forward recursion and N time periods for the backward recursion. Therefore, the Log-BCJR turbo decoder requires 4N time periods. Similar to each component of the Log-BCJR turbo decoder, the Log-BCJR equalizer requires 3N time periods for each forward and backward recursion computation. With I O turbodecoder iterations of the Log-BCJR turbo decoder, each iteration of the Log-BCJR turbo detection scheme requires a total of T = 4N · I O + 6N = (2I O + 3) · 2N time periods. Hence, in order to complete one detection iteration, the classic Log-BCJR turbo detection scheme needs (2I O + 3)N time periods more than the FPTDS.
C. TIME PERIOD DURATION
The time period duration here is defined as the longest time for an algorithmic block to complete all computations. It depends on the dependencies between the additions, subtractions and max * operations and it is quantified by the length of the critical path containing most operations. In practical hardware implementations, this dictates the highest clock frequency that can be used. For the FPTDS, the time period duration is the longer one of the pair of durations that one block of the FPTD completes (19)- (23) and the duration that one block of the FPE completes (15)- (18) . As analysed in [22] , the computation of (19)- (23) has a critical path comprising five additions plus log 2 (M ) max * evaluation operations, where M is the number of states in the turbo code trellis. As described in [29] , the times required to compute an addition and the approximation of the max * are equal, giving a time period duration D FD = 5 + log 2 (M )] operations for the FPTD. Similarly, it may be inferred from (15)- (18) that each algorithmic block of the FPE has a critical path comprising five additions and log 2 (L) max * operations, where L is the number of states in the equalizer trellis. Therefore, the time period duration of the FPE is Meanwhile, the time period duration of the Log-BCJR system is the longer one of the duration that one trellis stage of the Log-BCJR decoder completes (9)- (14) and the duration that one trellis stage of the Log-BCJR equalizer completes (2)- (6) . In contrast to the FPTD [22] , the Log-BCJR turbo decoder requires one max * evaluation of (10) and (11) to be completed before (12)- (14) . As a result, the time period duration of the Log-BCJR turbo decoder is D BD = 6 + log 2 (M )] operations. Similarly, the time period duration of the Log-BCJR equalizer is D BE = 6 + log 2 (L)] operations. Hence, the duration of the Log-BCJR scheme D B is the longer one of the pair of durations D FD and D FE .
Again, all of the time durations of the FPTDS and of the classic Log-BCJR turbo detection scheme are provided in Table 3 . It is noted that the time period of the FPTDS given by D F = 5 + log 2 [max(M , L)] is lower than that of the conventional Log-BCJR turbo detection formulated as
, albeit only by the time of one operation.
D. THROUGHPUT AND LATENCY
The detection latency is defined as the time duration in which a turbo detection scheme requires to completes its iterative equalization and decoding operations. Hence, it is given by the product of the time period D, the number of time periods T per decoding iteration and the required number of decoding iterations I , where the latter will be determined in Section V. The latency and throughput of the schemes are detailed in Table 3 . 
E. RESOURCE REQUIREMENTS
In practical hardware implementations, the chip area or hardware resource requirement depends both on the computational requirement X as well as on the memory requirement, which can be separated into the register and Random Access Memory (RAM) resources. The register resource requirement Y quantifies the amount of memory that is arranged into registers, which store values that can be accessed all at once, in every time period. By contrast, the RAM resource requirement Z quantifies the amount of storage that is arranged into RAM, which store different values that can be accessed in different time periods.
As analysed in [22] , the FPTDS having an odd-even interleaver can share hardware in alternate time periods. Thus, the computational resource required by the FPTD having an odd-even interleaver equals to half of the complexity plus N additional resources for adding the systematic a priori bitsb a 3 , hence resulting in a total computational resource X FD of C FD /2 + N . Since the FPE does not require the addition of the systematic a priori information, the computational resource X FE is reduced to C FE /2. The total computational resource required by the FPTDS is given by the summation of those of the FPE and the FPTD or quantified by X F = C F /2 + N = C FD /2 + C FE /2 + N . By contrast, the Log-BCJR system is capable of reusing the same hardware for processing successive trellis stages in successive time periods for computation within the equalizer and both within the decoder as well as between the equalizer and the decoder. Therefore, the computational resource required by the classic Log-BCJR system is reduced to the higher number of resources between the conventional Log-BCJR equalizer and the Log-BCJR decoder, which is formulated as X B = max(C BD /2N + 1, C BE /3N ).
In the FPTD, memory resources are required for storing the forward state metrics, backward state metrics and the extrinsic LLRs of (20), (21), (22) and (23), respectively. These outputs are produced, whenever an algorithmic block is operated and they must be stored for the next time period, where they are employed by the connected blocks. However, by using the odd-even interleaver described in Section III, only half of the blocks operated, while the other half remain idle. This allows the memory resources to be shared and physically positioned between two group of algorithmic blocks. Therefore, the memory resources have to store MN forward state metrics, MN backward state metrics and 4N extrinsic LLRs, resulting in a total requirement of Y FD = (2M + 4)N memory resources. Similarly, the FPE requires Y FE = (2L + 3)3N /2 memory resources. Finally, the total memory resources required by the FPTDS may be expressed as
As quantified in [22] , the classic Log-BCJR decoder only requires M memory resources due to the reuse of the same hardware for processing successive trellis stages in successive time periods. Additionally, it requires (3M + 4)N RAM resources for storing the state metrics and the extrinsic LLRs. Similarly, the Log-BCJR equalizer requires L memory resources and (3L + 3)3N RAM resources. Consequently, the memory resources required by the Log-BCJR system obey Y B = max(M , L), while the RAM requirement is
The final resource requirements depend on the specific system configuration, namely on the number of states L and M in the trellis as well as on the frame length N . Therefore, our comparison between the classic Log-BCJR turbo detection and the FPTDS will be detailed in Section V, where the specific system configurations will be defined. 
V. PERFORMANCE STUDY
In the simulations of this section, we employ an M = 8-state LTE turbo code [12] having a coding rate of 1/3, a frame length of N = 1024 bits relying on a 3-bit trellis termination, as described in [22] , Furthermore, BPSK modulation is used. The channel imposes 3-tap multipath fading plus AWGN. The fading between transmission frames is assumed to be independent. Fig. 3 shows the performance of the systems, where both the Log-BCJR and the fully-parallel algorithms are characterized. The classic Log-BCJR system is used both for iterative equalization and decoding. In each of the I BCJR equalizer-to-turbo-decoder iterations, the Log-BCJR system performs Log-BCJR equalization followed by I O = 8 iterations of Log-BCJR turbo decoding. By contrast, the FPE system carries out fully-parallel equalization and decoding simultaneously. In Fig. 3 , the performance of the BCJR system is represented by the dashed curves, while that of the FPE system is shown by the continuous ones. Observe that the FPE system exhibits a high BER, when the number of iterations is below 16. By contrast, when the number of iterations is increased to 32 and 64, the FPE system achieves a comparable performance to that of the Log-BCJR system employing I BCJR = 2 equalizer-to-turbo-decoder iterations Fig. 4 shows the performance of the FPE systems, where the fully-parallel and odd-even mechanisms are employed. The number of time periods of T = {1, 2, 4, 8, 16, 32, 64} are characterized in this figure. Recall that the fully parallel arrangement employs one time period for each equalization and decoding iteration, while the odd-even arrangement employs two periods for each equalization and decoding iter-ation. In Fig. 4 , the performance of the fully-parallel system is shown by the square-marked curves, while that of the oddeven system is represented by the diamond-marked curves. The results of Fig. 4 showed that the performance of both systems are comparable, regardless of the number of time periods observed. Hence, the FPTDS employing an odd-even interleaver achieves the same performance in conjunction with the same number of time periods, while reducing the complexity by 50% compared to the FPTDS using fullyparallel detection. Table 4 summaries the characteristics of both the Log-BCJR turbo detection and of the FPTDS when communicating over a 3-tap multipath fading channel. The complexity of both schemes is quantified in the operating region, where a BER below 10 −6 is achieved. The results showed that upon aiming for such a low BER, the FPTDS is capable of improving the latency and throughput by a factor of 600 over the conventional Log-BCJR scheme, which is achieved at the modest cost of increasing the computational complexity by a factor of 3.5 as well as the computational and memory resource requirements by a factor of 26.
VI. CONCLUSIONS
In this paper we proposed a novel FPTDS, where all the algorithmic decoding blocks of both the equalizer and of the turbo decoders are being operated in parallel. The odd-even interleaver between the equalizer and the channel decoder was designed for reducing the computational complexity. Our simulations demonstrated that when the LTE turbo code is employed for communication over a 3-tap fading channel, at the same near-error-free performance the FPTDS increases the complexity by a modest factor of 3.5 and the hardware resources by a factor of 26, while improving the processing latency and throughput by a factor of 600, making it an attractive candidate for high-throughput and low-latency applications. In our future research, the hardware implementation will be considered and the scope for potential complexity reduction will be further investigated. Finally, comparison with benchmarkers employing radix-4, Non-Slide Window and pipelining techniques will also be studied. 
