The effect of parallelism on Bit Error Rate (BER) performance of Turbo Code (TC) and Self Concatenated Convolutional Code (SECCC) with different levels of parallelism and frame sizes is investigated. Next Iteration Initialization (NII) method is employed for mitigating the BER degradation resulting from increased parallelism. In order to analyze and compare the architectural performance of both schemes, this paper presents the Very High Speed Integrated Circuit Hardware Description Language (VHDL) design of Maximum Aposteriori Probability (MAP) decoder for TC and SECCC, both employing the same constituent code. The simulation results show that for BER of 10 −4 , without parallelism, TC is 0.4 dB superior to SECCC, whereas, with parallelism of 64, the difference in performance between both schemes reduces to 0.25 dB. It is found that SECCC outperforms TC for frame sizes less than or equal to 2048 bits, when invoking a parallelism of 16, 32 and 64. The BER performance of both schemes shows that SECCC outperforms TC at parallelism of 256 by 0.3 dB at BER of 10 −4 . Hence, for high throughput architectures employing higher parallelism (beyond 64 and 128) without significant degradation in BER performance, SECCC performs better than TC. The synthesis results of VHDL design of the MAP decoder obtained using Xilinx ISE verify that both schemes have equal clock frequency and resource consumption. It is demonstrated that the MAP decoder achieves the clock frequency of 86.3 MHz which is capable of producing a throughput of 691 Mbps using parallelism of 64.
I. INTRODUCTION
Turbo Codes (TCs) were introduced in 1993 by Berrou [1] . They are Parallel Concatenated Convolutional Codes (PCCC) [2] which belong to Forward Error Correcting (FEC) codes. They are able to operate near Shannon's capacity limit [3] and hence, employed to support a variety of communication standards such as 3 rd Generation Partnership Project Long Term Evolution (3GPP LTE), IEEE Standard P802.16 also known as Worldwide interoperability for Microwave Access (WiMAX), Global System for Mobile communications (GSM), Universal Mobile Telecommunication System (UMTS), and Digital Video Broadcastingsatellite Services to Handheld (DVB-SH) [4] .
The associate editor coordinating the review of this manuscript and approving it for publication was Jafar A. Alzubi . For achieving a near capacity performance, complex decoding algorithms e.g., the Bahl-Cocke-Jelinek-Raviv (BCJR) algorithm were adopted [5] . The algorithmic complexity and the iterative nature of turbo decoder put a great challenge to the hardware designers for achieving their desired design goals, e.g., high-throughput, minimum latency, low complexity, low Bit Error Ratio (BER) and reduced power and energy consumption. However, there always exists a trade-off between these design goals. Hence, in the case of real-time communication system, if an optimal trade-off between these parameters does not exist, the decoder will exhibit undesirable performance.
Future communication standards demand for several Gbps data rates [6] , in which multiple channel decoders will be operated in parallel in order to achieve high throughput targets. In addition, multiple code blocks in each transport VOLUME 7, 2019 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see http://creativecommons.org/licenses/by/4.0/ block will be processed in parallel to meet the demands. The high throughput achieved by utilising parallel decoder architectures depends on several factors i.e., clock frequency, number of iterations, number of decoding units based on Maximum A posteriori Probability (MAP) algorithm and the technology used. Additionally, throughput increases linearly with an increase in the number of MAP decoding units, which results in increasing the resource utilization and chip area [7] . Several researchers have exploited the concept of parallelism to achieve high-throughput e.g., [8] - [11] , some are focused on resource optimization [12] , [13] and power reduction [14] , [15] rather than throughput. A 3GPP-LTE advanced turbo decoder presented in [16] has used state-metricinitialization technique to reduce the latency of SISO decoder for achieving high throughput. A fully parallel decoding algorithm for TC was proposed in [17] , which is a novel alternative to Log-BCJR algorithm. This algorithm is compatible with all TCs. It tends to increase throughput and reduce latency. Based on this algorithm, the implementation of a fully parallel turbo decoder was presented in [18] . High performance iterative algorithms have also been developed in [19] and [20] for Multiple Input Multiple Output Orthogonal Frequency Division Multiplexing (MIMO-OFDM) systems and 5G recievers, respectively. Recently, an arbitrary turbo decoder was presented in [21] to achieve higher processing throughputs and low latency, that uses rescheduling to avoid contention and enable parallelism of 128 and higher. Turbo decoder for achieving throughput of 100 Gbps is presented in [22] for higher code rates. A parallel turbo decoder architecture, which covers full range of code rates and provides higher throughput gains and better hardware efficiency, is presented in [23] . On the other hand, BER performance of an error correction code is affected by higher parallelism. The aspect that the BER performance of error correction codes decreases at higher parallelism is an important consideration for design implementation in high throughput scenarios. At higher parallelism levels, the input data block is subdivided into smaller blocks to be processed by the MAP decoders in parallelism levels. By dividing the data block into smaller sub-blocks, the size of sub-trellises become very small and hence, results in producing low BER performance. This BER degradation can be mitigated by performing more iterations. However, in order to reduce the performance loss with higher parallelism, two well-known techniques are adopted for parallel decoder design. One is based on Acquisition (ACQ) and the other is Next Iteration Initialization (NII) technique [24] . In ACQ, the state-metrics are initialized at the window boundaries [25] whereas, the NII method implicitly initializes the state-metrics over several decoding iterations. NII method is considered more preferable because of its less sensitivity to high code rates. However, in [26] , the strengths of both methods are combined to obtain a high throughput and hardware efficient turbo decoder architecture. In this work, the aim is to observe the performance of SECCC with short frame sizes and parallelism and compare it with TC. It is found that SECCC performs better than TC for short frame sizes and higher parallelism. The better performance of SECCC for short frame lengths and higher parallelism concludes the fact that in case of SECCC, single trellis is longer than each of the two trellises of TC, therefore SECCC performs better for smaller sized frames at higher parallelism. This work demonstrates the importance of code design and implementation for systems demanding high throughput. Besides BER performance, it is equally important to see the architectural performance of both schemes. Hence for the sake of enabling a complete comparison, the VHDL design of MAP decoder is also presented in this paper. The design is configured and synthesized for TC and SECCC to see the resource utilization and throughput for both schemes. Moreover, parallelism is important for producing high throughput. Hence, while comparing the BER performance of both schemes with parallelism, it is also important to see the architectural performance of parallel SECCC and TC decoder. However, the performance of architecture presented in this paper can be further optimized in future for specific applications.
SECCC belongs to PCCC and like some irregular TC [27] , it is constructed by a single constituent code and exhibits a single MAP decoder [28] . The schematic diagrams of encoding and decoding process of SECCC scheme is shown in Fig. 3 . Unlike the TC, a single component decoder in SECCC scheme exchanges extrinsic information iteratively with itself to achieve a desired performance. The SECCC scheme employing BPSK modulation was presented by [29] , [30] . The SECCC scheme was further investigated for non-binary higher modulation in [31] and [32] to achieve bandwidth efficiency while iterative decoding is invoked to achieve power efficiency. The SECCC was further analyzed by [33] for its applications in power efficient cooperative communication schemes. However, after doing a thorough literature survey, it was found that the implementation of SECCC and its BER performance characteristics with parallelism has not yet been performed. Moreover, the performance comparison between TC and SECCC exhibiting the same convolutional code has not been done in the literature. The novel findings from the results obtained through Matlab simulations and FPGA synthesis of MAP decoder developed in VHDL for TC and SECCC are summarised below:
• We present BER characteristics of parallel SECCC decoder, which is reported for the first time in literature.
• On the basis of this parallel SECCC decoding scheme, we analyze the performance of bigger sized frame (6144-bits) as well as smaller sized frames (2048, 512 bits) with different levels of parallelism.
• As the SECCC scheme with parallelism does not already exist in literature, so for comparing its performance we develop the TC scheme with the same RSC code and code rate (as used for SECCC scheme) and show that SECCC scheme has a significant improvement in BER performance compared to TC scheme, for shorter frames (≤ 2048-bits) with higher parallelism (≥ 16).
• We also analyze the performance of bigger frame of 6144-bits at higher parallelism of 256, where SECCC shows significant improvement in BER performance.
• For the sake of enabling a complete comparison of both schemes, we also introduce the VHDL design of MAP decoder and configure it for SECCC and TC. Additionally, through synthesis of this configurable MAP decoder, it is shown that TC and SECCC schemes exhibit the same architectural performance.
The paper is organized as follows. The operating structures for encoding and decoding of TC and SECCC schemes are presented in Section II-A and II-B, respectively. Section II-C elaborates the mathematical model for Max-log-MAP algorithm. The design for MAP decoder and its parallel architecture is presented in Section III-A and III-B, respectively. Section IV-B presents the EXtrinsic Information Transfer (EXIT) chart for SECCC scheme and expressions for its union bound analysis whereas the simulation results of TC and SECCC with different frame sizes and parallelism are discussed in Section IV-A. The synthesis results are provided in Section IV-C. Finally, the conclusions and future work are presented in Section V.
II. ENCODING AND DECODING OF TC AND SECCC
The process of encoding is presented in Section II-A, where the construction of TC and SECCC is discussed employing the same generator polynomial and code rate. The decoding process is explained in Section II-B and the Max-Log-MAP algorithm which has been used to implement the decoder is presented in Section II-C.
A. ENCODING
The construction of TC and SECCC encoders is discussed in this section. Fig. 1(a) shows a turbo encoder, constructed by the parallel concatenation of at least two RSC codes, which are same and connected through an interleaver [34] . The interleaver counteracts the effects of bursts errors and thus enhances the error correcting capabilities of the FEC code. The interleaver scrambles or re-arranges the encoded symbols over multiple code blocks with no repetition and spreads out long noise burst sequences. The scrambled information is provided to the second component decoder and the uncorrelated information exchange between the two component decoders is facilitated. The interleavers may be periodic or pseudo-random. Periodic interleavers are classified into block and convolutional interleavers. Block interleaver has good performance for non-puncturing small code lengths but for large code lengths random interleavers perform better both for puncturing and non-puncturing codes [35] . Since, we are using puncturing in our example component codes, we have a pseudo-random interleaver. The more ''scrambled'' the interleaver is, the more ''uncorrelated'' the information exchange is. The main role of interleaver is to eliminate low weight input patterns which contribute significantly to the error probability. The Code Matched Interleaver (CMI) [36] is an optimum interleaver, which breaks several low weight input sequences depending on the component codes. This CMI can effectively eliminate several spectral lines of the original distance spectrum and increase the overall Hamming distance of the code. Consequently, the code performance at high Signal to Noise Ratio (SNR) is improved and the error floor is lowered [37] . However in this paper, we have not focused on the interleaver design and we employ a pseudo-random interleaver, which would give an average performance. Fig. 1 (a) depicts the block diagram of a turbo encoder. The Recursive Systematic Convolutional (RSC) codes are mostly used as the constituent component codes. At any time k, the input to the encoder is a bit u k , which is converted to the corresponding code bit c k based on the generator polynomials. The structure and function of the encoder is determined by the generator polynomials and the constraint length which affect the distance properties and the error correcting capability of a convolutional code. Hence, the combination of generator polynomials should be optimum to maximize the minimum free distance of the code for achieving a good error correction performance of the code [38] , [39] . In this paper, an RSC component encoder having constraint length K = 4, memory ν = 3, with (13) 8 → (1011) as feedback generator polynomial and (15, 17) 8 → (1101, 1111) as feedforward generator polynomial is considered as an example, as shown in Fig. 2(a) . However, the RSC code with ν = 4 is also a good option to obtain better performance [40] . The state-transition diagram and 8-state trellis with a minimum free distance for RSC component encoder are shown in Fig. 2 (b) and Fig. 2 (c), respectively. As given in [41] , the codeword can be generated from:
where g li represents the i th bit from binary representation of the l th feedforward generator polynomial, K is a constraint length and u k is an information bit at any time k, may be a 0 or 1. In some examples, the encoded bits are punctured at rate R 2 = 1 2 to produce a rate R = R 1 2×R 2 = 1 3 [42] . The coding rate of the RSC encoder is R 1 = 1 3 , Hence the coding rates of both the TC and SECCC encoders are 1 6 before puncturing. To achieve a final coding rate of R = 1 3 a puncturing rate of R 2 = 1 2 was invoked, where half of the coded bits are punctured. The SECCC encoder is shown in Fig. 3 (a), we used the same code rate and puncturing for SECCC. The information bits u k and their interleaved version u k are converted to a serial stream, which is now fed to the RSC encoder. The encoded bits are then punctured at rate 1 2 to produce a code rate of 1 3 .
B. DECODING
The decoder for TC is shown in Fig. 1(b) . It comprises of two SISO decoding units connected in parallel to each other through interleaver and deinterleaver. The un-interleaved version of the channel Log Likelihood Ratios (LLRs), produced by the first encoder in Fig. 1(a) , is received by first MAP decoder. Since, the bits are punctured at R 2 = 1 2 to achieve a code rate R = 1 3 , the depuncturer inserts zeros in the places of punctured bits. The extrinsic LLRs produced in the first half iteration by the first MAP decoder is appropriately interleaved and along with the interleaved version of channel LLRs, which are produced by the second encoder in Fig. 1(a) , are processed by the second MAP decoder to produce a posteriori information as a result of the iteration. For the second iteration, the extrinsic information produced by the second MAP decoder is properly de-interleaved and along with the channel LLRs is fed to the first MAP decoder. This iterative process continues for certain number of iterations to achieve the desired BER performance. SECCC scheme comprises of a single RSC encoder and a single MAP decoder, as shown in Fig. 4 . Unlike TC, in SECCC scheme, the component decoder produces the extrinsic information and exchanges it with itself for a specific number of iterations to achieve a desired BER performance. SECCC is near in performance to TC. Fig. 3 (b) elaborates the concept of SECCC decoding in which the output of the MAP decoder is converted to two parallel streams which need to be appropriately interleaved/de-interleaved. These two parallel streams are again merged as one serial sequence and fed back to the MAP decoder. The same becomes a priori information for the next iteration. For understanding the decoding process, it would be helpful to define and understand the following important terms:
• a priori information: This is also called an intrinsic information which is denoted by L(u k ). It is the known information about a bit before commencing the decoding process.
• a posteriori LLR: This is the output of component decoder, which it generates with all available information about the concerned bit u k . It is denoted by L(u k y).
• extrinsic information: This information is denoted by L e (u k ) and obtained by excluding the L(u k ) (a priori information) from the L(u k y) (a posteriori output) as depicted in Fig. 1(b) and Fig. 3(b) . After being interleaved or de-interleaved L e (u k ) is sent as an a priori information to the other component decoder to generate more refined LLR for bit u k in the next half iteration.
• Iteration: One iteration for turbo decoder completes when first component decoder produces an extrinsic information (for the original information sequence of the channel LLRs alone) and provides it to the second component decoder, where it is utilized as a priori information along with the channel LLRs to produce a posteriori information. In this paper, an iteration for the MAP decoder is defined as the minimum number of MAP algorithm operations that are repeated. According to this definition, iteration means invoking the MAPalgorithmic unit only once, and this is equal to one SECCC iteration, as presented in Fig. 4(a) . The iteration for TC is depicted in Fig. 4(b) , which is equivalent to two SECCC iterations. However, throughout this paper, an iteration is defined as one SECCC iteration.
C. MAX-LOG-MAP ALGORITHM
Soft Output Viterbi Algorithm (SOVA) [43] and MAP algorithm [5] are the two well-known algorithms used for decoding TCs. SOVA was considered to be the preferred decoding algorithm for decoding block and convolutional codes. Unlike SOVA, MAP algorithm is considered to be optimal but more complex. However, in Log-MAP algorithm, the multiplications and additions of the MAP algorithm are replaced with additions and max* operations (or with max operations in the Max-log-MAP algorithm). In comparison to SOVA, Log-MAP algorithm is three times more complex whereas the complexity of Max-log-MAP algorithm is twice as that of SOVA [37] . Both these algorithms, at any time step k and for all the trellis paths which enter each state at that time step, calculate the measure of similarity or distance between the transmitted and received symbols, along the transmission length. The SOVA selects only the maximum likelihood paths at any time step, also known as the surviving paths and discards the least likely paths.
On the other hand, the MAP algorithm inspects every possible path at any time step along the whole trellis, divides these paths into two groups (one for bit '0' and the other for bit '1') and then calculates the log-likelihood values for all of these two sets. However, the MAP algorithm minimizes the probability of an incorrect path through the trellis and provides the estimated bit sequence as well as the probabilities for each correctly decoded bit [44] . Although the original MAP algorithm is optimal but it is computation-ally more intensive as it involves multiplications, additions and exponentials (non-linear functions). The flow of operations followed in the MAP algorithm is shown in Fig. 5 . However, for the purpose of reducing computational complexity, the original MAP algorithm was first simplified to Log-MAP algorithm and then to Max-log-MAP algorithm [45] with a little sacrifice on BER performance. Since, Log-MAP algorithm is as optimal as MAP algorithm, but is less complex than MAP algorithm due to its operations in log domain. However, the Max-log-MAP algorithm is the sub-optimal version of MAP algorithm and it reduces the complexity of Log-MAP algorithm further by using the maximization approximation. Max-log-MAP algorithm because of its less complexity, is widely used in turbo decoder implementations, but with a degradation in BER performance. However, this performance degradation can be compensated by using extrinsic scaling factor [46] . We have used a scaling factor of 0.75 in our BER simulations which has improved the BER performance by 0.3 dB over the standard Max-log-MAP algorithm at 10 −4 [47] , [48] . This algorithm calculates the a posteriori LLR L(u k y) of each bit u k by considering only the two best transitions. No trellis termination is used here. The state transition metric is calculated, as given in [44] : The first term in (2) is the a priori probability, whereas y ks in the second term represents the received systematic channel LLRs. The third term χ k (s , s) is equal to L c 2 n l=2 y kl x kl and is used to calculate the state transition metrics for the two parity bits. Similarly, the recursive calculation of forward state metric denoted by α k (s) and the backward state metric denoted by β k−1 (s) is performed by (3) and (4), respectively [44] . γ k (s , s) in (3) and (4) is the corresponding state transition metric.
(3)
The initial conditions for (3) and (4) are given below:
The a posteriori LLRs L(u k y) for each bit u k is calculated by using (7) L(u k |y) = max
The extrinsic information is calculated by (8)
where L(u k y) (a posteriori LLR) is produced by the component decoder, L(u k ) is the a priori LLR which is initially zero in the logarithmic domain, whereas in iterative decoding process, an estimate of L(u k ) is provided by each component decoder to the other component decoder. L c y ks represents the received soft LLR for the systematic bit u k from the channel demodulator. The calculation of extrinsic LLR L e (u k ) depends on the constraints imposed by the component code used. L c y ks and L(u k ) are subtracted from L(u k y) to produce the extrinsic LLR L e (u k ) for the systematic bit u k and not for the parity bit. L e (u k ) is then appropriately interleaved/deinterleaved and become as a priori for the next iteration.
III. DECODER ARCHITECTURE
The Fig. 6 (a) depicts the block diagram of a general architecture of a turbo decoder with some additional units added to that shown in [49] . The architecture contains the program interface, control unit, input buffer, address generator, MAP Decoding Unit (MDU) and output buffer. MDU comprises computational units for α, β, γ and LLRs, as is shown in Fig. 6(b) . As already explained above, turbo decoder has at least two Soft-In-Soft-Out (SISO) algorithmic units which are connected to each other in parallel through interleaver and deinterleaver. However, due to hardware re-usability, same MDU can be configured through a control mechanism to be used alternatively as an inner and outer decoding component.
In first half iteration, the encoded bits (which are now channel LLRs) produced by the upper encoder of Fig. 1(a) are processed by the MDU and the extrinsic LLRs produced from these bits are stored in intermediate memory. Now, in the second half iteration, the MDU starts processing the second set of LLRs produced by the lower encoder, using the deinterleaved version of previously generated extrinsic LLRs as a priori information. Max-log-MAP algorithm has been used as a SISO decoding algorithm for most of the implementations in the literature. The max operation in this algorithm is the approximation of the logarithm of exponential terms i.e., ln(e λ 1 + e λ 2 ) ≈ max(λ 1 , λ 2 ). The max operation is performed using compare and select sequence of steps. The interleaver/deinterleaver uses the same permutation pattern as used at the encoder side. The system parameters e.g., frame size, number of iterations, code and puncture rates are passed to the control unit through the program interface. The control mechanism is implemented in the form of a Finite State Machine (FSM) for coordination among computational and storage units based on the received parameters. These control signals are represented by dash-lines in Fig. 6(a) . The LLRs of the encoded bits from the channel are stored in the input buffer, which are then fed to the MDU. The control unit sends a start signal to MDU to initiate the decoding process. After completing first half iteration, a finish signal is generated by the control FSM for the address generator, which reads the interleaver/de-interleaver memory block to send the correct extrinsic information to the MDU for starting the next half iteration. Meantime, while the decoding unit is processing one block, the second input block is received by the input buffer. The MDU performs the decoding process as already stated above. The extrinsic LLRs produced in the first half iteration are written in memory in interleaved manner and read by the decoding unit in de-interleaved manner. We have used a pseudo-random interleaver in our design. The permutation indexes random pattern are generated in Matlab and with these permutation indexes, ROM is defined in VHDL [50] . Finally, after a fixed number of iterations, the control FSM generates a signal to load the decoded bits in the output buffer. The architecture shown in Fig. 6 can be configured to build TC and SECCC decoders. The main difference between TC and SECCC is that in case of TC, the single physical MAP decoder can be separated into two virtual decoders, but in SECCC, the single decoder is not separable. However, the architecture presented in Fig. 6 offers the same algorithmic and architectural complexity to both TC and SECCC schemes for the same number of iterations.
B. ARCHITECTURE OF PARALLEL MAP DECODER
The parallel MAP decoder architecture is presented in this section and its block diagram is shown in Fig. 7 . A number of parallel MAP decoder architectures with different parallelism levels for TCs have been presented in the literature [8] , [51] - [59] . The parallel-MAP decoder shown in Fig. 7 can be configured for any level of parallelism p ∈ {2, 4, 8, 16, 32, 64} for decoding frames of sizes divisible by the level of parallelism. A sub-block with size N p -bits is received FIGURE 7. Parallel MAP decoder architecture for TC and SECCC. VOLUME 7, 2019 and decoded by each MAP decoder, where, N is the size of frame and p shows the level of parallelism, thus reducing the decoding delay for each iteration [60] . The parallel decoder architecture shown in Fig. 7 contains a stack of p MAP decoders, having three storage blocks or memories where each memory contains p sub-memories. For example, the first memory contains sub-memories from M 1 to M p which store the input a priori LLRs, the second has sub-memories from M 1a to M 1p which store extrinsic information L e (u k ), and the third has sub-memories from M 2a to M 2p which store the interleaved and de-interleaved information. For synchronizing and routing data between MAP units and memories, there are two controllers named Controller 1 and Controller 2 .
The random permutation pattern of addresses are saved in ROM in order to address the memory collision problem [50] .
To start the decoding process, the p number of start 1 signals generated by Controller 1 are received by the corresponding MAP decoders. When all the MAP decoders produce extrinsic information L e after the first iteration, the MAP p decoder sends a finish-iteration signal to Controller 2 .
A rd-addr signal is generated by Controller 2 for reading randomly permuted interleaved and de-interleaved addresses from the ROM. These addresses are then given to submemories M 1a to M 1p for fetching the relevant extrinsic information from them. The fetched information is stored in sub-memories M 2a to M 2p and is then utilized as a priori information for the next iteration. Controller 2 generates the start 2 signal for all the MAP units to initiate the next iteration. A similar procedure is adopted to run each iteration, and after a desired number of iterations, Controller 2 sets the finish-flag high.
IV. RESULTS AND DISCUSSIONS
The error correction performance of channel codes can be analyzed by EXIT charts, union bounds or by plotting BER against E b N 0 to find the minimum value of E b N 0 suitable for reliable communication, where E b is the information bit energy and N 0 is the noise variance. The EXIT charts and the expression for calculation of union bounds of SECCC scheme with BPSK modulation and AWGN channel are given in Section IV-A. The simulations are carried out in Matlab to evaluate the BER performance of TC and SECCC decoder. The simulation results with different frame sizes and parallelism are presented in Section IV-B. In order to measure the architectural performance of TC and SECCC decoder, the VHDL design of the MAP decoder discussed in Section III is configured for TC and SECCC and synthesized by using Xilinx ISE for Virtex-6 FPGA (XCVLX240T). Finally, IV-C presents the synthesis results.
A. EXTRINSIC INFORMATION TRANSFER (EXIT) CHARACTERISTIC AND UNION BOUNDS FOR SECCC SCHEME
EXIT charts proposed by ten Brink [61] are helpful to predict the convergence behavior of iterative decoder.
EXIT charts plot the resulting extrinsic information characteristics of constituent decoders into a single diagram where both curves are the mirror images of each other. Convergence is only possible if the transfer characteristics do not intersect. The convergence estimates the average number of required decoding steps or iterations.
EXIT chart analysis for different code rates and modulation schemes of High Speed Downlink Packet Access (HSDPA) turbo decoder is presented in [62] , whereas for various SECCC schemes, the EXIT chart analysis is presented in [33] and [63] .
The EXIT charts for SECCC scheme based on Log-MAP decoder are shown in Figs. 8 and 9 at 1 dB and 1.7 dB, respectively, with code rate R = 1 3 , ν = 3 and AWGN channel for interleaver size of 20000-bits. Fig. 8 also shows the trajectories (snapshot number 2 and 18) of SECCC iterative decoding at E b N 0 = 1 dB. The trajectories at this value of E b N 0 can pass through the tunnel and reach the (1,1) convergence point by increasing the number of iterations to 40. Since, high number of iterations result in increasing hardware complexity and decrease the decoding speed, therefore system is configured to operate at 8 decoding iterations. Here, I A and I E represent a priori and extrinsic mutual information of the bit stream, respectively. Furthermore, Fig. 9 also shows that the trajectories (snapshot number 2 and 18) in EXIT chart reach the (1,1) convergence point with 8 decoding iterations at relatively higher E b N 0 value. In order to reduce the decoding complexity, we have considered the Max-log-MAP algorithm (with a scaling of 0.75) in our simulation results shown Fig. 10 . More specifically, the BER of the SECCC has started to converge to a low value at E b N 0 = 1.8 dB (instead of 1.7 dB, as predicted by its EXIT chart in Fig. 9 ) when the Max-log-MAP algorithm is employed in the decoder. This small performance loss of 0.1 dB is due to the employment of the Max-log-MAP instead of the Log-MAP algorithm in the BER simulation [47] .
EXIT charts can estimate the BER floors for considerably large interleaver sizes. Another technique to determine the BER floor is the truncated union bound analysis, which can be employed for arbitrary interleaver sizes. The truncated union bound analysis for SECCCs with uniform interleaver has been presented in [28] . It facilitates to design various SEC-CCs for desired BER floor. The following relation expresses the union bound for the average BER of a channel code, as given in [64] :
where, P(x −→x) denotes the Pair-Wise Error Probability (PWEP) and for AWGN channel:
Here, x is the encoded sequence without errors whereas,x is the erroneous encoded sequence. B H is the distance spectrum of the code and B H = w w N .A w,δ , where A w,δ is the Weight Enumerating Function (WEF) and represents the average number of error events in the sequence with w and δ showing the number of erroneous systematic and erroneous parity bits, respectively. H is the effective hamming distance. As we know that in case of SECCC scheme, to complete one turbo decoding step, one MAP decoder has to iterate with itself twice, so we consider two hypothetical MAP decoders. Hence, the WEF for SECCC is defined as follows [28] A w,δ = A (1) 2w,δ (1) .A (2) 2w,δ (2) .P N ,w π .
The third term in the above equation specifies the probability of occurance of all erroneous events for an interleaver size of N -bits. Eqs. (9) and (10) can be combined to give a union bound for SECCC scheme with BPSK modulation for transmission over AWGN channel. The detailed derivation of union bound for SECCC scheme can be found in [28] .
2w,δ (1) .A
B. ERROR RATE PERFORMANCE
In this section, based on the schematic of Fig. 1(a) and Fig. 1(b) , the BER performance of TC and SECCC, without parallelism and with different parallelism levels (p) is presented. The Frame Error Rate (FER) is also calculated at higher parallelism of 128 and 256 for TC and SECCC and compared in literature. We have used a pseudo-random interleaver in our Matlab simulations. Indeed, the S-random interleaver would give a better performance, while the CMI would be optimum. However, in this paper we did not focus on the interleaver design. The performance of these schemes without parallelism is shown in Fig. 11 for different frame sizes. Like other channel codes [65] , TC have a feature to perform better for longer frame sizes. However, the purpose of the research presented in this paper is to observe the behavior of SECCC scheme for different frame sizes and with different levels of parallelism. Since, there is no work reported in literature regarding the parallelism of SECCC, we compared this behavior of SECCC with TC employing the same code rate, frame sizes and parallelism. The results obtained from the analysis of both schemes showed that SECCC performs better than TC for frame sizes less than or equal to 2048 with parallelism higher than 16. Fig. 11 shows the minimum attainable BER for both TC and SECCC schemes with different frame sizes while performing 8 iterations and using a scaling factor of 0.75 with Max-log-MAP algorithm [46] . TC and SECCC show the same performance for frame size of 40 bits, however TC outperforms SECCC for larger frames. The floating and quantized (using quantization of (5,7), where 5 bits are for integer and 7 bits for fractional part) performance of 20,000 bit frame is shown in Fig. 10 . The BER performance of TC at E b N 0 = 1.4 dB and SECCC at E b N 0 = 1.8 dB, exhibiting the same decoding iterations are compared with uncoded BPSK scheme at E b N 0 = 8.4 dB. The coding gain of 7 dB and 6.6 dB is achieved for TC and SECCC, respectively, in comparison to uncoded BPSK in order to achieve a BER of 10 −4 , when transmitting over AWGN channel.
The effect of parallelism on TC scheme has been presented in the literature [8] , [56] , [57] . However, the effect of parallelism on SECCC scheme is investigated in this paper and compared with TC. Fig. 12 shows the BER plot for rate 1 3 TC and SECCC with 100 frames of 6144 bits and different parallelism levels p ∈ {2, 4, 8, 16, 32, 64}. This is based on noninitilised method and with standard Max-log-MAP algorithm, where αs for all windows in every iteration are initialized with 0 and βs for all windows in every iteration are initialised with 1. However, the BER plot shown in Fig. 13 is obtained by combining the NII method [24] with parallelism and also using the scaling factor [46] for performance improvement. NII-method improved the BER performance by 0.2 dB whereas, further improvement of 0.3 dB for BER of 10 −4 is achieved by multiplying the scaling factor of 0.75 with the extrinsic information in each iteration [47] . This NII-method initialises the αs at the left end of one window with the αs obtained at the right end of the left-neighbouring window in the previous iteration. Likewise, it initialises the βs at the right end of one window with the βs obtained at the left end of the right-neighbouring window in the previous iteration. To elaborate this method, consider that there are parallelism levels denoted by p which are in the following order {2, 4, 8, 16, 32, 64, 128, 256} and I denotes iteration number. According to NII method, at any parallelism level k p and iteration I , the backward state metrics are initialized with the ones computed at the level (k p+1 ) in the previous iteration (I − 1), i.e., β(I , k p ) = β(I − 1, k p+1 ). Similarly, the forward state metrics at any parallelism level k p are initialized with the ones computed at the level (k p−1 ) in the previous iteration (I − 1) i.e., α(I , k p ) = α(I − 1, k p−1 ). However, the first iteration is performed with the uninitialized forward and backward state metrics with use of Eqs. (5) and (6), respectively.
It can be observed in Fig. 12 and Fig. 13 that the difference in performance for BER of 10 −4 between p = 2 to p = 64 for TC is 0.3 dB whereas for SECCC, it is 0.15 dB. Moreover, the BER performance of both schemes for higher parallelism of p = 128 and p = 256 is also analyzed with both noninitialized method and NII-method, as shown in Fig. 14 and  Fig. 15 , respectively. The BER curve obtained with noninitialized method at higher parallelism shows an unusual behavior which is due to the fact that the bits near the two ends of each window do not benefit from both αs and βs and they only benefit from either one of them. Hence, some bits have better error correction than others and the floor in BER plot occurs when the SNR is high enough to recover the bits with better error correction, but is not high enough to recover those bits which are near the ends of the windows. By employing the NII method, our simulation results in Fig. 15 show that the error floor disappears completely for SECCC whereas, a small error floor still exists in case of TC for BER lower than 10 −5 . Fig. 15 shows that TC with NII method shows better performance than SECCC for BER of 10 −4 at p = 128 and for 10 −2 at p = 256. However, for achieving BER lower than 10 −4 and 10 −2 at p = 128 and 256, respectively, SECCC shows better performance than TC. Fig. 16 shows the effect of parallelism on TC and SECCC, with varying frame sizes to achieve a BER of 10 −4 at minimum required E b N 0 and 8 decoding iterations, while transmitting over AWGN channel. Fig. 16 depicts that the small-sized frames with higher parallelism require higher E b N 0 values to achieve the desired BER as compared to the large sized frames. SECCC outperforms TC for frame sizes less than or equal to 2048 bits at parallelism of 16, 32 and 64, as shown in Fig. 16 . For example, for a frame size of 2048 bits, the performance of SECCC is superior than TC by 0.2 dB, 0.55 dB and 1 dB at parallelism of 16, 32 and 64, respectively. This is due to the fact that at higher parallelism levels a 2048 bits frame size is subdivided at smaller-sized subframes and the length of a single trellis of SECCC is twice the length of each of the two trellises of TC. Moreover, the degradation in BER performance between parallelism levels for different frame sizes in case of SECCC is lower than TC. Hence, there are certain frame sizes and parallelism levels where SECCC performs better than TC. Fig. 17 expresses the error performance of TC and SECCC in terms of FER. It can be observed from Fig. 17 that both schemes have equal FER of 10 −2 at E b N 0 of 2.5 dB for p = 128. However, SECCC provides FER much lower than 10 −3 beyond 2.5 dB. At parallelism level of 256, SECCC curve decays faster for FER of 10 −3 than TC. The FER results shown in Fig. 17 are comparable to those shown for LTE-turbo decoder of rate 1 3 [22] where, the Fully Parallel Turbo Decoder (FPTD) and Iteration Unrolled XMAP (UXMAP) provide the FER of 10 −3 at E b N 0 of 2.5 dB while performing 17 and 4 iterations, respectively. The SECCC at parallelism of 128 and 256 shows the same FER performance with 8 decoding iterations (which are equivalent to 4 turbo iterations). 
C. SYNTHESIS RESULTS
The VHDL design of MAP decoder architecture shown in Fig. 6(b) is configured for SECCC and TC decoders and synthesized using Xilinx ISE on Virtex-6 FPGA (XCVLX240T) for different frame sizes and parallelism levels. Synthesis results are expressed in the form of clock frequency in MHz and resource consumption in Look Up Tables (LUTs). The clock frequency for each of SECCC and TC decoder are the same, which is 86.3 MHz. The resource consumption in terms of LUTs for different frame sizes without parallelism is given in Table 1 . The throughput achieved without parallelism is 10.7 Mbps with 8 decoding iterations, as shown in Table 1 . It can be seen that throughput increases linearly with parallelism. The estimate of throughput with different parallelism for frame size of 6144 bits is given in Table 2 . The sub-frame length decreases with increasing parallelism levels. The resource consumption depends on length of sub-frames and the number of decoders operating in parallelism levels. The parallel architecture shown in Fig. 7 is configured for 2, 4, 8, 16, 32, 64, 128 and 256 parallelism levels and the resource consumption using frame size of 6144 bits for each parallelism level is presented in Table 2 . The throughput is calculated by f clk ×p I [7] , where f clk is the clock frequency, p is level of parallelism and I represents the number of iterations.
Another important aspect is complexity which refers to the number of operations per bit. For a MAP decoder, the complexity can be quantified in terms of the number of trellis states per bit [44] . Complexity of TCs can be expressed as C(TC) = 2 × 2 v × I , where 2 v is the number of states of the encoder with memory v and I is the number of activations of decoding algorithm [66] . For SECCC, C(SECCC) = 2 v × I = 0.5C(TC). Therefore, the complexity of SECCC with 8 iterations is equal to the complexity of TC with 4 iterations [33] .
In case of parallel decoder architecture with parallelism levels p, the frame of size N -bits is divided among p MAP decoders such that each MAP decoder processes a N p -bit subframe. Hence, parallel MAP decoder with parallelism p is p times faster in decoding speed but it is p times higher in hardware complexity. However, according to the above definition of complexity, the complexity in terms of total processed trellis states stays the same for the sequential as well as parallel MAP decoder.
V. CONCLUSION AND FUTURE WORK
In this paper, the BER performance of SECCC is investigated with/without parallelism and with different frame sizes and also compared with the TC, where both schemes employ the same RSC code and code rate. In order to invoke a complete comparison, the VHDL design of MAP decoder for both schemes is synthesized using Xilinx ISE. The synthesis results show that both schemes produce equal throughput and exhibit equal resource utilization for the same number of iterations, frame sizes and parallelism. Based on the simulation results presented in Section IV-B it can be concluded that for BER of 10 −4 , SECCC outperforms TC for frame sizes less than or equal to 2048 bits with parallelism of 16, 32 and 64 as well as for frame sizes greater than or equal to 6144 bits with parallelism of 256. However, Fig. 13 and Fig. 15 show that for achieving BER of 10 −6 at parallelism of 32, 64, 128 and 256 with NII-method, TC still exhibits a small error floor. At higher parallelism the frame size is divided into smaller sized sub-frames and in case of SECCC, the single trellis is longer than each of the two trellises of TC, therefore SECCC performs better for smaller sized frames at higher parallelism. From the synthesis results tabulated in Table 2 , parallel MAP decoder can achieve a throughput of 691.4 Mbps, 1.38 Gbps and 2.76 Gbps at parallelism of 64, 128 and 256, respectively. The FER performance of SECCC at higher parallelism of 128 and 256 is comparable to FPTD and UXMAP of [22] .
The future communication standards will demand high throughput and low latency which is possible by employing high parallelism but with a degradation in BER performance. Until 2018, most of the implementations in literature have achieved a maximum parallelism of 64, while the recent work proposed in [21] employs parallelism of 128 to serve the throughput and latency demands of 5G. However, the increasing demands of throughput for future communication standards may require parallelism greater than 128. It is shown in this paper that SECCC employing the same constituent code and code rate as TC gives better performance at p = 128 for achieving BER lower than 10 −4 and outperforms it significantly for parallelism greater than 128. Hence, by using certain frame sizes and parallelism, both TC or SECCC decoder architectures can independently provide the desired performance. Hence, this analysis is beneficial in terms of proposing a reconfigurable architecture as a future work, capable of operating in either TC or in SECCC mode and can support any frame size and parallelism without significant degradation in the BER performance. This paper is focused on the successful demonstration of the concept of SECCC with parallelism. However, for improved throughput and BER performance, a state-of-the-art technique ACQ combined with NII [26] will be considered in our future work for the design of LTE-SECCC scheme.
