This paper describes the optimization, parallelization, and simulated execution performance of a software double-binary turbo decoder implementation supporting the WiMAX standard suitable for software-defined radio (SDR). Turbo codes offer excellent error-correcting performance, but they introduce significant computational demands in a communication system. In order to enhance execution performance for SDR, software for a turbo decoder based on the maximum a posteriori (MAP) algorithm was first adapted from the open-source Coded Modulation Library. Optimization and parallelization of the adapted software were then pursued and assessed with a multiprocessor version of the SimpleScalar simulator. Simulation results show that serial optimizations of the original adapted stand-alone C decoder software improve performance by more than 200%. The use of special instructions to accelerate important functions provides a further benefit of nearly 40% relative to the new baseline for performance. Exploiting the parallelism available in the MAP algorithm then yields a speedup of 10.8 on 12 processors. Simulation also shows that cache effects do not have a significant impact on parallel execution times.
I Introduction
The number of current and evolving wireless standards, coupled with the need to react quickly to changing market requirements, has led to the emergence of communication-processing architectures that offer more flexibility than previous application-specific integrated circuit (ASIC) designs [1] . More flexible wireless solutions are being actively pursued with greater reliance on general-purpose, processorcentred designs [2] . This trend has led to the emergence of the software-defined radio (SDR) paradigm, where a significant portion of the physical-layer functionality for a wireless standard is implemented in software.
Turbo codes have gained increasing popularity since their inception in 1993 because they offer superior bit error rate (BER) performance in comparison to other channel codes for forward error correction [3] . WiMAX, an emerging wireless radio standard, relies on enhanced double-binary turbo codes [4] , which offer better BER performance compared to earlier single-binary turbo codes [5] . Turbo decoding is one of the most computation-intensive functions in a wireless communications system. Thus, to support next-generation wireless standards such as WiMAX in an SDR environment, an efficient software implementation of a double-binary turbo decoder is desirable, particularly as microelectronics technology enables the integration of multiple processors in a single chip to support parallel execution. This paper describes the adaptation, optimization, and parallelization of software for a WiMAX-capable double-binary turbo decoder suitable for SDR. The decoder software was adapted from the opensource Coded Modulation Library (CML) [6] . CML is implemented in the MATLAB software environment with C extensions, and it offers an extensive communications simulation environment for verification of decoder functionality and assessment of communication performance. In order to enhance execution performance for SDR, a stand-alone implementation entirely in C was adapted from the base CML software turbo decoder based on the maximum a posteriori (MAP) algorithm. This adaptation was first optimized for a substantial reduction in serial execution time with software enhancements as well as special instructions to accelerate frequently used functions in the decoding algorithm. The enhanced decoder software was then parallelized for significant speedup on an arbitrary number of processors.
The execution performance for the serial and parallel enhancements was assessed with a multiprocessor version of the SimpleScalar simulator [7] . Functional and cache simulations were used to compare ideal parallel execution times with predicted execution times that reflect cache effects. The serial enhancements improve ideal execution performance by more than 200% over the original adapted stand-alone C decoder implementation. Relative to this new baseline, special instructions provide a further serial performance improvement of up to 40%, and finally ideal parallel execution provides a substantial speedup of up to 10.8 on 12 processors. Data cache miss rates are shown to be low for limited impact on parallel execution times.
The remainder of this paper is organized as follows. Section II provides an overview of the WiMAX turbo code. Decoding of turbo codes using the log-MAP algorithm is described in Section III. Section IV discusses serial and parallel design of the decoding algorithm. The implementation of various optimizations and the parallelization of the enhanced decoder software are described in Section V. Simulation results to assess serial and parallel execution performance are presented in Section VI. Finally, the paper is concluded in Section VII.
II The WiMAX turbo code
The turbo code defined by the WiMAX standard is a double-binary circular recursive systematic convolutional (CRSC) code [4] . Fig. 1 shows the WiMAX convolutional turbo code (CTC) encoder and its constituent recursive encoder, with a constraint length of four [8] . Consecutive data-bit pairs, a k and b k , are fed to the encoder in blocks of 2N bits or N couples, where a k and b k are the k-th data-bit pair at time k. In symbolic notation, the polynomials defining the connections are 1 + D + D 3 for the feedback branch, 1 + D 2 + D 3 for the y k parity bit, and 1 + D 3 for the w k parity bit, where D is a delay or storage element. The parity bit pair {y 1,k , w 1,k } is generated from the data sequence in natural order, whereas {y 2,k , w 2,k } is the parity bit pair generated from the interleaved sequence.
Two consecutive bits are used as simultaneous inputs. Hence the encoder has four possible state transitions, as opposed to two possible state transitions for a single-binary turbo encoder. There are eight possible encoder states, where the current state is defined by S = 4s1 + 2s2 + s3. The trellis diagram for the WiMAX turbo code, with all possible state transitions for a given input data-bit pair (ab) along with its corresponding parity bit pair (yw), is illustrated in Fig. 2 . Because the WiMAX encoder standard utilizes a circular code, the start and end states of the encoded data sequence are forced to be the same.
Once a frame with N pairs of bits is encoded, the data output from the encoder consists of three sets of bit pairs:
• systematic bit pairs, A = {a k } Because each set contains the same number of bits, the natural code rate of the encoder is 1/3. In other words, three bit pairs are generated by the encoder for each data-bit pair provided to it. Following encoding, each set is interleaved separately, followed by puncturing of the sets containing the parity bits using a specific puncturing pattern to achieve the target code rate, such as rate-1/2 [4] .
III Decoding double-binary turbo codes
A typical turbo decoder consists of two soft-input soft-output (SISO) component decoders that are serially concatenated via an interleaver [3] , [9] , as shown in Fig. 3 . The input to the turbo decoder is assumed to be in bit log-likelihood ratio (LLR) form. Following depuncturing and sub-block de-interleaving, the bit LLRs are converted to double-binary symbol LLRs, which are defined as
for the k-th symbol U k , where z belongs to φ = {01, 10, 11}. Three distinct sets of LLRs are provided to the decoder:
• systematic channel LLRs, L(X),
• parity channel LLRs for CRSC decoder 1, L(Z1), and
• parity channel LLRs for CRSC decoder 2, L(Z2).
During decoding, each CRSC decoder receives extrinsic systematic LLRs, Λe(X), obtained from the other CRSC decoder and computed as
Thus, each CRSC decoder receives systematic symbol LLRs, V (X), computed as
along with parity channel LLRs, L(Z). The output of each CRSC decoder is the a posteriori probability of each systematic symbol in LLR form, Λ(X). The decoder operates for a specified number of iterations before making hard bit decisions. Each additional iteration improves the decoder's BER performance because of its feedback nature, but with diminishing returns as more iterations are used. Typically, turbo decoders perform between four and 10 iterations. Each decoder iteration consists of two half-iterations, during which only one SISO unit is performing CRSC decoding.
The WiMAX CTC interleaver patterns are generated in a two-step process and are dependent on the frame size, modulation, and code rate [4] . This work focuses on the design and optimization of the WiMAX turbo decoder for SDR, while an efficient SDR implementation to generate the WiMAX turbo code interleaver/de-interleaver patterns and perform sub-block de-interleaving and depuncturing is left for future consideration.
III.A MAP algorithm
Hardware implementations of turbo decoders typically use a simplified version of the maximum a posteriori algorithm [10] . The log-MAP algorithm performs MAP decoding in the logarithmic domain to avoid expensive multiplication and division operations. After each half-iteration, the LLR output of each CRSC decoder for symbol X k is expressed as
where z belongs to φ, and s k and s k+1 are the start and end states, respectively, of a particular trellis branch associated with input symbol z. A, B, and Γ are the forward, backward, and transition branch metrics, respectively. The details of the max* operator are discussed in Section III.C. The A and B metrics are defined recursively as
and
where sA is the set of states at time k − 1 connected to state s k , and sB is the set of states at time k + 1 connected to state s k . Γ is defined for each trellis branch connecting states s k and s k+1 by the equation
III.B Tail-biting turbo codes
The WiMAX turbo code encoder relies on the tail-biting technique to terminate frames, which forces the encoder start and end states to be the same. Tail-biting is more bandwidth-efficient than the use of flush bits, which pads a frame with zeros to force the encoder start and end states to zero. Padding is not necessary with tail-biting. As a consequence of tail-biting, the exact start and end states of a frame are unknown to the decoder, but they are known to be the same.
The A k and B k metrics are defined recursively in (5) and (6); therefore A0 and BN−1 must be initialized for all states at the start of each half-iteration. The preferred method uses the metrics computed during the previous iteration and has been shown to offer BER performance comparable to other, more complex, methods [11] . Thus, A k and B k are initialized as follows:
In the above formulation, s denotes the encoder state, and A N (s) and B 0 (s) are the metrics from the previous iteration.
III.C The max* operator
The max* operator is employed extensively throughout the log-MAP algorithm, and its particular implementation affects both decoder complexity and BER performance. The max* operator is defined as
where λi is a real number. This operation with multiple arguments can be decomposed into a recursive form using a max* operator with only two arguments [12] , such as max*(λ1, λ2, λ3, λ4) = max*(max*(λ1, λ2), max*(λ3, λ4)), (11) for example. Applying the Jacobian logarithm, one can express a twoinput max* operator in the form [12] max*(λ1, λ2) = max(λ1, λ2)
that is, the max* operator achieves the equivalent of finding the maximum of the two input arguments and then adding a correction term,
The most accurate max* operator implementation is simply referred to as log-MAP, because the correction function computes the Jacobian logarithm exactly as it is defined in (12) . In software, the correction function can be implemented by using either the log and exp function calls in C (or the equivalent in other languages), or by using a large look-up table [10] . This is the most computationally expensive max* operator to implement in software, and other alternatives exist that have reduced complexity at the expense of accuracy.
Although it is the least accurate, the max-log-MAP max* algorithm is popular for hardware implementations because it is the least complex of the max* variants [10] . This algorithm simply sets the correction function to zero, resulting in max*(λ1, λ2) ≈ max(λ1, λ2).
A slightly more complex version is the constant-log-MAP algorithm, which uses a constant correction term [13] . It approximates the Jacobian logarithm as
where the constants a = 0.5 and T = 1.5 are optimal [9] .
A further enhancement is the more complex linear-log-MAP algorithm [9] , [14] . It achieves an approximation very close to that of the log-MAP max* implementation by using a linear correction function,
In other research, parameters a = −0.24904 and T = 2.5068 were found to minimize the total squared error between the exact correction function and its linear approximation [9] .
III.D Effect of max* operator on BER performance
To characterize the effect that different max* operators have on BER performance decoding, simulations were performed using the Coded Modulation Library [6] . The log-MAP, linear-log-MAP, constantlog-MAP, and max-log-MAP max* variants were compared, each with eight iterations. Simulation results were obtained for decoding a rate-1/2, 36-byte frame using QPSK modulation, annotated as QPSK (36,72). For brevity, only a summary is provided here. As expected, the max-log-MAP decoder consistently performed the worst, as the correction function is set to zero. The other, more complex max* variants performed similarly, yielding results within 0.1 dB of each other for a given BER. The constant-log-MAP did perform slightly worse than the linear-log-MAP and log-MAP decoders. The max-log-MAP decoder variant performed approximately 0.25 dB worse than the others for a given BER.
A more detailed comparison using two of the max* variants was subsequently performed. More accurate and complex max* variants can achieve the same BER performance as the least complex max-log-MAP max* variant with fewer decoder iterations, which could potentially offer reduced decoding latency. The linear-log-MAP decoder was chosen for comparison with the max-log-MAP decoder because it offers comparable BER performance to that of the log-MAP decoder, while being less complex. Decoder BER simulations comparing the max-log-MAP algorithm using eight iterations and the linear-log-MAP algorithm using a varying number of iterations were performed, with results displayed in Fig. 4 . The simulations revealed that decoding using the max-log-MAP algorithm with eight iterations performs worse than the linear-log-MAP algorithm using both four and six iterations. Decoding using the max-log-MAP algorithm with eight iterations and the linear-log-MAP algorithm with four iterations is comparable for large values of E b /N0.
IV Serial and parallel decoder design
This section covers critical design aspects of a turbo decoder for serial and parallel processing. The discussions refer primarily to a software decoder, but the techniques are applicable to hardware realizations as well.
IV.A Decoding using the sliding window technique
The majority of turbo decoder implementations use the sliding window technique to operate on a portion of a frame at a time so as to minimize memory requirements [15] . A large frame is divided into a number of sub-blocks, and the MAP algorithm is applied to each sub-block independently. Memory requirements are reduced because branch metrics need to be stored only for the sub-block currently being decoded, rather than for the entire frame. Decoding each sub-block is typically done by calculating either the A or the B metric for the entire sub-block, and then calculating the other metric (B or A) and the LLR values, Λ, for each trellis index. The second metric calculated (B or A) does not need to be stored for the entire sub-block; only the metrics for two trellis indexes need to be stored. Fig. 5 illustrates how decoding proceeds using the sliding window technique with the original frame divided into four sub-blocks. The details of a single sub-block are shown separately to highlight the difference in calculations for the first and second metrics. For each individual sub-block shown, the B metrics are calculated first for the entire sub-block, followed by the A and LLR calculations. In this case, the A metrics need to be stored for only the current and previous trellis index, as the A metrics from earlier indexes are not required. Because the sub-blocks are decoded serially, the horizontal time scale in Fig. 5 indicates that a total of 48 time units are required.
As in the tail-biting technique, one of the metrics at the border of each sub-block must be initialized when a sliding window is used. For the frame shown in Fig. 5 , the B metrics must be initialized for each sub-block because they are defined recursively in the reverse direction, while the A metrics can be initialized from the previous sub-block. Initialization can be accomplished either by performing dummy calculations, which involves performing calculations on the frame outside the current sub-block [16] , or by using metrics from the previous iteration. Initialization using dummy calculations increases decoder latency, while using metrics from the previous iteration requires a border memory to store the border values. Therefore a trade-off must be made when using either technique. A double-binary turbo decoder implementation has shown that the two techniques provide almost identical BER performance [17] . Because one of the goals of this work is to maximize decoder throughput, using metrics from the previous iteration is the selected technique for the decoder in this paper.
IV.B Parallel MAP decoding algorithm
Maximizing the throughput of a turbo decoder can be accomplished by exploiting the parallelism of the MAP algorithm. This is accomplished with an approach that is similar to the sliding window technique described above. Parallelism is exploited by processing each sub-block simultaneously rather than sequentially. As each CRSC decoder performs MAP decoding to generate output LLRs, the time spent during each half-iteration by a CRSC decoder constitutes the parallel execution portion of the algorithm described below. The sequential portion of the algorithm is composed of less complex tasks performed between half-iterations, including the generation of extrinsic information, interleaving and de-interleaving, and making hard bit decisions.
To maximize the flexibility of the software turbo decoder, it is desirable to support an arbitrary degree of parallelism, which translates into decoding a frame in parallel using an arbitrary number of subblocks. This corresponds to assigning a particular software thread to perform MAP decoding on a particular sub-block. There are a number of ways to apply the MAP algorithm to each sub-block by varying the order and timing of the A, B, and Λ calculations. A number of parallel turbo decoding structures have been proposed in previous work, and an analysis of their design trade-offs has been done [16] . The parallel turbo decoding algorithm in this paper was implemented with a design similar to the double-windowing scheme, but adapted to support an arbitrary number of sub-blocks, as described below.
When A and B calculations begin somewhere in the middle of the frame, they must be initialized as in the sliding window technique. To obtain reliable metrics at the borders of a sub-block, the decoder uses metrics from the previous iteration, which have been shown to perform well [11] . In Fig. 6 , the SB2 B metrics and SB3 A metrics must be initialized at the start of each half-iteration. At time t4 in Fig. 6 , the A and B metrics must be initialized before decoding can proceed.
These border values can be obtained from adjacent sub-blocks, where metrics have already been calculated up to the border. This approach results in threads exchanging information, requiring synchronization to ensure that the threads have completed their respective calculations up to this point. SB1 obtains B initialization values from SB2, while SB2 obtains A initialization values from SB1. In a similar manner, SB3 and SB4 also exchange metrics at time t4. The initialization and exchange of metrics between sub-blocks has been generalized to an arbitrary number of sub-blocks for the parallel software turbo decoder described in this work.
A parallel decoder's BER performance may exhibit some degradation as the degree of parallelism is increased. This behaviour is due to more unknown border metrics requiring initial estimates as the number of sub-blocks is increased. Results for BER simulations will be presented in Section VI.B to assess this phenomenon.
V Decoder optimization and parallelization
The WiMAX double-binary turbo decoder was implemented in C, where each CRSC decoder utilizes the log-MAP algorithm. The software was adapted from the open-source Coded Modulation Library, which provides sequential encoder and decoder implementations in MATLAB with C extensions [6] . A stand-alone implementation in C was obtained. Numerous optimizations and enhancements were subsequently pursued, including a multithreaded implementation making use of parallel MAP decoding.
V.A Initial adaptation and optimizations
For the original CML WiMAX decoder, the operations performed by each CRSC decoder were implemented as C extensions, while the rest of the decoder functionality was implemented in MATLAB. The initial effort for this paper involved implementing a stand-alone decoder in C by adapting the WiMAX decoder functionality written in MAT-LAB outside the CRSC decoder, including interleaver/de-interleaver operations, the generation of V1(X), V2(X), and Λe2(X), and the ability to make hard bit decisions. The adapted stand-alone C implementation continued to leverage CML's wireless simulation environment; the original binary data was still generated by CML, followed by encoding, modulation, and simulated transmission over an additive white Gaussian noise (AWGN) channel. The data provided from CML to the adapted decoder software included the appropriate interleaver/de-interleaver pattern and channel log-likelihood ratio values. An input file was used for this purpose. The LLRs provided to the decoder included the systematic channel LLRs and parity channel LLRs in double-binary form. Log-MAP decoding then proceeded exactly as described in Section III.
The CML decoder software used an ideal halting condition, in which the decoder either stops as early as possible when all frame errors are corrected or is forced to stop after a maximum of ten iterations is reached. Ideal early halting requires a priori knowledge of the original data to determine whether errors exist in the decoded binary data. Hence it is not suitable for a real-world decoder implementation. The C decoder software was therefore modified to perform a fixed count of iterations that was dependent on the selected max* implementation.
In the original software implementation, various global arrays were allocated dynamically during each CRSC decoder half-iteration, including arrays for A, B, Γ, and Λ metrics, as well as other temporary storage. The size of the arrays allocated depended on the size of the frame currently being decoded, as well as the number of states and state transitions. For a particular coding scheme, such as the WiMAX turbo code, the number of states and state transitions from a particular state remained constant. To eliminate the execution-time overhead required for dynamic memory allocation, arrays were simply globally declared to accommodate the largest WiMAX turbo code frame size of 60 bytes. Global variables were allocated on program startup and continued to exist for the duration of the main program, entirely eliminating any overhead associated with dynamic memory allocation.
In the original implementation, the trellis information associated with each trellis branch-including a start state, end state, systematic bit pair, and parity bit pairs-was generated in every decoder halfiteration by means of a function call with the encoder generator polynomials as inputs. However, only the trellis end state and parity bit pair were stored in arrays. The trellis start state and systematic bit pair were generated on demand, in particular when A, B, Γ, and bit LLR calculations were performed. The trellis information for a particular code did not change from iteration to iteration, as it is characteristic of the code constraint length and encoder generator polynomials. Therefore, the generation of the WiMAX turbo code information was changed so that the information was generated and stored in an array only on program startup, rather than every half-iteration.
The original decoder ran for some number of cycles within each half-iteration when performing A, B, and Γ calculations. For the first cycle, border A and B metrics were initialized to zero, while for the second and subsequent cycles, the border metrics were set to the values of the previous cycle, as in tail-biting. In every cycle, a comparison was made between the new and previous metrics, and if the difference between any pair exceeded some threshold, another cycle was executed. As a result, at least two cycles were executed every half-iteration, with the total number of cycles executed varying in each half-iteration, up to a maximum of four cycles. Modifications were made to the decoder to avoid redundant calculations and to eliminate the check between new and previous metrics, so that effectively only one cycle was performed in each half-iteration. Also, the initialization of A and B border metrics was modified to use the metrics of the previous iteration, properly implementing the tail-biting initialization scheme. The calculation of Γ values was modified so that they were generated only once per halfiteration, rather than twice.
The decoder software profile statistics suggested that optimizing the max* function would increase performance significantly, as this function contributes the most to execution time. A flexible software decoder can call the desired max* variant from an array of function pointers, using an argument supplied at run-time. If a particular max* operator is chosen prior to run-time, its corresponding function can be called directly rather than through a function pointer. Each time the max* operator is used, there is run-time overhead associated with the array look-up, as well as the pointer indirection to resolve the actual location of the function in memory. The software was changed to permit compile-time selection of the max* operator, which allows the corresponding function to be called directly during execution, eliminating this extra overhead.
Other less significant optimizations that contributed collective improvements were elimination of a posteriori parity LLR calculations, reductions in the frequency of hard bit decision LLR calculations, and removal of unused historical values for border A and B metrics.
A final optimization that can be considered for the turbo decoding software is to use integer arithmetic instead of floating-point arithmetic for a simpler, faster single-chip implementation. Although the authors have experimentally obtained reductions in real execution times for the decoder software with integer instructions on actual processor hardware, the simulation results in Section VI are reported with floatingpoint instructions because the difference in simulated cycle count is not significant with instructions assumed to take one cycle.
V.B Special instruction for trellis look-up
During MAP decoding, a particular code's trellis information is used extensively for the calculation of A, B, Γ, and Λ. In the original software, trellis information was generated on program startup, requiring array look-ups during decoding to obtain the required information. For example, the calculation of A k (s k ) for a particular state s k at trellis index k requires eight trellis information array look-ups. Thus, A calculations require 8 × 8 = 64 array look-ups for each trellis index, and a total of 64N in each half-iteration for a frame with N pairs of bits. Similarly, 64N trellis array look-ups are required for the calculation of Γ and B, while Λ requires 96N , in each half-iteration. A total of 2(64N + 64N + 64N + 96N )I = 576N I trellis array look-ups are required to decode a frame of size N using I decoder iterations.
Providing a method to speed up the acquisition of trellis information would improve decoder performance, especially given the extensive use of trellis information. A new instruction can be added to any reduced-instruction-set-computer (RISC)-style instruction set architecture to allow trellis information to be obtained in a single cycle. In hardware, a fixed look-up table or special memory can store the trellis information, which can then be accessed by this special instruction. For the WiMAX turbo code, because there are 32 branches associated with each trellis stage and four pieces of information associated with each branch, 128 values need to be stored. The values are all integers in the range 0-7, so 3 bits would be needed for each value, resulting in a total storage requirement of only 48 bytes.
V.C Accelerating the max* calculation
The max* operation is used extensively throughout the MAP algorithm, as outlined in Section III.A. The calculation of all A metrics for each frame index requires 32 max* operations per half-iteration. For a frame with N pairs of bits requiring I decoder iterations (2I halfiterations), this calculation results in a total of 2 × I × 32N = 64N I max* operations. Similarly, there are 64N max* operations used in the calculation of each B and Λ, along with 4N calls made during the calculation of bit LLRs during the last decoder iteration, giving a total of (192I + 4)N required max* operations during the decoding of a frame. As a result, there is an opportunity to increase decoder throughput by reducing the number of cycles required to perform all of these max* operations.
To accelerate the max* calculation, a special instruction can be introduced to eliminate the use of more general C code to implement the required functionality. Analysis of the optimized assembly code for a typical RISC architecture produced by a compiler indicated that each max-log-MAP max* function call requires six cycles to execute, with call/return statements. Because the max-log-MAP algorithm's max* operation performs a single floating-point maximum operation, common among other architectures, it was assumed that the operation could be performed in a single cycle by one inlined instruction.
As the linear-log-MAP algorithm offers a good trade-off between BER performance and complexity, it was desirable to investigate what performance benefit could be achieved by providing a special instruction to implement it as well. Referring to (15) , the linearlog-MAP computation requires a single multiplication and two additions/subtractions. Computations may be performed speculatively in parallel, and after comparisons are performed, the required intermediate values can be used to calculate the final result. A special instruction can be introduced into a typical RISC instruction set for the linear-log-MAP max* operation. It may be conservatively assumed to require a constant three cycles for execution. This enhancement is a significant reduction from the requirement of up to 20 cycles for the original C function implementation.
V.D Thread creation and synchronization Section IV.B explained the parallelization of the sliding window algorithm to enhance performance by allowing multiple sub-blocks to be decoded simultaneously with an exchange of metrics for pairs of sub-blocks. Consequently, the implementation of the parallel sliding window algorithm requires appropriate thread creation and synchronization for multiprocessor execution. Because the algorithm is performed in the same way over multiple iterations, all of the necessary software threads can be created at the beginning, and they can remain active throughout the execution. For any serial phases of activity, one thread can be responsible for that computation, and the other threads can busy-wait on a barrier. When pairs of threads exchange metrics during the parallel phases, collective synchronization with barriers can be used again for the most straightforward implementation because the parallel computation is distributed evenly across the threads.
VI Simulated execution performance results
Execution performance of the optimized decoder software was evaluated with a multiprocessor version of the SimpleScalar simulator based on a functional model of single-issue processors executing one instruction per cycle [7] . The decoder software was compiled using the gcc compiler version 2.6.3, with optimization level -O2, into the MIPS-derived portable instruction set architecture (PISA) for the SimpleScalar simulator. The special instructions described in Sections V.B and V.C were implemented in SimpleScalar in order to assess their benefit. This section summarizes the results of simulations to evaluate the serial optimizations and parallelization, and it also considers cache effects on performance based on additional results from multiprocessor cache simulations in SimpleScalar.
VI.A Speedup from serial optimizations
The benefit of the serial optimizations described in Section V.A was evaluated with functional simulations of a single thread executing versions of the original and enhanced decoder software. Fig. 7 summarizes the cumulative reductions in simulated execution time from progressive inclusion of the various optimizations. The results are for decoding of rate-1/2, 18-byte frames with QPSK modulation using the max-log-MAP decoder with eight iterations and the linear-log-MAP decoder with four iterations. These two implementations have comparable BER performance for large SNR, as described in Section III.D.
With all software-only optimizations combined, the simulated execution time with idealized single-cycle instruction processing was reduced by 223% and 248% for the max-log-MAP and linear-log-MAP decoders respectively, relative to the original software implementation. A significant source of improvement was the reduction in half-iteration complexity, as is evident in Fig. 7 . The significant performance benefits described above are essential for establishing an appropriate point of reference for SDR considera- tion. From this new baseline, inclusion of the special instruction for trellis look-up improved performance by a further 11% for both implementations, as shown in Fig. 7 . The max-log-MAP and linear-log-MAP max* special instructions resulted in further relative reductions of 17% and 28% in serial execution time. The combined improvement resulting from both trellis and max* special instructions was 27% for the max-log-MAP decoder and 39% for the linear-log-MAP decoder.
VI.B Speedup from parallel execution
With all of the serial optimizations included, the decoder software was parallelized in the manner described in Sections IV.B and V.D. For rate-1/2, 18-byte frames with QPSK modulation, Fig. 8 shows the execution times from functional multiprocessor simulation as more threads are used for the max-log-MAP and linear-log-MAP decoders. To highlight the impact of special instructions for trellis look-up and max* operations, simulated execution times are shown with and without their use. Once again, max-log-MAP uses eight iterations, and the more accurate linear-log-MAP uses four iterations for comparable BER performance, as discussed in Section III.D. The difference in the number of iterations consequently allows linear-log-MAP to have a throughput that is 90% higher. In all cases, the simulated execution times decrease monotonically from 1 to 12 threads, confirming the effectiveness of exploiting the available parallelism in the algorithm. The speedup with 12 threads is as high as 10.8, and the parallel efficiency with 12 threads is as high as 90%.
BER simulations were also performed to determine to what extent BER degradation occurs as decoder parallelism is increased. This degradation is an expected characteristic explained in Section IV.B. 9 illustrates BER performance for parallel decoding of an 18-byte frame with QPSK modulation using the max-log-MAP max* variant with eight iterations. The BER degradation as parallelism is increased is not significant, with the maximum degradation using 12 sub-blocks being less than 0.1 dB for a particular BER.
VI.C Data cache effects on parallel execution time Consideration of data cache effects on execution performance relied on a multiprocessor cache simulator implemented in SimpleScalar [7] , where functional simulation is augmented with modelling of data cache tags and collection of statistics on data cache misses and actions related to maintaining multiprocessor data cache coherence. These actions depend on cache size and the number of threads, as well as the nature of the data sharing. For the parallel implementation of the sliding window algorithm, there are two primary forms of data sharing. Some data is shared between one thread and all other threads (e.g., V (X) values generated by (3) in serial phases), and other data is shared between pairs of threads (e.g., exchange of A and B metrics for adjacent sub-blocks in parallel phases). In either case, writes to shared data by any thread cause copies of that data in other caches to be invalidated. Cache misses stem from rereading invalidated data as well as from initial data accesses. Fig. 10 shows the simulated data cache miss rates for different data cache sizes and different numbers of threads when decoding 18-byte frames with max-log-MAP. The results are averages weighted by proportion of memory accesses for a given number of threads. Larger frames have slightly higher miss rates, but the results are otherwise similar. For 4, 8, or 12 threads, a modest data cache of 8 kbytes is sufficient to achieve a data cache miss rate that is well below 5%.
Cache misses introduce additional latency for parallel execution time over the ideal case. Results from multiprocessor cache simulation can therefore be combined with best-case and worst-case bus/memory delays to predict bounds on expected parallel execution time for N processors in the presence of data caches. For a contention-free memory access latency of M cycles, the best case for each cache miss is M cycles. The worst case assumes that for each cache miss, the other N − 1 processors are always contending for the bus with higher priority; hence the latency is (N − 1)M + M = N M cycles. The average expected latency is between these two extremes. Fig. 11 uses the results of multiprocessor simulations with 8-kbyte data caches to predict best-case and worst-case execution times for a relatively long on-chip memory latency of M = 20 cycles. Fig. 11 also shows results from an enhancement of the aforementioned SimpleScalar multiprocessor cache simulator that was done independently [18] , where bus/memory delays are modelled in addition to the collection of statistics for cache coherence. For N ≥ 4 processors, the more detailed multiprocessor simulations using 8-kbyte data caches and a memory latency of M = 20 cycles predict execution times that are much closer to the best-case bound because of the low data cache miss rates that reduce the occurrence of bus contention.
VII Conclusion
This paper has described the adaptation, optimization, and parallelization of software for WiMAX-capable turbo decoding based on the MAP algorithm. The software has been adapted from the open-source Coded Modulation Library, and execution performance following optimization and parallelization has been assessed through multiprocessor simulation using SimpleScalar. Serial optimizations have been shown to improve performance by more than 200% over the original adapted software. From this new baseline for performance, using special instructions to accelerate important functions additionally improved performance by up to 40%. Finally, parallel execution of the optimized decoder software provided a speedup of up to 10.8 on 12 processors, with low data cache miss rates that limited the impact on parallel execution times. The combination of modest application-specific enhancements to general-purpose RISC processors using special instructions and the high parallel execution efficiency for the turbo decoder software adaptation described in this paper offers considerable promise for pursuing software-defined radio based on single-chip multiprocessor architectures.
