Abstract-Shuffled versions of iterative decoding of low-density parity-check codes and turbo codes are presented. The proposed schemes have about the same computational complexity as the standard versions, and converge faster. Simulations show that the new schedules offer better performance/complexity tradeoffs, especially when the maximum number of iterations has to remain small.
I. INTRODUCTION

I
TERATIVE decoding based on belief propagation (BP) [1] has received significant attention recently, mostly due to its near-Shannon-limit error performance for the decoding of lowdensity parity-check (LDPC) codes [2] and turbo codes [3] . Like the maximum a posteriori (MAP) probability decoding scheme [4] , it is a symbol-by-symbol soft-in/soft-out decoding algorithm. It processes the received symbols recursively to improve the reliability of each symbol based on constraints that specify the code. In the first iteration, the decoder only uses the channel output as input, and generates soft output for each symbol. Subsequently, the output reliability measures of the decoded symbols at the end of each decoding iteration are used as inputs for the next iteration. The decoding-iteration process continues until a certain stopping condition is satisfied. Then hard decisions are made, based on the output reliability measures of decoded symbols from the last decoding iteration. The aim of this letter is to develop shuffled versions of the standard iterative decoding algorithms for both LDPC and turbo codes.
A similar approach for low-latency decoding of turbo product codes was proposed in [5] . In [6] and [7] , a horizontal partitioning of the parity-check matrix was proposed to serialize the decoding of LDPC codes, and in the process, speed-up of the convergence was achieved. In this letter, we consider a vertical partitioning of the parity-check matrix to speed up the decoding. The two approaches are well introduced in [8] . It is interesting to note that the vertical and horizontal schedulings come from two different angles, and eventually, achieve similar gains. The vertical scheduling proposed in this letter is an algorithmic approach intended to speed up BP decoding at no cost in complexity. For very-large-scale integration (VLSI) considerations, groups are introduced to preserve some parallel advantages of BP decoding. The horizontal scheduling of [6] and [7] is a hardware approach intended to serialize the totally parallel BP decoding. In the serialization, new updates become available at the same iteration, and speed up is also achieved by using them.
II. ITERATIVE DECODING OF LDPC CODES
A regular binary LDPC code of length and dimension has a parity-check matrix , with ones in each column and ones in each row. We denote the set of bits that participate in check by , and the set of checks in which bit participates as . Assume a codeword is transmitted over an additive white Gaussian noise (AWGN) channel with zero mean and variance using binary phaseshift keying (BPSK) signaling, and let be the corresponding received sequence.
A. Standard BP for Iterative Decoding of LDPC Codes
Based on [1] , let be the log-likelihood ratio (LLR) of bit and initially set . Let and be the LLRs of bit which is sent from check node to bit node , and sent from the bit node to check node , respectively. Let denote the a posteriori LLR of bit . The standard BP algorithm [1] is carried out as follows.
Initialization: Set , maximum number of iterations to . For each , , set .
Step, for and each , process or the maximum iteration number is reached, stop the decoding iteration and go to Step 3. Otherwise, set and go to Step 1.
Step 3: Output as the decoded codeword.
B. Shuffled BP for Iterative Decoding of LDPC Codes
At the th iteration of the standard BP algorithm, first all values of the check-to-bit messages are updated by using the values of the bit-to-check messages obtained at the th iteration, i.e., each is updated by using . Then, all values of the bit-to-check messages are updated by using the values of the check-to-bit messages newly obtained at the th iteration, i.e., each is updated from . In general, for both the check-to-bit messages and bit-to-check messages, the more independent information is used to update the messages, the more reliable they become. Iteration of the standard two-step implementation of the BP algorithm uses all values computed at the previous iteration in (1). However, certain values could already be computed in (3) based on a partial computation of the values obtained from (2), and then be used instead of in (1) to compute the remaining values . This suggests a shuffling of the horizontal and vertical steps of the standard BP decoding. Hence, we refer to this new version as shuffled BP decoding. Note that the updating procedure in shuffled BP is bit-based.
In the shuffled BP algorithm, the initialization, stopping criterion test, and output steps remain the same as in the standard BP algorithm. The only difference between the two algorithms lies in the updating procedure.
Step 1 of the shuffled BP algorithm is modified as: For and each , process the horizontal step and vertical step jointly, with (1) modified as (5) We observe, however, that while one iteration of the standard BP algorithm can be fully processed in parallel, that of the shuffled BP algorithm becomes totally serial. To decrease decoding delay and preserve the parallelism advantages of the standard BP algorithm, a parallel shuffled decoding scheme named "group shuffled BP" is developed next. In the group shuffled BP algorithm, the code length is divided into a number of groups. In each group, the updating of messages is processed in parallel, but the processing of groups remains sequential. More precisely, assume the bits of a codeword are divided into groups, and each group contains bits (assuming for simplicity).
Step 1 of the group shuffled BP algorithm is carried out as follows.
Step 1) For , process jointly the following two steps. a) Horizontal
Step, for and each , process
b) Vertical
For , the group shuffled BP becomes the standard BP, while the group shuffled BP with is the previously proposed shuffled BP. 1 As an example, consider the code with parity-check matrix (10) The decoding processes for one iteration of the group shuffled BP is illustrated in Fig. 1 with 1 (standard BP), 2, and 6 (original shuffled BP). The shuffled BP algorithm for the decoding of LDPC codes keeps the computational advantages of the forward-backward implementations of the standard iteratively decoded BP, and requires the same computational complexity [10] . 2 Furthermore, when the Tanner graph of the LDPC code is acyclic and connected, the proposed method is optimal in the sense of MAP decoding and converges faster (or at least, no more slowly) than the standard BP algorithm [10] (proofs follow from the fact that shuffled BP is simply a new scheduling on the same graph). It is also straightforward to generalize the shuffled approach to various suboptimum versions of BP decoding. Fig. 2 depicts the word-error rate (WER) of iterative decoding of a (8000,4000) (3,6) LDPC code, with the group shuffled BP algorithm, for 1 (standard BP), 2, 8, 100, and 8000 (original shuffled BP). We observe that at the WER and a maximum of 20 iterations, the original shuffled BP algorithm performs about 0.2 dB better than standard BP algorithm, and the larger the value of , the better the error performance. However, for , no significant difference from is observed. Fig. 3 depicts the corresponding average number of iterations. We observe that the average number of iterations of the original shuffled BP algorithm is about half that of the standard BP algorithm, and the same type of differences as in Fig. 2 with respect to the values of . Both standard and shuffled BP decoding achieve the same error performance with a maximum of 2000 iterations, which indicates that the speedup is not achieved at the expense of a poorer achievable error performance. Similar gains were also achieved with suboptimum versions of BP decoding and the shuffled decoding approach.
C. Simulation Results
III. ITERATIVE DECODING OF TURBO CODE
A turbo code [3] encoder comprises the concatenation of two (or more) convolutional encoders, and its decoder consists of two (or more) soft-in/soft-out convolutional decoders which feed reliability information back and forth to each other. For simplicity, we consider a turbo code that consists of two 
A. Standard Serial and Parallel Turbo Decoding
The decoding approach proposed in [3] operates in serial mode, i.e., the component decoders take turns generating the extrinsic values of the estimated information symbols, and each component decoder uses the extrinsic messages delivered by the last component decoder as the a priori values of the information symbols. The disadvantage of this scheme is high decoding delay. In the parallel turbo decoding algorithm [9] , all component decoders operate in parallel at any given time. After each iteration, each component decoder delivers extrinsic messages to other decoder(s) which use these messages as a priori values at the next iteration.
B. Shuffled Turbo Decoding
Although the parallel turbo decoding overcomes the drawback of high decoding delay of serial decoding, the extrinsic messages are not taken advantage of as soon as they are available, because the extrinsic messages are delivered to component decoders only after each iteration is completed. The aim of the shuffled turbo decoding is to use the more reliable extrinsic messages at each time. Let be the permuted sequence by the interleaver corresponding to the original information sequence , according to the mapping , for . We assume that . There is a unique corresponding reverse mapping , for and . In shuffled turbo decoding, the two component decoders operate simultaneously as in the parallel turbo decoding scheme, but the scheme of updating and delivering messages is different. We further assume that the two component decoders deliver extrinsic messages synchronously, i.e., , where and denote the times at which decoders 1 and 2 deliver the extrinsic values of the th estimated symbol of the original information sequence , and of the interleaved sequence , respectively. The shuffled turbo decoding scheme processes the backward recursion followed by the forward recursion. Let us first consider the forward recursion stage at the th iteration of component decoder 1. After time , the values of should be updated, and the values of are needed. There are two possible cases. The first case is , which means the extrinsic value of the information bit is not available yet. Then the values , which are stored in the backward-recursion stage of the current iteration, are used to update the values and . The second case is , which means the extrinsic value of the information bit has already been delivered by decoder 2. Then this newly available is used to compute the values (then stored), , and . The backward recursion in decoder 1, as well as both recursions in decoder 2, are realized based on the same principle. After iterations, the shuffled turbo decoding algorithm outputs as the decoded codeword, where , which is different from that in the standard serial turbo decoding [3] . The decoding processes of the standard serial, parallel, and shuffled turbo decoding are illustrated in Fig. 4 . It is straightforward to generalize the shuffled turbo decoding to multiple turbo codes which consist of more than two component codes. Based on the above descriptions, the total computational complexity of the shuffled turbo decoding for multiple turbo codes at each decoding iteration is the same as that of the parallel turbo decoding, and each of them has a decoding delay which is about of the decoding delay of serial turbo decoding, where is the number of component codes.
C. Simulation Results
We observed that shuffled turbo decoding does not present an advantage over standard decoding for turbo codes with two component codes. A possible reason is that with an increasing number of component codes, the proportion of new updated extrinsic messages taken advantage of by each component decoder also increases. It is also known that parallel decoding outperforms serial decoding for turbo codes with more than two component codes [9] . Fig. 5 depicts the bit-error performance of a turbo code with three component codes (rate-1/4) and interleaver size 16384, with standard parallel decoding and shuffled decoding. After five iterations, the shuffled turbo decoder outperforms its parallel counterpart by several tenths of a decibel.
