Abstract-Replica shuffled versions of iterative decoders of turbo codes, low-density parity-check codes and turbo product codes are presented. The proposed schemes converge faster than standard and previously proposed "shuffled" approaches. Simulations show that the new schedules offer good performance versus complexity/latency trade-offs.
I. INTRODUCTION
Iterative decoding based on belief propagation (BP) [1] has received significant attention recently, mostly due to its nearShannon-limit error performance for the decoding of lowdensity parity-check (LDPC) codes [2] and turbo codes [3] . Like maximum a posterior probability (MAP) decoding [4] , it is a symbol-by-symbol soft-in/soft-out decoding algorithm. It processes the received symbols recursively to improve the reliability of each symbol based on constraints that specify the code. In the first iteration, the decoder only uses the channel output, and generates soft output for each symbol. Subsequently, the output reliability measures of the decoded symbols at the end of each decoding iteration are used as inputs for the next iteration. The decoding iteration process continues until a certain stopping condition is satisfied. Then hard decisions are made based on the output reliability measures of the decoded symbols at the last decoding iteration. The standard BP decoder of LDPC codes often needs several tens or hundreds of iterations for the decoding process to converge, which is not desirable because of high decoding delay. Furthermore, LDPC codes of interest can have large codeword length and it can be difficult to implement the decoding in hardware in a fully parallel way. Because of the serial property of the BCJR algorithm, decoders of turbo codes cause high delay as well. In [5] , [6] and [7] , "shuffled" methods were presented to reduce the required number of iterations for decoding LDPC and turbo codes, respectively. A speedup factor of 2 for LDPC codes and the saving of one iteration for turbo codes were reported. The aim of this paper is to introduce a "replica shuffled" scheme which further accelerate the decoding process for turbo codes, LDPC codes, turbo product codes and other iterative decodable codes.
II. ITERATIVE DECODING OF TURBO CODE
A turbo code [3] encoder is constructed using a concatenation of two (or more) convolutional encoders, and its decoder consists of two (or more) soft-in/soft-out convolutional decoders which feed reliability information to each other. For simplicity, we consider a turbo code that consists of two rate-1/n systematic convolutional codes with encoders in feedback form. Let u = (u 1 , u 2 , . . . , u K ) be an information block of length K and c = (c 1 , c 2 , . . . , c K ) be the corresponding coded sequence, where 
em (û k ) denote the extrinsic value of the estimated information bitû k delivered by component decoder m at the ith iteration [8] .
A. Standard serial and parallel turbo decoding
The decoding approach proposed in [3] operates in serial mode, i.e., the component decoders take turns generating the extrinsic values of the estimated information symbols, and each component decoder uses the extrinsic messages delivered by the last component decoder as the a priori values of the information symbols. The disadvantage of this scheme is high decoding delay. In the parallel turbo decoding algorithm [9] , all component decoders operate in parallel at any given time. After each iteration, each component decoder delivers extrinsic messages to other decoder(s) which use these messages as a priori values at the next iteration.
B. Plain shuffled turbo decoding
Although the parallel turbo decoding reduces the decoding delay of serial decoding by half, the extrinsic messages are not taken advantage of as soon as they become available, because the extrinsic messages are delivered to component decoders only after each iteration is completed. The aim of the shuffled turbo decoding is to use the more reliable extrinsic messages at each time. Letũ = (ũ 1 ,ũ 2 , . . . ,ũ K ) be the sequence permuted by the interleaver corresponding to the original information sequence u = (u 1 , u 2 , . . . , u K ), according to the
There is a unique corresponding reverse mapping
In shuffled turbo decoding, the two component decoders operate simultaneously as in the parallel turbo decoding scheme, but the messages are updated during each iteration based on π(k) and π − (k) [7] . Correspondingly it provides a faster decoding convergence.
C. Replica shuffled turbo decoding
In the plain shuffled turbo decoding summarized in Section II-B, we assume all the component decoders process the backward recursion followed by the forward recursion. Let us refer to the two component decoders as − → D 1 and − → D 2 . Naturally another possible scheme is to operate in the reverse order, i.e, all the component decoders process the forward recursion followed by the backward recursion and we refer to them as ← − D 1 and ← − D 2 . In terms of error performance, there is no difference between these two approaches. However, the reliabilities of the extrinsic messages concerning a certain information bit delivered by these two shuffled turbo decoders are not the same. In general, the more independent information is used, the more reliable the delivered messages become. Therefore for the extrinsic messages delivered by component decoder
, the larger k is, the more reliable this message is. Following a similar analysis, for the extrinsic message
, the smaller k is, the more reliable this message is. It is natural to expect a faster decoding convergence if these two shuffled turbo decoders operate cooperatively instead of independently. Because in this approach two sets of shuffled component decoders are used to decode the same sequence of information bits, we refer to it as replica shuffled turbo decoding. In replica shuffled turbo decoding, two plain shuffled turbo decoders (processing recursions in opposite directions)
operate simultaneously and exchange more reliable extrinsic messages. We assume that the component decoders deliver extrinsic messages synchronously, i.e., − → T 
Then in plain shuffled turbo decoding, the values α
based on the extrinsic messages delivered at last iteration. In replica shuffled turbo decoding, however, there are two further subcases. The first subcase is 
e2 (û k ) are not available yet. In this subcase, the values of − → α 
, which is different from that in the standard turbo decoding [3] and plain shuffled turbo decoding. It is straightforward to generalize the replica shuffled turbo decoding to multiple turbo codes which consist of more than two component codes. Also group of bits can be updated periodically only to reduce information exchanges between replicas. Based on the above descriptions with two replicas, the total computational complexity of the replica shuffled turbo decoding for multiple turbo codes at each decoding iteration is about twice that of the parallel turbo decoding. Fig. 1 depicts the bit error performance of a turbo code with two component codes (rate-1/3) and interleaver size 16384, with standard parallel decoding, plain shuffled and replica shuffled decoding. We observe that, to obtain the same error performance, the replica turbo decoding requires about half the number of iterations of that of the standard parallel turbo decoding. Hence latency is greatly reduced (by half if information exchanges are ignored).
D. Simulation results
III. ITERATIVE DECODING OF LDPC CODES Let H = [H mn ] be the parity check matrix which defines an LDPC code. We denote the set of bits that participate in check m by N (m) = {n : H mn = 1} and the set of checks in which bit n participates as M(n) = {m : H mn = 1}. Assume a codeword w = (w 1 , w 2 , . . . , w N ) is transmitted over an AWGN channel with zero mean and variance N 0 /2 using BPSK signaling and let y = (y 1 , y 2 , . . . , y N ) be the corresponding received sequence.
A. Standard BP for iterative decoding of LDPC codes
Based on [1] , let F n be the log-likelihood ratio (LLR) of bit n and initially set F n = mn be the LLR of bit n which is sent from check node m to bit node n and sent from the bit node n to check node m, respectively. Let z (i) n denote the a posteriori LLR of bit n. The standard BP algorithm [1] is carried out as follows:
Step, for 1 ≤ n ≤ N and each m ∈ M(n), process:
(ii) Vertical
Step 2 Hard decision and stopping criterion test:
If Hŵ (i) = 0 or I Max is reached, stop the decoding and go to Step 3. Otherwise set i := i + 1 and go to Step 1.
Step 3 Outputŵ
(i) as the decoded codeword.
B. Plain shuffled BP for iterative decoding of LDPC codes
At the i-th iteration of the standard BP algorithm, first all values of the check-to-bit messages are updated by using the values of the bit-to-check messages obtained at the (i − 1)-th iteration, i.e., each ε 
Since this algorithm becomes totally serial, bits can be grouped to maintain a sufficient level of parallelism with the same error performance [7] .
C. Replica shuffled BP decoding for LDPC codes
Plain shuffled BP decoding is a bit-based sequential approach and the method described in Section III-B is based on a natural increasing order, i.e, belief messages concerning bit nodes are updated according to order i = 1, 2, . . . , N.
The larger the value of i, the more independent informations are used to update the beliefs of bit note i and the more reliable these belief messages are. Therefore the reliability of bit nodes increases and the error rate decreases as i increases. Following a similar analysis, in plain shuffled BP decoding based on a natural decreasing order, after each iteration, the reliability of bit nodes decreases as i increases. As in replica shuffled turbo decoding, in replica shuffled BP decoding, two replica shuffled subdecoders based on different updating orders operate simultaneously and cooperatively. After each iteration, each subdecoder receives more reliable messages from and sends more reliable messages to another subdecoder. Based on these more reliable messages, both replica subdecoders begin the next iteration of decoding. Let − → D and ← − D denote the replica subdecoder with natural increasing and decreasing updating order, respectively. Let − → ε Step
Step 2 Exchange of the more reliable messages. Set − → z
Step 3
Hard decision and stopping criterion test:
If Hŵ (i) = 0 or I Max is reached, stop the decoding and go to Step 4. Otherwise set i := i + 1 and go to Step 1.
Step 4 Outputŵ
An alternative approach that can be used is to make the two subdecoders exchange the more reliable messages after updating the reliability message of each bit (or a group of bits). We observe that this "simultaneous replica" approach provides a faster convergence, especially for Gallager-type LDPC codes. Thus all simulation results of LDPC codes are based on simultaneous updating of each bit (or a group of bits). It is also straightforward to extend the replica shuffled BP decoding to the cases in which more than two replica subdecoders are used. In order to decrease decoding delay of plain shuffled BP decoding, a parallel version of shuffled BP named group shuffled BP was developed in [7] . In a similar way, group replica shuffled BP can also preserve the parallelism advantage of the standard BP algorithm. It is also clear this idea can be extended to other other grouping scheme (e.g, [11] ) and other iterative decoding algorithms, such as bit flipping and weighted bit flipping decoding. The maximum number of iterations for plain and group replica shuffled BP was set to be 10. We observe that the WER performance of replica shuffled BP decoding with four subdecoders, I max =10, and a group number larger or equal to four, are approximately the same as that of the standard BP with I max =60. Fig. 3 depicts the WER of the standard and replica shuffled BP decoding of the (16200, 7200) irregular LDPC code which is constructed in a semi-random matter [12] . The variable node degree distribution is λ(x) = 0.00006x + 0.57772x 2 + 0.3111x 3 + 0.11111x 8 . The number of replica subdecoders was four. We observe that the replica shuffled BP with I max =10 and G = 32 provides a similar performance as that of the standard BP with I max =70.
D. Simulation results

IV. ITERATIVE DECODING OF TURBO PRODUCT CODES
A two-dimension turbo product code (TPC) can be denoted as C 1 C 2 , where C 1 and C 2 are two linear block codes. Place k 1 × k 2 information symbols in an array of k 1 rows and k 2 columns, and then encode the k 1 rows using code C 2 . Afterwards, the resulting n 2 columns are encoded using code C 1 . Usually, we choose C 1 the same as C 2 .
A. Conventional decoding methods of TPC
The conventional TPC decoder performs row and column decoding in a serial fashion. A soft input soft output (SISO) decoder, such as MAP, is used to decode each row or column. A low complexity decoding approach is provided in [13] . It applies the Chase algorithm iteratively on the row and column decoding, but still in a serial fashion. In order to halve the decoding latency, a parallel TPC decoder has been proposed in [10] . As opposed to the conventional serial TPC decoder, the row and column decoders in this method operate in parallel and send each other the updated extrinsic information immediately after a row or column has been decoded. The simulation results reveal that this parallel decoder can reduce decoding latency by half of that of the original decoder.
B. Replica decoding of TPC
Based on the parallel decoder, we propose a replica parallel decoder as shown in Fig. 4 , which can further reduce decoding latency. Both row and column decoders are duplicated, but work from opposite extremes, which means the two row decoders process rows from the top and bottom, respectively and the two column decoders process columns from the left and right, respectively. Row (Column) decoders send to both column (row) decoders immediately the latest extrinsic matrices, [W The procedure of the replica parallel decoding is shown in Fig. 5 . The circles denote bit positions that were already updated by other decoders. Their number is much larger than that in the parallel decoder, which greatly benefits the decoding because most bits have more accurate priori information. The arrows with letters a T , b T , etc., represent the processing order of different decoders. After both row (column) decoders finish decoding all the rows (columns), the most reliable parts of them are combined and the resulting extrinsic matrix for the next iteration is transmitted to both column (row) decoders. 6 , the performance at the third iteration in the replica decoder is better than that at the fourth iteration in the parallel decoder. Error performance for iterative decoding of the (16200, 7200) irregular LDPC code. 
C. Simulation results
