Abstract-Thanks to the message passing principle, turbo decoding is able to provide strong error correction near the theoretical (Shannon) limit. However, the minimum Hamming distance (MHD) of a Turbo Code may not be sufficient to prevent a detrimental change in the error rate vs. signal to noise ratio curve, the so-called flattening. Increasing the MHD of a Turbo
I. INTRODUCTION
Turbo Codes (TCs) are today mainly used in Automatic ReQuest (ARQ) systems, which do not usually require very low error rates. Target Frame Error Rates (FER) from 10-2 to 10-5 are typical for this kind of communication systems. Most of current commercial applications of TCs, such as Digital Video Broadcasting standard [1] , and the 802.16a WiMAX standard for local and metropolitan area networks [2] , are based on 8-state component encoders. While such codes offer performance close to the Shannon limit in the so-called waterfiall region, they suffer from a flattening around FER 1U due to a poor minimum Hamming distance (MHD). In future system generations, lower error rates, down to 10-8, will be required to open the way to real-time and more demanding applications, such as TV broadcasting or videoconferencing. Therefore, state-of-the-art 8-state TCs are not suitable for these kind of applications, but more powerful coding schemes are required. At the same time, a reasonable complexity should be preserved.
Improving performance at very low error rates by raising the MHD may involve using component encoders with a larger number of states, devising more appropriate internal permutations, or increasing the dimension of the TC, i.e. the number of component encoders. More complex 16-state TCs have already been adopted in some standards. A very powerful TC based on 16-state components, achieving FERs down to 10-7 for a wide range of code rates has been recently proposed in [3] . ' A. Graell i Amat is supported by a Marie Curie Intra-European Fellowship within the 6th European Community Framework Programme. The price is paid in terms of complexity, which is doubled w.r.t. 8-state components TCs. Designing better permutations is another alternative to improve the minimum distance of TCs. Recently proposed DRP [4] and ARP [5] interleavers are highly structured interleavers capable of achieving reasonably high distances. Devising better interleavers is a very difficult and time consuming task. This paper addresses the third alternative to improve TCs performance, and proposes a three-dimension Turbo Code (3D-TC), simply derived from the classical TC by concatenating a post-encoder at its output. Thanks to the message passing (turbo) principle, it has become simple today to imagine various coding structures, by concatenating simple component codes, provided that their corresponding decoders are of the Soft-In/Soft-Out (SISO) type. Basically, there are two kinds of concatenation: serial and parallel. Mixed structures are also possible, like the ones proposed in [6] or [7] . The 3D-TC we describe here is inspired by these contributions and calls for both parallel and serial concatenation.
The proposed code is very versatile and provides very low error rates for a wide range of block lengths and code rates. It significantly improves performance in the error floor with respect to the 8-state DB TC of the DVB-RCS standard, at the expense of a very small increase in complexity (less than 10%). Also, it compares favorably with respect to more complex 16-state TCs, and the LDPC code of the DVB-S2 standard.
II. THE ENCODING STRUCTURE
A block diagram of the proposed 3D Turbo Code is depicted in Fig. 1 
The fraction 1 -A of parity bits which is not re-encoded is sent directly to the channel or punctured to achieve the desired code rate.
The material added to the classical turbo encoder, which we call the patch because it is placed just behind a pre-existing turbo encoder, is composed of:
. a parallel to serial (P/S) multiplexer which takes alternately the parity bits yu, and Yi to be re-encoded and groups them into a single block of P bits, . a permutation denoted H' which permutes the parity bits before feeding them to the post encoder, . a rate-I post-encoder whose output is denoted w. The value of A to be chosen is a matter of trade-off between the convergence loss and the required MHD. Convergence designates the zone of the error rate vs. signal to noise ratio (SNR) curve where the error rate begins to decrease noticeably. Choosing a large value of A penalizes the decoder from the convergence point of view. This results from the decoder associated with the post-encoder, which does not benefit from any redundant information at the first iteration and therefore multiplies the errors during the first processing. Let us assume for instance that the post-encoder is the wellknown accumulator (i.e. convolutional code with memory 1), depicted in Fig. 3(a) . The associated decoder (the pre-decoder), without any extra information, doubles the errors at its input. From (1), the fraction 0 of the codeword bits that are postencoded bits is: 
Then, if p is the probability of error at the channel output, the average probability of error p' at each decoder intrinsic input is:
Using (3) (4) (5) i.e. the probability of error at each decoder intrinsic input is risen by a factor (1±(1+±R) inducing a loss in convergence.
On the other hand, a large value of A will translate into a higher MHD.
The strategy for choosing the permeability rate A ensues directly from (5) 2) For a given coding rate, deduce the value of A.
3) If the resulting MHD is not sufficient, increase p' and go to (1) . Note that the rate of the 3D Tubo Code is necessarily upper bounded by I0.
III. THE CHOICE OF THE POST-ENCODER
The choice of the post-encoder is crucial for code performance. It has to meet the following requirements: 1) Its decoder must be simple, adding little complexity to the classical turbo decoder, while being able to handle soft-in and soft-out information, 2) in order to prevent the decoder suffering from any side effects, because very low error rates are sought for, the post-code has to be a homogeneous block code, 3) at the first iteration (so, without any redundant input information), the pre-decoder associated with the rate-I post-encoder must not exhibit too much error amplification, to prevent from a high loss in convergence. Possible candidates, low memory linear RSC codes which satisfy requirement (1), are given in Fig. 2 .
Requirement (2) (6) where G is the state matrix of the linear feedback register (LFR). For instance, considering the LFR in Fig. 3 
J=1
where I is the v x v identity matrix.
Note that sc exists if and only if I + GP is invertible. This condition is never satisfied for some matrices G, whatever P. This is the case of the encoders of Fig. 2(a) and Fig. 2(c Fig. 2(b) and, from (9), sc can be related to so by:
( 1 1) Finally, the encoder being initialized to the circulation state, the encoding process can really start to provide the redundancy sequence.
B. The post-encoder with generator polynomial 5 The encoder of Fig. 2 The decoding process has to take transformation (12) into account. This is done by an exchange of metrics after having processed address i = P, during the forward recursion, and after having processed address i = 1, during the backward recursion, when the MAP algorithm, or a simplified version, is employed. Table I provides the corresponding values of sp and s' , obtained through (12) and the corresponding values of so and s', obtained through (15). We can observe that only 2 (if P is odd) or 3 (if P is even) metrics need to be swapped during the decoding process, at the extremity of the block, which represents a very small additional complexity for the 4-state decoder.
For short block sizes (k < 1000 bit) and for medium sizes (1000 < k < 5000) with low rates, the linear postencoder of Fig. 2(c) was chosen. However, for large blocks (5000 > k) and for medium sizes with high rates, the 3D-TC with post-encoder (c) shows a flattening around FER=10-6, due to a poor minimum distance. In these cases, the postencoder was turned into the non-linear encoder depicted in Fig. 4 . In Fig. 5 we report the EXIT curves for the linear post-encoder of Fig. 2(c) and the non-linear encoder of Fig. 4 . When no a-priori information is available at the input of the pre-decoder (i.e. first iteration) the Mutual Information (MI) at its output is higher for the linear post-encoder. In fact, the linear post-encoder doubles the the number of error at the first iteration, while the non-linear post-encoder multiplies the the number of error by a factor 3 -4. Therefore, the properties of this code do not seem very appropriate with respect to requirement (3) in the choice of the post-encoder. However, the two curves in Fig. 5 cross around input MI 0.3. For high input MI the behavior of the non-linear code is better, indicating a better behavior in the floor region. The non-linear encoder contributes to increase the MHD of the 3D-TC, with favorable consequences in the cases cited above.
IV. PERMUTATIONS Il AND I'
The proposed 3D-TC is characterized by two permutations, denoted H and H'. H is the internal permutation of the DB-TC. Here, we consider H to be of the Almost Regular Permutation (ARP) type, as described in [5] . The ARP permutation is a generalization of the permutation model used in the DVB-RCS turbo code. A second permutation H' is used to spread the parity bits before post-encoding. We assume Hl' to be the simplest one. Denote i (1 < i < P) and j (1 < j < P) the address in the natural order and in the permuted order, respectively. Then, H' is defined by the following congruence relation: i = HI' (j) = Poj +io mod P,
where io is the starting index, and mcd(Po, P) = 1. Note that the two permutations assumed in this paper are based on very simple models, enabling large degrees of parallelism.
V. DECODING THE 3D TURBO CODE
The decoding of the 3D-TC calls for the classical turbo procedure, as depicted in Fig. 6 , in the logarithmic domain. As for standard TCs, the two 8-state SISO decoders (DEC1 and DEC2) exchange extrinsic information on the systematic bits (A, B in Fig. 1 ) of the received codeword. Also, they must provide the 4-state SISO pre-decoder (PRE-DEC) with extrinsic information on the post-encoded parity bits. In turn, the pre-decoder feeds the 8-state decoders with extrinsic information on these parity bits.
Because DEC1 and DEC2 are quaternary 8-state decoders processing N = k/2 couples of bits and the pre-decoder is a binary 4-state decoder processing P = Ak data, the relative computational complexity added by the latter is very small.
For instance, with A = 1/4, the additional complexity is only 6%. To this, however, some extra-functions must be added to the classical turbo decoder, the main one being the calculation of the extrinsic information on parity bits to be fed to the pre- decoder. Overall, the additional complexity, compared to the classical turbo decoder, is less than 10% for A = 1/4.
VI. SIMULATION RESULTS The performance of the 3D-TC was assessed by means of simulation. In Figs. 8 and 9 we report frame error rate results for two typical block sizes, 188 and 57 bytes, respectively, and coding rates 1/4, 1/2 and 4/5. In all simulations A = 1/4 and a maximum of 8 iterations were assumed. The component decoding algorithm is the simple Max-Log-MAP algorithm (also called the dual Viterbi algorithm). Note that for global coding rate 1/4, the component double-binary encoder in Fig. 1 has to output three parity bits. Fig. 7 depicts the block diagram of the best 8-state encoder that was found to provide three outputs.
The proposed code shows excellent performance for both short and medium block sizes. In particular, for information block size 188 bytes (see Fig. 8 ) only 0.8 dB loss is observed with respect to the Gallager's random coding bound at 10-7 for all rates. For comparison purpose, the performance of the original DVB-RCS TC is also reported for rates 1/2 and 4/5. As expected, in most cases (except for R = 4/5) the 3D-TC shows a small convergence loss at high error rates with respect to the DVB-RCS TC. On the other hand, the error floor is significantly improved. The largest gain is obtained for 188 bytes and R = 1/2 (about 1.4 dB at FER= 10-7).
The performance of the 3D 8-state TC is comparable to that of much more complex 16-state TCs, such as the one described in [3] , also reported in the Figs. 8 and 9 for rates 1/2 and 4/5. For a block length of 188, the 3D-TC looses 0.1 dB in convergence with respect to the 16-state double-binary Turbo Code in [3] . However, the proposed code improves the 16-state TC in the error floor. A similar behavior is observed in Fig. 9 for a block length of 57 bytes. The proposed code also shows very good performance for large block lengths. In Fig. 10 the bit error rate (BER) performance of the proposed 3D-TC is compared with that of the DVB-S2 standard LDPC code [9] , for coding rates 1/2 and 8/9 and a coded block length of 8000 bytes. The performance of the LDPC code was obtained from an FPGA, and is it very close to simulated performance.
Here, A = 1/8 and 12 iterations are assumed for the 3D-TC.
Similar performance are observed for the two codes. To the best of our knowledge, this is the first time that a (modified) turbo code achieves such a performance for a long block and high coding rate.
Finally, in 
E--------------------E------------------------E----------.~~~~~~~~~~~~~~~~~~----------t----_-----------------------_---__--__---__--__t--__---__--__---__--___--__-,_--__--___--__---__--___--__-----__--___--__---__--__--------------------------------------------------------------------------------I-----------------
-k= 12288 bits, R= 
