A full-parallel architecture for turbo decoding, which achieves ultrahigh data rates when using product codes as error correcting codes, is proposed. This architecture is able to decode product codes using binary BCH or m-ary Reed-Solomon component codes. The major advantage of our architecture is that it enables the memory blocks between all half-iterations to be removed. Moreover, the latency of the turbo decoder is strongly reduced. The proposed architecture opens the way to numerous applications such as optical transmission and data storage. In particular, the block turbo decoding architecture can support optical transmission at data rates above 10 Gbit=s.
Introduction: In recent years turbo codes [1] have been adopted by several digital communication applications. They are particularly attractive to increase transmission rates and=or to guarantee the Quality of Service (QoS). Currently, research is under way to use turbo codes to protect data stored on hard drive or DVD and in fibre optical transmission. The earliest FEC for optical communication [2] employed the well-known Reed-Solomon (RS) codes to recover the degradation in bit error rate (BER) owing to the effects of fibre nonlinearity and polarisation-dependent phenomena. A net coding gain of around 6 dB is provided by the RS(255, 239) code. Very highspeed data transmission developed for fibre optical networking systems necessitates the implementation of ultra-high-speed forward error correction (FEC) architectures to meet continuing demands for ever higher data rates. Currently the RS(255, 239) code can be used in ultra-high-speed (40 Gbit=s [3] and 80 Gbit=s [4] ) fibre-optic systems. More powerful FECs as block turbo codes (BTCs) have a theoretical potential net coding gain of around 10 dB with a redundant overhead of less than 25% [5] . Typically, realistic block turbo codes can operate at less than 1 dB from the Shannon limit for a binary symmetric channel. In 2005 Mitsubishi Electric announced the development of the first block turbo decoder for 10 Gbit=s optical transmission [6] .
Previous work: Many block turbo decoder architectures have previously been designed. The classical approach involves decoding all the rows or all the columns of a matrix before the next halfiteration. When an application requires high-speed decoders, an architectural solution is to cascade soft input soft output (SISO) elementary decoders for each half-iteration. In this case, memory blocks are necessary between each half-iteration to store channel data and extrinsic information. Each memory block is composed of four memories of qn 2 symbols where q is the number of bits to quantify the matrix symbols. Thus, duplicating a SISO elementary decoder (e dec) results in duplicating the memory block which is very costly in terms of silicon area. In 2002, a new architecture for turbo decoding product codes was proposed [7] . The idea is to store several data at the same address and to perform parallel decoding to increase the data rate. However, it is necessary to process these data by row and by column. Let us consider m adjacent rows and m adjacent columns of the initial matrix. The m 2 data constitute a word of the new matrix that has m 2 times fewer addresses. This data organisation does not require any particular memory architecture. The results obtained show that the turbo decoding throughput is increased by m 2 when m elementary decoders processing m data simultaneously are used and its latency is divided by m. The area of the m elementary decoders (me dec) is increased by m 2 =2 while the memory is constant.
Full-parallel turbo decoding principle: The codewords of all rows (or all columns) of a matrix can be decoded in parallel. If the architecture is composed of n elementary decoders, an appropriate treatment of the matrix enables the elimination of the reconstruction of the matrix between each decoding. Let i and j be the indices of a row and a column of the n 2 matrix. In full-parallel processing, the row decoder i begins the codeword decoding by the symbol in the ith position. Moreover, each row decoder processes the codeword symbols by increasing the index by one modulo n. Similarly, the column decoder j begins the codeword decoding by the symbol in the jth position. In addition, each column decoder processes the codeword symbols by decreasing the index by one modulo n. Therefore only one time cycle is necessary between two successive matrix decoding operations. The full-parallel decoding of a n 2 product code matrix is detailed in Fig. 1 . A similar strategy was previously presented in [8] . In this case, the conflicts of n independent RAM memories are eliminated by the appropriate treatment of the matrix. The elementary decoder latency, L, can be defined as the symbol number processed by the decoder during the decoding of one symbol. This latency depends on the structure of the elementary decoder and the codeword length n. As the reconstruction matrix is removed, the latency between row and column decoding is null. n rows of n symbols n columns of n symbols one symbol (0, 0) index(i + 1) = i + 1 mod n index(j + 1) = j -1 mod n j i Fig. 1 Full-parallel decoding of product code matrix Full-parallel turbo decoder for product codes: The major advantage of our full-parallel architecture is that it enables the memory block of 4qn 2 symbols between each half-iteration to be removed. However, the codeword symbols exchanged between the row and column decoders have to be switched. One solution is to use a connection network for this task. In our case we have chosen an Omega network.
The Omega network is one of several connection networks used in parallel machines [9] . It is composed of log 2 n stages, each having n=2 exchange elements. In fact, the Omega network complexity in terms of number of connections and of 2 Ã 2 switch transfer blocks is n Ã log 2 n and (n Ã log 2 n)=2, respectively, e.g. the equivalent gate complexity of a 32 Â 32 network can be estimated to be 200 per exchange bit. The proposed full-parallel architecture for product codes is presented in Fig. 2 . It is composed of cascaded modules for the block turbo decoder. Each module is dedicated to one iteration. However, it is possible to process several iterations by a same module. In our approach, 2n elementary decoders and two connection networks are necessary for one module. In fact, the full-parallel turbo decoder complexity essentially depends on the complexity of the elementary decoder. 
Conclusion: A full-parallel turbo decoding architecture for product codes is proposed. This architecture enables the memory blocks between all half-iterations to be removed. Moreover, the latency of the turbo decoder is strongly reduced. The ultra-high-speed FEC architectures obtained meet demands for ever-higher data rates. In particular, our architectural solution can support optical transmission at data rates above 10 Gbit=s. In this context, using more powerful FEC as block turbo codes opens up new opportunities for the next generation of optical communication systems. 
