This article presents an innovative Turbo Product Code (TPC) decoder architecture without any interleaving resource. This architecture includes a full-parallel SISO decoder able to process n symbols in one clock period. Syntheses show the better efficiency of such an architecture compared with existing previous solutions. Considering a 6-iteration turbo decoder of a (32,26)
INTRODUCTION
Nowadays, high throughput telecomunication systems such as optical fiber transmission systems or passive optical networks require powerfull error correcting codes in order to increase their optical budget. Iterative decoding provides effective solutions for next generation optical systems. Recently, a (660,480) LDPC code decoder ASIC implementation was proposed. The information throughput is 2.4Gb/s while it could be enhanced to 10Gb/s with a (2048,1723) LDPC code [1] . Turbo product codes [2] also tend to be good candidates for emerging optical systems [3] . In [4] , a BTC decoder is included in a 12.4Gb/s optical experimental setup. Since only a part of the transmitted data are actually coded, the information throughput of the BTC turbo decoder is 156Mb/s. The inherent parallel structure of the product code matrix confers to TPC good ability for parallel decoding. Nevertheless, enhancing parallelism rate rapidly induces the use of a prohibitive amount of memory. Some solutions were proposed in [5] [6] [7] to efficiently use the interleaving resources (IR), reaching high parallelism rates. However, a particular scheduling of the product code matrix enables to propose a turbo decoder architecture without any interleaving resource.
After a brief introduction of the TPC coding and decoding concept in section 2, section 3 reviews previous works and proposes an innovative TPC decoder architecture without any interleaving resource. This new TPC decoder includes a new full-parallel SISO decoder architecture which is described in section 4. Section 5 gives some synthesis results and demonstrates the better efficiency of the proposed BTC decoder. Finally, a solution is proposed to enhance parallelism rate while highering the architecture efficiency. The interconnection issue is assessed and compared with an equivalent LDPC code decoder implementation.
TPC CODING AND DECODING PRINCIPLES

Product codes
The concept of product codes is a simple and efficient method to construct powerful codes with a large minimum Hamming distance d using cyclic linear block codes [8] . Let us consider two systematic cyclic linear block codes C 1 having parameters (n 1 , k 1 , d 1 ) and C 2 having parameters (n 2 , k 2 , d 2 ) where n i , k i and d i (i = 1, 2) stand for code length, number of information symbols and minimum Hamming distance respectively. The product code P = C 1 × C 2 is obtained by placing (k 1 × k 2 ) information bits in a matrix of k 1 rows and k 2 columns, coding the k 1 rows using code C 2 and coding the n 2 columns using code C 1 .
Considering that C 1 and C 2 are linear codes, it is shown that all n 1 rows are codewords of C 2 exactly as all n 2 columns are codewords of C 1 by construction. Furthermore, the parameters of the resulting product code P are given by
Thus, it is possible to construct powerful product codes using linear block codes. In the following sections, we will consider a squared product code, meaning that n 1 = n 2 = n. The most commonly used component codes are Bose Chaudhuri Hocquenghem (BCH) codes. These codes are an infinite class of linear cyclic block codes that have capabilities for multiple error detection and correction. Product codes were adopted in 2001 as an optional correcting code system for both the uplink and downlink of the IEEE 802.16 standard (WiMAX) [9] . Despite the existence of other decoding algorithms [10] , the Chase-Pyndiah algorithm is known to give the best tradeoff between performance and decoding complexity [11] . The Chase-Pyndiah SISO algorithm for a t = 1 BCH code [12] 7. For each symbol of the DW,
• Add extrinsic information (multiplied by α it ) to the channel received word,
The α it coefficient allows decoding decisions to be damped during the first iterations. As detailed in [13] , decoding parameters p, T , and Cw have a notable effect on decoding performance. The number of quantization bits of the soft information q also impacts on performance.
PARALLEL DECODING OF PRODUCT CODES
Previous work
Many TPC decoder architectures were previously designed. The classical high speed approach involves the use of a pipelined structure at the iteration level. Separate decoding resources are assigned for each half-iteration. In existing architectures, the reconstruction of the matrix is necessary between each iteration: memory blocks are used between each half-iteration to store [R ] it and [R]. Each interleaving memory block is then composed of four memories of q × n 2 symbols. This solution presents several drawbacks. First, a large amount of memory is required which increases the global latency and the complexity of the design. In addition, increasing the parallelism degree of each half-iteration produces memory conflicts when several data have to be adressed at the same time.
In 2002, a new architecture was proposed [5] in order to increase the parallelism degree without any extra storage between half-iteration. The idea was to store several product code matrix symbols at the same address and to use elementary decoders able to process m symbols during the same clock period (denoted as m-decoders). A half-iteration structure includes m decoders each decoding m symbols in one clock period and an interleaving memory of size 4qn
2 . The resulting throughput is O(m 2 ) while the overhead factor of the decoder complexity is ∼ m 2 2 . In [6] , authors suggested to use a barrel shifter between decoding resources and the interleaving memory (IM) in order to avoid memory conflicts. This solution enables reaching the parallelism rate P = n by rotating the to be stored data. The extra-complexity only consists in a barrel shifter with a complexity in O(n log(n)). However, the IM requirement is still prohibitive.
In [7] [13], an IM-less architecture is detailed and prototyped on an FPGA device. In this architecture, a particular scheduling of the product code matrix decoding enables the interleaving memory to be replaced by an interconnection network (omega network). The complexity of the interleaving resources is then drastically reduced and highly-parallel structure can be implemented onto low-cost targets. Table 1 . Throughput and complexity order of previous highspeed architectures These three architectures can reach high parallelism degrees (i.e. high throughput). However, as illustrated on in table 1, the internal memory complexity remains an important issue in the design of high-throughput TPC decoders. s is the number of pipeline stages inside the SISO decoder. Moreover, in these architectures, decoding resources consist in a duplication of sequential decoders. Increasing the parallelism rate by duplicating computation resources is inefficient since the reuse of available resources is not optimized. In the next section, we propose to merge the duplicated sequential SISO decoders into one full-parallel SISO decoder.
Proposed IM-free architecture using full-parallel SISO decoder
Considering that one can design a SISO decoder able to process n symbols in parallel in one clock period, a product code matrix can be decoded without any interleaving resource as shown in figure 1. At t = 0, the full-parallel SISO decoder processes the column 1. During the next clock period, n sequential SISO decoders start decoding the first symbol of each row while the parallel decoder process the column 2. During the n th clock period, sequential decoders complete matrix decoding while the parallel decoder is already decoding the next matrix.
Data generated by the parallel decoder are immediately used by a sequential decoder. Consequently, no IM or data routing resources are required between the full-parallel decoder and sequential decoders. The resulting proposed architecture and the typical previous architecture for one iteration is depicted on figure 2. 
Algorithmic parameter reduction
As explained in section 2, the Chase-Pyndiah algorithm includes parameters (p, T, Cw, q) which impact on both the performance and the complexity of the turbo decoding. Performing 8 iterations, the parameter set p 0 = {p = 5, T = 16, Cw = 3, q = 5} gives the best performance for a reasonnable complexity [11] . However, algorithmic simulations showed that the reduced parameter set p 1 = {p = 3, T = 8, Cw = 0, q = 5} only induce a performance loss of 0.25dB at BER= 10 −6 while it becomes nul below BER= 10 −9 . Consequently, using p 1 enables the architecture to be simplified for a limited performance lost. Figure 3 depicts the architecture of the full-parallel SISO decoder. It was firstly designed totally combinatorial, then, a critical path study enabled the insertion of pipeline stages within the structure. 
Full-parallel SISO decoder architecture
Reception stage
The reception stage corresponds to steps (1-3) of the ChasePyndiah algorithm detailed in section 2.
The syndrome of the incoming vector R it can be derived as S(R it ) = H × sign(R it ) where H is the parity check matrix of the BCH code. A straightforward implementation of such a matrix multiplication is depicted on figure 4 . The H matrix, the corresponding parity check equations and the syndrome S(t 0 ) = [s 2 , s 1 , s 0 ] implementation of a (7,4) BCH code are detailed.
Fig. 4. BCH(7,4) code: (a) Parity check matrix (b) Parity check equations (c) Syndrome parallel computation implementation
It can be noticed that some parity check equations have similar terms. For instance, the term (x 1 ⊕ x 0 ) is used in both s 1 and s 2 computation. This enables a reuse of computation resources for an even more efficient implementation. The parity of the incoming vector R it is computed with a similar structure by "xoring" (n − 1) incoming bits.
Selecting the least reliable bits among the incoming vector in parallel requires a sorting network. Such structures are composed of interconnected Compare and Select operators (CS). The interconnection scheme depends on the considered sorting algorithm. Many parallel sorting algorithm are conceivable [14] . However, most of them are optimized for a complete sorting, while the Chase-Pyndiah algorithm only requires a partial sorting (i.e. extracting p minima). Consequently we devised a network optimized, in terms of area and critical path, for the partial sorting of p=3 values among n=32, as depicted on figure 5. The structure is based on shuffle networks coupled with local minima computation blocks. After the first shuffle stage, min 1 is in the lower section while the upper section can either contain min 2 or min 3 or no minimum. The same reasonning is applied recursively. After 5 shuffle stages, the minimum is determined while 5 values can still be min 2 and min 3 . A local sorting of this 5 values enables the determination of min 2 and min 3 value. This partial sorting network requires 35 CS and 29 minimum elements. The critical path consists in 9 comparisons stages. 
Test pattern processing stage
The test pattern processing stage corresponds to steps (4) (5) in the Chase-Pyndiah algorithm detailed in section 2. Instead of being processed sequentially, test patterns are processed in parallel. The syndrome of each test pattern is computed by adding S(t 0 ) with the position of the inverted reliable bits. The parity management block computes the parity of R it+1 considering the parity of R it and the detection of an error which is the case when S(t i ) = 0. Metrics of each test pattern is then computed by adding the contribution of each inverted bit in the current test pattern (least reliable bits, syndrome corrected bits and, the new parity bit). The minimum metric is determined in the DW selection block. The structure is a simple minimum selection tree. The multiplexer selects R it (S(t i )) in order to compute test pattern metrics.
Soft output computation stage
The last stage is a duplication of n soft output computation blocks. As shown on figure 6, this block first computes the new reliability F it of each symbol. Since, no competitor is considered, the β value is automatically assigned. The β value is based on an estimation of the competitor word metric value. It is calculated with the reliability of the corrected bit and the least reliable bits. Then, the extrinsic information is computed and damped by the coefficient α it which is devised to be a power of 2 making the multiplication a simple bit shifting. Finally, the channel information is added to generate the soft output R it+1 . Within this block, all computation are performed in sign and magnitude format. Other arithmetic format were explored but the chosen one requires less computation resources than others.
AREA VS THROUGHPUT TRADEOFF
In this section, we compare the parallel and the sequential SISO decoders in terms of throughput, complexity and ef- Fig. 6 . Soft output computation stage ficiency. Then a pipelined version of the parallel SISO decoder is proposed. Logic syntheses were performed using Synopsys Design Compiler with a ST-microelectronics 90nm CMOS process. The area is transposed in logic gate count. One equivalent logic gate corresponds to the area of a 2-input NAND gate. It enables a more technology-independant measure of the complexity.
Logic synthesis results
We designed 5 versions of the (32,26) BCH parallel SISO decoder having from 1 to 5 pipeline stages. The 1-pipeline stage version is a full-combinatorial architecture with register banks only in input and output. Table 2 summarizes synthesis results of the 5 different full-parallel SISO decoder and compare them with n duplicated sequential SISO.
s is the number of pipeline stages, f max is the maximum working frequency reached during synthesis, A represents the area of the design in equivalent gate count, the parallelism rate P corresponds to the number of processed symbols per clock period. T in is the input throughput such as T in = P × f and T out is the information throughput T out = R p × P × f . Increasing throughput regardless of turbo decoder complexity is not relevant. In order to compare the throughput and complexity of SISO decoders, the efficiency of each decoder is defined as : η = Table 2 shows that the parallel SISO decoder can reach high throughput with low complexity. This clearly shows the higher efficiency of the parallel SISO decoder compared with a duplication of n sequential decoders. Actually a fully parallel structure enables a better reuse of computation and memory resources. For instance, in a parallel SISO decoder, the memory requirement for R and R it are O(2sqn) while in a sequential-based parallel decoder, it is O(2sqn 2 ). Since the resulting throughput is limited by the slower stage, the maximum throughput of a full-iteration module is 14.8Gb/s. In this case, a full-iteration of the proposed architecture (see figure 2b) takes 236Kgates while the previous architecture (see figure 2a) takes 400Kgates.
Towards a maximal parallelism rate
Previous synthesis results showed the better efficiency of parallel architecture compared with duplicated architectures.
Alternative turbo decoding scheduling for enhanced parallelism rate Figure 5 .2 proposes an alternate product code matrix parallel decoding scheme where m parallel decoders are used for column decoding while row decoding is performed by n mdecoders. A m-dec decoder can decode m symbols in one clock period and 1 < m < n [5] . In such an architecure, the maximum reachable parallelism rate P = n 2 can be achieved by using n full-parallel SISO decoder for column decoding and n full-parallel SISO decoder for row decoding. Considering that the TPC decoder includes 2n parallel SISO decoders (with s = 5 pipeline stages) for it decoding iterations, the turbo decoder output throughput is
Placing and routing 2n parallel SISO decoders would reduce the maximum working frequency of the parallel SISO decoder. Nevertheless, using the previous equation, an information throughput T out =10Gb/s is reached when f>88MHz which is a most probably acheivable frequency Furthermore, a reasonnable working frequency of f =300MHz leads to a (32,26) 2 BCH product code turbo decoder with an information throughput T out =33.7Gb/s. The total area of such a parallel turbo decoder is A = 10μm
2 . The achieved parallelism rate (P = n 2 =1024) is similar to a full parallel n =1024 LDPC decoder. The number of interconnection among the product code turbo decoder is I n (TPCD) = 2 × n 2 BCH × q while an equivalent full parallel 1024-LDPC decoder would have I n (1024-LDPC) = n LDP C × q × d v where d v represents the variable node degree. Consequently, as long as d v >2, the following inequation is verified: I n (T P CD) < I n (1024 − LDP C).
CONCLUSION
High throughput TPC decoder architecture complexity is made prohibitive by the amount of memory usually required for data interleaving and pipelining. In this article, we proposed an innovative product code matrix decoding scheduling which enables any interleaving resource to be removed. The resulting architecture requires a full-parallel SISO decoder able to decode n symbols in one clock period. Such a SISO decoder architecure is described and includes a new optimized parallel sorting network. ASIC based-logic synthesis showed the better efficiency of the proposed architecture. Actually, compared to a previous architecture, the area is reduced while the throughput is the same. Finally, in order to enhance parallelism rate and throughput, a slightly different scheduling is proposed. With such a scheduling, the maximum parallelism rate O(n 2 ) can be achieved. An area estimation showed that a (32,26)
2 BCH code can be decoded at 33.7Gb/s with an estimated silicon area of 10μm 2 .
