Full-parallel architecture for turbo decoding of product codes by Jego, Christophe et al.
Full-parallel architecture for turbo decoding of product
codes
Christophe Jego, Patrick Adde, Camille Leroux
To cite this version:
Christophe Jego, Patrick Adde, Camille Leroux. Full-parallel architecture for turbo decoding of
product codes. Electronics Letters, IET, 2006, 42 (18), pp.1052 -1053. <10.1049/el:20062168>.
<hal-00538604>
HAL Id: hal-00538604
https://hal.archives-ouvertes.fr/hal-00538604
Submitted on 23 Nov 2010
HAL is a multi-disciplinary open access
archive for the deposit and dissemination of sci-
entific research documents, whether they are pub-
lished or not. The documents may come from
teaching and research institutions in France or
abroad, or from public or private research centers.
L’archive ouverte pluridisciplinaire HAL, est
destine´e au de´poˆt et a` la diffusion de documents
scientifiques de niveau recherche, publie´s ou non,
e´manant des e´tablissements d’enseignement et de
recherche franc¸ais ou e´trangers, des laboratoires
publics ou prive´s.
Full-parallel architecture for turbo decoding  
of product codes  
 
 
C. Jégo, P. Adde and C. Leroux 
 
 
 
A full-parallel architecture for turbo decoding, which achieves ultra high data 
rates when using product codes as error correcting codes, is proposed. This 
architecture is able to decode product codes using binary BCH or m-ary Reed 
Solomon component codes. The major advantage of our architecture is that it 
enables the memory blocks between all half-iterations to be removed. 
Moreover, the latency of the turbo decoder is strongly reduced. In fact, the 
proposed architecture opens the way to numerous applications such as 
optical transmission and data storage. In particular, our block turbo decoding 
architecture can support optical transmission at data rates above 10 Gb/s. 
 
Introduction: In recent years, turbo codes [1] have been adopted by several 
digital communication applications. In fact, they are particularly attractive to 
increase transmission rates and/or to guarantee the Quality of Service (QoS). 
Currently, research is under way to use turbo codes to protect data stored on 
hard drive or DVD and in fiber optical transmission. The earliest FEC for 
optical communication [2] employed the well-known Reed-Solomon (RS) 
codes to recover the degradation in bit error rate (BER) due to the effects of 
fiber nonlinearity and polarization-dependent phenomena. A net coding gain 
of around 6 dB is provided by the RS(255,239) code. Very high-speed data 
transmission developed for fiber optical networking systems necessitate the 
implementation of ultra high-speed FEC architectures to meet the continuing 
demands for ever higher data rates. Currently, the RS(255, 239) code can be 
used in ultra high-speed (40 Gb/s [3] and 80 Gb/s [4]) fiber optic systems. 
More powerful FECs as Block Turbo Codes (BTC) have a theoretical potential 
net coding gain of around 10 dB with a redundant overhead of less than 25 % 
[5]. Typically, realistic block turbo codes can operate at less than 1 dB from 
the Shannon limit for a binary symmetric channel. In 2005, Mitsubishi Electric 
announced the development of the first block turbo decoder for 10 Gb/s 
optical transmission [6].  
 
Previous work: Many block turbo decoder architectures have been previously 
designed. The classical approach involves decoding all the rows or all the 
columns of a matrix before the next half-iteration. When an application 
requires high-speed decoders, an architectural solution is to cascade SISO 
elementary decoders for each half-iteration. In this case, memory blocks are 
necessary between each half-iteration to store channel data and extrinsic 
information. Each memory block is composed of four memories of qn2 
symbols where q is the number of bits to quantify the matrix symbols. Thus, 
duplicating a SISO elementary decoder (e_dec) results in duplicating the 
memory block which is very costly in terms of area. In 2002, a new 
architecture for turbo decoding product codes was proposed [7]. The idea is to 
store several data at the same address and to perform parallel decoding to 
increase the data rate. However, it is necessary to process these data by row 
and by column. Let us consider m adjacent rows and m adjacent columns of 
the initial matrix. The m2 data constitute a word of the new matrix that has m2 
times fewer addresses. This data organization does not require any particular 
memory architecture. The results obtained show that the turbo decoding is 
increased by m2 when m elementary decoders processing m data 
simultaneously are used and its latency is divided by m. The area of the m 
elementary decoders (m-e_dec) is increased by m2/2 while the memory is 
constant.  
 
Full-parallel turbo decoding principle:  The codewords of all rows (or all 
columns) of a matrix can be decoded in parallel. If the architecture is 
composed of n elementary decoders, an appropriate treatment of the matrix 
enables the elimination of the reconstruction of the matrix between each 
decoding. Let i and j be the indices of a row and a column of the n2 matrix. In 
full-parallel processing, the row decoder i begins the codeword decoding by 
the symbol in the ith position. Moreover, each row decoder processes the 
codeword symbols by increasing the index by one modulo n. Similarly, the 
column decoder j begins the codeword decoding by the symbol in the jth 
position. Besides, each column decoder processes the codeword symbols by 
decreasing the index by one modulo n. Thus only one time cycle is necessary 
between two successive decoding the matrix. The full-parallel decoding of a 
n2 product code matrix is detailed in Figure 1. A similar strategy was 
previously presented in [8]. In this case, the conflicts of n independent RAM 
memories are eliminated by the appropriate treatment of the matrix. The 
elementary decoder latency can be defined as the symbol number processed 
by the decoder during the decoding of one symbol. This latency L depends on 
the structure of the elementary decoder and the n codeword length. As the 
reconstruction matrix is removed, the latency between row and column 
decoding is null. 
 
Full-parallel turbo decoder for product codes:  The major advantage of our 
full-parallel architecture is that it enables the memory block of 4qn2 symbols 
between each half-iteration to be removed. But, the codeword symbols 
exchanged between the row and column decoders have to be switched. One 
solution is to use a connection network for this task. In our case, we have 
chosen an Omega network. The Omega network is one of several connection 
networks that are used in parallel machines [9]. It is composed of log2n 
stages, each having n/2 exchange elements. In fact, the Omega network 
complexity in terms of number of connections and of 2*2 switch transfer 
blocks is n*log2n and (n*log2n)/2 respectively. For example, the equivalent 
gate complexity of a 32x32 network can be estimated to be 200 per exchange 
bit. The proposed full-parallel architecture for product codes is presented in 
Figure 2. It is composed of cascaded modules for the block turbo decoder. 
Each module is dedicated to one iteration. However, it is possible to process 
several iterations by a same module. In our approach, 2n elementary 
decoders and 2 connection networks are necessary for one module. In fact, 
the full-parallel turbo decoder complexity essentially depends on the 
complexity of the elementary decoder. In order to compare our architectural 
solution with the previous solutions, Table 1 gives the features of these 
architectures. The features depend on different parameters: symbol codeword 
n, decoding iteration it, elementary decoder throughput Dref, elementary 
decoder latency L, symbol quantization bits q and adjacent symbol group m. 
The e_dec and m-e_dec architecture types correspond to the classical 
solution and the solution in [7] respectively. 
 
Towards the implementation of architectures for ultra high rates: By using the 
full-parallel decoding principle, block turbo decoders using BCH component 
codes have been implemented. An architecture of BCH(32,26)2 product codes 
with single correction power was synthesized. The decoding algorithm is 
chosen with q=4 quantization levels, 8 test vectors, 1 competitor and it=4 
iterations. The elementary decoding of a codeword is split into three pipelined 
phases. Each phase requires 32/m clock periods and the elementary decoder 
latency is equal to 64/m clock periods. Syntheses were performed using the 
Synopsys tool with an STMicrolectronics 0.09-νm CMOS process target 
library. Two architecture types were chosen: e_dec as the reference and 4-
e_dec where m=4 symbols are simultaneously processed by an elementary 
decoder. Elementary decoders have a clock period equal to 2 ns which 
corresponds to a frequency of 500 MHz. The estimated area complexity in 
terms of equivalent gates for the two elementary decoders are: 4400 for 
BCH(32,26) e_dec and 5700 for BCH(32,26) 4-e_dec. This complexity 
includes all the elementary decoder elements (processing and memorization). 
The processing unit gate numbers of the block turbo decoders are equivalent 
between previous and proposed architectures: 1.13 and 1.45 millions for 
e_dec and 4-e_dec respectively. The latency is strongly reduced for the 
proposed architecture. It decreases from 270336 to 512 for e_dec and from 
5120 to 128 for 4-e_dec. The memory complexity of the previous architecture 
in terms of equivalent gates is 126400. It corresponds to 10 percent of 
BCH(32,26)2 block turbo decoder complexity. On the other hand, the 
equivalent gate complexity of connection networks is only 5600 for the 
proposed architecture of BCH(32,26)2 block turbo decoder. 
 
Conclusion: A full-parallel turbo decoding architecture for product codes has 
been proposed. This architecture enables the memory blocks between all 
half-iterations to be removed. Moreover, the latency of the turbo decoder is 
strongly reduced. The ultra high-speed FEC architectures obtained meet the 
demands for ever-higher data rates. In particular, our architectural solution 
can support optical transmission at data rates above 10 Gb/s. In this context, 
using more powerful FEC as block turbo codes open up new opportunities for 
the next generation of optical communication systems. 
 
References 
 
1 Berrou C., Glavieux A., Thitimajshima P.: ‘Near Shannon limit error 
correcting coding and decoding : Turbo Codes’, IEEE International 
Conference on Communication ICC93, vol. 2/3, May 1993. 
 
2 Azadet K., Haratsch E.F., Kim H., Saibi F., Saunders J.H., Shaffer M.S., 
Song L., Meng-Lin Y.: ‘Equalization and FEC techniques for optical 
transceivers‘, Solid-State Circuits, IEEE Journal of Volume 37, Issue 3, 
March 2002, pp. 317-327. 
 
3 Leilei S., Meng-Lin Y., Shaffer M.S.: ‘10- and 40-Gb/s forward error 
correction devices for optical communications‘, Solid-State Circuits, IEEE 
Journal of Volume 37, Issue 11, Nov. 2002, pp. 1565-1573. 
 
4 Hanho L.: ‘A high-speed low-complexity Reed-Solomon decoder for optical 
communications’, Circuits and Systems II: Express Briefs, IEEE 
Transactions on Volume 52, Issue 8, Aug. 2005, pp. 461-465. 
 
5 Mizuochi T.: ‘Recent Progress in Forward Error Correction for Optical 
Communication Systems‘, IEICE Transactions on Communications, Volume 
E88-B, Number 5, May 2005. 
 
6 Tagami H., Kobayashi T., Miyata Y., Ouchi K., Sawada K., Kubo K., Kuno 
K., Yoshida H., Shimizu K., Mizuochi T., Motoshima K., ‘A 3-bit soft-
decision IC for powerful forward error correction in 10-Gb/s optical 
communication systems’, Solid-State Circuits, IEEE Journal of Volume 40, 
Issue 8, Aug. 2005, pp. 1695-1705. 
 
7 Cuevas J., Adde P., Kerouedan S., Pyndiah R.: ‘New architecture for high 
data rate turbo decoding of product codes’, GLOBECOM 2002, November 
2002, pp. 139-143. 
 
8 Zhipei Chi; Parhi, K.K.: ‘High speed VLSI architecture design for block turbo 
decoder‘, ISCAS 2002, IEEE International Symposium on Volume 1, May 
2002, pp. 901-904. 
 
9 Lawrie D. H.: ‘Access and alignment of data in an array processor’, IEEE 
Trans. Computer, vol. C-24, no. IO, December 1975, pp. 1145-1155. 
 
 
Authors’ affiliations: 
C. Jégo, P. Adde  and C. Leroux (GET/ENST Bretagne, CNRS TAMCIC UMR 
2872, Brest, France) 
 
 
Figure captions: 
 
 
Fig. 1 Full-parallel decoding of a product code matrix 
 
 
Fig. 2 Full-parallel architecture for product codes 
 
 
Table 1 : Features of different architectures for block turbo decoding  
 
 
Figure 1 
 
 
n rows of n  
symbols 
n columns of n symbols one symbol
(0,0) 
index (i+1) = i + 1 mod n
index (j+1) = j - 1 mod n
i
j
 
 
Figure 2 
 
 elementary 
decoder 
for row 1 
elementary 
decoder 
for row 2 
elementary 
decoder 
for row n 
elementary
decoder for
column 1
elementary
decoder for
column 2
elementary
decoder for
column n
co
nn
ec
tio
n 
ne
tw
or
k 
elementary
decoder 
for row 1
elementary
decoder 
for row 2
elementary
decoder 
for row n
co
nn
ec
tio
n 
ne
tw
or
k 
co
nn
ec
tio
n 
ne
tw
or
k 
elementary 
decoder for 
column 1 
elementary 
decoder for 
column 2 
elementary 
decoder for 
column n 
co
nn
ec
tio
n 
ne
tw
or
k 
a module for one iteration 
 
Table 1 
 
 
Previous architectures Proposed architectures  
 e_dec m-e_dec e_dec m-e_dec 
latency  
(symbol number) 
n*(2it*n2+
2it*L) 
n/m*{2it*(n2/m2)
+2it*(L/m)} 2it*L 2it*(L/m) 
throughput 
(Gb/s) n*(2it*Dref) n/m*(m
2
*2it*Dref) n*(2it*Dref) n/m*(m2*2it*Dref) 
e_dec number n*2it n/m*(m*2it) n*2it n/m*(m*2it) 
memory size 
(Kb) 2it*4qn
2 2it*4qn2 0 0 
connection  
network number 0 0 2it-1 2it-1  
 
