A highly parallel Turbo Product Code decoder without interleaving resource by Leroux, Cédric et al.
A highly parallel Turbo Product Code decoder without
interleaving resource
Ce´dric Leroux, Christophe Jego, Patrick Adde, Michel Jezequel, D. Gupta
To cite this version:
Ce´dric Leroux, Christophe Jego, Patrick Adde, Michel Jezequel, D. Gupta. A highly paral-
lel Turbo Product Code decoder without interleaving resource. Signal Processing Systems,
2008. SiPS 2008. IEEE Workshop on, Oct 2008, Washington DC, France. pp.1 -6, 2008,
<10.1109/SIPS.2008.4671728>. <hal-00538606>
HAL Id: hal-00538606
https://hal.archives-ouvertes.fr/hal-00538606
Submitted on 22 Nov 2010
HAL is a multi-disciplinary open access
archive for the deposit and dissemination of sci-
entific research documents, whether they are pub-
lished or not. The documents may come from
teaching and research institutions in France or
abroad, or from public or private research centers.
L’archive ouverte pluridisciplinaire HAL, est
destine´e au de´poˆt et a` la diffusion de documents
scientifiques de niveau recherche, publie´s ou non,
e´manant des e´tablissements d’enseignement et de
recherche franc¸ais ou e´trangers, des laboratoires
publics ou prive´s.
A HIGHLY PARALLEL TURBO PRODUCT CODE DECODER
WITHOUT INTERLEAVING RESOURCE
Camille Leroux, Christophe Jego, Patrick Adde, Michel Jezequel, Deepak Gupta
Institut TELECOM, TELECOM Bretagne, CNRS Lab-STICC
firstname.lastname@telecom-bretagne.eu
ABSTRACT
This article presents an innovative Turbo Product Code (TPC)
decoder architecture without any interleaving resource. This
architecture includes a full-parallel SISO decoder able to pro-
cess n symbols in one clock period. Syntheses show the bet-
ter efficiency of such an architecture compared with exist-
ing previous solutions. Considering a 6-iteration turbo de-
coder of a (32,26)2 BCH product code, synthetized in a 90nm
CMOS technology, the resulting information throughput is
2.5Gb/s with an area of 233Kgates. Finally a second archi-
tecture enhancing parallelism rate is described. The informa-
tion throughput is 33.7Gb/s while an area estimation gives
A =10µm2.
Index Terms— Product codes, Parallel architectures,
Ultra-high-speed integrated circuits
1. INTRODUCTION
Nowadays, high throughput telecomunication systems such
as optical fiber transmission systems or passive optical net-
works require powerfull error correcting codes in order to in-
crease their optical budget. Iterative decoding provides effec-
tive solutions for next generation optical systems. Recently,
a (660,480) LDPC code decoder ASIC implementation was
proposed. The information throughput is 2.4Gb/s while it
could be enhanced to 10Gb/s with a (2048,1723) LDPC code
[1]. Turbo product codes [2] also tend to be good candidates
for emerging optical systems [3]. In [4], a BTC decoder is
included in a 12.4Gb/s optical experimental setup. Since only
a part of the transmitted data are actually coded, the informa-
tion throughput of the BTC turbo decoder is 156Mb/s. The
inherent parallel structure of the product code matrix con-
fers to TPC good ability for parallel decoding. Neverthe-
less, enhancing parallelism rate rapidly induces the use of
a prohibitive amount of memory. Some solutions were pro-
posed in [5][6][7] to efficiently use the interleaving resources
(IR), reaching high parallelism rates. However, a particular
scheduling of the product code matrix enables to propose a
turbo decoder architecture without any interleaving resource.
After a brief introduction of the TPC coding and decod-
ing concept in section 2, section 3 reviews previous works and
proposes an innovative TPC decoder architecture without any
interleaving resource. This new TPC decoder includes a new
full-parallel SISO decoder architecture which is described in
section 4. Section 5 gives some synthesis results and demon-
strates the better efficiency of the proposed BTC decoder. Fi-
nally, a solution is proposed to enhance parallelism rate while
highering the architecture efficiency. The interconnection is-
sue is assessed and compared with an equivalent LDPC code
decoder implementation.
2. TPC CODING AND DECODING PRINCIPLES
2.1. Product codes
The concept of product codes is a simple and efficient method
to construct powerful codes with a large minimum Hamming
distance d using cyclic linear block codes [8]. Let us con-
sider two systematic cyclic linear block codes C1 having pa-
rameters (n1, k1, d1) and C2 having parameters (n2, k2, d2)
where ni, ki and di (i = 1, 2) stand for code length, number
of information symbols and minimum Hamming distance re-
spectively. The product code P = C1 × C2 is obtained by
placing (k1 × k2) information bits in a matrix of k1 rows and
k2 columns, coding the k1 rows using code C2 and coding the
n2 columns using code C1.
Considering that C1 and C2 are linear codes, it is shown
that all n1 rows are codewords ofC2 exactly as all n2 columns
are codewords of C1 by construction. Furthermore, the pa-
rameters of the resulting product code P are given by np =
n1 × n2, kp = k1 × k2, and dp = d1 × d2 and the code
rate Rp is given by Rp = R1 × R2. Thus, it is possible to
construct powerful product codes using linear block codes.
In the following sections, we will consider a squared product
code, meaning that n1 = n2 = n. The most commonly used
component codes are Bose Chaudhuri Hocquenghem (BCH)
codes. These codes are an infinite class of linear cyclic block
codes that have capabilities for multiple error detection and
correction. Product codes were adopted in 2001 as an op-
tional correcting code system for both the uplink and down-
link of the IEEE 802.16 standard (WiMAX) [9].
2.2. Iterative decoding of product codes
TPC decoding involves successively decoding rows and
columns using SISO decoders. Repeating this soft decod-
ing during several iterations enables the decrease of the Bit
Error Rate (BER). It is known as the TPC decoding pro-
cess. Each decoder has to compute soft information [R′]it+1
from the channel received information [R] and the previous
half-iteration computed information [R′]it.
Despite the existence of other decoding algorithms [10],
the Chase-Pyndiah algorithm is known to give the best trade-
off between performance and decoding complexity [11]. The
Chase-Pyndiah SISO algorithm for a t = 1 BCH code [12][2]
is summarized below. t represents the maximum number of
correctable errors.
1. Search for the p least reliable binary symbols of the
previous half-iteration output vector [R′]it,
2. Compute the syndrome S(t0) of [R′]it,
3. Compute the parity of [R′]it,
4. Generate T test patterns ti obtained by inverting some
of the p least reliable binary symbols (T ≤ 2p).
5. For each test pattern (1 ≤ ti ≤ T − 1)
• Compute the syndrome S(ti),
• Correct the potential error by inverting the bit po-
sition S(ti),
• Recompute the parity considering the detection of
an error and the parity of [R′]it,
• Compute the square Euclidian distance (metric)
Mi between [R′]it and the considered test pattern
ti.
6. Select the Decided Word (DW) among test patterns
having the minimal metric (MDW ) and choose Cw
competitors codewords ci (1 < i < Cw) having the
second minimum metric.
7. For each symbol of the DW,
• Compute the new reliabilityFit = min2(Mi)−min(Mi)
or Fit = β when no competitor exists,
• Compute extrinsic information Wit = Fit−R′it,
• Add extrinsic information (multiplied by αit) to
the channel received word, R′it+1 = R+αitWit.
The αit coefficient allows decoding decisions to be
damped during the first iterations. As detailed in [13], de-
coding parameters p, T , and Cw have a notable effect on
decoding performance. The number of quantization bits of
the soft information q also impacts on performance.
3. PARALLEL DECODING OF PRODUCT CODES
3.1. Previous work
Many TPC decoder architectures were previously designed.
The classical high speed approach involves the use of a
pipelined structure at the iteration level. Separate decoding
resources are assigned for each half-iteration. In existing
architectures, the reconstruction of the matrix is necessary
between each iteration: memory blocks are used between
each half-iteration to store [R′]it and [R]. Each interleaving
memory block is then composed of four memories of q × n2
symbols. This solution presents several drawbacks. First,
a large amount of memory is required which increases the
global latency and the complexity of the design. In addition,
increasing the parallelism degree of each half-iteration pro-
duces memory conflicts when several data have to be adressed
at the same time.
In 2002, a new architecture was proposed [5] in order
to increase the parallelism degree without any extra storage
between half-iteration. The idea was to store several prod-
uct code matrix symbols at the same address and to use ele-
mentary decoders able to process m symbols during the same
clock period (denoted as m-decoders). A half-iteration struc-
ture includes m decoders each decoding m symbols in one
clock period and an interleaving memory of size 4qn2. The
resulting throughput is O(m2) while the overhead factor of
the decoder complexity is ∼ m22 .
In [6], authors suggested to use a barrel shifter between
decoding resources and the interleaving memory (IM) in or-
der to avoid memory conflicts. This solution enables reaching
the parallelism rate P = n by rotating the to be stored data.
The extra-complexity only consists in a barrel shifter with a
complexity in O(n log(n)). However, the IM requirement is
still prohibitive.
In [7][13], an IM-less architecture is detailed and proto-
typed on an FPGA device. In this architecture, a particu-
lar scheduling of the product code matrix decoding enables
the interleaving memory to be replaced by an interconnection
network (omega network). The complexity of the interleav-
ing resources is then drastically reduced and highly-parallel
structure can be implemented onto low-cost targets.
Memory (1/2 iteration)
Throughput IM Internal
[5]
O(m2),m < n
mmax = 8
O(4qn2) O(2sqnm)
[6] O(n)
O(4qn2)
+O(n log n) O(2sqn
2)
[7][13] O(n) O(n log n) O(2sqn2)
Table 1. Throughput and complexity order of previous high-
speed architectures
These three architectures can reach high parallelism de-
grees (i.e. high throughput). However, as illustrated on in ta-
ble 1, the internal memory complexity remains an important
issue in the design of high-throughput TPC decoders. s is the
number of pipeline stages inside the SISO decoder. More-
over, in these architectures, decoding resources consist in a
duplication of sequential decoders. Increasing the parallelism
rate by duplicating computation resources is inefficient since
the reuse of available resources is not optimized. In the next
section, we propose to merge the duplicated sequential SISO
decoders into one full-parallel SISO decoder.
3.2. Proposed IM-free architecture using full-parallel
SISO decoder
Considering that one can design a SISO decoder able to pro-
cess n symbols in parallel in one clock period, a product code
matrix can be decoded without any interleaving resource as
shown in figure 1.
Fig. 1. Proposed parallel decoding scheduling of a product
code matrix
At t = 0, the full-parallel SISO decoder processes the
column 1. During the next clock period, n sequential SISO
decoders start decoding the first symbol of each row while the
parallel decoder process the column 2. During the nth clock
period, sequential decoders complete matrix decoding while
the parallel decoder is already decoding the next matrix.
Data generated by the parallel decoder are immediately
used by a sequential decoder. Consequently, no IM or data
routing resources are required between the full-parallel de-
coder and sequential decoders. The resulting proposed archi-
tecture and the typical previous architecture for one iteration
is depicted on figure 2.
Fig. 2. Previous TPC decoder architecture (a) and proposed
full-parallel SISO based TPC decoder architecture (b)
4. ARCHITECTURE OF A FULL-PARALLEL
COMBINATORY SISO DECODER
4.1. Algorithmic parameter reduction
As explained in section 2, the Chase-Pyndiah algorithm in-
cludes parameters (p, T, Cw, q) which impact on both the per-
formance and the complexity of the turbo decoding. Per-
forming 8 iterations, the parameter set p0 = {p = 5, T =
16, Cw = 3, q = 5} gives the best performance for a rea-
sonnable complexity [11]. However, algorithmic simulations
showed that the reduced parameter set p1 = {p = 3, T =
8, Cw = 0, q = 5} only induce a performance loss of 0.25dB
at BER= 10−6 while it becomes nul below BER= 10−9.
Consequently, using p1 enables the architecture to be simpli-
fied for a limited performance lost.
4.2. Full-parallel SISO decoder architecture
Figure 3 depicts the architecture of the full-parallel SISO de-
coder. It was firstly designed totally combinatorial, then, a
critical path study enabled the insertion of pipeline stages
within the structure.
Fig. 3. Combinatorial version of the full-parallel SISO de-
coder
4.2.1. Reception stage
The reception stage corresponds to steps (1-3) of the Chase-
Pyndiah algorithm detailed in section 2.
The syndrome of the incoming vector R′it can be derived
as S(R′it) = H × sign(R′it) where H is the parity check
matrix of the BCH code. A straightforward implementation
of such a matrix multiplication is depicted on figure 4. The
H matrix, the corresponding parity check equations and the
syndrome S(t0) = [s2, s1, s0] implementation of a (7,4) BCH
code are detailed.
Fig. 4. BCH(7,4) code: (a) Parity check matrix (b) Parity
check equations (c) Syndrome parallel computation imple-
mentation
It can be noticed that some parity check equations have
similar terms. For instance, the term (x1 ⊕ x0) is used in
both s1 and s2 computation. This enables a reuse of computa-
tion resources for an even more efficient implementation. The
parity of the incoming vector R′it is computed with a similar
structure by ”xoring” (n− 1) incoming bits.
Selecting the least reliable bits among the incoming vec-
tor in parallel requires a sorting network. Such structures
are composed of interconnected Compare and Select opera-
tors (CS). The interconnection scheme depends on the consid-
ered sorting algorithm. Many parallel sorting algorithm are
conceivable [14]. However, most of them are optimized for
a complete sorting, while the Chase-Pyndiah algorithm only
requires a partial sorting (i.e. extracting p minima). Con-
sequently we devised a network optimized, in terms of area
and critical path, for the partial sorting of p=3 values among
n=32, as depicted on figure 5. The structure is based on shuf-
fle networks coupled with local minima computation blocks.
After the first shuffle stage,min1 is in the lower section while
the upper section can either containmin2 ormin3 or no min-
imum. The same reasonning is applied recursively. After 5
shuffle stages, the minimum is determined while 5 values can
still be min2 and min3. A local sorting of this 5 values en-
ables the determination ofmin2 andmin3 value. This partial
sorting network requires 35 CS and 29 minimum elements.
The critical path consists in 9 comparisons stages.
Fig. 5. Optimized sorting network for least reliable bits selec-
tion
4.2.2. Test pattern processing stage
The test pattern processing stage corresponds to steps (4-5)
in the Chase-Pyndiah algorithm detailed in section 2. Instead
of being processed sequentially, test patterns are processed in
parallel. The syndrome of each test pattern is computed by
adding S(t0) with the position of the inverted reliable bits.
The parity management block computes the parity of R′it+1
considering the parity of R′it and the detection of an error
which is the case when S(ti) 6= 0. Metrics of each test pat-
tern is then computed by adding the contribution of each in-
verted bit in the current test pattern (least reliable bits, syn-
drome corrected bits and, the new parity bit). The minimum
metric is determined in the DW selection block. The structure
is a simple minimum selection tree. The multiplexer selects
R′it(S(ti)) in order to compute test pattern metrics.
4.2.3. Soft output computation stage
The last stage is a duplication of n soft output computation
blocks. As shown on figure 6, this block first computes the
new reliability Fit of each symbol. Since, no competitor
is considered, the β value is automatically assigned. The β
value is based on an estimation of the competitor word met-
ric value. It is calculated with the reliability of the corrected
bit and the least reliable bits. Then, the extrinsic informa-
tion is computed and damped by the coefficient αit which is
devised to be a power of 2 making the multiplication a sim-
ple bit shifting. Finally, the channel information is added to
generate the soft output R′it+1. Within this block, all com-
putation are performed in sign and magnitude format. Other
arithmetic format were explored but the chosen one requires
less computation resources than others.
5. AREA VS THROUGHPUT TRADEOFF
In this section, we compare the parallel and the sequential
SISO decoders in terms of throughput, complexity and ef-
Fig. 6. Soft output computation stage
ficiency. Then a pipelined version of the parallel SISO de-
coder is proposed. Logic syntheses were performed using
Synopsys Design Compiler with a ST-microelectronics 90nm
CMOS process. The area is transposed in logic gate count.
One equivalent logic gate corresponds to the area of a 2-input
NAND gate. It enables a more technology-independant mea-
sure of the complexity.
5.1. Logic synthesis results
We designed 5 versions of the (32,26) BCH parallel SISO
decoder having from 1 to 5 pipeline stages. The 1-pipeline
stage version is a full-combinatorial architecture with register
banks only in input and output. Table 2 summarizes synthe-
sis results of the 5 different full-parallel SISO decoder and
compare them with n duplicated sequential SISO.
s is the number of pipeline stages, fmax is the maximum
working frequency reached during synthesis,A represents the
area of the design in equivalent gate count, the parallelism rate
P corresponds to the number of processed symbols per clock
period. Tin is the input throughput such as Tin = P × f
and Tout is the information throughput Tout = Rp × P × f .
Increasing throughput regardless of turbo decoder complex-
ity is not relevant. In order to compare the throughput and
complexity of SISO decoders, the efficiency of each decoder
is defined as : η = ToutA .
Parallel SISO decoder Seq-based
s 1 2 3 4 5 3
fmax(Mhz) 222 385 526 690 741 700
A (Kgates) 47 32 33.8 39 36.2 200
Tin(Gb/s) 7.1 12.3 16.8 22.1 23.7 22.4
Tout(Gb/s) 4.7 8.1 11.1 14.6 15.6 14.8
η(Mb/s/gate) 0.1 0.25 0.33 0.37 0.43 0.07
Table 2. Comparison of sequential and parallel (32,26) BCH
SISO decoder performances
Table 2 shows that the parallel SISO decoder can reach
high throughput with low complexity. This clearly shows the
higher efficiency of the parallel SISO decoder compared with
a duplication of n sequential decoders. Actually a fully paral-
lel structure enables a better reuse of computation and mem-
ory resources. For instance, in a parallel SISO decoder, the
memory requirement for R and R′it are O(2sqn) while in a
sequential-based parallel decoder, it is O(2sqn2).
Since the resulting throughput is limited by the slower
stage, the maximum throughput of a full-iteration module is
14.8Gb/s. In this case, a full-iteration of the proposed archi-
tecture (see figure 2b) takes 236Kgates while the previous ar-
chitecture (see figure 2a) takes 400Kgates.
5.2. Towards a maximal parallelism rate
Previous synthesis results showed the better efficiency of par-
allel architecture compared with duplicated architectures.
Alternative turbo decoding scheduling for enhanced
parallelism rate
Figure 5.2 proposes an alternate product code matrix par-
allel decoding scheme wherem parallel decoders are used for
column decoding while row decoding is performed by n m-
decoders. A m-dec decoder can decode m symbols in one
clock period and 1 < m < n [5]. In such an architecure, the
maximum reachable parallelism rate P = n2 can be achieved
by using n full-parallel SISO decoder for column decoding
and n full-parallel SISO decoder for row decoding. Consider-
ing that the TPC decoder includes 2n parallel SISO decoders
(with s = 5 pipeline stages) for it decoding iterations, the
turbo decoder output throughput is
Tout =
P × f ×Rp
it
Placing and routing 2n parallel SISO decoders would reduce
the maximum working frequency of the parallel SISO de-
coder. Nevertheless, using the previous equation, an infor-
mation throughput Tout=10Gb/s is reached when f>88MHz
which is a most probably acheivable frequency Furthermore,
a reasonnable working frequency of f=300MHz leads to a
(32,26)2 BCH product code turbo decoder with an informa-
tion throughput Tout =33.7Gb/s. The total area of such a par-
allel turbo decoder is A = 10µm2. The achieved parallelism
rate (P = n2 =1024) is similar to a full parallel n =1024
LDPC decoder. The number of interconnection among the
product code turbo decoder is In(TPCD) = 2 × n2BCH × q
while an equivalent full parallel 1024-LDPC decoder would
have In(1024-LDPC) = nLDPC × q × dv where dv rep-
resents the variable node degree. Consequently, as long as
dv >2, the following inequation is verified: In(TPCD) <
In(1024− LDPC).
6. CONCLUSION
High throughput TPC decoder architecture complexity is
made prohibitive by the amount of memory usually required
for data interleaving and pipelining. In this article, we pro-
posed an innovative product code matrix decoding scheduling
which enables any interleaving resource to be removed. The
resulting architecture requires a full-parallel SISO decoder
able to decode n symbols in one clock period. Such a SISO
decoder architecure is described and includes a new opti-
mized parallel sorting network. ASIC based-logic synthesis
showed the better efficiency of the proposed architecture.
Actually, compared to a previous architecture, the area is
reduced while the throughput is the same. Finally, in order
to enhance parallelism rate and throughput, a slightly dif-
ferent scheduling is proposed. With such a scheduling, the
maximum parallelism rate O(n2) can be achieved. An area
estimation showed that a (32,26)2 BCH code can be decoded
at 33.7Gb/s with an estimated silicon area of 10µm2.
7. REFERENCES
[1] A. Darabiha, A. C. Carusone, and F. R. Kschischang, “A
3.3-gbps bit-serial block-interlaced min-sum ldpc de-
coder in 0.13-um cmos,” in Custom Integrated Circuits
Conference, 2007. CICC ’07. IEEE, 16-19 Sept. 2007,
pp. 459–462.
[2] R. Pyndiah, A. Glavieux, A. Picart, and S. Jacq,
“Near optimum decoding of product codes,” in Global
Telecommunications Conference, 1994. GLOBECOM
’94. ’Communications: The Global Bridge’., IEEE, 28
Nov.-2 Dec. 1994, pp. 339–343vol.1.
[3] T. Mizuochi, K. Kubo, H. Yoshida, H. Fujita, H. Tagami,
M. Akita, and K. Motoshima, “Next generation fec for
optical transmission systems,” in Optical Fiber Commu-
nications Conference, 2003. OFC 2003, 23-28 March
2003, pp. 527–528vol.2.
[4] T. Mizuochi, K. Ouchi, T. Kobayashi, Y. Miyata,
K. Kuno, H. Tagami, K. Kubo, H. Yoshida, M. Akita,
and K. Motoshima, “Experimental demonstration of net
coding gain of 10.1 db using 12.4 gb/s block turbo code
with 3-bit soft decision,” in Optical Fiber Communica-
tions Conference, 2003. OFC 2003, 23-28 March 2003,
pp. PD21–P1–3vol.3.
[5] J. Cuevas, P. Adde, S. Kerouedan, and R. Pyndiah,
“New architecture for high data rate turbo decoding of
product codes,” in Global Telecommunications Confer-
ence, 2002. GLOBECOM ’02. IEEE, 17-21 Nov. 2002,
vol. 2, pp. 1363–1367vol.2.
[6] Zhipei Chi and K.K. Parhi, “High speed vlsi architec-
ture design for block turbo decoder,” in Circuits and
Systems, 2002. ISCAS 2002. IEEE International Sympo-
sium on, 26-29 May 2002, vol. 1, pp. I–901–I–904vol.1.
[7] C. Jego, P. Adde, and C. Leroux, “Full-parallel architec-
ture for turbo decoding of product codes,” in Electronics
Letters, 31 August 2006, vol. 42, pp. 55–56.
[8] P. Elias, “Error-free coding,” Information Theory, IEEE
Transactions on, vol. 4, no. 4, pp. 29–37, Sep 1954.
[9] IEEE Std 802.16-2001, “Ieee standard for local and
metropolitan area networks part 16: Air interface for
fixed broadband wireless access systems,” December
2001.
[10] Jr. Forney, G., “Generalized minimum distance decod-
ing,” Information Theory, IEEE Transactions on, vol.
IT-12, pp. 125–131, April 1966.
[11] P. Adde, R. Pyndiah, and O. Raoul, “Performance and
complexity of block turbo decoder circuits,” in Elec-
tronics, Circuits, and Systems, 1996. ICECS ’96., Pro-
ceedings of the Third IEEE International Conference on,
13-16 Oct. 1996, vol. 1, pp. 172–175vol.1.
[12] D. Chase, “A class of algorithms for decoding block
codes with channel measurement information,” IEEE
Trans. Inform. Theory, vol. IT, pp. 170–182, january
1972.
[13] C. Leroux, C. Jego, P. Adde, and M. Jezequel, “Towards
gb/s turbo decoding of product code onto an fpga de-
vice,” in Circuits and Systems, 2007. ISCAS 2007. IEEE
International Symposium on, 27-30 May 2007, pp. 909–
912.
[14] S. G. Akl, Parallel sorting algorithms, Academic Press,
1985.
