Efficient MAP-algorithm implementation on programmable architectures by F. Kienle et al.
Advances in Radio Science (2003) 1: 259–263
c  Copernicus GmbH 2003 Advances in
Radio Science
Efﬁcient MAP-algorithm implementation on programmable
architectures
F. Kienle, H. Michel, F. Gilbert, and N. Wehn
University of Kaiserslautern, Germany
Abstract. Maximum-A-Posteriori (MAP) decoding algo-
rithms are important HW/SW building blocks in advanced
communication systems due to their ability to provide
soft-output informations which can be efﬁciently exploited
in iterative channel decoding schemes like Turbo-Codes.
Multi-standards demand ﬂexible implementations on pro-
grammable platforms.
In this paper we analyze a quantized turbo-decoder based
on a Max-Log-MAP algorithm with Extrinsic Scaling Fac-
tor (ESF). Its communication performance approximate to a
Turbo-Decoder with a Log-MAP algorithm and is less sensi-
tive to quantization effects. We present Turbo-Decoder im-
plementations on state-of-the-art DSPs and show that only a
Max-Log-MAP implementation fulﬁlls a throughput require-
ment of ∼2Mbit/s. The negligible overhead for the ESF im-
plementation strengthen the use of Max-Log-MAP with ESF
implementation on programmable platforms.
1 Introduction
Third generation’s wireless communication systems com-
prise advanced signal processing algorithms that increase the
computational requirements more than ten-fold over 2G’s
systems (Third Generation Partnership Project). Numerous
existing and emerging standards require ﬂexible implemen-
tations (“software radio”) (Greifendorf et al., 2002). It is
argued in Bickerstaff et al. (2000) that “the trends in de-
coding algorithms are moving from standard Viterbi towards
more computationally-expansive algorithms like soft-output
Viterbi algorithm (SOVA) and maximum a posteriori (MAP)
algorithm. The implementation efﬁciency of these algo-
rithms will become a differentiating factor for next gener-
ation wireless communications – particularly for those em-
ploying programmable DSP-devices.”
Turbo-Codes, introduced by Berrou et al. (1993), have
near Shannon-limit error correction capacity and are among
the most advanced channel-coding algorithms and thus used
in many communication standards like the 3GPP standard.
Correspondence to: F. Kienle
(kienle@eit.uni-kl.de)
The important innovation was the reintroduction of iterative
decoding schemes of convolutional codes by means of soft-
output information exchange. The basic building block of a
turbo-decoderisthecomponentdecoderwhichprovidessoft-
output information. This information is a measure on the de-
coder conﬁdence in its decoding decision. These component
decoders are typically based on the MAP algorithm.
To reduce the MAP implementation complexity the algo-
rithm is transformed into the logarithm-domain (Log-MAP)
(Robertson et al., 1997) which reduces the operation strength
but implicates the so-called max∗ operation. This operation
is composed of a maximum search and a correction term.
Discarding the correction term results in the so called Max-
Log-MAP algorithm, its implementation is less complex and
thus faster but implies a loss in the communication perfor-
mance of up to 0.3dB.
The implementation of the max∗ operation is especially
very time consuming on standard DSP-processors. The ad-
ditional assembler commands to add the correction term de-
creases the overall throughput of a Log-MAP implementa-
tion by a factor 2–3 compared to a Max-Log-MAP imple-
mentation. (the throughput degradation in a dedicated VLSI
implementation is less worse). This trade-off between com-
munication performance versus implementation complexity
is a typical problem in the design of communication systems.
Recently some very advanced DSPs like the TigerSharc from
Anlog Devices have implemented a special max∗ instruction
to support an efﬁcient Log-Map implementation.
In this paper we investigate different MAP-algorithms
with respect to their communication performance and im-
plementation complexity on state-of-the-art DSPs. In Vogt
et al. (1999) an Extrinsic Scaling Factor (ESF) was proposed
to improve the turbo-decoder performance with Max-Log-
MAP component decoders. We observe the usage of ESF
in a quantized turbo-decoder model and show that the Max-
Log-MAP in combination with ESF is less sensitive to quan-
tization effects than the Log-MAP algorithm. We present
different turbo-decoder implementations on modern DSPs
and state the enormous throughput difference between a Log-
MAP and Max-Log-MAP implementation. The low imple-
mentation complexity of ESF and Max-Log-MAP and the260 F. Kienle et al.: Efﬁcient MAP-algorithm implementation
good communication performance strengthen the use of this
combination for turbo-decoder implementations on state-of-
the-art DSPs.
The remainder of this paper is structured in two parts:
Sect. 2 explains the system model and the MAP algorithm.
In Sect. 2.2 we present turbo-decoder performance under
quantization and SNR mismatch. Section 3 is devoted to
Turbo-Decoder implementation on state-of-the-art DSP ar-
chitectures: Starcore SC140 from Moterola/Lucent, ST120
from ST-Microelectronics and TigerSharc from AnalogDe-
vices. Section 4 concludes this paper.
2 Turbo-System
Forward error correction is enabled by introducing parity
bits. In Turbo-Codes, the original information (xs), de-
noted as systematic information, is transmitted together with
the parity information (x1p,x2p). In the Third Generation
Partnership Project (3GPP), the encoder consists of two re-
cursive systematic convolutional (RSC) encoders with con-
straint length K = 4. One RSC encoder works on the block
of information in its original, one on an interleaved sequence,
see Fig. 1. On the receiver side a corresponding component
decoder for each of them exists. The MAP-Decoder has been
recognized as the component decoder of choice as it is supe-
rior to the Soft-Output Viterbi Algorithm (SOVA) in terms of
communications performance and implementation scalabil-
ity, see Vogt et al. (1999).
The soft-output of each component decoder (3) is mod-
iﬁed to reﬂect only its own conﬁdence (z) in the received
information bit of being sent either as “0” or “1”. These con-
ﬁdences are exchanged between the decoders to bias their
next estimations iteratively. During this exchange, the pro-
duced information is interleaved following the same scheme
as in the encoder. The exchange continues until a stop crite-
rion, seeWormetal.(2000b), isfulﬁlled. Thelastsoft-output
is not modiﬁed and becomes the soft-output of the Turbo-
Decoder (32). Its sign represents the 0/1 decision and its
magnitude the conﬁdence of the Turbo-Decoder in it.
2.1 The MAP algorithm
Given the received samples of systematic and parity bits
(channel values) for the whole block (yN
0 , where N is the
block length), the MAP algorithm computes the probability
for each bit to have been sent as dk = 0 or dk = 1. The log-
arithmic likelihood ratio (LLR) of these probabilities is the
soft-output, denoted as:
3k = log
Pr{dk = 1|yN
0 }
Pr{dk = 0|yN
0 }
. (1)
Equation 1 can be expressed using three probabilities,
which refer to the encoder states Sm
k , where k ∈ {0...N}
and m,m0 ∈ {1...8}:
The branch metrics γ
k,k+1
m,m0 (dk) is the probability that a
transition between Sm
k and Sm0
k+1 has taken place. It is derived
from the received signals, the a-priori information given by
the previous decoder, the code structure and the assumption
of dk = 0 or dk = 1, for details see Robertson et al. (1997).
From these branch metrics the probability αk
m that the en-
coder reached state Sk
m given the initial state and the received
sequence yk
0, is computed through a forward recursion:
αk
m0 =
X
m
αk−1
m · γ
k−1,k
m,m0 .
Performing a backward recursion yields the probability βk+1
m0
that the encoder has reached the (known) ﬁnal state given the
state Sk+1
m0 and the remainder of the received sequence yN
k+1:
βk
m =
X
m0
βk+1
m0 · γ
k,k+1
m,m0
αs and βs are both called state metrics. Equation 1 can be
rewritten as:
3k = log
P
m
P0
m αk
m · βk+1
m0 · γ
k,k+1
m,m0 (dk = 1)
P
m
P0
m αk
m · βk+1
m0 · γ
k,k+1
m,m0 (dk = 0)
. (2)
The original probability based formulation implies many
multiplications and has thus been ported to the logarith-
mic domain resulting in the Log-MAP Algorithm (Robertson
et al., 1997). Multiplications turn into additions and addi-
tions into the already mentioned max∗ operation which is
deﬁned as:
max∗(δ1,δ2) = max(δ1,δ2) + ln(1 + e−|δ2−δ1|). (3)
This transformation does not decrease the communication
performance. Arithmetic complexity can further be reduced
by discarding this correction term which results in a 0.3dB
communication performance loss.
If we multiply the extrinsic information which is passed
between the different constituent decoders with an appropri-
ate scaling factor ESF the communication performance can
be approximated to that of a Log-MAP decoder (J. Vogt,
2000). Simulations show that the optimal scaling factor is
0.7. For ﬁxed-point implementation (which is a must on
DSPs) an ESF=0.75 is used, which can be easily imple-
mented by a shift operation.
2.2 Turbo-Decoder performance under quantization and
SNR mismatch effects
SNR estimation is a difﬁcult task in wireless communica-
tion systems. An imprecise SNR estimation can have strong
inﬂuence on the communication performance of the decod-
ing process. Worm et al. (2000a) have proven that Turbo-
decoding based on a Max-Log-MAP algorithm is SNR in-
dependent, whereas the decoding performance with the Log-
MAP algorithm depends on the accuracy of the SNR esti-
mation. In this case the authors propose to work with SNR
operating points Lop.
The estimated SNR values are used to scale the received
channel input values. These values are interpreted as Log-
Likelihood values and are calculated as follows:
3k =
4Es
N0
akyk (4)F. Kienle et al.: Efﬁcient MAP-algorithm implementation 261
R
S
C
1
R
S
C
2
I
N
T
P
u
n
c
t
u
r
e
r
x
s
x
1
p
x
2
p
-
-
-
-
M
A
P
1
M
A
P
2
D
E
D
E
I
N
T
I
N
T
y
s
y
s
i
n
t
y
1
p
y
2
p
i
n
t
￿
1
￿
2
z
2
i
n
t
z
1
Fig. 1. Turbo-Encoder and Decoder
The factor 4Es
N0 ak, denoted as Lc in the following, is com-
monly referred to as channel reliability factor (Hagenauer
et al., 1996). If the channel characteristics does not change
over time, it is sufﬁcient to use a single SNR operating point
which is constant, i.e. Lop = Lc = const.
The investigations in Worm et al. (2000a) were based on
ﬂoating-point models - however VLSI- or DSP implementa-
tions require ﬁxed-point representations with limited accu-
racy. The move from ﬂoating to ﬁxed-point models (quan-
tization) with minimized wordwidth requires a thorough ex-
ploration: thebitwidthhastobewideenoughtocoverthedy-
namic range and the number of fractional bits has to be large
enoughtoensureanappropriateprocessingaccuracy(Michel
and Wehn, 2001). To avoid an implementation of a multiplier
for the channel reliability factor to evaluate Eq. (4), we ap-
proximate the SNR operating points with 2x, x ∈ Z. Thus
the multiplication with Lop in Eq. (4) can be substituted by
simple shift operations with a resulting SNR operating point
granularity of 3dB.
Figure 2 shows the communication performance of differ-
ent MAP decoding algorithms. Bit Error Rates (BER) are
simulated with an Additive White Gaussian Noise (AWGN)
channel, the blocksize is 5114 bits (3GPP standard). Eight
iterations are carried out.
In Fig. 2a we have compared the performance of a ﬂoat-
ing point Log-MAP algorithm with a ﬁxed-point Log-MAP,
a Max-Log-MAP and a Max-Log-MAP combined with
ESF=0.75 scaling algorithm respectively. A time invariant
stable operating point Lop = 2 is used, which is an optimum
SNR operating point for an AWGN channel. The quanti-
zation (bitwidth 6, fractional part 2 bit) is accurate enough
to prevent an error ﬂoor or any other larger quantization ef-
fects. Thus the ﬁxed-point Log-MAP algorithm has a very
small performance loss compared to the reference Log-MAP
algorithm. The ﬁxed-point Max-Log-MAP algorithm has a
degradation of ∼0.3dB due to its algorithmic simpliﬁcation.
But remarkable is that the Max-Log-MAP algorithm with
ESF=0.75 scaling degrades only minimal compared to the
ﬁxed-point Log-MAP algorithm. Thus in the case of stable
operating points we can use the ﬁxed-point Max-Log-MAP
algorithm with ESF scaling which has a similar communica-
tion performance as the Log-MAP algorithm.
In Fig. 2b we have investigated the inﬂuence of SNR mis-
matches for different ﬁxed-point algorithms. The perfor-
mance with an optimal SNR operating point (Lop = 2) is
compared with the operating points Lop = 1 and Lop = 4
which correspond to a −3dB and +3dB SNR mismatch. Ob-
viously the ﬁxed-point Log-MAP algorithm is very sensi-
tive with respect to the SNR operating point e.g. the per-
formance degradation is 0.2dB for Lop = 4 and even larger
than ∼0.4dB for Lop = 1. The performance of the Max-Log
MAP algorithm with ESF scaling is by far not so sensitive
to Lop variations. The degradation ranges between 0.05 and
0.1dB for Lop = 4 and Lop = 1 respectively. In a ﬂoating-
point model all the Max-Log-MAP graphs for the different
operating-points would coincide, but due quantization differ-
ent graphs result.
Under the consideration of quantization effects and SNR
mismatches we recommend the use the Max-Log-MAP algo-
rithm with ESF scaling in turbo-decoder implementations.
3 Turbo-Decoder Implementation on modern DSP ar-
chitectures
Modern DSP architectures attempt to increase the signal
processing performance by exploiting the inherent paral-
lelism of many signal processing algorithms. This class of
DSP architectures provides several independent ALU units
along with wide and fast busses to the internal memories.
To allow this increased degree of instruction level paral-
lelism, parallel executed instructions for each active unit are
grouped together to so-called very large instruction words
(VLIW). Further, the processing units usually support the
single-instruction/multiple-data approach (SIMD). This ex-
ploits sub-word parallelism (SWP), where several sub-words
of a data word can be processed with the same operation. In
the following we present state-of-the art DSP architectures
and implementation results.262 F. Kienle et al.: Efﬁcient MAP-algorithm implementation
−2 −1.8 −1.6 −1.4 −1.2 −1 −0.8
10
−7
10
−6
10
−5
10
−4
10
−3
10
−2
10
−1
10
0
SNR
B
E
R
float Log−MAP            
fixed Log−MAP            
fixed Max Log−MAP        
fixed Max Log−MAP and ESF
−2 −1.8 −1.6 −1.4 −1.2 −1 −0.8 −0.6
10
−7
10
−6
10
−5
10
−4
10
−3
10
−2
10
−1
10
0
SNR
B
E
R
fixed Log−MAP L_op= 2
fixed Log−MAP L_op= 1
fixed Log−MAP L_op= 4
fixed Max+ESF L_op= 2
fixed Max+ESF L_op= 1
fixed Max+ESF L_op= 4
(a) (b)
Fig. 2. BER of UMTS Turbo-Decoder: (a) Different MAP implementations 8 iterations, (b) Performance of different algorithms dependent
on the SNR operating points.
3.1 Motorola/Lucent SC140
TheStarcoreSC140isjointlydesignedbyMotorolaandLu-
cent. A signiﬁcant design issue is the variable length instruc-
tionset(VLES).Mostinstructionsare16-bitwide, butcanbe
grouped into VLES packtets of 128 bit. We use the SC140
as an example for a multiple ALU VLIW DSP, supporting
sub-word parallelism. Its architecture employs 4 ALU and
2 AGU units. In one clock cycle the SC140 is thus able to
perform 4 ALU operations, each using 32 bit wide operands,
and a 128bit data transfer. A (Max-)Log-MAP implementa-
tion for decoding an 8-state Turbo-Code on this DSP should
exploit the beneﬁts of the sub-word parallelism by using 16-
bit packed data types.
In Chass and Gubeskys (2000) the update of the path met-
rics of a Max-Log-MAP, comprising four butterﬂies, needs 3
clock cycles. This fully utilizes the architectural capabilities
of the DSP. One Max-Log-MAP decoder can be realized in
16 cycles per bit. Assuming 5 iterations the resulting Turbo-
decoding performance is 1875kbit/s at 300MHz. The com-
ponent decoder with a Log-MAP implementation has a cycle
count of approximately 50 cycles per bit, leading to a Turbo-
decoding throughput of 600kbit/s at 300MHz (see Table 1).
This decrease in decoding performance results from the
complexityofthemax∗-operation(seeEq.3), whichincludes
the steps: difference of parameters, absolute value, access to
lookup-table (LUT), and adding the LUT result to maximum
of parameters. The sequence of these steps needs 10 clock
cycles in contrast to just one cycle needed for the plain max-
imum operation (Chass and Gubeskys, 2000).
3.2 STM ST120
The ST120 is provided by ST-Microelectronics. It features
two ALU units and supports three different instruction sets:
a 16-bit instruction set (GP16) for compact microcontroller
code, a 32-bit instruction set (GP32) for higher performance
and more complex instructions, and a third one for an in-
creased level of instruction parallelism. In this 4 × 32-bit
Score-boarded Long Instruction Word (SLIW) mode the pro-
cessor is able to execute four GP32 instructions in one clock
cycle. Following the SIMD approach, the processor supports
2 × 16-bit data packed into one 32-bit data word.
An optimized hand-coded assembly language implemen-
tation of an 8-state Turbo-Decoder has a performance of
37 cycles per bit and MAP, using packed data types and
Max-Log-MAP algorithm. Assuming 5 iterations the Turbo-
decoding performance results to 540kbit/s at 200MHz. For
a Log-MAP implementation the throughput degrades to
200kbit/s with 100 cycles per bit and MAP decoder.
3.3 TigerSharc
The TigerSharc DSP processor from Analog Devices is a
high-performance architecture, which is targeted, e.g. for
wireless infrastructure applications, such as cellular base sta-
tions. With its VLIW architecture, TigerSharc is capable to
execute up to four intructions in a single cycle and combines
hierarchically both types of data-level parallelism: SIMD
and SWP.
The TigerSharc is the only processor with dedicated in-
struction support to implement the Log-MAP algorithm. A
max∗ instruction is provided, which operates only on a set of
enhanced communication registers. Transferring values be-
tween the ALU register ﬁle and these special registers causes
signiﬁcant data transfer overhead. The max∗ operation pro-
cesses a (sub-word parallel) maximum selection of the input
parameters and adds a respective correction term value from
an integrated lookup-table to the maximum. Thus, full Log-
MAP support is achieved without performance penalty.F. Kienle et al.: Efﬁcient MAP-algorithm implementation 263
Table 1. Turbo-Decoder throughput with (Max-)Log-MAP decoder on DSPs
Processor Architecture Clock freq. cycles/(bit · MAP) Throughput @ 5 iter.
Max-Log-MAP
ST120 VLIW, 2 ALU 200MHz 37 540kbit/s
SC140 VLIW, 4 ALU 300MHz 16 1875kbit/s
Log-MAP
ST120 VLIW, 2 ALU 200MHz ≈100 ≈200kbit/s
SC140 VLIW, 4 ALU 300MHz 50 600kbit/s
ADI TS VLIW, 2 ALU 180MHz 27 666kbit/s
Unfortunately, the integrated LUT is unsymmetrical to
zero. Thus, the two parameters of the max∗ operation are
not commutative any more, which complicates the validation
of implementation’s bit-true behavior versus a bit-true model
written in a high-level language.
A single Log-MAP decoder requires 27 cycles per bit.
Assuming 5 iterations the overall throughput of this Turbo-
Decoder implementation on TigerSharc results to 666kbit/s
at 180MHz.
3.4 Summary
Table 1 summarizes performance results of 3GPP compli-
ant Turbo-Decoder implementations. The number of cycles
per bit and MAP is three times higher for a Log-MAP than
for a Max-Log-MAP implementation. The required Turbo-
Decoder throughput for UMTS (up to 2Mbit/s) can only be
reached with the SC140 processor, using the Max-Log-MAP
algorithm. The Turbo-Decoder with Log-MAP implemen-
tation achieves only a throughput of 666kbit/s. For higher
throughput requirements a multiprocessor architecture is
mandatory. By using the Max-Log-MAP with ESF=0.75 the
communication degradation can be almost avoided with an
negligible implementation overhead (1 cycle/(bit· MAP)).
4 Conclusions
Turbo-Codes are part of the 3G cellular wireless standard.
The complexity of the decoding algorithm and the through-
put requirements pose great demands on the computational
power of the signal processing devices. Current DSPs sup-
port the kernel operations of the Viterbi algorithm, how-
ever for the MAP algorithm, this support is lacking. Using
VLIW DSPs one 3G data channel can be processed using the
sub-optimal Max-Log-MAP algorithm. This validates the
throughput of 1875kbit/s for the Starcore SC140. Therefore
we propose a Max-Log-MAP with Extrinsic Scaling Factor.
Its communication performance is close to the performance
of a Log-MAP implementation and is less sensible to SNR
mismatch. The implementation complexity of the ESF Max-
Log-MAP is nearly equal to the Max-Log-MAP and can be
implemented on standard DSPs without speciﬁc instruction
extensions.
Acknowledgements. This work has been supported by the Deutsche
Forschungsgesellschaft (DFG) under grant We 2442/1-1 within the
Schwerpunktprogramm “Grundlagen und Verfahren verlustarmer
Informationsverarbeitung (VIVA)”.
References
Berrou, C., Glavieux, A., and Thitimajshima, P.: Near Shannon
Limit Error-Correcting Coding and Decoding: Turbo-Codes, in:
Proc. 1993 International Conference on Communications (ICC
’93), pp. 1064–1070, Geneva, Switzerland, 1993.
Bickerstaff, M., Hughes, G., Nicol, C., Xu, B., and Yan, R.-H.: DSP
Systems for Next-Generation Mobile Wireless Infrastructure, in:
Proc. ICASSP 2000, pp. 3710–3713, 2000.
Chass, A. and Gubeskys, A.: On Performance/Complexity Analy-
sis and SW Implementation of Turbo Decoding, in: Proc. 2nd
International Symposium on Turbo Codes & Related Topics, pp.
531–534, Brest, France, 2000.
Greifendorf, D., Stammen, J., and Jung, P.: The Evolution of Hard-
ware Platforms for Mobile ‘Software Deﬁned Radio’ Terminals,
in: Proc. 2002 International Symposium on Personal, Indoor, and
Mobile Radio Communications (PIMRC ’02), 2002.
Hagenauer, J., Offer, E., and Papke, L.: Iterative Decoding of Bi-
nary Block and Convolutional Codes, IEEE Transactions on In-
formation Theory, 42, 429–445, 1996.
J. Vogt, A. F.: Improving the Max-Log-MAP Turbo Decoder, IEEE
Electronic Letters, 36, 2000.
Michel, H. and Wehn, N.: Turbo-Decoder Quantization for UMTS,
IEEE Communications Letters, 5, 55–57, 2001.
Robertson, P., Hoeher, P., and Villebrun, E.: Optimal and
Sub-Optimal Maximum a Posteriori Algorithms Suitable for
TurboDecoding, EuropeanTransactionsonTelecommunications
(ETT), 8, 119–125, 1997.
Third Generation Partnership Project, 3GPP home page, www.
3gpp.org.
Vogt, J., Koora, K., Finger, A., and Fettweis, G.: Comparison of
Different Turbo Decoder Realizations for IMT-2000, in: Proc.
1999 Global Telecommunications Conference (Globecom ’99),
vol. 5, pp. 2704–2708, Rio de Janeiro, Brazil, 1999.
Worm, A., Hoeher, P., and Wehn, N.: Turbo-Decoding without SNR
Estimation, IEEE Communications Letters, 4, 193–195, 2000a.
Worm, A., Michel, H., Gilbert, F., Kreiselmaier, G., Thul, M. J., and
Wehn, N.: Advanced Implementation Issues of Turbo-Decoders,
in: Proc. 2nd International Symposium on Turbo Codes & Re-
lated Topics, pp. 351–354, Brest, France, 2000b.