














 Thailand IC Design Incubator (TIDI), National Electronics and Computer 
Technology Center 
2
 School of Informatics, University of Westminster 
3
 Applied DSP and VLSI Research Centre, Eastern Mediterranean University 
 
 
Copyright © [2006] IEEE. Reprinted from the proceedings of the 2006 IEEE 
International Symposium on Circuits and Systems, ISCAS 2006, pp. 2801-
2804. 
 
This material is posted here with permission of the IEEE. Such permission of 
the IEEE does not in any way imply IEEE endorsement of any of the 
University of Westminster's products or services. Internal or personal use of 
this material is permitted. However, permission to reprint/republish this 
material for advertising or promotional purposes or for creating new collective 
works for resale or redistribution must be obtained from the IEEE by writing to 
pubs-permissions@ieee.org. By choosing to view this document, you agree to 
all provisions of the copyright laws protecting it. 
 
 
The WestminsterResearch online digital archive at the University of Westminster 
aims to make the research output of the University available to a wider audience.  
Copyright and Moral Rights remain with the authors and/or copyright owners. 
Users are permitted to download and/or print one copy for non-commercial private 
study or research.  Further distribution and any use of material from within this 
archive for profit-making enterprises or for commercial gain is strictly forbidden.    
 
 
Whilst further distribution of specific materials from within this archive is forbidden, 
you may freely distribute the URL of the University of Westminster Eprints 
(http://www.wmin.ac.uk/westminsterresearch). 
 
In case of abuse or copyright appearing without permission e-mail wattsn@wmin.ac.uk. 
A High-Speed, Low-Power Interleaved Trace-Back
Memory for Viterbi Decoder
Pasin Israsena 4 Izzet Kale +
#Thailand IC Design Incubator (TIDI) *Applied DSP and VLSI Research Group,
National Electronics and Computer Technology Center Department of Electronic Systems,
112 Thailand Science Park, Paholyothin Rd University of Westminster, London, United Kingdom
Pathumtani, Thailand 12120 +Applied DSP and VLSI Research Centre,
Email: pasin.israsenagnectec.or.th Eastern Mediterranean University,
Gazimagusa, Mersin io, KKTC
Email: kaleigwestminster.ac.uk
Abstract- This paper presents a high-speed, low-power trace-
back memory structure for a Viterbi Decoder. The new Control Unit
memory is based on an array of registers connected with trace- PMU
back signals that decode the output bits on the fly. The trace- (Path Metric Unit)
back memory is internally interleaved such that high-speed
characteristic is achieved while low-power consumption is Seu
maintained. The structure is used together with appropriate BMU + ACSrvivou decoded
clock and power-aware control signals. The design is 100% p MetrichUnit) mSelect) bits
portable and is suitable for a SoftIP approach. Based on the
AMS 0.35 pm CMOS implementation the trace-back memory
is estimated to consume energy of 232 pJ, which is 53.6% less Figure 1. Viterbi Decoder
than a conventional RAM based design, with a maximum
throughput of 1.1 Gbps. * Branch metric computation unit (BMCU)
* Add-compare-select unit (ACSU) containing ACS
I. INTRODUCTION cells(s) and PMU
* Survivor memory unit (SMU)
Convolution coding is widely used in modern digital For an encoder with constraint length L =K, the BMCU
communication systems such as mobile or satellite works to find the likelihood of each of the 2(K-1) states of the
communications to achieve low-error rate data transmission. decoder transferring to a next state under particular set of
The Viterbi algorithm [1], in particular, is known to be an input symbols. The ACSU compares the results to find
efficient method for the realisation of maximum- likelihood maximum likelihood for each state, updates its path metrics,
(ML) decoding of the convolutional codes. Today the Viterbi and generates a decision bit that uniquely identifies the
Decoder is widely used in established systems such as GSM previous or surviving states. These decision bits are stored in
mobile or the IEEE 802.1 la wireless LANs standard. With the SMU and used to reconstruct the most likely state
emerging applications such as the DAB, DVB or wearable sequence. The transitions of the states at different points in
personal entertainment devices, wireless communication is times can be visualized using a Trellis diagram [1].
also increasingly becoming more pervasive. These
applications require devices with ultra low-power
consumption. Already it has been shown that the Viterbi Viterbi decoders have been subjects of ongoing research,
decoder can account for more than one third of the power the most recent ones being [3-5], dealing with high-speed
consumption during baseband processing in second- implementations. Approaches for power reduction have also
generation cellular telephones [2]. Power consumption is been proposed [6-1 1]. In this paper we introduce high-speed,
therefore the critical design criterion to be tackled. low-power design techniques targeting specifically the key
A conventional Viterbi decoder consists of 3 major parts power hungry block within the Viterbi decoder, which is the
that are: SMU. An SMU usually employs either register exchange or
trace back techniques. In the register exchange designs [12]
0-7803-9390-2/06/$20.00 ©)2006 IEEE 2801 ISCAS 2006
the decoded bits for each state given maximum likelihood approach, termed parallel (Fig 5), data is connected to all the
path are stored directly onto the state registers. The approach column registers in parallel; each clocked at clk/35 rate and
consumes significant power and is suitable only for designs appropriately skewed. The results showed that the parallel
with short decoder length, as the decoded data stored, which structure gives a better result with the power reduction of as
needs constant read/write updates, increases with the length much as 60% compared to conventional RAM based design.
of the decoding process. For systems that require longer up
decoding length for better bit error rate, such as those in
GSM or CDMA, the trace-back technique [13] is usually A
employed. Using trace-back algorithm, only information
that identifies previous unique states is kept in the memory, data
meaning only one bit/state/stage is required. After a certain
decoder length the information is read out and decoded to B
find the most likely incoming bit sequences received during
that period. The trace-back SMU is conventionally realized sel
using RAM, and has been shown to contribute to more than
half of the power consumption in a conventional Viterbi Figure 2. Cell structure
decoder due to the expensive memory accesses [8].
ength 35 stage 1 stage 33 stage 34 stage 35
In our recently reported design [11], a new memory select
structure where decoded bits are presented without having to path H Ain[O] Aout[O] row
be read out was proposed. The structure was based on an * stateO
array of registers connected with trace-back signals that Zone
B LA B Bin[ 3o
decode the output bits on the fly. Two structures that are StateO-31 t ,p A] ini] Aout[1] sPtate
Id.t. d.td ~~~~~~~~~~~stateIparallel and systolic based are employed, together with lo B Bin[l] Bout[l]
appropriate clock and power-aware control signals to achieve state 2
the low-power goal. The results showed that power reduction
of as much as 60% was possible in the parallel structure. One 1
--------
problem with the design, however, was that the speed was
,p A] Fp A] Ain[32] kout[32] 'p A p A
relatively slow due to long critical paths. This paper proposes B Id n[32 a[ \ state 31
an improved architecture based on a memory interleaving B Bin[32] 3ut[ S fl
technique. In the next section the review of the low-power pA Al P A] p A
design in [11] is provided. The proposed high-speed d dat dt dt state32
improvement is then discussed in Sec. III, followed by State32-63
results and discussion in Sec. IV. Sec. V is the conclusions.
'p A 'p A Fp A] 'p A
dtadata data data state 63
II. Low POWER TRACE-BACK MEMORY ARCHITECTURE l
V 0 path VJ
The low-power memory structure in [11] was based on decode[O] decode[32] code[33] decode[34]
connections of new cell elements; each has a structure as
shown in the Figure 2. The cell consists of a register with
additional logics to allow the stored trace-back information stage stage stage stage
to be read. The logic is also connected to other cells such 1 2 3 35
that, based on the bit information stored, it can uniquely A A A A
identify the previous or surviving state so that the decoded
bit can be found. An example of the memory structure for a
64-state Viterbi decoder with a decoding length of 35 is
shown in Figure 3. Power consumption is minimized as, data
once stored in the register bits, the result is already presented
without having to be read out again as is the case of using
RAM. The memory structure is also designed such that only B B B B
once the winner is decided upon that the signals are to be
propagated through the array to decode the output bits. Apart Ik
from that period, no switching of decoding logic occurs. In clk
terms of writing the data onto the registers, two approaches decode bits
were considered. The first approach, termed systolic was
when the data propagates through the stage (or column) Figure 4. Systolic architecture
resisters under the same clock rate (Fig 4). For the second
2802
in reading the state in the V direction is only (TOR] + TOR2)
stage stage stage stage
1 2 3 35 per stage, where TOR] is the delay of the OR gate within the
cell structure, and TOR2 the delay of OR gate used in each
A A _ A A pair of sel states.
data clkl c1k34 c1k35
data
B B B B
r> ~~~~~~~~~~~~~R/W
clkl clk2 clk3 clk35 rea Ml
(skewed clocks, rate clk/35) decode bits _
clkl clk34 clk 5
Figure 5. Parallel architecture A#
III. HIGH- SPEED, LOW POWER SMU wri_e
It can be seen from Figure 3 that major critical paths that
dictate the maximum clock speed are the paths H and V, M2
which run horizontally and vertically respectively. After the
memory registeres all the information to decode a 35-bit
data, the signals need to completely propagate through the a) High-speed, low-power memory structure
paths H and V before the first bit of a new set of data can be
registered. This bottleneck severely limits the clock speed of
the memory. The delay of path H depends on the number of clkl Y Ln
stages (memory depth), where as the path V depends on the Ml cdk2 . write read 1 ite
number of states. Although the design was shown to have cdk3 1-FL
adequate speed for a number of applications, its maximum clk35 -- -
speed is still quite limited by these critical paths. In this clkl n
paper we propose an improvement for a high-speed low- M2 read write read
power solution based on simple interleaved trace-back
architecture. The structure we propose is as in Figure 6. To clk3 n
overcome the problem of propagation in horizontal direction, clk35 -
we use two memories (Fig 6a) that are complementarily
alternating between read (track-back read out) and write b) Write sequences for memory Ml and M2
states, each with the duration of decoding length (35 bits in
this case) (Fig 6b). The operation is controlled by R/W signal Figure 6. New trace-back memory connection
that appropriately gates the clocking signals used when
writing data to the memories. In doing so, the traced-back
result of the previous 35 bits can be read out while the new IV. RESULTS AND DISCUSSIONS
data is updated to the other memory. Even though the critical
path H remains the same, it no longer affects the clocking
speed of the stages in writing. As the critical path H is the To test the effectiveness of the new memory structure, a
time taken to decode all 35 bits, so effectively the critical Viterbi decoder that complies with the DAB specifications
delay of the memory in reading state is only (TAND + TOR) per was constructed. The DAB specifications have a constraint
stage, where TAND and TOR are respectively the delays of length L of 7 (hence 64 states) and a rate R Of 1/4. Although
AND and OR gates within the cell structure. It is clear that the memories considered can be easily adapted for both
the memory can now operate at a much higher speed, single and multiple ACU cells approaches, for simplicity the
effectively D times faster, where D is the decoder length. Viterbi structure implemented here has a single state ACS
unit, requiring 64 clocks per one stage operation. Also, as it
is generally considered that the decoder length should be
The same argument goes for the vertical path T, where about five times the constraint length, the trace-back has a
the sel signal from each state in the column needs to be memory depth of35.
logical ORed together to produce the decoded bit for that
column (Fig 3). Under the same scenario as with the
horizontal case, effectively the critical delay of the memory
2803
The Viterbi decoder based on the two SMU architectures the memory to be used in even more demanding, ultra high-
that are the low-power version reported in [11] and the new speed low-power applications.
high-speed low-power version are implemented using
Verilog HDL and are functionally verified using Mentor V. CONCLUSIONS
Graphic's ModelSim. The designs are synthesised for ASIC
implementation of SMU using Mentor Graphic's Leonardo
Spectrum based on 0.355pm CMOS technology. Power A new high-speed low-power trace-back memory
dissipation is estimated by observing switching activities and structure for a Viterbi Decoder is proposed. The new
using information provided by [14]. The results are given in memory is based on two arrays of registers interleaved and
Table 1. connected with trace-back signals that decode the output bits
on the fly. The structure is used together with appropriate
clock gating and power-aware control signals. Based on 0.35
TABLE I. ASIC IMPLEMENTATIONS pm CMOS implementation the trace-back back memory
ASIC [ Low power f High-speed consumes and energy of 232 pJ with amaximum bit rate of
0.35 pm low power 1.1 Gbps.
Tracc-back mcmory
Total Energy/sample, pJ 182 232 REFERENCES
- Registers 23.6 26
- Gates 30.5 33
- Nets 61 98 [1] A. J. Viterbi, "Error bounds for convolutional codes and an
- Clock 67 75 asymptotically optimum decoding algorithm," IEEE Trans. Inform.
%Energy/Sample 36.4 46.4 Theory, vol. IT-13, pp. 260-269, 1967.
compare to RAM [2] I. Kang and A.Willson Jr., "Low-power Viterbi decoder for CDMA
Area, mm 1.20 2.5 mobile terminals," IEEE J Solid-State Circuits, vol. 33, pp. 473-482,
Longest delay, ns 20.45 21 Mar. 1998.
Delay per stage, ns 20.45 0.9 [3] K.K Parhi, "An improved pipelined MSB-first add-compare select
Max bit rate (memory 50 1111 unit structure for Viterbi decoders", IEEE Transactions on Circuits
part), Mbps and Systems I: Fundamental Theory and Applications, vol. 51, pp.
System with onic ACS 504 51 1, March 2004
Max bit rate, Mbps 1.1 1.1 [4] Liu Xun and M.C Papaefthymiou, "Design of a 20-mb/s 256-state
Area, mm
~
1.8 3.02 Viterbi decoder," IEEE Transactions on Very Large Scale Integration
(VLSI) Systems, vol. 11, Issue 6, pp. 965 - 975, Dec. 2003
It can be seen that the new memory design is 22.7 times [5] Liu Xun and M.C Papaefthymiou, "Design of a high-throughput low-
faster than the original design. ,, critical d per...stag of . ..power IS95 Viterbi decoder", Proc. 39th Design Automationfaster than the original lesig The critical dlelay per stage of Conference, pp. 263-268, June 2002.
the memory path is 0.9 ns. That corresponds to the maximumCofrnep.2628,Je202[6] D. Garrett and M. Stan, "Low power architecture of the soft-outputbit rate of 1111 Mbps. Compare to the requirement for DAB Viterbi algorithm," Proc. ISLEPD '98, pp. 262-267, 1998.
which can be as high as 5 Mbps [15], it can be seen that the [7] I. Kang and A.N. Willson Jr, "Low-power Viterbi decoder for CDMA
new memory design can easily handle the DAB requirement mobile terminals", IEEE-Journal ofSolid-State Circuits. vol.33, no.3,
without the need to employ deep sub micron technology. The pp.473-82, March 1998
bottleneck is now the ACS unit especially where a single [8] D. Oh and S. Hwang, "Design of a Viterbi decoder with low power
unit is used as in this example, for which immediate using minimum transition traceback scheme," Electronic Letters,
improvements can be achieved by using tehichniqu euh as vol.32, no.22, pp. 2198-2199, Oct. 1996.
multiple ACS designs. For power consumption, the power [9] K. Seki; S. Kubota, M. Mizoguchi, and S. Kato, "Very low powermultiple designs. power consumption, power consumption Viterbi decoder LSIC employing the SST (scarce statedissipated in the new memory design is slightly increased transition) scheme for multimedia mobile communications,"
due to the additional logic and routing, but it is still Electronics Letters, vol.30, no.8, pp.637-639, April 1994.
significantly less than the RAM based equivalent. Given that [10] F. Ghanipour and A.R. Nabavi, "Design of a low-power Viterbi
the SMU contributes more than 500O of the total power decoder for wireless communications," Proc. IEEE ICECS 2003, vol.
consumed by a Viterbi decoder, by using the new memory 1, pp. 304-307, Dec. 2003
structure the reduction in power in total is potentially more [11] P. Israsena and I Kale, "A Viterbi Decoder with Low-Power Track-
than 25%. All these advantages are achieved at the expense Back Memory Structure for Wireless Pervasive Communicaions,"
ofra sizable increase in ternis of area, which is acceptable i* Proc. International Symposium on Pervasive Computing, Jan 2006Of a sizable increase in terms Of which iS acceptable in
most cases as silicon is now plentiful given sub micron [12] Gennady Feygin and P.G. Gulak. "Architectural Tradeoff forSurvivor Sequence Memory in Viterbi Decoder." IEEE Transactionsimplementations. The register-based design iS also more on Communications, Vol. 41, pp. 425-429, 1998.
appropriate for a soft IP approach as the code can be 1000 [13] R. Cypher and C. B. Shung, "Generalized trace-back techniques for
portable. This is favorably compared to designs using RAM survivor memory management in the Viterbi algorithm," IEEE J
where a macro RAM block usually needs to be provided by VLSI Signal Processing, vol. 5, pp. 85-94, Jan. 1993.
the vendor. The benefit of this flexibility is even more [14] AustriaMicroSystems 0.35um CMOS Digital Standard Cell
apparent when compared to other approaches such as analog Databook, 2003
implementation. Deeper sub-micron technology will allow [15] ETSI EN 300 401: Radio Broadcasting Systems; Digital Audio
Broadcasting (DAB) to mobile, portable andfixed receiver
2804
