Designing of precomputational-based low-power Viterbi decoder by Yang, JL & Wong, AKK
Title Designing of precomputational-based low-power Viterbi decoder
Author(s) Yang, JL; Wong, AKK
Citation
The 6th IEEE Circuits and Systems Symposium on Emerging
Technologies: Frontiers of Mobile and Wireless Communication
Proceedings, Shanghai, China, 31 May - 2 June 2004, v. 2, p. 603-
606
Issued Date 2004
URL http://hdl.handle.net/10722/46505
Rights Creative Commons: Attribution 3.0 Hong Kong License
IEEE 6th CAS Symp. on Emerging Technologies: Mobile and Wireless CO". 
Shanghai, China, May 3 I-June 2,2004 
Designing of Precomputational-based Low-Power Viterbi Decoder 
Jing-ling Yang, Alfred, K.K. Wong 
Department of Electrical and Electronic Engineering 
The University ofHong Kong 
jlvanz. awonzt2eee. hku. hk, 
Abstract- This work addresses the low-power VLSI 
implementation of the Viterbi decoder 0). A new 
precomputational scheme applied to the trellis 
butterflies calculation is presented. The proposed 
scheme is implemented in a l h t a t e ,  rate 1/3 VD. Gate- 
level power verification indicates that the proposed 
design reduces the power dissipated by the original 
trellis butteflies calculation by 42%. 
I. INTRODUCTION 
The Viterbi algorithm [I], which has been 
extensively applied to several decoding and 
estimation applications in communication and signal 
processing, was introduced in 1967 as a method for 
decoding convolutional code [Z]. 
Convolutional encoding with Viterbi decoding is 
one of the most popular forward-error-correction 
methods for error correction in communication 
systems. In decoding, the VD attempts to reconshuct 
the action of the encoder based on the transmission of 
its outputs over a noisy channel. 
The VD is comprised of three basic units - a 
branch metric unit (BMU), an add-compare-select 
unit (ACSU) and a trace back unit (TBU). The VD 
dissipates most of its power on ACSU and TBU. 
The feedback loop can not be parallelized using a 
standard method, so the ACSU is normally 
considered to be the most critical part of the 
implementation of high speed Viterbi algorithms. 
Reference [3] and [4] showed that, despite the 
feedback loop, high speed can be achieved using a 
purely feed-forward parallel implementation that 
operates on the M-Step trellis. A high-speed VU that 
needs that must calculate more parallel trellis 
butterflies in parallel usually consumes more power. 
Numerous techniques for reducing power 
dissipation on TBU have been proposed. Scarce state 
transition architecture [5] has been used to reduce the 
switching rate of the SMU In [6] ,  both system and 
circuit techniques have been proposed to reduce the 
power consumed by the SMU. In ACSU design for 
low power consumption, an adaptive VD [7] 
dynamically discards some states in the trellis that 
have high path metrics during the decoding, however, 
the extra-llcnntrols required are too complex. 
Reference IS] and [9] introduce the low power design 
of large state VD. 
This work presents a precomputation-based low 
power design scheme, which can also be applied to 
high-speed VD design, to perform the low-power 
trellis butterflies calculation. The proposed 
architecturc is validated by implementating a 16- 
state, rate 113 VD. Designed to use a 0.13 um 
standard cell library, at a supplied voltage of 1.2V, 
the described decoder achieves a dissipations of 
40Ouw at a throughput of Z M H Z ,  with only minor 
design modifications. 
11. CALCULATION TRELLIS BUTTERFLIES 
Fig.1 shows a basic diagram of a VD. The 
function of each block is described briefly below. 
RegisterlLatch 
received channel symbol decoded bits 
Fig. 1 : Basic diagram of a Viterbi decoder 
BMU generates the branch metrics, which 
measure the difference between the received symbol 
and the symbol that causes transitions between states 
in trellis. 
ACSU is a collection of butterflies' calculations. 
Each buttertly calculates the path metrics of the two 
paths that connect two old statuses at time t to a new 
state at time t f l ,  and selects a smaller one as the new 
path metric of the new state. 
The TBU traces the decision vector to generate 
the corrected output sequence. This can be done 
either by forward-processing the decision using the 
so-called register-exchange algorithm or by 
bachard-tracing the previously stored decisions. 
The quality of a VD design is primarily measured 
by applying four criteria - coding gain, throughput, 
area and power consumption. This work focuses on 
the power-efficiency implementation of the txellis 
butterfly calculation. 
0-7803-7938-1/04 /$20.00 ? 2004 IEEE 603 
A. Trellis Buttefly 
A trellis diagram is a simple means of visualizing 
the input and output sequences of a convolutional 
code [ I ]  and 121. The trellis diagram for a 
convolutional code can be subdivided into a number 
of basic modules, as shown in Fig. 2, in which, BM is 
a branch metric, S is the number of states, and 
O<jsS/2-1. For efficient processing, the VD is 
implemented to compute two trellis butterflies in 
parallel. 
Statej State 2j 
- 
Fig. 2: Trellis Butterfly 
These trellis butterflies depict the transitions 
between two old states at trellis stage i and two new 
states at trellis stage i+l. 
Each stage of a trellis diagram has 2(k-1) trellis 
butterflies, where K is the constraint length of the 
convolutional code. To visualize this feature of a 
trellis diagram, consider one stage of the trellis 
diagram for a convolutional code with a constraint 
length of K=3. See Fig. 3. 
Fig.3: One Stage of a Trellis Diagram for K=3 and 
R= 1 /n 
The VD computes the trellis butterflies by using 
those basic function units called ACSj, O<=j<=2(k- 
])-I. See Fig.3. 
Computing a trellis bunerfly includes the 
following steps: 
1. Reading the path metrics and survivor paths of 
states j and j+s/Z at stage i. 
2. Computing the path metrics of states 2j and Zj+l 
at stage i+l . 
3. Comparing the two path metrics and selecting the 
path with the smaller one. 
4. Storing the updated path metrics and survivor 
paths. 
B. ACSU Implementation 
Fig. 4 shows a conventional VLSI architecture for 
implementing the ACSU butterfly unit of A VD. See 
Fig.2. The core structure, called ACSj, is confined 
within the dotted rectangle. The ACSU runs 
recursively. The new path metrics of the current 
recursion will be the old metrics of the sequent 
recursion. 
Fig. 4 Architecture of Conventional A C b  
In ACSU implementation, see Fig. 4, two 
competing paths arrive at each state in each cycle. 
PM PM?~:’ represent the survivor path metric of 
state 2j, 2j+l at time i+l respectively, PM ! and 
PMJf”‘are old path metric of state j and j+S/2 at 
time i respectively. The new path metrics of states 2j 
and 2j+l at i+l are calculated in two butterflies 
which are defmed in following equations. 
PM,!jl= min (PM! +BMI, PMy’” tBM3) (I)  
PM = min (PM ! +BM2, PM ;+’/‘ +BM4) ( 2 )  
For a 16-state, rate 1/3, VD, 32 additions and 16 
compare-and-select operations have to be done for 
every decoded bit. The number of operation is 
significant in relation to the number of operations 
associated with other unit, such as the BMU and 
SMU. For the high speed VD with parallel structure, 
, I  
604 
the number is even large. The power consumed by 
the ACSU must be minimized to reduce the power 
consumed by VD. 
111. PRECOMPUTATION SCHEME 
Precomputation logic, first proposed in [IO], is 
optimized to trade area for power in a synchronous 
digital circuit. The principle of precomputation logic 
is to identify logical conditions at some inputs to 
combination logic that do not affect output. Since 
those input values do not affect the output, the input 
transitions can be disabled to reduce the switching 
activities. 
A. Proposed Precomputation Logic 
Fig. 5 shows the precomputation architecture of a 
16-state, rate 1/3 VD. Due to the nature of ACSj, 
there are some conditions under which the output is 
independent of some of the values of the input 
registers and latches. For a code rate 113 VD, the 
maximum number of branch metrics is three, when 
the difference between the old path metric of the two 
competing paths exceeds three, the selected path can 
be determined independent of the input data of 
another path. Under such precomputation condition 
referred as G(X) in Fig. 5 ,  we can disable register 
load of these registers and latches to prevent 
unnecessary switching, thereby conserving power. 
The ACSU is correctly computed because it receives 
all required value from the remaining register. 
Comparing Fig 4 and Fig. 5 shows that the extra 
logic added for the precomputation is latch and G(X). 
Since the number of branch metric is usually smaller, 
latches do not cause much hardware cost. For 
example, an R=1/3 code has a maximum branch 
metric of three, so two-bit latches suffice to control a 
branch metric. To be efficient, the selection logic 
G(X) must also be simple. The following section 
introduces a simple C(X) design. 
B. Precomputation Conditions 
To generate the load-disable signal to the 
unusable registers and latches, a precomputation 
function G(X) is required to detect the condition 
under which ACSU is independent of the unusable 
registers and latches. For ACSU, an obvious G(X) is: 
1 PMI - PM(+”* 1 > 3 
When G (X) = PMJ - PMYS” > 3, the selected 
path is h m  state j+S/2, new path metrics are 
calculated using Eqs (3) and (4) without the inputs of 
PM ( , BMl and BM3. 
PM = PM Ys” +BM3 (3) 
(4) PMzI+l=PMpS’Z I +BM4 Ztl 
When G , (X) = PM(tS” - PMf >3, the selected 
path is from state j, new path metrics are calculated 
using Eqs (5 ) ,  (6) without the inputs of PM, , 
BM2 and BM4. 
j + S l 2  
PM = PM i +BMI ( 5 )  
PM = PM f +BM2 (6) 
j+s iz  15 3, no data can be 
disabled because all signals are required to compute 
the out put of ACSU. 
If PMf and P M Y i z  use a 6 bit path metrics 
collided A(a5, a4, a3, a2, al ,  aO) and B(b5, b4, b3, 
b2, bl, bo) respectively, then the precomputation 
conditions G (X) and G , (X) can be calculated by 
using the following logic expressions. 
When I P M f -  PM, 
_ -  
G , ( X ) =  a,b,.b,b,b,b, 
G ,  (X)= b5a,.a,a,a2a, 
___. 
According to the G(X), if the probabilities of 
obtaining a zero and a one are equal, then the 
probability that I PM - PM psi’ 1 > 3 is a 47% 
(30/64), under which condition almost half of the 
path metric calculation is disabled. Also when the 
load disable signal is asserted, the comparator 
performs few switching activities because some of its 
input data are not switched. 
Iv. IMF‘LEMENTATION AND RESULTS 
A 16-state, rate 1/3 V D  is designed, using the 
proposed precomputation logic. The conventional 
and the proposed VDs are coded in VHDL and 
synthesized with Synopsys using the TCMS 0.13 um 
technology library. Fig. 6 shows the results of a gate- 
level. Here I” refers to original data input to the 
convolutional encoder, EOP is the output of the 
encoder, and SOUT is the output of the VD. 
The power consumptions of the two architectures 
are estimated using a gate-level power simulator 
based on a real delay model. A 1.2V supply and 
25MHZ operating frequency are assumed. The 
results indicate that the proposed ACSU design has 
an increased by 3% larger area, a 1% lower speed and 
605 
a 42% lower power than the conventional ACSU 
design. 
V. CONCLUSIONS 
Low-power architectures for the ACSU of VD 
were presented. A 3% increase in area, 1% increase 
in speed and 42% reduction in power consumption 
were obtained using the proposed architecture. 
REFERENCES 
[I] G.D. Fomey, Jr., “The Viterbi Algorithm,” Proc. 
IEEE. Vo1.61, pp.268-278, Mar. 1973. 
[2] A.J. Viterbi, “Error Bounds for Convolutional 
Codes and Asymptotically Optimum Decoding 
Algorithms,” IEEE Trans. Inform. Theory, vol. 
IT-13, pp.260-269, April, 1967. 
[3] G. Fettweis, L. Thiele, G. Meyr, “Algorithm 
Transformation for Unlimited Parallelism”, 
1990. 
[4] Herbert Dawid, Gerhard Fethveis, and Heinrich 
Meyr, “A CMOS IC for GB Viterbi Decoding: 
System Design and VLSI Implementation”, 
IEEE Trans on Very Large Scale Intergration 
(VLSI) Systems, vol. 4, no. 1, Mar, 1996. 
[5] K. Sekiet. al., “Very Low Power Consumption 
Viterbi Decoder LSLC Employing the SST 
Branch  Branch 
(Scarce State Transition) Scheme for Multimedia 
Mobile Communications”, in IEE electronic:s 
Letters,Vol. 30,No. 8,pp.637-639, April, 1991 
[6] Kang and A. N. Wilson Jr., “Low Power Viterbi 
Decoder for CDMA Mobile terminals,” in IEEE 
Journal of Solid-state Circuits, vo1.33, no.3, 
pp.473-482, March, 1998. 
[7] M-H Chan, W-T Lee, M-C Lin and L-G Chen, 
“IC Design of an Adaptive Viterbi Decoder,” 
IEEE Trans. On Consumer Electronics, vol. 4:2, 
pp. 52-61, Feb. 1996. 
[SI Xun Liu, Marios C. Papaefthymiou, “Design o fa  
High-Throughput Low Power IS95 Viterbi 
Decoder,” DACO2. 
[9] Tobias Gemmeke, Michael Gansen, and Tobi.as 
G. Noll, “Implementation of Scalable Power and 
Area Efficient High-Throughput Viterbi 
Decoders,” in IEEE Journal of Solid-state 
Circuits, vo1.37, no.7, pp.941-948, July, 2002. 
[IOIAlidina, M., Monteiro, J., Devadas, S., Ghosh, 
A.,. Papaefiymiou, M., “Precomputation-based 
sequential logic optimization for low power,” 
IEEE Transactions on Very Large Scale 
Integration (VLSI) Systems, , Volume: 2 Issue: 4 
, Dec.1994 Page(s): 426 -436 
Branch Branch 
~~ ~ 
metric metric metric metric 
Load  Lo ad Load Load 
disable disable disable disable - 
A C  SO ACSI 
I I 1 
GVO 1 
Fig. 5: Structure of proposed Precomputation Scheme 
lo 5 0 0  l o o n  1500 2000 
/TB_TOP/RES 
/TB_TOP/SOUT 
/TB_TOP/INP 
P /TB-TOP/EOP(Z:O) i 
Fig. 6: Gate-Level Simulation Results 
606 
