The design of an asynchronous BCJR/MAP convolutional channel decoder. by Perta, Kristofer Patrick
University of Windsor 
Scholarship at UWindsor 
Electronic Theses and Dissertations Theses, Dissertations, and Major Papers 
2005 
The design of an asynchronous BCJR/MAP convolutional channel 
decoder. 
Kristofer Patrick Perta 
University of Windsor 
Follow this and additional works at: https://scholar.uwindsor.ca/etd 
Recommended Citation 
Perta, Kristofer Patrick, "The design of an asynchronous BCJR/MAP convolutional channel decoder." 
(2005). Electronic Theses and Dissertations. 3820. 
https://scholar.uwindsor.ca/etd/3820 
This online database contains the full-text of PhD dissertations and Masters’ theses of University of Windsor 
students from 1954 forward. These documents are made available for personal study and research purposes only, 
in accordance with the Canadian Copyright Act and the Creative Commons license—CC BY-NC-ND (Attribution, 
Non-Commercial, No Derivative Works). Under this license, works must always be attributed to the copyright holder 
(original author), cannot be used for any commercial purposes, and may not be altered. Any other use would 
require the permission of the copyright holder. Students may inquire about withdrawing their dissertation and/or 
thesis from this database. For additional inquiries, please contact the repository administrator via email 
(scholarship@uwindsor.ca) or by telephone at 519-253-3000ext. 3208. 
NOTE TO USERS
This reproduction is the best copy available.
®
UMI
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
The D esign  of an A synchronous B C J R /M A P  




S ubm itted  to  th e  F aculty  of G rad u a te  S tudies and  Research th ro u g h  th e  
D epartm ent of E lectrical and  C om puter Engineering in  P a rtia l Fulfillm ent 
of th e  R equirem ents for th e  Degree of M aster of A pplied Science a t  th e
U niversity of W indsor
Windsor, Ontario, Canada 
2004
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
1*1 Library and Archives Canada
Published Heritage 
Branch
395 Wellington Street 






395, rue Wellington 
Ottawa ON K1A 0N4 
Canada
Your file Votre reference 
ISBN: 0-494-00498-3 
Our file Notre reference 
ISBN: 0-494-00498-3
NOTICE:
The author has granted a non­
exclusive license allowing Library 
and Archives Canada to reproduce, 
publish, archive, preserve, conserve, 
communicate to the public by 
telecommunication or on the Internet, 
loan, distribute and sell theses 
worldwide, for commercial or non­
commercial purposes, in microform, 
paper, electronic and/or any other 
formats.
AVIS:
L'auteur a accorde une licence non exclusive 
permettant a la Bibliotheque et Archives 
Canada de reproduire, publier, archiver, 
sauvegarder, conserver, transmettre au public 
par telecommunication ou par I'lnternet, preter, 
distribuer et vendre des theses partout dans 
le monde, a des fins commerciales ou autres, 
sur support microforme, papier, electronique 
et/ou autres formats.
The author retains copyright 
ownership and moral rights in 
this thesis. Neither the thesis 
nor substantial extracts from it 
may be printed or otherwise 
reproduced without the author's 
permission.
L'auteur conserve la propriete du droit d'auteur 
et des droits moraux qui protege cette these.
Ni la these ni des extraits substantiels de 
celle-ci ne doivent etre imprimes ou autrement 
reproduits sans son autorisation.
In compliance with the Canadian 
Privacy Act some supporting 
forms may have been removed 
from this thesis.
While these forms may be included 
in the document page count, 
their removal does not represent 
any loss of content from the 
thesis.
Conformement a la loi canadienne 
sur la protection de la vie privee, 
quelques formulaires secondaires 
ont ete enleves de cette these.
Bien que ces formulaires 
aient inclus dans la pagination, 
il n'y aura aucun contenu manquant.
i * i
Canada
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
©  2004 Kristofer Patrick Perta
All Rights Reserved. No P art of this document may be reproduced, stored or otherwise 
retained in a retreival system or transm itted in any form, on any medium by any means 
without prior w ritten permission of the author.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.




Dr. R. Lashkari, External Examiner 
Department of Industrial and Manufacturing Systems Engineering
Dr^H. Wu, Internal Reader 
Department of Electrical and Computer Engineering
v  Dr Aft. Tepe, Supervisor 
Department of Electrical and Computer Engineering
Dr. B. Shahrrava, Chair 
Department of Electrical and Computer Engineering
December 17, 2004
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Abstract
The digital design alternative to the everyday synchronous circuit design paradigm is the 
asynchronous model. Asynchronous circuits are also known as handshaking circuits and 
they may prove to be a feasible design alternative in the modern digital Very Large Scale 
Integration (VLSI) design environment. Asynchronous circuits and systems offer the pos­
sibility of lower system power requirements, reduced noise, elimination of clock skew and 
many other benefits.
Channel coding is a useful means of eliminating erroneous transmission due to the 
communication channel’s physical limits. Convolutional coding has come to the forefront 
of channel coding discussions due to the usefulness of turbo codes.
The niche market for turbo codes have typically been in satellite communication. The 
usefulness of turbo codes are now expanding into the next generation of handheld communi­
cation products. It is probable th a t the turbo coding scheme will reside in the next cellular 
phone one purchases [1].
Turbo coding uses two B CJR  decoders in its implementation. The B CJR decoding 
algorithm was named after its creators Bahl, Cocke, Jelinek, and Raviv (BCJR). The BCJR 
algorithm is sometimes known as a Maximum Priori Posteriori (MAP) algorithm. This 
means a very large part of the turbo coding research will encompass the B C JR /M A P 
decoder and its optimization for size, power and performance.
An investigation into the design of a B CJR /M A P convolutional channel decoder will
iv
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
A B S T R A C T
be introduced. This will encompass the use and synthesis of an asynchronous Hardware 
Definition Language (HDL) called Balsa. The design will be carried through to the gate 
implementation level. Proper gate level analysis will identify the key metrics tha t will 
determine the feasibility of an asynchronous design of tha t of the everyday clocked paradigm.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Acknowledgments
There are many people I would like to offer thanks and appreciation. The many people 
tha t have helped me along the way have proven to be titanic and priceless.
First and foremost, I would like to thank my advisor for truly caring and making these 2 
years fly by. Dr. Tepe's expert guidance and colossal technical expertise paved the way for 
a truly enriched and pleasant graduate program.
I like to extend my thanks to the RCIM group, especially Till Kuendiger for his endless 
patience and tremendous expertise. I would also like to offer thanks to my committee 
members, Dr. Huapeng Wu and Dr. Reza Lashkari for their patience and guidance.
To my friends, especially my University of Windsor fellow alumni, Pedram Mokrian, Mike 
Howard, Alan Soltis, Collin Hayes, Marianne Dent, Colleen Middaugh and R ita Turchi. I 
would like to say thanks for helping in every conceivable way and giving me great advice.
I’d like to  give a warm thanks to my family for putting up with me for the past 2 years 
and motivating me to get things done.
vi
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Contents
A b stract iv
A cknow ledgm ents v i
L ist o f F igures xi
L ist o f Tables xiii
1 Introduction  1
1.1 Asynchronous Circuits And S y s te m s .....................................................................  1
1.1.1 Fundamental Asynchronous C o n c e p ts .....................................................  2
1.1.2 Benefits Of Asynchronous S y s te m s ............................................................ 3
1.1.3 Recent Developments In Asynchronous A p p lica tio n s ........................... 4
1.2 Channel C oding ............................................................................................................. 5
1.2.1 Basic C o n c e p ts ................................................................................................  5
1.2.2 Channel Coding T echniques.........................................................................  7
2 A synchronous C ircuits A nd S ystem s 8
2.1 In troduction ...................................................................................................................  8
2.2 Bundled D ata (BD) Or Single Rail (SR) P ro to c o ls .................................. 8
2.2.1 4 Phase Bundled D ata  Protocol ( 4 P B D P ) ..................................................... 9
2.2.2 2 Phase Bundled D ata Protocol (2 P B D P )...............................................  10
2.3 Dual Rail Protocols (DRP) .....................................................................................  12
vii
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
C O N TE N TS
2.3.1 4 Phase Dual Rail Protocol (4PDRP) Or l-of-2 RTZ Protocol . . .  13
2.3.2 2 Phase Dual Rail Protocol Or l-of-2 NRTZ Protocol ........................ 13
2.4 Discussion On Protocol C h o ic e ............................................................................... 14
2.5 Muller B a s ic s ................................................................................................................  18
2.5.1 The Muller C -E lem en t................................................................................... 18
2.5.2 The Muller Pipeline (M P ) ............................................................................  20
3 C onvolutional C odes (C C ) 23
3.1 In troduc tion ...................................................................................................................  23
3.2 E n c o d in g ....................................................................................................................... 23
3.3 Decoding ......................................................................................................................  27
3.4 Channel M o d e l ............................................................................................................. 28
3.5 Sample Values ...............   30
3.6 Viterbi Algorithm ......................................................................................................  31
3.7 B C JR /M A P ...................................................................................................................  34
3.7.1 Forward R ecu rs io n .........................................................................................  36
3.7.2 Backward R e c u rs io n ......................................................................................  38
3.7.3 State Transition M a t r i x ................................................................................ 39
3.7.4 APPs Of The S ym bols................................................................................... 40
4 V iterb i D ecod er U sin g  A synchronous Techniques 43
4.1 In troduc tion ...................................................................................................................  43
4.2 Basic Building Block C o n c e p ts ............................................................................... 43
4.3 System O verv iew .........................................................................................................  44
4.4 System P a ra m e te rs ...................................................................................................... 45
4.5 C onclusions...................................................................................................................  46
5 D esign ing  T he B C J R /M A P  D ecod er 47
5.1 In tro d u c tio n ............................   47
5.2 Design F lo w ...................................................................................................................  47
5.2.1 High Level Languages And T o o ls ...............................................................  47
viii
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
C O N TE N TS
5.2.2 CSP Vs. HDL ................................................................................................  48
5.2.3 Balsa: Asynchronous Hardware Language And Synthesis Tool . . . .  49
5.2.4 Feasibility Design Flow ................................................................................ 52
5.3 The Asynchronous B C JR /M A P D ecoder..............................................................  53
5.3.1 System C onstra in ts .......................................................................................... 53
5.3.2 The Log-MAP Algorithm And Max-Log-MAP A lg o r i th m ................. 55
5.3.3 Gamma A rch itectu re ......................................................................................  55
5.3.4 Alpha A rch itec tu re .......................................................................................... 57
5.3.5 Beta Architecture .......................................................................................... 65
5.3.6 LLR A rchitecture.............................................................................................  69
5.3.7 Normalization And The Positive Domain ................................................  73
6 S im ulation  A rch itecture and S im ulation  R esu lts 75
6.1 Software Tools - Verilog, Synopsys and M a tL a b .................................................  75
6.2 MatLab Simulation A rc h ite c tu re ............................................................................ 76
6.3 Synopsys Synthesis Simulation R e s u lts .................................................................. 79
6.3.1 Problems Encountered And Possible R e m e d ie s ..................................... 84
6.3.2 Future W o r k .................................................................................................... 84
7 Sum m ary O f C ontributions and C onclusion  86
7.1 Asynchronous VLSI ................................................................................................... 86
7.2 Simulation A rc h ite c tu re ............................................................................................  87
7.3 B CJR /M A P Channel D eco d in g ...............................................................................  87
7.4 Conclusion .................................................................................................................... 87
R eferences 88
A  List o f  A bbreviations 91
B M atlab  C ode - see enclosed  C D  94
C B alsa  C ode - see enclosed  C D  95
ix
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
C O N TE N TS
D  B alsa  To Verilog N e tlis t  M apping Files:
For T SC M  0.18 m icron, Single Poly, S ix M etal, Salicide CM O S P rocess - 
see enclosed  C D 96
E V erilog C ode - see enclosed  CD  97
F Synopsys A rea, Pow er A nd T im ing R eport F iles - see enclosed  C D  98
V IT A  A U C T O R IS  99
X
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
List of Figures
1.1 Synchronous Circuit [2]   2
1.2 Asynchronous Circuit [3 ]............................................................................................  3
1.3 Basic Digital Communication S ystem .....................................................................  6
2.1 Bundled D ata C h an n e l...............................................................................................  9
2.2 4-Phase Bundled D ata Protocol (4 P B D P ) ...........................................................  10
2.3 Transition Signaling Paradigm [ 1 3 ] ........................................................................  11
2.4 2-Phase Bundled D ata Protocol ( 2 P B D P ) ...........................................................  12
2.5 4-Phase Dual Rail P r o to c o l .....................................................................................  14
2.6 2-Phase Dual Rail Protocol For A 2-Bit Wide C h an n e l....................................  15
2.7 2 Phase, 4 Phase (Push) And 4 Phase (Pull) P ro to co ls ....................................  16
2.8 Binput Handshake C h a n n e l ...................................................................................... 17
2.9 The C-Element (Denoted By A ’C’) And The ’OR’ Element Schematic . . .  19
2.10 Behavior Of C-Element W ith Inverter [1 3 ]........................................................... 20
2.11 The Muller P ip e l in e ................................................................................................... 22
3.1 (7,5) Convolutional Encoder W ith R c =  i  And L =  3 [9]   24
3.2 Finite State Machine (FSM) For (7,5) Encoder [ 9 ] ........................................... 24
3.3 Trellis Diagram For The (7,5) Encoder [9 ] ...........................................................  25
3.4 Recursive Systematic Convolutional (RSC) Encoder [1 7 ] ................................  27
3.5 Binary Symmetric Channel (BSC) And Binary Erasure Channel (BEC) [9] 29
3.6 Additive W hite Gaussian Noise (AWGN) Channel [9].......................................  29
xi
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
L IST  OF FIG URES
3.7 Sample Encoder O utput Values W ith Corresponding Channel Error . . . .  31
3.8 Viterbi Algorithm Decoding The Sample Encoder O utput V a lu e s ................. 32
3.9 System Diagram Of The B CJR /M A P Decoding Algorithm ........................... 37
3.10 Transitions For NRC codes [9]...................................................................................  41
4.1 Viterbi Decoder Using Asynchronous Techniques - System Level [12, 8] . . . 44
5.1 Balsa Asynchronous Design Flow [19]   50
5.2 Gate-Level and Handshake Component Level Of A Modulo-10 Counter [19] 51
5.3 The Feasibility Asynchronous Design Flow (For The Area And Power Metrics) 52
5.4 Gamma A rch itec tu re ...................................................................................................  58
5.5 Gamma Architecture - D ata Type S tr u c tu re s .....................................................  59
5.6 Gamma Sub-System Architecture - LUT ............................................................  60
5.7 Gamma Sub-System Architecture - B u f fe r ............................................................ 60
5.8 1000 Blocks Transmitted Per dB Level - BER Vs. S N R .................................. 61
5.9 Alpha A rch itec tu re .......................................................................................................  62
5.10 Alpha Architecture - D ata Type S tr u c tu r e s ......................................................... 63
5.11 Alpha Sub-System Architecture - A d d e r ...............................................................  64
5.12 Alpha Sub-System Architecture - Subtractor .....................................................  64
5.13 Beta A rchitecture..........................................................................................................  66
5.14 Alpha Architecture - D ata Type S tr u c tu r e s ........................................................  67
5.15 LLR Architecture  ................................................................................................  70
5.16 LLR Architecture - D ata Type S tru c tu re s ............................................................ 71
5.17 LLR Sub-System Architecture - Minimer ............................................................ 72
5.18 Alpha Sub-System Architecture - Decision B lo c k ............................................... 72
6.1 1000 Blocks Transmitted Per dB Level - BER Vs. SNR - Invalid Balsa Design 77
6.2 M atLab System Simulation A rch itec tu re ...............................................................  78
6.3 10000 Blocks Transmitted Per dB Level - BER Vs. SNR - trunc_5bit Design 80
6.4 10000 Blocks Transmitted Per dB Level - BER Vs. SNR - trunc_4bit Design 81
xii
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
List of Tables
2.1 1-bit Channel - Encoding C h a r t ............................................................................. 13
2.2 Comparison Of Protocols [1 4 ] ....................................................................................  17
2.3 C-Element Truth Table ............................................................................................  18
2.4 ’OR’ Truth T a b le .......................................................................................................... 19
5.1 Asynchronous B C JR /M A P System C o n s tra in ts .................................................  54
6.1 breeze-cost Estimates For Area (Units Are Normalized) ............................... 80
6.2 Area, Power And Timing Values .............................................................................  82
6.3 Comparison To Synchronous MAP Decoder D e s ig n s .........................................  83
xiii
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Chapter 1
Introduction
1.1 A syn ch ron ou s C ircu its A n d  System s
The digital VLSI designers are faced with many challenges. Some of the biggest challenges 
they face include [2]:
• Lowering power consumption
• Addressing clock skew issues
• Decreasing noise
• Increasing performance
These challenges have been in existence for some time. They will become increasingly 
prevalent in future digital VLSI designs due to scaling issues with technology. There are 
additional concerns th a t need to be addressed; however, the issues stated above, are staples 
in the designer’s healthy diet of problems.
To overcome these current obstacles, tem porary fixes tha t simply patch up the greater 
need for a concise solution are currently employed. The many advantages th a t asynchronous
l
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.











Figure 1.1: Synchronous Circuit [2]
circuits and systems have, e.g., lower power and noise systems than in synchronous systems, 
are an attractive alternative and the key to the reduction of VLSI design problems.
1.1.1 Fundamental A synchronous Concepts
To understand the fundamental concepts of asynchronous circuits and systems, one must 
first examine a synchronous circuit shown in Figure 1.1.
The synchronous circuit works in the following way [2]:
1. The current state is stored in an array of registers.
2. The next state is then computed from the current state and the input to the combi­
national logic circuit.
3. When the clock makes a transition from low to high, registers are enabled and the 
next state gets copied into the register. Thus, becoming the next state.
An im portant note to  make is tha t the clock period depends on the length of time tha t 
the combinational logic circuit takes to compute the input and deliver an output.
2
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.





C o n tro l
Req
Data
C o n tro l C o n t ro l
Figure 1.2: Asynchronous Circuit [3]
In an asynchronous system, instead of having a clock as the main synchronizer, coor­
dination is accomplished through the use of acknowledge (ack) and request (req) signals. 
This is also known as handshake signaling.
In the asynchronous system in Figure 1.2, each stage of the system contains a register and 
control circuitry. The control circuit communicates with the preceding and succeeding stages 
by handshake signaling. This in turn  controls the state of the registers, i.e., transparent 
(open) or opaque (closed), which allows the input to pass towards the next register and so 
forth [2]. The different protocols will be discussed in the next chapter.
1 .1 .2  B e n e f its  O f A sy n ch ro n o u s  S y s te m s
Asynchronous Systems (AS) offer many advantages over the common clocked Synchronous 
System (SS) design. According to [4], these advantages are as follows, but not limited to:
1. E lim in a tio n  o f C lock  Skew  - W ith the need for larger, more complex integrated 
systems, the synchronization of the clock’s arrival time, at different areas of the circuit, 
is increasingly more difficult to achieve. AS utilizes handshaking protocols to achieve
3
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
1. IN T R O D U C T IO N
the synchronization needed within its circuitry, thus eliminating the clock.
2. A v erag e  C ase  P e rfo rm a n c e  - SS performance is dictated by worst-case condition. 
AS performance is dictated by average case conditions, which may lead to increases 
in performance.
3. A d a p tiv ity  to  P ro c e ss in g  a n d  E n v iro n m e n ta l V a ria tio n s  - SS have their clock 
rate set to allow for correct operation under some allowed variation. AS operate under 
all variations and simply speed up or slow down as necessary.
4. C o m p o n en t M o d u la r ity  a n d  R eu se  - AS have no need to synchronize with a 
global clock, therefore simple system integration with other sub-systems and reuse 
are simpler than with SS.
5. L ow er S y stem  P o w er R e q u ire m e n ts  - Additional clock networks are not needed 
in AS. 15-45 % of electrical power consumed by a SS chip must be devoted to  the 
clock network [5]. There is no waste of energy during spurious transitions, reducing 
global dynamic power consumption.
6. R e d u c e d  N oise  - Activity is uncorrelated in AS, resulting in a more distributed 
noise spectrum and lower peak noise values.
1.1.3 R ecent D evelopm ents In Asynchronous A pplications
There are many facets tha t encompass asynchronous circuit design. However, the idea th a t 
asynchronous circuits alone are going to revolutionize VLSI design is a false presumption.
The amalgamation of the asynchronous and synchronous conceptual frameworks will 
offer promising solutions such as Central Processing Units (CPU) tha t are synchronous 
internally, yet communicate asynchronously with memory [6],
An im portant advantage th a t AS have is their ability to be modular, meaning th a t they 
are easily incorporated into other synchronous or asynchronous systems. Commercially, this 
has been shown with the use of asynchronous circuits in the UltraSPARC Illi  synchronous 
processor at Sun Microsystems [7].
4
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
1. IN TR O D U C T IO N
In the field of communications, Brackenbury et al. [8] have proposed tha t asynchronous 
techniques can be used in VLSI implementations of communication systems yielding lower 
power consumption, specifically the Viterbi decoder. Their design yields 23-29 % less area 
than  a selection of other synchronous implementations with the same design parameters 
which use the same fabrication process and cell library.
1.2 C hannel C od in g
In a communication system, the aim is to transm it information from a source to a target. 
A communication engineer faces many challenges. According to [9], these challenges are as 
follows, but not limited to:
• Thermal noise
• Changes in signal power
• Short losses of signal power (Erasure)
To tackle these challenges a useful tool is channel coding. When performing channel 
coding in a communication system, redundant information is introduced into the original 
signal. This redundancy, at the receiving end, can aid in recovering the original signal that 
may have been corrupted during transmission.
1.2.1 Basic C oncepts
C.E. Shannon [9] demonstrated th a t channel coding helps realize ’’reliable” communication 
over noisy channels. Many landmark achievements have been proposed and used, for in­
stance Convolutional Coding (CC). CC is an error correction channel coding technique tha t 
differs from the block coding technique. In block coding, the data stream is divided into 
a number of blocks and each block is encoded into a code word. In CC, the entire data 
stream  is encoded into a code word.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.












Figure 1.3: Basic Digital Communication System
Coding is a two part process, which involves the encoding of the original information 
and then followed by decoding at the receiving end. Figure 1.3 shows the system diagram 
of a basic digital communication system [9].
First, the source produces a message, a digital signal, which will be transm itted to a 
particular target. If the source produces an analog signal, tha t signal needs to be made 
digital.
The source encoder removes most redundancies from the message. This is done so th a t 
the efficiency of the overall system increases.
The channel encoder inserts redundancies in a controlled manner so tha t a t the receiver, 
the system can correct or detect any errors th a t may have been introduced by the channel 
or the receiver’s front end.
The modulator gives the coded information a new form so th a t it has the best chance 
to pass through the channel with as little error as possible.
After passing through the channel, the demodulator, channel decoder and source decoder 
reverse the original operation of their equivalent partner to produce the original message.
6
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
1. IN T R O D U C T IO N
1.2.2 Channel Coding Techniques
W ith channel coding, the main source of importance lies with the decoder. This is because 
the decoder is more complex than the encoder. The research th a t has been done has largely 
comprised the decoder.
There are many different types of decoding algorithms th a t are used for convolutional 
codes. The most popular are the Viterbi Algorithm (VA) and BCJR/M AP. The Viterbi is 
a maximum-likelihood decoding algorithm [9]. The BCJR is known as a symbol-by-symbol 
’Maximum A Posteriori’ (MAP) algorithm. The main reason th a t we use the B CJR /M A P 
is because it can calculate soft information for the output symbols, which is used in iterative 
decoding schemes like Turbo Codes [9], [1].
As mentioned earlier, Brackenbury et al. [8] have proposed th a t asynchronous tech­
niques can be used with application to VLSI implementations of communication systems, 
specifically the Viterbi decoder.
Applying asynchronous techniques, as previously mentioned, can conceivably advance 
or offer a feasible design alternative where synchronous techniques cannot. Asynchronous 
techniques, as seen in [8], can offer the potential for lower power systems.
The natural extension of the work done at the University of Manchester with the asyn­
chronous decoder is to apply asynchronous techniques to other decoding algorithms such 
as the BCJR/M AP. This thesis investigates the design of an asynchronous convolutional 
decoder using the B CJR /M A P algorithm.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Chapter 2 
Asynchronous Circuits And System s
2.1 In trod u ction
To fully understand asynchronous circuitry, the different types of protocols and building 
blocks used must be examined.
The following chapter examines the various types of protocols used and the building 
blocks required for the implementation of these protocols. This chapter will assist in the 
knowledge required to understand asynchronous circuits and systems.
2.2 B u n d led  D a ta  (B D ) Or S ingle R ail (SR ) P ro toco ls
The two main protocols in asynchronous circuits and systems are the Bundled D ata Proto- 
col(BDP) and the Dual Rail Protocol (DRP) [3].
The BDPs, in different literature, are sometimes known as the Single Rail Protocols 
(SRP). These terms refer to the separate single req and ack wires th a t are bundled together 
with the data signals, see Figure 2.1.
The dot in Figure 2.1 denotes the active side of communication. The receiver does not 
have a dot because it is passive, i.e., only responds (ack) when communicated to (req). This
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.





Figure 2.1: Bundled D ata Channel
type of communication channel is called a push channel. The ’n ’ denotes the number of 
data  bits.
In push channels, data is always valid before the request signal is put high (for 4-phase) 
or pu t into a different Boolean state (for 2-phase). The 4 and 2 phase protocols make up 
the BD and SR protocol.
2.2.1 4 Phase Bundled D ata  P rotocol (4P B D P )
The 4-Phase Bundled D ata Protocol (4PBDP) or in other texts the R eturn to Zero Bundled 
Data Protocol (RTZBDP), works using the following steps [3], see Figure 2.2:
1. The sender supplies data  on the data  line and sets the request line high. When the 
request line is pu t high, constant valid data  must be provided on the data  line.
2. The receiver then absorbs the valid data and sets the acknowledge signal high.
3. The sender then responds to the receiver by putting the request line low. When the 
request line is low, the data  on the data  line is no longer valid.
9
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.




Figure 2.2: 4-Phase Bundled D ata Protocol (4PBDP)
4. The receiver acknowledges the sender by putting the acknowledge line low. After the 
acknowledge line is low the sender can then s ta rt the process again.
The advantage of the 4PBDP is its familiarity to  most designers. The disadvantage is 
th a t the protocol makes use of the Return To Zero (RTZ) behaviour, which wastes time 
and energy, i.e: Extra transitions to RTZ in each communication cycle will consume more 
power and time rather than the Non R eturn To Zero (NRTZ) behaviour.
The way a certain asynchronous system ’responds to events’ is a complex issue. This 
means, tha t saying at the beginning of a design th a t protocol A is better than protocol B 
because A usually is quicker, may not be true or a realistic assumption.
Note: Depending on the convention of the designer, the circuit’s handshaking may be 
initialized on a rising or falling edge.
2 .2 .2  2 P h a s e  B u n d le d  D a ta  P r o to c o l  (2 P B D P )
The 2-Phase Bundled D ata Protocol (2PBDP) [3] or in other texts the Non R eturn To Zero 
Bundled D ata Protocol (NRTZBDP) is different than  the 4PBDP in tha t it uses the NRTZ
10
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
2 . A SY N C H R O N O U S C IRC UITS A N D  SY ST E M S
Figure 2.3: Transition Signaling Paradigm [13]
behaviour.
The 2PBDP also uses a different signaling paradigm called transition signaling or event 
signaling. Normally when circuits are enabled, an adjustment to the Boolean level is how 
the circuit interprets active or not active.
In the transition signaling paradigm, the circuit is enabled through a transition from ’0 
to 1’ or T to O’, see Figure 2.3. The Boolean level in transition signaling is not important; 
the event or transition is the relevant measurement.
The 2PBDP works using the following steps, see Figure 2.4:
1. The receiver sends a transition (in the first case from ’0 to 1’) on the acknowledge 
line to signal th a t it is ready to receive data. The transition on the acknowledge line 
is also used to signal to the sender th a t the receiver is done with the previous cycle’s 
data.
2. The sender then puts new valid da ta  on the da ta  line and sends a transition on the 
request line to indicate tha t valid data is present. Then the process starts anew.
11
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.





D a ta  mtmmmmm
Figure 2.4: 2-Phase Bundled D ata Protocol (2PBDP)
Note: When the communication starts again, the transitions will not be going from ’0 
to 1’ as mentioned. The next cycle will always produce an opposite transition, e.g., from 
’1 to O’ if the communication starts from ’0 to 1’. This opposite direction in the transition 
is irrelevant in the transition paradigm because a transition from ’0 to 1’ holds the same 
meaning as a transition from ’1 to O’.
2.3 D u al R ail P ro to co ls  (D R P )
In the BD [3] convention, there exists a need for delay matching. This delay matching 
is needed so tha t the order of the signals’ events on the sender’s end is preserved on the 
receiver’s end. This is done so th a t valid data is always produced when needed.
There exists another protocol th a t is insensitive to delays; this protocol is the Dual Rail 
Protocol (DRP). It contains two main protocols called the 4-phase and 2-phase DRPs.
The ’l-of-2’ name refers to  the protocols use of 2 wires to  encode 1-bit of data  informa­
tion.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
2. A SY N C H R O N O U S C IRC UITS A N D  SYSTE M S
Table 2.1: 1-bit Channel - Encoding C hart
For n = l d.t d.f
Empty 0 0
Valid ’0’ 0 0
Valid ’1’ 0 0
Not Used 0 0
2.3.1 4 Phase D ual Rail P rotocol (4P D R P ) Or l-of-2  RTZ Protocol
The 4-Phase Dual Rail Protocol (4PDRP) [3], also known as the l-of-2 RTZ protocol is 
very similar to the 4PBDP except th a t the request signal is encoded into the data  signals. 
The protocol uses two wires per bit of information, d; one wire d.t. is used to transm it a 
Boolean logic T  (true) and one wire d.f. is used to transm it a Boolean logic ’0’ (false). 
Once, again ’n ’ refers to the number of information bits, see Table 2.1 .
This method of communication is robust since the communication of the sender and 
receiver can be done reliably, regardless of delays. Hence, the protocol is ’delay insensitive’.
The 4PBDP works in the following way, see Figure 2.5:
1. The sender supplies a valid codeword, which is also a request pu t into the high level.
2. The receiver absorbs the valid codeword and sets acknowledge high.
3. The sender then responds to the acknowledge by supplying an empty codeword.
4. The receiver then sets acknowledge low. After the acknowledge line is low, the process 
can sta rt over again.
2 .3 .2  2 P h a s e  D u a l R a il P r o to c o l  O r l-o f -2  N R T Z  P r o to c o l
The 2-Phase Dual Rail Protocol (2PDRP) otherwise known as the l-of-2 NRTZ protocol 
encodes the da ta  signals in the request signals. I t also uses 2 wires {d.t, d,f} per bit. The
13
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.






• / 2n 17
Figure 2.5: 4-Phase Dual Rail Protocol
information is encoded as transitions (events) as discussed in Section 2.2.2. Unlike the 4- 
phase system in Section 2.2.1, there is no empty value. Valid information is acknowledged, 
followed by another valid messaged and then the process continues. Figure 2.6 demonstrates 
a 2-bit wide channel using the 2PDRP.
2.4  D iscu ssion  On P ro to co l C hoice
The previous sections have outlined the main protocols used to design asynchronous circuits. 
However, these 4 protocols can be combined to offer a plethora of protocol possibilities. In 
addition, the dual rail protocol can be further manipulated to offer extra encoding per bit 
of information, e.g., 4 wires to encode 1-bit of data, known as l-of-4.
To give a general idea of the many possibilities of protocols used to design asynchronous 
circuits, the other channel types used must be briefly mentioned. Thus far, the ’push’ 
channel was the only channel discussed.
In fact, there are 3 other channels used. The ’nonput’ channel, is a channel tha t passes 
no data  between channels. It is simply used as a coordinating link, i.e: req, ack and data
14
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
2 . A SY N C H R O N O U S C IRC UITS A N D  SYSTE M S
Req
Ack
r *  " \
.
v>
Figure 2.6: 2-Phase Dual Rail Protocol For A 2-Bit Wide Channel
link.
The ’pull’ channel is the opposite of the push channel in th a t the receiver is the active 
party in the initiation of communication and the sender is the passive member. Figure 2.7 
shows the different data  validity schemes for the different types of channels.
The ’b ipu t’ channel is where the data is passed in both direction along with the ac­
knowledge and request signals, see Figure 2.8.
The choice of protocol offers different features th a t may or may not suit the needs of the 
application. The protocol is an im portant choice and one th a t will influence the applications 
behaviour.
In [14], a comparison on asynchronous design styles were presented. They compared the 
2PBDP, 2PDRP, 4PDRP, l-of-4 RTZ and l-of-4 NRTZ.
The comparison study yielded interesting results with respect to the potential for the 
l-of-4 protocol. However, a general view of all protocols is outlined, which tends to show 
th a t the 2PBDP is a very reliable protocol with respect to area and power, see Table 2.2.
The 2PBDP offers a potential increase in circuit speed and decrease in area (wires/bit)
15
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.




( p u s h )
d a t a
(pu ll)
\  /
\ ................... _ J
(XX XXXXX XX)





d a t a
(b ro a d )
d a t a
( la te )
/  "A







(e a rly )
d a t a
(b ro a d )
d a t a
( la te )
r  - - \
r  \
(XX XXXXXXXXXXXXX
(XX • '■ ~ x x
XXXXXXXXXXXXXX ■ > q
Figure 2.7: 2 Phase, 4 Phase (Push) And 4 Phase (Pull) Protocols
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.






Figure 2.8: Binput Handshake Channel
Table 2.2: Comparison Of Protocols [14]
Area (wires/bit) Energy (transitions/bit)
2PBDP 1 1/2 (avg.)
2PDRP 2 1
4PDRP 2 1
l-of-4 (RTZ) 2 1
l-of-4 (NRTZ) 2 1/2 (avg.)
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
2 . A SY N C H R O N O U S C IRC UITS A N D  SYSTE M S
Table 2.3: C-Element Truth Table
X Y z
0 0 0
0 1 Retain previous Z value
1 0 Retain previous Z value
1 1 1
and energy (transitions/bit). This increase in speed and decrease in energy and area costs 
is directly correlated to the NRTZ behaviour th a t the ’transition signaling’ paradigm offers.
The downfall to the 2PBDP is the overhead required to implement the delay matching 
circuitry and the extra level of complexity of transition signaling circuitry. However, the 
speed gained in using the 2PBDP would be a beneficial due to the computationally heavy 
nature of the B CJR /M A P algorithm. This will be investigated in later chapters.
2.5 M uller B asics
Muller [15] invented the Muller C-Element in 1959. I t is the most often used primitive in 
asynchronous circuit implementations. The Muller pipeline, which is used in most asyn­
chronous protocol implementations, is the backbone for handshaking circuitry [3].
2 .5 .1  T h e  M u ller  C -E lem en t
In asynchronous circuits, signals are required to be valid at all times. Indication and 
acknowledgment plays an im portant role in the design of these circuits. This means tha t 
every signal transition has a meaning and tha t hazards and races should be avoided [3].
The C-Element primitive is better suited to these types of constraints. The C-Element 
is a state holding element tha t acts as an ’OR’ element for events. Unlike the ’OR’ element, 
the C-Element can represent an acknowledgment when both inputs are ’1’ or ’O’, see Figure 
2.9, Table 2.3 and Table 2.4.
18
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
2 . A SY N C H R O N O U S C IRC UITS A N D  SYSTE M S
Figure 2.9: The C-Element (Denoted By A ’C’) And The ’OR’ Element Schematic






Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
2 . A SY N C H R O N O U S CIRCUITS A N D  SYSTE M S
IF X  and Y differ in state 
THEN  copy X for Z 
E L S E  hold previous state
Figure 2.10: Behavior Of C-Element W ith Inverter [13]
2 .5 .2  T h e  M u ller  P ip e lin e  (M P )
In most asynchronous protocols, when analyzing the circuit implementation, the backbone 
th a t is always present is th a t of the MP. The MP is built from C-Elements and inverters. 
The MP is the structure th a t controls the handshaking in asynchronous circuits.
To understand the concept of the MP [3], the MP must be first pu t in its initial state. 
The initial state of all C-Elements are X = ’0’ and Y = T , therefore Z = ’0’.
When an inverter (denoted by a circle) is placed on the input of the C-Element, it works 
in the following way, see Figure 2.10.
Network A, in Figure 2.11, is ready to send data to the MP. It sends a constant ’1’ on 
the req(in) line. This then starts a ripple effect through the MP. Note: The initial state of 
all C-Elements are X = ’0’ and Y = T , therefore Z = ’0’.
Examining ’node 1’, we see tha t after A has sent a ’1’, the value at ’node 1’ will become 
a T ’. This changes the Req signal heading to the 2nd C-Element. This also sends an ack 
signal back to A.
’Node 2’, then changes to a ’1’ bringing the value a t ’node 3’ to ’1’. This has no effect on 
first C-Element. The first C-Element is now in a ready state. The ’1’ from the Req(in) line, 
propagates through all the C-Elements and puts each C-Element in a ready state (X = ’l ’, 
Y = ’l ’ and Z = ’l ’ (previous state)).
20
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
2 . A SY N C H R O N O U S C IRC UITS A N D  SYSTE M S
Once all C-Elements are in the ready state, the system can either send a constant ’O’ or 
’1’ on the req(in) line. This ’O’ or ’1’ will propagate through the MP as did the first request 
of ’1’. The MP acts as a flow through First-In First-Out (FIFO) circuit.
In an abstract way, the MP can be thought of as a wave. Where the wave carries a ’1’ 
through the MP or a ’O’.
There are few other points tha t should be mentioned about the MP. D ata doesn’t 
necessarily have to flow from A to B. The symmetry of the MP makes the MP bi-directional. 
The complexity from A to B is the same from B to A. Therefore, communication can sta rt 
from either ’A to B ’ or ’B to A ’.
The MP also can be used for 2-phase or 4-phase protocols. The only difference between 
the two protocols is how the designer interprets the signals.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
2. A SY N C H R O N O U S CIRCUITS A N D  SYSTE M S
<
Figure 2.11: The Muller Pipeline
22
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Chapter 3
Convolutional Codes (CC)
3.1 In trod u ction
P. Elias [9] proposed CC in 1954. CC is an error correction channel coding scheme th a t is 
the alternative to block codes.
As previously mentioned, in block coding, a data stream is broken up into separate 
blocks of information and then each block is encoded into an n-bit codeword. CC differs 
from block coding in th a t they encode the complete data stream into a single codeword 
[10].
There are several types of decoding algorithms used for CC. This chapter will focus on 
the Viterbi Algorithm and the B C JR /M A P algorithm.
3.2 E n cod in g
Before the decoding algorithms are discussed, the encoding process must be examined. In 
CC, encoding is accomplished through shift registers and adders (’XOR’ elements), see 
Figure 3.1.
The rate or ratio (R c) of source bits (k ) to code bits (n) is known as the code rate.
23
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
3. C O N V O LU TIO N A L COD ES (CC)
u,
Figure 3.1: (7,5) Convolutional Encoder W ith R c =  \  And L  =  3 [9]
1 0 !
Z e r o  ‘O' T ran s it io n
O n e  '1 ' T ran s it io n
Figure 3.2: Finite State Machine (FSM) For (7,5) Encoder [9]
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.





Figure 3.3: Trellis Diagram For The (7,5) Encoder [9]
25
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
3. C O N V O LU TIO N A L COD ES (CC)
The constraint length (L ) is defined as M  +  1, where M  is the number of Memory elements 
needed. The memory describes how many bits in the input sequence will affect the output 
sequence at a time. The generator sequence can be thought of how the actual wiring of 
the encoder is carried out, i.e., where a binary T ’ mean there exists a connection and a 
binary ’0’ represents an absence of a connection. The standard for the sequence is usually 
represented by the octal number system, e.g., (7 ,5)s =  (111, 101)2- Figure 3.1 shows a (7,5) 
convolutional encoder with R c =  |  and L — 3.
The encoder, before the source bits of information enter it, is filled with zeros, i.e., 
the all-zero state. This brings the encoder to an initial state, this process is sometimes 
known as ’padding the systems with zeros’, ’flushing’ or ’trellis term ination’. This ’flushing’ 
makes the decoding process easier but causes rate loss. The process is satisfactory for long 
block lengths but for relatively short block lengths, the loss is unacceptable and causes 
degradation to the system’s performance.
’Flushing’ works in the following way; the information bits are entered, k bits a t a 
time. The corresponding code bits are then transm itted over the channel. This process 
continues and when all sets of k bits are sent, the encoder is sent zeros to pu t the encoder 
in the all-zero state. Then the system is ready for the next transmission. The amount of 
zeros needed for padding is k(L  — 1). For the (7,5) encoder example, the systems needs 
k(L  — 1) =  1(3 — 1) =  2 ’zeros’.
For convolutional encoding, there are two main types of encoders. They are the Re­
cursive Systematic Convolutional (RSC) see Figure 3.4 and Non-Recursive Convolutional 
(NRC) encoders see Figure 3.1. In a turbo encoder, typically we would find 2 NRC en­
coders. The NRC encoders are used due to  their advantage in performance at low Signal 
To Noise Ratio (SNR) values in the communication channel. [16]. However, in terms of a 
standalone B CJR/M A P decoder, using one type of encoder over another does not m atter 
in terms of performance.
RSC encoders are different than NRC encoders due to the feedback nature in their shift 
registers implementations, see Figure 3.4. NRC encoders are implemented as Finite Impulse 
Response (FIR) Filters, however, RSC are implemented as Infinite Impulse Response (HR)
26
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
3. C O N V O LU TIO N A L COD ES (CC)
Information Bits
Figure 3.4: Recursive Systematic Convolutional (RSC) Encoder [17]
filters. For the sake of simplicity with this project, a NRC encoder was utilized. This has 
ramifications when the Log Likelihood Ratio (LLR) calculation is executed in the decoder,
i.e., more storage and additions are needed with the RSC implemention because all three 
metrics (alpha,beta and gamma) are needed for the LLR calculation unlike the NRC im­
plementation were only two metrics are needed (alpha and beta). Therefore, an NRC code 
implementation will more than likely have smaller area and lower power consumption than 
RSC without the expense of performance.
The convolutional encoder can be presented in various forms, e.g., functional diagram 
shown in Figure 3.1, finite state diagram shown in Figure 3.2 and the trellis diagram shown 
in Figure 3.3.
3.3 D ecod in g
Decoding is the reverse act of encoding. Decoding compared to encoding is more complex 
in term s of the amount of computation and memory requirements needed. There are two 
types of decoding frameworks, known as hard and soft decision decoding.
Hard-decision decoding is the process of using two quantized levels from the demodulator 
as the symbols used for comparison methods. In hard decision decoding, the quantized levels
27
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
3. C O N V O LU TIO N A L COD ES (CC)
are binary values of ’0’ and ’1’. Soft-decision decoding uses more than two quantized levels 
and this yields better performance. Hard-decision decoding produces an extra 2 to  3 dB  
performance loss over tha t of soft-decision decoding [11].
3.4  C hannel M od el
Before the decoding algorithms are demonstrated, the channel model used must be first 
discussed. A channel is the medium th a t ’’transfers or stores information” th a t is used in a 
communication cycle [9]. The physical medium can be wire, e.g., twisted pair communica­
tion, air (wireless), glass (fibre-optic), etc.
In this study, the modulator, channel and demodulator make up the channel model. 
There are three im portant channel models used. They are the Binary Symmetric Channel 
(BSC) and Binary Erasure Channel (BEC) which are forms of the Discrete Memoryless
Channel (DMC), see Figure 3.5 and Figure 3.5. The channel model used in this thesis
is the Additive W hite Gaussian Noise (AWGN) Channel which is a type of discrete-time 
channel characterized by a set of probability density functions, see Figure 3.6.
The input, X  =  {0,1}, to a channel in a digital communication system is a sequence of 
binary numbers. The output, Y  =  {0,1}, is also a sequence of binary numbers. Suppose 
the channel introduces ’’statically independent errors” in the transm itted binary sequence 
with the average probability p.
Therefore,
Pr {Y =  0 | X  =  1) =  Pr {Y  =  1 | A  =  0) =  p (3.1)
Pr (Y  =  1 | X  =  1) =  Pr (Y  =  0 | A  =  0) =  p  -  1 (3.2)
Thus, the BSC is obtained, see Figure 3.5.
Another DMC is the BEC, with inputs, X  =  0,1, and outputs, Y  =  0, e, 1, where e is 
an ’erasure’, meaning the channel cannot decide if the output is a ’1’ or a ’O’, see Figure 
3.5.
28
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
3. C O N V O LU TIO N A L CO D ES (CC)






Figure 3.5: Binary Symmetric Channel (BSC) And Binary Erasure Channel (BEC) [9]
Figure 3.6: Additive W hite Gaussian Noise (AWGN) Channel [9]
As mentioned, the AWGN channel is the channel model used in this thesis. It is shown 
in Figure 3.6. This channel model includes the ’’output alphabet equal to the entire real 
line” , Y  =  {—oo,oo}, and the ’’input alphabet has a finite number of symbols” , X  =  
{ x i , X 2 , • • ■, Xf-1}. The AWGN channel is characterized by the following probability density 
function,
p (y | X  =  Xi) =  where i =  1 ,2 , . . . ,  I — 1 (3-3)
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
3. C O N V O LU TIO N A L COD ES (CC)
The channel equation is as follows,
Y  = X  + N  (3.4)
where N is a zero mean Gaussian random variable with variance a 2 and X  =  Xi where 
i =  1 , 2 , . . . ,1 -  1. The probability distribution function of Y  with a given X  is Gaussian 
with mean and variance a 2. Then,
P ( y \  X  = Xi) = - (3 5)
V27T<7
Therefore, for a given input sequence, x t j ,  of symbols, n, where t = 1 , 2 , n  and xt,% S X ,  
the output sequence for a time, t, is given by the following equation,
Vt =  %t,i +  n t (3.6)
The condition of the channel when memoryless can be expressed as,
P r ( y i , V 2 ,--- ,Vn I X  =  x u , . . .  , X )  = Ut=1Pr(yt \ X  = x t,i)
= ( s U ) eXP{ - 2? g (,* - I *’-)2}  (3J)
The channel also, in our case, depends on the modulation used. This thesis uses the Binary 
Phase Shift Keying (BPSK) modulation technique. This means tha t any level value of ’0’ 
gets mapped to ’-1’ and any level value of T  gets mapped to T ’.
3.5 Sam ple V alues
Section 3.6 will use Figure 3.7 to obtain encoder output with corresponding channel errors, 
which is the input to the decoder after demodulation.
Note: Channel errors in Figure 3.7 are highlighted in red.
30
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
3. C O N V O LU TIO N A L CO D ES (CC)
Encoding ' X[2 Padding
U, 0 0 0 0 0 0 0 0 0 0 0 0
X.1 0 0 0 0 0 0 0 0 0 0 0 0
X* 0 0 0 0 0 0 0 0 0 0 0 0
1 2 3 4 5 6 7 8 9 10 11 12
Decoding
x * 0 0 ..... 0 0 0 0 0 0 0 0 0
0 0 0 0 IT? 0 0 0 0 0 0 0
Vew ? ? ? ? ? ? ? ? ? ? ? ?
t 1 2 3 4 5 6 7 8 9 10 11 12
□  - Errors during transm ission
Figure 3.7: Sample Encoder O utput Values W ith Corresponding Channel Error
3.6 V iterb i A lgorith m
The Viterbi is a maximum-likelihood decoding algorithm [9]. Figure 3.8 shows how an
example of the Viterbi Algorithm works. The bold line along the top is the final most
likely decoded sequence, all ’O’ transitions, which is the correct decoded sequence from the 
sample. The lightly coloured lines are the lines during the algorithm in which we eliminate 
as being not a likely transition. The circled numbers signify the local winner. The final 
circled number at the top right signifies the global winner.
In its simplest form, a specific sequence is chosen as the most likely sequence if the 
likelihood of th a t sequence is larger than the likelihood of all other possible sequences.
The algorithm uses the following computational properties to find the most likely 
sequence:
• Branch Word or Label: Is the encoder output due to the transition of an encoder
state to another, see Figure 3.8.
• Branch Metric : Is the distance between the sequence received and the possible branch
31
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.






















Figure 3.8: Viterbi Algorithm Decoding The Sample Encoder O utput Values 32
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
3. C O N V O LU TIO N A L COD ES (CC)
word. For hard-decision decoding, it is usually Hamming distance and Euclidean 
distance for soft-decision decoding. Hamming Distance between two code words, c, 
and Cj, is the number of components th a t the two words differ from each other. 
Hamming distance is denoted by d(ci ,Cj) =  d^.  Euclidean distance, df- can be 
expressed in terms of Hamming distance,
where £  is the signal energy [17]
• Path Metric: Is the summation of branch metrics
• Trellis: Is the basis for a visual and structural representation of the calculations th a t 
need to be carried out for the algorithm. The trellis shows the states or nodes of the 
encoder at different moments in time.
Now th a t the structural metrics have been defined, the following steps make up the 
Viterbi algorithm:
1. The path metrics must be found for every path  at each node. This is done by summing 
the correct branch metric to the related survived path  metric.
2. The two paths, there will be two paths for the K  =  3 case, tha t enter each node or 
state must be compared. The path with the smallest path  metric is selected. This 
path is the survivor path metric. This step needs to be performed in parallel for 
2 K  — 1 states.
3. This procedure needs to be repeated until the end of the trellis is reached.
When the end of the trellis is reached, one will notice th a t all paths merge into a common 
state (if your system is padded or flushed). This means tha t only one path  survives. If you 
trace-back tha t path, you can obtain the most likely original signal.
The depth of trellis in practice is five times the constrain length. There is no performance 
increase if you deepen the trellis. However, the depth is dependant on the rate of the encoder.
(3.8)
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
3. C O N V O LU TIO N A L CO D ES (CC)
3 .7  B C J R /M A P
Bahl et al. [24], first published the BCJR algorithm otherwise known as the ’symbol by 
symbol’ Maximum A Priori (MAP) algorithm, in 1974 [24]. This algorithm was pu t aside 
for decoding convolutional codes due to its inherent complexity. However, the B CJR /M A P 
algorithm gained a second wind with the introduction of Turbo Codes. The main reason to 
use this algorithm is th a t it produces soft likelihood value of symbols on the output.
’’The B CJR/M A P is an optimum decoding algorithm used to minimize the symbol error 
rate for the term inated convolutional codes. The B CJR /M A P estimates the A Posterior 
Probability (APP) of the states and the state transitions of a convolutional code and then 
the APPs of the symbols. [9]”
The B CJR/M A P algorithm, though complex, is very similar to the Viterbi algorithm.
As mentioned above, the Viterbi, in its simplest form, chooses the most likely sequence as 
the final decoded sequence. This is done through several computational steps in a forward 
motion (left to right) through the trellis.
Like the Viterbi Algorithm, the B CJR /M A P algorithm search Forward and Backwards 
(FB) through the trellis for a ’received channel sequence’ defined by V1i  =  {Y\,  Y2 , •■., F /}  
created by a ’channel input sequence’ (encoder) X f  =  { X i , A-2, ..., X ^ } .  The B C JR /M A P is 
otherwise known as the FB algorithm. Depending on the channel model, Yt =  (yt , i ,yt ,2 , • ■ •, Vt,n) 
and X t =  (x t , \ ,x t :2 , ■■ ■,%t,n) for a 1 /n  code and x^i and y y  , i  =  1 , 2 , . . . , n. In this thesis, 
we use AWGN with BPSK, therefore x t j  £ { — 1,1} and yt:i £ {—00, 00}. This FB recursion 
will produce state probabilities defined as A; (-) or state transition probabilities defined as
MO-
To produce the state probabilities or the state transitions 3 different values are needed:
• The forward recursion probability function, a  (•).
• The backward recursion probability function, (3 (■).
• The state transition matrix, P (m1 ,m) .
34
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
3. C O N V O LU TIO N A L COD ES (CC)
The forward recursion is defined as the joint probability th a t the state of the system is m  
at time t, S t =  to, and the received channel sequence from time 1 to t is T /,
a t (to) =  P r  {S) =  m; Y *} (3.9)
The backward recursion is defined as the probability tha t the received channel sequence 
from time t  +  1 to L  is Y ^ ,  given th a t the state m  a t time t is S t =  to. Then,
Pt (m)  =  P r  {Y tL+1 | St =  to} (3.10)
The state transition probability m atrix is defined as the conditional probability tha t the 
state to  at time t  is St =  m  and the received channel sequence a t time t is Yj, while being 
at the state m'  a t time t — 1, S t~i =  m! . Then,
T (m! , m) =  Pr  {5) =  m ;Yt | S t-i =  m/} (3-H)
The next step involves estimating the APPs of the states and the state transitions for the 
received channel sequence. They are defined as follows,
P r{g ,= m |y iq =  - <312>
and
P r  {S t  =  to; St_! =  to ' | Y  ̂ } =  ^   —  (3‘13)
The received sequence probability, P  (YjL}, is constant during calculation, then the joint 
probabilities of the states and the received channel sequence denoted as A (m), and the joint 
probability of the state transitions and the received channel sequence denoted as A (to) can 
be calculated. They are defined as followed,
35
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
3. C O N V O LU TIO N A L COD ES (CC)
At (m) =  P r  { S t =  m; F /'} (3.14)
and
a t (m' ,m )  =  P r  {St- i  =  m ';S t =  m ;F 1i } (3.15)
Mathematically, At (m) and at (m /,m ) can be expressed as a ( m ) ,  0 ( m )  and T (m ',m ). By 
using Markov and DMC properties [9] for the derivations,
Figure 3.9 shows the system diagram and the basic building blocks of the B CJR /M A P 
decoding system.
3.7.1 Forward R ecursion
The forward recursion probability function, a  (•), is the sum of all probabilities of the state 
m  for the received channel sequence from time 1 to t. It is described as follows,
At (m)  =  P r  { S t =  m ] Y f }  ■ P r  {YtIl 1 \ S t = m ; Y { }  
At (m) =  a { m ) -  P r  { Y ^  \ S t = m }
At (m) =  a  (m) • @ (m) (3.16)
and
a t (m' ,m )  =  P r  {5t_i =  m ';F /  J } ■ P r  { S t =  m ;Y t \ S t- i  =  m!) ■ P r  { F ^  | St =  m )  
a t ( m \  m)  =  a t- 1 (m!) ■ r t (m', m) ■ /3t (m) (3.17)
When At (m) or a t (m ,m ! )  are found, the APPs of the data symbols can be calculated.
36
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.











Figure 3.9: System Diagram Of The B CJR/M A P Decoding Algorithm
Rewriting equation 3.18 using the Markov chain and the DMC properties, the equation 
becomes,
2k—i_ l
a (m ) =  222 P r { St~i = m ' iYt ~ 1} - P r { S t =  m ; Y t \ S )-i = r r i }
rrv
a "(m) =  ^ 2  a t - i  f o r  t  =  1 , 2 , . . . ,  L  (3.19)
Now using vector notation, we can take at  as an N  tuple row vector with the m th entry 
being the probability of being in the m th state at time t, where N  is the number of states 
for memory, M , then, N  =  2M . Therefore, equation 3.19 is a row vector multiplied with a 
matrix. Note, the bold character refers to a matrix. Therefore,
37
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
3. C O N V O LU TIO N A L CO D ES (CC)
a t =  ttf-! ■ r t (3.20)
a t (m ) requires normalization during each iteration to prevent underflow, since P r  {FiL} is 
not used in the calculation. Then, the normalized forward recursion, a t  (m), is,
a t  ( m )  
Z * ’2 0 a t (m>)=  (3-21)
The forward recursion expressed with normalization is as follows,
N - 1
(m) =  ^  a t- i  (m1) ■ Tt (3.22)
m f = 0
aoj the initial conditions, can be determined by analyzing the starting state of the encoder. 
When the encoder starts a t state m, then ao =  (0 , . . . ,  1 , . . . ,  0) where the rrPh entry is 1.
3.7.2 Backward Recursion
Using the same procedure as in the forward recursion, the backward probability function 
can be obtain , pt (m). Therefore,
N - l
Pt {m) =  P r  {Fj+j | St =  to} =  ^  P r  {5t+1 =  to'; YtL+1 | St =  to} (3.23)
m'= 0
Rewriting equation 3.23 using the Markov chain and the DMC properties, the equation 
becomes,
i V - l
0t (m ) = ^ 2  P r  {^+2 I s t+1 =  m '}  ' P r  { ^ t+ i  =  m ' ’ s <- =  m }
m ! ~  0 
N - l
P t { m ) =  ^  &+1 ( t o ' ) - r i+1 (m,m' )  / o r  f =  1 , 2 , . . . , L - l  (3.24)
r n '= 0
38
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
3. C O N V O LU TIO N A L COD ES (CC)
Using vector notation, equation 3.24 becomes,
Pt- i  = T f  fit (3.25)
Due to the previous subsections reasons, the backward recursion needs normalization, which 
is as follows,
Pt (m) =  - / ‘l y  (3.26)
L m '=0 Pt  (m )
The backward recursion in terms of the normalized values is then,
N - 1
Pt (to) =  E Pt+i (to ) * (rn, ) (3.27)
=0
If the final state is know and it is m  or the encoder is ’term inated’, then /?£, the initial
conditions, are (0 , . . . ,  1 , . . . ,  0) where 1 is the m ih entry.
3.7.3 S tate Transition M atrix
An N x N  m atrix, where N  is the number of states in the code, is known as the transition 
matrix, T (m ' , m ), and is defined as follows,
T (m!, m) =  P r  {S^ =  m; Y) | S t~i =  to '} (3.28)
Applying Bayes’ rule we can show the following,




Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
3 . C O N V O L U T IO N A L  CO D ES (CC)
To describe equation 3.29, i t ’s individual pieces can be broken up and analyzed:
• P r { U t  | St =  mi St-1 — m'}:  This value is either 0 or 1. A 1 /n  rate convolutional 
code contains two transitions from each state for a binary input [/i; e.g., transition 1 
and transition 0.
• P r  {St  =  m  | S t - i  =  m'}:  These are the APPs of the symbols at time t. The encoder 
input symbol probabilities are the same as the state transition probabilities. W ith 
turbo  coding, i.e., iterative coding, the soft symbol information given by another de­
coder will be used in place of this term  for the later iterations. The first iteration 
uses 0.5 for all symbols, given th a t the decoder has no priori probabilities of the sym­
bols. The B CJR/M A P decoder th a t has been designed will be a standalone decoder. 
Therefore, 0.5 should be used for all symbols, however, it has been concluded through 
experimentation tha t we can neglect these APPs without any loss to performance see 
Section 5.3.1.
• P r { Y t | Xt}:  This value depends on the channel property. Since, the AWGN channel 
with BPSK will be used, for a rate 1/n  code, the AWGN channel with a single side 
power spectrum density value of N aj 2 gives,
/  1 \ nA /  I n \
P r  {Yt | X t } =  exp ( £  (yt4 -  xt4)2 \ (3.30)
where x^i G {—1,1} and recall th a t in BPSK, bit level 0 is m apped to  —1 and 1 is 
m apped to 1.
3.7.4 A P P s Of The Sym bols
Once Tj (m ',m ), at  (to), pt (m ) are calculated it relatively simple to calculate Xt (m ) defined 
as the joint probabilities of the states and the received channel sequence, and oq (to/, to), 
either one of the following equations are used to find the APPs of the symbols,
A( (to) =  a t (m) • p t (™) (3.31)
40
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
3. C O N V O LU TIO N A L CO D E S (CC)








0 -  Transition 
States
1 -  Transition 
States
Figure 3.10: Transitions For NRC codes [9] 
at (m ' , m ) =  a t-1 (m!) ■ (m') • (3t (m) (3.32)
At (ni) is typically used for NRC codes because each state can be either ’reached’ by 1 or
0, bu t not both, unlike at  (m ' , m ). <rt (m!,m)  is typically used for RSC codes because each
state can be ’reached’ by both 1 or 0 transitions, see Figure 3.10.
The following equations show the remainder of the calculations needed,
P t a p p  [ut =  1] =  (m) (3.33)
Z-'mZ.All states t
where A is all ’1’ transition states and
P t a p p  [ut  =  1] =  ^ ----------------------r - ;-----c (3.34)
2—i(m ',m )€A ll branches 'P '  *
where B is all ’1’ transition branches. The APPs of symbol ’-1’ at time t for a given received 
channel sequence is
41
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
3. C O N V O LU TIO N A L COD ES (CC)
P t a p p  [ut =  -1 ] =  1 -  P t a p p  [ut =  1] (3.35)
By choosing the APP with the largest value, the decoder can give hard outputs which are
estimations of the input symbols, denoted ut  £ {—1,1}- Therefore, the decision rule is,
(  1 if P t a p p  [ut — 1] > P t a p p  [ut =  -1 ] 
ut =  < (3.36)
[ - 1  if P t a p p  [ut — 1] < P r A p p  [ut =  -1 ]
Equations 3.33 - 3.36 are based on the AWGN channel with BPSK.
42
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Chapter 4
Viterbi Decoder Using Asynchronous  
Techniques
4.1 In trod u ction
The asynchronous paradigm can conceivably provide viable advantages to the problems 
caused by synchronous design. Applying asynchronous techniques, as seen in [8, 12] can 
offer the potential for lower power systems. The natural extension of the Amulet Group’s 
(University of Manchester) paper is to apply asynchronous techniques to the BC JR /M A P al­
gorithm. This chapter briefly introduces the ideas concerned with the asynchronous Viterbi 
decoder from [8, 12], as to better describe an asynchronous B C JR /M A P decoder.
4.2  B asic  B u ild in g  B lock  C on cepts
The Amulet Group divided up the design into three main sub-systems shown in Figure 4.1:
1. Branch Metric Unit (BMU)
2. Path  Metric Unit (PMU)
43
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
4. V ITERBI D E C O D E R  USING A S Y N C H R O N O U S  TECHNIQUES
h p u t  F ro m , B ran ch
M etric
Unit
R e q R e q
L oca l ^
B ra n c h  ^ P a th
M etric
Unit
W in n ers
M etric
G loba! ^
W in n ers
Ack Ack
H istory
Unit O utpu t
Figure 4.1: Viterbi Decoder Using Asynchronous Techniques - System Level [12, 8]
3. History Unit (HU)
The HU is sometimes called the Survival Metric Unit (SMU) in synchronous systems.
In Figure 4.1, the thick lines are the data signals and the thin lines are the handshaking 
control signals. The system’s first request line and last acknowledge line is connected to a 
clock. This clock is solely to synchronize with an external synchronous systems’ clock.
4.3  S ystem  O verview
The BMU takes the inputs from the receiver and calculates the distance between the received 
bits and the four ideal symbols (ideal branch words) th a t could have been transm itted, i.e., 
branch metric calculations. These weights are the inputs to the PMU.
44
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
4. V ITERBI DE C O D E R  USING A S Y N C H R O N O U S  TECHNIQUES
Holding node weights, the PMU calculates the node plus branch weights, selecting the 
lower overall metric as the next metric for a node in a specific time period. The calculated 
node weights are then fed back into the PMU for the next calculation. The PMU also holds 
a bit of data for whether or not the winner was on the upper or lower branch entering the 
node.
For each time slot, the local winner information is inputted to the HU. The global 
winner, the state with the lowest node weight, is also passed on to the HU. This is done so 
th a t there is a known starting point when tracing back through the trellis history to find 
the most correct decoded sequence.
The HU, using the global and local node weights, reconstructs the trellis. I t also traces 
back to find the path (most correct decoded sequence) from the global winner node to  the 
local winner node in the oldest timeslot.
4 .4  S y stem  P aram eters
The system described used 4PBDP. The received bits to the decoder were soft coded; three 
bits were used to represent values. The system constrain length was 7. T h at means th a t 
there were 64 nodes/states and 128 paths or branches.
The system receives and decodes at a rate of however, other data  rates are obtained 
by omitting some of the encoded symbols sent. Therefore, code rates of and |  can be 
obtained. As the code rate increases, less redundancy is included in the transm itted data 
signal, which results in increases in error rate from the decoder. The \  rate encoder has 
the most error-free output.
The system operates on a 90 MHz clock. This clock is used by all code rates, and the 
code ra te defines two im portant ideas. First, the code rate defines the number of input 
bits to the number of transm itted encoded characters, which are the inputs to the decoder. 
Second, it specifies the ratio of the number of clocks containing encoded information and to 
the number of clock cycles. As an example, a f  rate means th a t three input bits result in 
four ou tput transm itted characters and tha t there are three clocks out of four th a t contain 
encoded information. Therefore, in a |  rate code, every fourth clock cycle contains no
45
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
4. V ITERBI D E C O D E R  USING A S Y N C H R O N O U S  TECHNIQUES
encoded information (Block-Valid Low). So a 90 M H z  clock rate using a |  ra te encoder 
produces valid encoded information on every other clock cycle and yields a data  rate of
1  X  90 =  45 Msvmbds
2 sec
4.5 C onclusions
The Viterbi decoder described in [3, 8] was compared in [12]. In many asynchronous de­
signs, their synchronous counterpart cannot be directly translated into the asynchronous 
conceptual framework. This resulted in a design th a t was conceived from first principles 
and caused the designers to completely rethink how to implement the decoder.
The designers compared their design to the standard Viterbi decoder and the other 
decoders in the Power Reduction for System Technologies (PREST) project. The designers 
claim th a t the ’’power consumption was approximately an order of magnitude less than 
th a t of the other novel designs, and twice th a t again less than the reference design (when 
decoding an uncorrupted bit stream )” .
The authors only mentioned tha t since the design is asynchronous, th a t low Electro 
Magnetic (EM) emissions would be desirable for a communication-related circuit.
The most promising aspect th a t was observed was th a t of the exploitation in the HU. 
Asynchronous designs most significant assets; doing nothing when nothing useful needs to 
be done was visibly shown in the power results. Since this is true when good channel 
conditions exist, this in no way impeded the system when the condition didn’t  exist. This 
aspect is the most difficult to achieve in a synchronous design.
The architecture used both high-speed serial operation (in the PMU) and lower speeds 
in the parallel operation (in the HU). These two architectures were used when seen to be 
most appropriate. The variety of architectures would have been difficult to achieve in a 
synchronous design due to the synchronization of different clocks.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Chapter 5 
Designing The B C J R /M A P  Decoder
5.1 In trod u ction
This chapter will introduce Balsa which is a Hardware Definition Language (HDL) used 
for asynchronous circuits and systems and a quick tool to assess the feasibility of an asyn­
chronous design over a synchronous design.
The design constraints along with the cost saving design feature will be discussed.
5.2 D esign  F low
5 .2 .1  H ig h  L ev e l L a n g u a g es A n d  T oo ls
Asynchronous circuits are highly concurrent and the communication between modules is 
based on handshake channels. Therefore, a hardware language for the description of asyn­
chronous circuits needs to incorporate these two characteristics.
The Communicating Sequential Processes (CSP) language [3] meets these constraints 
and contains the following characteristics:
1. Concurrent processes
47
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
5 . DESIGNING THE B C J R /M A P  DE C O D E R
2. Sequential and concurrent composition of statem ents within a process
3. Synchronous message passing over point to point channels (supported by primitives 
such as ’send’, ’receive’ and possibly ’probe’)
CSP languages designed specifically for designing asynchronous circuits are:
1. Tangram, developed by Phillips and not available in the public domain
2. Balsa, developed by the University of Manchester and available in the public domain
5 .2 .2  C S P  V s . H D L
The CSP language [3] is not a standardized language as are other Hardware Definition Lan­
guages (HDL) such as Very High Speed Integrated Circuit Hardware Description Language 
(VHDL) or Verilog.
For instance, VHDL can be used to design asynchronous circuits, however, there are 
some disadvantages, such as:
1. VHDL lack primitives to implement the synchronous message passing needed by asyn­
chronous circuits. It is possible to write low level code tha t implements the handshak­
ing, however, it is undesirable to mix low level details into code whose purpose is to 
capture the high-level behaviour of the circuit.
2. VHDL lacks the statem ent level concurrency within a process.
VHDL does have advantages to its use, such as:
1. VHDL is well supported by existing Computer Aided Design (CAD) tool frameworks 
tha t provide simulators, pre-designed modules, mixed-mode simulation and tools for 
synthesis, layout and the back annotation of timing information.
2. The same simulator and test bench can be used throughout the entire design pro­
cess form the first high-level specification to the final implementation in some target 
technology, i.e., standard cell layout.
48
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
5. DESIGNING THE B C J R /M A P  D E C O D E R
3. It is possible to  perform mixed mode simulation where some entities are modelled 
using behavioural specification and other are implemented using the components of 
the target technology.
4. Many real world systems include both synchronous and asynchronous subsystems and 
such hybrids systems can be modelled without any problems in VHDL.
Although, VHDL does have some excellent advantages, in terms of the study a t hand. A 
CSP language offers immediate results. Balsa, a framework for synthesizing asynchronous 
hardware systems and a language for describing such systems will be the tool of choice for 
this study.
5 .2 .3  B a lsa : A sy n ch ro n o u s  H a rd w a re  L a n g u a g e  A n d  S y n th e s is  T o o l
Asynchronous circuits, at a hardware level, are complex systems th a t have not been ad­
dressed by industry standard Electronic Design and Automation (EDA) tools and com­
panies. EDA tools are lacking and this is a sentiment expressed by Ian Sutherland, vice 
president, fellow and asynchronous expert a t Sun Microsystems Laboratories [18].
VHDL and Verilog, industry standard hardware description languages, lack the needed 
’’concurrency and platform for handshaking channels” [3] th a t are required by asynchronous 
systems. The problem is th a t the majority of tools are manufactured for the clocked design 
paradigm.
A recent free tool from the University of Manchester, one of the academic leaders in 
asynchronous design, is Balsa. Balsa is both a ’’framework for synthesis of asynchronous 
hardware systems and a language for describing such systems” [3]. Balsa uses the adopted 
approach of proprietary languages like Tangram, ’’syntax-directed compilation into commu­
nication handshaking components” [3]. This means tha t there is direct ’one-to-one’ mapping 
between language and direct handshaking circuits produced.
Figure 5.1 details the design flow for a digital asynchronous circuit using Balsa version 
3.4.1.
The Balsa circuit description is compiled using the balsa-c program to an intermediate
49
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
5. DESIGNING THE B C J R /M A P  DE C O D E R
Balsa description
i .
D e s ig n  re f in e m e n t  























Object /  File --------------------------------------------------► Object / File
4 Manual process
Figure 5.1: Balsa Asynchronous Design Flow [19]
50
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
5. DESIGNING THE B C J R /M A P  DE C O D E R
Figure 5.2: Gate-Level and Handshake Component Level Of A Modulo-10 Counter [19]
file type, called the breeze description. The breeze description is a network arrangement of a 
finite set of Handshake Components (HC) ( 45). The HC are connected across channels were 
the handshake signaling occurs. Figure 5.2 demonstrates a gate-level modulo-10 counter 
with the underlining handshake circuitry underneath drawn in light grey.
The m ajority of the tools in the Balsa package deal with the manipulation of the breeze 
files. The breeze files can be used by ’back-end’ tools for different protocol implementations, 
however, the breeze files contain procedures and type definitions passed on from Balsa source 
files allowing breeze to be used as the package description format for Balsa, breeze-sim 
provides the behavioural simulation package for source level debugging, visualization of the 
channel activity at the handshake circuit level and it provides conventional waveform traces 
th a t can be viewed using GTKWave. GTKWave is a waveform viewer. The eventual target 
CAD package can be used to perform more accurate simulations and eventual physical 
layout.
After the gate-level netlist is obtained, commercial Silicon(5f) or Field Programmable 
Gate Array (FPGA) packages can be used to continue the design flow. Since the Research 
Center for Integrated Microsystems (RCIM) at the University of W indsor does not have the 
rights for the supported technology mappings (from Balsa to Verilog Gate-Level Netlists) 
th a t the Amulet Advanced Processor Technologies (APT) Group from the University of
51
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
5 . DESIGNING THE B C J R /M A P  DE C O D E R
D esig n  re fin em en t
B alsa description <•
8 ‘balsa-c’ ‘breeze2ps’
‘breeze-cost’





‘balsa-netlist’ I > •  Behavioural
Sim ulation
results
Area and  Power
G ate-level netlist > • Functional
Figure 5.3: The Feasibility Asynchronous Design Flow (For The Area And Power Metrics)
Manchester provides; a mapping has to be created for our supported technology. The most 
recent fabrication technology available at the RCIM Lab is the V irtual Silicon Standard Cell 
Library for Taiwan Semiconductor Manufacturing Company (TSMC) 0.18-micron, single 
poly, six metal, salicide Complementary Metal Oxide Semiconductor (CMOS) process. See 
Appendix D for information regarding gate-level mappings and [19, 20].
5 .2 .4  F ea s ib ility  D e s ig n  F lo w
The feasibility design flow is the comparison of different metrics at an asynchronous gate- 
level (see Figure 5.3) to the synchronous gate-level. These metrics are dependant on the 
designer, but area and power are sufficient for this thesis. In other words, this study will 
consider everything up to the gate-level synthesis section of the design flow.
If the asynchronous metric(s) are less than tha t of the synchronous metric(s), then the 
design is feasible. These metrics are area and power dissipation. If the metrics (area and 
power) are equal, then there is no advantage (area and power wise) of using asynchronous 
over synchronous. However, one could hypothesize tha t noise spectral values may be more 
uniform compared to a synchronous implementation. This means th a t depending on the 
application and design constraints; the metrics should be chosen wisely.
52
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
5. DESIGNING THE B C J R /M A P  DE C O D E R
Feasibility may mean to continue with the asynchronous design or th a t an asynchronous 
design may offer benefits greater than tha t of an synchronous design. This rule can be 
expanded for other metrics at different levels of abstraction, i.e., layout level will provide 
the most accurate results, however, it takes longer to reach tha t level of design.
5.3 T h e A synchronou s B C J R /M A P  D ecod er
The asynchronous B CJR /M A P consists of 4 main subsystems, they are:
• The Gamma Block, i.e., The Transition M atrix Block
• The Alpha Block, i.e., The Forward Recursion Block
• The Beta Block, i.e., The Backward Recursion Block
• The LLR Block, i.e., The APP Calculator and Decision Block
These subsystems naming conventions are very similar to the naming conventions of
the Balsa procedures of the design, see Appendix C for the complete source files for the
B C JR /M A P Balsa design.
5 .3 .1  S y s te m  C o n stra in ts
There are some design constraints to be noted to provide an accurate description of the 
design constructed, Table 5.1 lists these constraints.
The channel model used was an AWGN channel. It is a benchmark for all communication 
channels. The channel was used because decoders th a t perform well in AWGN channels 
perform well in all channels. Rescaling equation 3.29,
T ( m' ,m )  =  ^ 2  P r  {U t \ S t =  m\ St_ i  =  m '}  ■ P r  { S t =  m  \ S t- i  =  m '} • P r  {Yt \ X t }
u
The input channel variance can be neglected without any performance loss, see Section 
5.3.3.
53
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
5. DESIGNING THE B C J R /M A P  D E C O D E R
Table 5.1: Asynchronous B CJR /M A P System Constraints
Constraint Value
Encoder Type NRC Encoder
Decoder Type Max-Log MAP
Channel Model AWGN
M odulation Used BPSK
Code Rate 1/2
Constraint Length 3-bit
Block Length 20 bits
Decoder Input 2 6-bit inputs (SXX.XXX)
Decoder O utput Variable [Hard(l-bit) or Soft(6-bit)]
Input APP Neglected
Input Channel Variance Neglected
In Table 5.1 was th a t the input to the decoder did not use APPs. This was done for a 
several reasons. It would decrease the complexity of the decoder and th a t would mean no 
extra additions (in the logarithmic domain). Since a standalone decoder was being used, 
another decoding source would not be feeding this thesis’ decoder with APPs. Therefore, 
in actuality, a standalone decoder would probably not receive APPs. If this decoder were to 
be used in a turbo system, this would be a constraint tha t would have to be altered because 
a turbo system requires iterative decoding between two decoders.
The block length input to the system is 20 bits long. Compared to most systems, this 
is a small number. However, the number of bits was chosen to  simplify the design. Since, 
the feasibility of the design is of more importance, 20 bits was a sufficient block length.
The decoder outputs were hard outputs. This was done to check our system quickly 
with the MatLab simulation architecture, see the description of the simulation architecture 
in Section 6.2. M atrix Laboratory Software (MatLab) is ”a high-level technical computing 
language and interactive development environment” [26].
54
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
5 . DESIGNING THE B C J R /M A P  DE C O D E R
5 .3 .2  T h e  L o g -M A P  A lg o r ith m  A n d  M a x -L o g -M A P  A lg o r ith m
Chapter 5 shows th a t the B CJR/M A P algorithm is computational heavy. In terms of VLSI 
implementation, the amount of multiplications and exponentials needed to carry out the 
computations of the algorithm are quite large. By using the logarithmic domain; multi­
plications and exponentials can be replaced by small Look Up Tables (LUTs), ’Maximum’ 
operators and additions. These three replacements are more cost effective than  large and 
bulky multiplications and exponentials.
The Log-MAP and Max-Log-MAP algorithms are basically the B C JR /M A P described in 
Section 3.7, except tha t an approximation called the Jacobian Logarithm [21, 23] (equation 
5.1) is used to remove the implementation complexities of the BCJR/M AP. The Jacobian 
Logarithm is as follows,
M A X *  =  In (ex +  ey)
=  M A X  {x, y)  +  In ( l  +  (5.1)
Essentially, the M A X *  is a M A X  operator with a correction factor. This correction 
factor can be simply stored in a LUT with negligible effect on performance [22, 23].
The basic difference between the Max-Log-MAP and the Log-MAP is th a t the Max- 
Log-MAP disregards the correction factor in its computation. This leads to  smaller area 
and a faster decoder, however, a t the expense of performance [25]. This study uses the 
Max-Log-MAP algorithm due to its simplicity and better area and speed results than  the 
Log-MAP.
5 .3 .3  G a m m a  A r c h ite c tu r e
The gamma architecture th a t was implemented is shown in Figure 5.4. The d ata  types used 
in the gamma system are shown in Figure 5.5. The sub-system architecture for the gamma 
sub blocks are shown in Figure 5.6 and Figure 5.7. For full coding details see Appendix C.
Recall equation 3.29, this equation was broken up into three parts.
1. P r { U t \ S t = m - , S t - i = m ' }
55
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
5 . DESIGNING THE B C J R /M A P  DE C O D E R
2. P r  {St  =  m  | S t - 1 =  m'}
3. P r { Y t \ X t }
’P art 1’ determines the order of the values in the transition matrix. ’P art 2’ was ne­
glected, for various reasons discussed, see Section 3.7.3 and Section 5.3.1. ’P art 3’ depends 
on the input to the system and the channel model, as discussed in Section 5.3.1, the channel 
variance can be neglected because it is not varying and does not depend on the input.
Therefore, ’part 3’ can be written as follows,
P r  {Yt | X t } =  exP (Vt,i ~  x t j ) 2^ (5.2)
In the logarithmic domain, the equation becomes,
In (Pr {Yt | X t}) =  In ^  j  +  In ^exp ^  (j/M -  x t,i)2 j  (5.3)
The first term  in the previous equation can be neglected because the Add-Compare Select 
property of the Alpha and Beta block will deem this value ineffective. For example,
if a — constant +  2 and b =  constant +  1, then max(a, b) =  a 
The constant is irrelevant to the max  operator. Therefore, the equation becomes,
n
In ( P r  (Yt | X t)) =  - J 2  (Vt,i ~  x t j ?  (5-4)
i=1
This was also proven in MatLab sim ulations. A previously designed B C JR /M A P de­
coder, see Appendix B for all MatLab code, had its constant removed to see how the system 
would effect performance in terms of Bit Error R ate (BER) after decoding vs Signal To Noise
56
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
5. DESIGNING THE B C J R /M A P  D E C O D E R
Ratio (SNR) of the communication channel. The units of BER are ( #  of bits with errors 
after being decoded \  #  of bits transm itted) and the units of SNR are (dB).
Figure 5.8 shows tha t even by removing the constant it does not effect performance. 
The value labelled ’no sigma’ is a B C JR /M A P decoder th a t has had i t ’s constant removed. 
It achieves the same performance as the ’unquantized’ system. Figure 5.8 also shows a small 
variation of the ’no sigma’ values compared to the ’unquantized’ values, i.e., not exactly the 
same. This is discrepancy is negligible due to the fact the input values are random generated 
values. The exact simulation may produce very similar results with new variations in the 
values.
5 .3 .4  A lp h a  A r c h ite c tu r e
The alpha architecture th a t was implemented is shown in Figure 5.9. The data  types used 
in the alpha system are shown in Figure 5.10. The sub-system architecture for the alpha 
sub blocks are shown in Figure 5.11 and Figure 5.12. For full coding details see Appendix 
C.
Regarding Figure 5.9, there is local logic th a t controls the sequence of iterative events, 
i.e., logic th a t initially feeds in locally stored variables (top left of schematic) and then feds 
the input the iterative output of the system (counter). X  =  x +  1, where initially x=0  and 
increments to 19.
Recall, equation 3.22, excluding the normalization correction. The normalization 
correction will be discussed in Section 5.3.7.
N - 1
a t (m ) =  a t ~i (m 0 ■r * (m / ’m )
m '
In the logarithmic domain the equation becomes,
( N - l  \
In a t (m) =  In I ^  a t~i (rri) ■ Tt (m ' , m ) J
57
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
5 . DESIGNING THE B C J R /M A P  DE C O D E R
o © ~ —
o a  ©  ©
O O §  §
AA
o o o o 
AA AA AA AA
© O G O G
o
2 + S ©  ~  — ++3 X
— © 






A A A ^  , A
Figure 5.4: G am m a Architecture
58
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
5 . DESIGNING THE B C J R /M A P  D E C O D E R
Figure 5.5: G am m a Architecture - D ata T ype Structures
59
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
5. DESIGNING THE B C J R /M A P  DE C O D E R
LU T l   (Y+l)
  (Y+l)











.  G191020 
I  G191021 
:  G191120 
► G19U21
Figure 5.7: Gamma Sub-System Architecture - Buffer
60
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
5 . DESIGNING THE B C J R /M A P  D E C O D E R
— i—  uncoded  
■ O  ■ ■ no sigm a 
— I—  lo g -m ap  A synch 
— 0 —  unquan tized
 low er bound
■ - - x- • • lo g -m ap ________
2.5 3.5 4.5
Figure 5.8: 1000 Blocks Transmitted Per dB Level - BER Vs. SNR
61
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
5 . DESIGNING THE B C J R /M A P  D E C O D E R
< < < < < < < <
Figure 5.9: Alpha Architecture
62
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
5. DESIGNING THE B C J R /M A P  D E C O D E R
i
$ £
~ i  J,---------------- — 1 -------------5T~
Figure 5.10: Alpha Architecture - Data Type Structures
63
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
5. DESIGNING THE B C J R /M A P  DE C O D E R
Adder_8ops_W 1 Oa
Figure 5.11: Alpha Sub-System Architecture - Adder
Figure 5.12: Alpha Sub-System Architecture - Subtractor
64
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
5. DESIGNING THE B C J R /M A P  DE C O D E R
N - 1
In a t  (to) =  ln (m 0 )  ’ ln (r ‘ (TO'>m ))
m'
N - 1
In a t (m) =  ^  ln (a t_ i (to' to', m ))
m'
In a t (m) =  ln ( eln(a:t_i(rn'))+ln (r ((tn',m))\
Let,
<5i =  ln a t- i  (to')
S2 =  In r t (to', to) (5-6)
and by using equation 5.1, the previous equation 5.5 becomes,
\ n a t (m) =  MAX*m, ( 5 1,62) (5.7)
Rewriting, this equation becomes,
ln a t (m) =  M A X mi (<5l5 52) +  ln ( l  +  (5-8)
Since the Max-Log-MAP algorithm will be used for this design, the equation can be 
rewritten minus the correction factor,
ln a t (to) =  M A X mi (Si, S2) (5.9)
5.3.5 B eta  A rchitecture
The beta architecture th a t was implemented is shown in Figure 5.13. The data  types used 
in the beta system are shown in Figure 5.14. The sub-system architecture for the beta  sub 
blocks are the same as the alpha sub blocks, see Figure 5.11 and Figure 5.12. For full coding 
details see Appendix C.
65
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
5. DESIGNING THE B C J R /M A P  DE C O D E R
Figure 5.13: Beta Architecture
66
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
5 . DESIGNING THE B C J R /M A P  D E C O D E R
k
0 0
~ k  I T
..........................  k
Figure 5.14: Alpha Architecture - D ata Type Structures
67
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
5. DESIGNING THE B C J R /M A P  D E C O D E R
Regarding Figure 5.13, there is local logic th a t controls the sequence of iterative events, 
i.e., logic tha t initially feeds in locally stored variables (top left of schematic) and then feeds 
the input the iterative output of the system (counter). X  =  x, where initially x=19 and 
decrements to 0.
All sub-blocks in alpha, function the same as the blocks in Beta.
Recall, equation 3.27, excluding the normalization correction. The normalization 
correction will be discussed in Section 5.3.7.
N - 1
P t (m) = E P t+1 ( in ')  ■ rt+i (m , m ')
m'=0
Using the same procedure in Section 5.3.4, in the logarithmic domain the equation 
becomes,
ln p t ( m )  =  in ^  eln(ft+i(m'))+ln(rt+1(m',m))
m>
Let,
di =  ln p t+i (m1)
52 =  In r t+i ( m \ m )
and by using equation 5.1, the previous equation 5.10 becomes,
ln p t ( m ) = M A X * m,( 6 1,52) (5.11)
Rewriting, this equation becomes,
ln p t (m) =  M A X m, {6x,52) + \ n  ( l  +  e - \Sl~52^  (5.12)
Since the Max-Log-MAP algorithm will be used for this design, the equation can be 
rewritten minus the correction factor,
ln p t (m) =  M A X m,{S i ,62) (5.13)
68
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
5. DESIGNING THE B C J R /M A P  D E C O D E R
5 .3 .6  L L R  A r c h ite c tu r e
The LLR architecture th a t was implemented is shown in Figure 5.15. The data types used 
in the LLR system are shown in Figure 5.16. The sub-system architecture for the LLR sub 
blocks are shown in Figure 5.17 and Figure 5.18 . For full coding details see Appendix C.
Since a NRC Encoder is being used for this study, At (m)  will be used to find the A PPs of 
the symbols. Recall equation 3.31,
At (m) =  a t (m) • f t  (m)
Using the same procedure in Section 5.3.4 and 5.3.5, in the logarithmic domain the 
equation becomes,
In At (m) =  M A X  (f t, f t)  (5-14)
where
ft =  ln a t (m) 
f t  =  In f t  (m)
Recalling equation 3.33, we can substitute in the previous equation,
Z m e A ^ A X  ( f t ,  f t )  
YsmeAll states M A X  0*1 > <^)
Recall equation 5.16,
P i  APPlog [Ut =  1] =  ^  r X  (5.1.5)
PlAPPlog [ut =  -1 ] =  1 -  PiAPPlog K  =  l] (5.16)
Therefore, the decision rule for the hard output remains the same and is as follows,
1 i iPrA PP iog [u t  =  i \ > P r A P P i o g W t  =  - l }  
ut =  \  (5-17)
- 1  if P i APPlog K  =  1] <  PiAPPlog [ut =  - 1 ]
69
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
5. DESIGNING THE B C J R /M A P  DE C O D E R
3 2 3 333
__
I l l s  S i l l
Figure 5.15: LLR Architecture
70
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.










~ d ~ *
A '-. - X  
1«  >o
!, i i "E C O O  E f 1 1 0 I •E o  c o "E t 1 ■£
£
1 *
1 1 1 S £ S £
Figure 5.16: LLR Architecture - Data Type Structures
71
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
5. DESIGNING THE B C J R /M A P  D E C O D E R
Minimer
Lx3 Lx4







Figure 5.18: Alpha Sub-System Architecture - Decision Block
72
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
5. DESIGNING TH E B C J R /M A P  D E C O D E R
5 .3 .7  N o r m a liz a tio n  A n d  T h e  P o s it iv e  D o m a in
Normalization was discussed previously in Section 3.7.1 and 3.7.2. One of the best tech­
niques for normalization, for instance to prevent overflow, is as follows,
Let,
A  { ^ 1 5 3̂ 2) • • • i 15
(5.18)
be a set of numbers tha t need to be normalized.
Then the normalization of each term, denoted by, x, is as follows,
(5-20)
]Cfc=l x k
where k =  { 1 ,2 , . . . ,  K  — 1, K } and K  is the number of elements in the set X .
For VLSI implementations of normalization, a division, which is basically a multiplica­
tion, is large in terms of area. Different normalization techniques are needed. For numbers 
between 0 ,1 , 2 , . . . ,  oo, the following could be used.
Let,
X  {xr , X2, • • • , %n—l , x n}
(5.21)
be a set of numbers th a t need to be normalized.
Then the normalization of each term, denoted by, x, is as follows,
73
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
5. DESIGNING THE B C J R /M A P  D E C O D E R
x x =  x x -  M A X  (X ) (5.22)
x n =  x n — M A X ( X )  (5.23)
For a set of numbers in the negative domain, i.e., number between —oo ,. . . ,  2 ,1 ,0 , the 
same rule applies. For the normalization of the a  and 0  values, equations 5.22 need to  be 
used.
The values coming from the Gamma(F) system block are always negative. Therefore, 
to save area and speed, all subsequent sub systems, i.e., alpha, beta and the LLR, can use 
the values all converted into the positive domain. However, since the values were originally 
negative, the normalization must take this into account. Therefore, for normalization, 
instead of maximum, the minimum must be taken. This also applies for all maximum
comparisons in the alpha and beta calculations. The decision rule for hard outputs, i.e:
equation 5.17 has to also be altered to accommodate this conversion to positive values.
To make a comment on area and speed savings, the system will now not need to em­
ploy two’s complement arithmetic. Since the alpha,beta and LLR calculations do require 
additions. Using an extra signed bit and two’s complement arithm etic throughtout the 
calculations needed will yield extra area, slower speeds and complexity tha t can be avoided 
using the simple conversion to the positive domain.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Chapter 6
Simulation Architecture and Simulation  
Results
6.1 Softw are T ools - V erilog, S yn opsys and M atL ab
The simulation step in design is most crucial for determining functionality, performance 
and feasibility. Different software tools were used including MatLab, Balsa, Verilog and 
Synopsys.
M atLab is ”a high-level technical computing language and interactive development en­
vironment” [26]. It was used for functionality and best case performance evaluation.
Balsa, the asynchronous HDL was converted to gate-level Verilog code during one of the 
asynchronous design flow steps. Verilog emerged in 1983 at Gateway Design Automation. 
Verilog is a ’’general-purpose” HDL which uses syntax similar to C programming. I t utilizes 
different levels of abstraction in the same model. A designer can code a circuit in term s of 
switches, gates, Register Transfer Level (RTL) or behavioural code. One language needs to 
be learned to create ’hierarchical’ design and a stimulus harness [27],
Once the mapping from Balsa to  Verilog occurred, the synthesis of Verilog code to a 
gate-level netlist was needed so th a t area, power and timing could be analyzed. Synopsys
75
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
6. SIM ULATION A R C H IT E C T U R E  A N D  SIM ULATION RESU LTS
performs synthesis and area, power and timing reports. Synopsys is a full set of tools that 
not only performs HDL simulation bu t full RTL coding solutions [28].
6.2 M atL ab S im u lation  A rch itectu re
The most im portant step in the design of the asynchronous B C JR /M A P decoder was tha t 
of functionality testing of the Balsa designed decoder. A method of autom ation was needed 
and MatLab was a perfect fit due to the ability to execute UNIX commands.
Balsa has a built in test harness Graphical User Interface (GUI) th a t allows the user 
to give inputs and outputs to the Balsa designed system. The Balsa design suite can 
also be executed without the GUI. By combining the results of the Balsa B C JR /M A P 
decoder, which used the test harness program, with tha t of other systems implemented in 
MatLab, the design was thoroughly tested for functionality and best case performance. The 
performance was measured in BER against SNR.
Figure 6.1 demonstrates an early design where the performance of the system was either 
not met or the system still had incorrect code th a t needed to be altered. The axes of the 
figure are BER vs. SNR.
Figure 6.2 demonstrates the software simulation architecture for functionality and per­
formance testing. W ithin MatLab, random numbers were generated and then encoded. 
These encoded values were then put through a AWGN channel using BPSK modulation. 
These values were then passed onto previously designed MatLab B CJR /M A P decoders. 
These same values were then w ritten to  file. Using UNIX commands within MatLab, the 
Balsa simulator used the values from file and gave an output th a t was also written to  file. 
The values produced from Balsa and the previously designed decoders were then used to 
calculate the BER values against the SNR values of the channel. The different decoders 
names are listed in the following figures as:
1. u n co d e d  - This is the only system where the values are not encoded or decoded. This 
represents the worst possible scenario. Any type of encoding and decoding should 
improve the performance.
76
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
6. SIM ULATION A R C H IT E C T U R E  A N D  SIM ULATION RESULTS
— I—  uncoded  
• O  • no sigm a 
— t—  lo g -m ap  A synch 
— 0 —  unquan tized
 low er bound
■ x  ■ lo g -m ap
3.52.5 4.5
Figure 6.1: 1000 Blocks Transmitted Per dB Level - BER Vs. SNR - Invalid Balsa Design
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
6. SIM ULATION A R C H IT E C T U R E  A N D  SIM ULATION RESU LTS
0 0 0 0
78
Figure 6.2: M atLab System Simulation Architecture
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
6. SIM ULATION A R C H ITE C TU R E  A N D  SIM U LATIO N  RESULTS
2. u n q u a n tiz e d  - This B CJR /M A P system does not use quantized values. This system 
represents the best possible encoding-decoding scheme realizable.
3. lo g -m ap  - This B C JR /M A P system uses 6-bit quantized values. This system repre­
sents a more realistic decoder, the best possible realization of a hardware design.
4. no  s ig m a - This is the same B CJR /M A P system as the ’log-map’, however, the system 
has been simplified. It achieves the same performance as the ’log-map’ system.
5. lo g -m ap  A sy n ch  - These are Balsa simulator test results of the design under test.
6. low er b o u n d  - This is a system th a t contains no errors after decoding. It is the best 
possible scenario bu t unrealistic.
The optimal performance was obtained through the design named trunc JSbit. The 5-bit 
truncation refers to the number of bits represented after the LUT in the Gamma block, see 
Appendix C. Figure 6.3 demonstrates th a t ’log-map Asynch’ achieves approximately the 
same level of performance as ’log-map’ and ’no sigma’ systems. The LUT in trunc_5bit is 
fed a 6-bit word and then is coverted to a 5-bit word.
The design named truncM bit is a decoder th a t uses a 6-bit to 4-bit LUT. At the expense 
of performance, the trunc_4bit design, see Figure 6.4, would most likely decrease the area 
and power metrics. This was roughly estimated with the Balsa tool called breeze-cost. This 
tool estimates the area cost of the circuit. These are rough estimates, a more thorough 
gate-level Synopsys area value would be more effective. The values produced by breeze-cost 
are shown in Table 6.1. This estimate shows a possible reduction in area of 10.2%. This 
reduction is due to  the ’new’ da ta  type truncation being carried through the remainder of 
the design.
6.3 S yn op sys S yn th esis S im u lation  R esu lts
The step from Balsa to Verilog code was not supported automatically by the creators at 
the University of Manchester. This was partly because the gate-level netlist contained 
technology specific cells and they did not support the same technology th a t the RCIM
79
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
6. SIM ULATION A R C H ITE C TU R E  A N D  SIM ULATION RESULTS
Table 6.1: breeze-cost Estimates For Area (Units Are Normalized)
trunc_6bit trunc_5bit trunc_4bit
Full System B JR/M A P 233597.5 212015.5 190433.5
— i—  uncoded  
■O ■ ■ no sigm a 
— t—  lo g -m ap  A synch 
— 0 —  unquan tized
 lower bound
■ >; ■ lo g -m ap
. - 3
2.5 3.5 6.54.5 5.5
Figure 6.3: 10000 Blocks Transmitted Per dB Level - BER Vs. SNR - trunc_5bit Design
80
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
6. SIM ULATION A R C H ITE C TU R E  A N D  SIM ULATION RESU LTS
-I—  uncoded 
■O ■ ■ no sigma 




Figure 6.4: 10000 Blocks Transmitted Per dB Level - BER Vs. SNR - trunc_4bit Design
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
6. SIM ULATION A R C H IT E C T U R E  A N D  SIM ULATION RESU LTS
Table 6.2: Area, Power And Timing Values
Area (/xm2) Power (m W ) Delay (n s )
Overall 6552259.5000 218.0502 254.35
Gamma 463780.0000 - -
Alpha 2041387.3750 - -
Beta 2461549.2500 - -
LLR 796769.2500 - -
Other 788773.6250 - -
group uses. Therefore, adding the TSCM 01.8 micron technology was needed to use the 
Synopsys tools available to the RCIM group. The files needed for this mapping are included 
in Appendix D. The process is discussed in further detail in [19].
The process of obtaining gate-level Verilog code was known as an ’implementation’. The 
implementation tha t was created used the 4PBDP. The 4PBDP can also have variations as 
previously discussed. The variation of 4PBDP th a t was used was the broad variation, see 
Figure 2.7.
Once the design was mapped to a gate-level netlist, the gate-level netlist could be used 
to generate area, power and timing results. The software tool tha t derived these reports 
is Synopsys’ ’Design Analyzer’. These results are highlighted in Table 6.2. The complete 
reports for area, power and timing are shown in Appendix F.
The area values reported showed an overall ’gate’ size of 6552259.5 f im 2. The area of 
improvement should be dedicated to the LLR and Balsa tools which generate all the ’O ther’ 
circuitry required. The chip gate dimension was 2.560 mm X 2.560 mm.
The power value shown is referred to as the ’total dynamic power’. This is the total 
power consumed by the chip and it is the sum of ’cell internal power’ and ’net switching 
power’. The ’cell internal power’ was 95.9450 m W  (44%) and the ’net switching power’ 
was 122.1051 m W  (56%). The ’cell leakage power’ was 43.9936 / iW  and thus is considered 
negligible. The power of the submodules of the BCJR/M A P decoder were not applicable
82
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
6. SIM ULATION A R C H IT E C T U R E  A N D  SIM U LATIO N  RESULTS
Table 6.3: Comparison To Synchronous MAP Decoder Designs
Authors Rate Algorithm Block Length Tech. (/xm2) Area (m m 2) Power (f iW )
Vogt 12 ML-MAP 668 0.35 2.8 -
Vogt l3 SOVA-OU 668 0.35 2.5 -
Bicherstaff l3 L-MAP 2568 0.35 9 -
Sabeti 12 ML-MAP 512 0.18 1.16 59.66
Perta 12 ML-MAP 20 0.18 6.55 218.05
since the design is constrained at the top level. If the values of the submodules were to be 
found, the sum of their parts would exceed the total power of the whole system. This has 
to do with the way the simulator derives the power value.
The worst case path  delay of the B CJR /M A P decoder is 254.35 ns. The values obtained 
are ’conservative’ (default) constraints placed on the system. Constrained area and timing 
outcomes may result in better values.
In [29], a comparison of different MAP decoders were presented. These findings are 
presented alongside the proposed B CJR /M A P decoder in Figure 6.3.
These results show th a t the asynchronous digital design technology and tools are not 
yet up to par with the standard clocked tools. The tools optimize clock networks and this 
has serious ramifications on m apped systems th a t exhibit an asynchronous topology.
The large area value can be explained because non-optimized Random Access Memory 
(RAM) is being synthesized. These RAM blocks are being implemented by using small 
primitive cells. If there was an asynchronous optimized cells used in the mapping from Balsa 
to Verilog, this would improve the area and power consumption. Since the B C JR /M A P 
decoder for hardware implementation required large amounts of RAM, this was reflected in 
the larger than  normal area values. In [29], optimized RAM blocks were used and this did 
comparatively reduce area.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
6. SIM ULATION A R C H IT E C T U R E  A N D  SIM ULATION RESU LTS
6 .3 .1  P r o b le m s  E n c o u n te red  A n d  P o s s ib le  R e m e d ie s
The Balsa to Verilog mapping was troublesome. When the conversion was finally completed, 
Synopsys detected cells th a t were placed without loads. These cells were removed, see the 
fully generated Verilog code in Appendix E. The commented code reflects the cells th a t 
needed to be removed to obtain area, power and timing. Since the cells were removed, 
functionality at a gate-level becomes a concern. Balsa simulation is not sufficient in terms 
of functionality testing until these concerns are addressed. Gate-level testing needs to  be 
employed to fully test the functionality of the system if the mapping behaves erratically.
The Balsa suite offers two other protocol implementations: a delay-insensitive dual-rail 
encoding and a delay-insensitive one-of-four encoding. These other protocols require serious 
file manipulation (or ’hacking’) in order th a t they map the technology properly. The sheer 
amount of work tha t is required is daunting due to the ’trial and error’ required methodology 
to fix the files.
The immense amount of small cells generated by the Balsa to Verilog netlist created 
memory problems with the well equipped Sun Workstations. The Sun stations often times 
ran out of memory. Adressing this with inventive constraint parameters will make the 
generation of results less volatile.
6 .3 .2  F u tu re  W ork
Some future studies could include the already addressed issues, as well as the following. 
The Balsa language is a computer scientist’s solution to the asynchronous digital circuit 
automation for VLSI design. The separate language and system for simulation and synthesis 
is satisfactory, however, existing electrical engineers that are fluent in digital design should 
not have to learn a brand new language. Verilog can be used to design asynchronous 
systems, however, they lack several features th a t the CSP languages offer. A solution 
should be somewhere in between.
From a practical standpoint, training engineers to learn a new language requires time 
and money. However, if a system could be designed tha t uses existing synchronous RTL 
code (Verilog or VHDL) which is m apped to asynchronous gate-level code, the solution
84
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
6. SIM ULATION A R C H IT E C T U R E  A N D  SIM ULATION RESULTS
would seem a little more practical. The biggest problem is the unsupported tools for the 
asynchronous domain.
Most digital asynchronous designers are physically laying-out (transistor level) their 
systems. This becomes analog design and is only plausible for companies and institutions 
with many engineers and large sums of money. When the technology changes, i.e.: feature 
size of transistor shrinks, these designs need to be redesigned. The autom ation of VLSI 
asynchronous digital design is the sector th a t needs to be researched, not analog ’digital’ 
physical layout because this will yield more efficient systems and designs.
An autom ated way of taking existing synchronous code (Verilog or VHDL) and con­
verting it to gate-level asynchronous code is a needed area of research. This should also 
include the option of giving the designer different asynchronous design styles and the choice 
of algorithms to achieve low-power and smaller area systems.
Further exploration is needed past the gate-level step to uncover other advantages th a t 
the asynchronous paradigm offers. Since this thesis only examined area and power, a larger 
study needs to be conducted in the other beneficial areas of asynchronous systems. This 
can include lower noise systems and speed.
Once asynchronous design is recognized as a viable option, it will slowly amalgamate 
with synchronous systems offering solutions to existing unsolvable problems, e.g., clock skew 
and interconnect scaling issues. This will then give the required time for academia and 
industry to address low power and other algorithmic remedies needed in digital autom ated 
asynchronous design.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Chapter 7
Summary Of Contributions and 
Conclusion
This study has provided a foundation into the investigation of asynchronous digital design 
using a B CJR /M A P channel decoder as the design example. The results were encouraging. 
The following sections will summarize the results.
7.1 A synchronou s V L SI
An asynchronous digital design flow has been investigated. A B CJR /M A P channel decoder 
has been designed in an asynchronous HDL (Balsa). This was converted to Verilog gate-level 
code for analysis with Synopsys. Most asynchronous designs employ a physical layout of 
the circuit. This level of design is expensive and not for small institutions and businesses. 
Employing a strategy th a t autom ates digital asynchronous design, such as Balsa, yields 
metrics th a t will eventually improve designs th a t are heavily constrained.
86
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
7. SU M M A R Y  OF CO N TR IB U TIO N S A N D  C O N C LU SIO N
7.2 S im u lation  A rch itectu re
The MatLab /  Balsa simulation architecture th a t was implemented was a quick and efficient 
method for functionality and best case performance evaluation. The B CJR /M A P decoder 
was fully tested for functionality and best case performance and met the standard. For 
future designs, this simulation architecture will improve the quickness needed to determine 
feasibility of an asynchronous design.
7.3 B C J R /M A P  C hannel D ecod in g
The B C JR /M A P channel decoder was implemented and the results were encouraging. The 
level of refinement tha t asynchronous tools need to achieve is a daunting and difficult task. 
The application of these new tools, however, need to be tested on useful technologies, e.g., 
Turbo Codes, to improve all areas of design.
Although the B CJR /M A P decoder meets reasonable area standards, other benefits in­
cluding lowered noise could catapult asynchronous decoding into the forefront of space 
communications. Space systems require data sensitive communication exchange th a t would 
highly benefit from the use of a lower noise and power system. Furthering the design flow 
beyond the gate-level layer will aid in revealing the benefits of asynchronous design.
7.4 C onclusion
The B CJR/M A P decoder has been synthesized to the gate-level layer of the digital design 
flow. The results for gate area were 2.560 mm X 2.560 mm. The power consumption of 
218.0502 gW  was a promising result.
This asynchronous B CJR /M A P decoder will hopefully inspire others to examine the 
asynchronous paradigm as a viable alternative to clocked systems.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
References
[1] Erico Guizo,”Closing In On The Perfect Code” ,IEEE Spectrum Magazine,March 
2004
[2] C.H. Van Berkel, M.B. Josephs, and S.M. Nowick, Scanning the Technology: 
Applications of Asynchronous Circuits” , Proceedings of the IEEE, Volume 87, Issue 
2, pp. 223-233, 1999
[3] J. Sparso, S. Furber, ’’Principles of Asynchronous Circuit Design: A System 
Perspective” , Kluwer Academic Publishers, 2001
[4] C. J. Myers, ’’Asynchronous Circuit Design” , John Wiley and Sons, Inc., 2001
[5] S. Furber, ’’The Return of Asynchronous Logic” ,
http: /  /  www.cs.man.ac.uk/async/background /  return_async.html
[6] ’’Keynoter Sees Asynchronous Future For Digital Designs” , EE Times, 
http://www.eetimes.com/story/OEG20021204S0018, December 4, 2002
[7] I. Sutherland and J. Ebergen, ’’Computers W ithout Clocks” , 
h ttp ://w w w . sciam. com/article. cfm?articleID=00013F47-37CF-lD2A- 
97CA809EC588EEDF
[8] L. E. M. Brackenbury, M. Cumpstey and S.B. Furber, ’’Applying Asynchronous 
Techniques to a Viterbi Decoder Design” , IEEE Seminar on Low Power IC Design, 
2001
[9] K. E. Tepe, ’’Iterative Decoding Techniques for Correlated Rayleigh Fading and 
Diversity Channels” , Communication, Information and Voice Processing Report 
Series, Report TR-2001-1, University of Lund - Information Technology Department 
and Rensselaer Polytechnic Institu te - Electrical, Computer and System Engineering 
Department, 2001
[10] R. B. Wells, ’’Applied Coding and Information Theory for Engineers” , Prentice Hall, 
1999
88
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
REFEREN CES
[11] S. Shahzad Shah, S. Yaqub and F.l Suleman, ’’Self-Correcting Codes Conquer Noise, 
Part One: Viterbi Codecs” , EDN Magazine, http://w ww .reed-
electronics.com/ ednmag/contents/images/75255.pdf, 2001
[12] P.A. Riocreux, L.E.M. Brackenbury, M. Cumpstey and S.B. Furber, ”Low-Power 
Self-Timed Viterbi Decoder” , Seventh International Symposium on Asynchronous 
Circuits and Systems, 2001
[13] I. Sutherland, ’’Micropipelines: Turing Award Lecture” , Communications of the 
ACM, Volume 32, no. 6, pp. 720-738, 1989
[14] D.W. Lloyd and J.D. Garside, ”A Practical Comparison of Asynchronous Design 
Styles” , Seventh International Symposium on Asynchronous Circuits and Systems, 
pp. 36-45, 2001
[15] D. Muller and W. Bartky, ”A Theory of Asynchronous Circuits” , Proceedings of an 
International Symposium on the Theory of Switching,pp. 204-243, 1959
[16] C. Berrou, A. Glavieux and P. Thitim ajshim a, ’’Near Shannon Limit Error 
Correcting Coding and Decoding: Turbo Codes” , Proceedings 1993 IEEE 
International Conference on Communications, pp. 1064-1070, 1993
[17] John G. Proakis and Masoud Salehi, ’’Communication Systems Engineering, 2nd 
Edition” , Prentice Hall, 2002
[18] R. Goering,”Keynoter Sees Asynchronous Future For Digital Designs” , EE Times, 
December 4, 2002
[19] Doug Edwards, Andrew Bardsley, Lilian Janin and Will Toms, ’’Balsa: A Tutorial 
Guide” , Version V3.4.1, 2004
[20] ’’A PT Website” , http://w w w .cs.m an.ac.uk/apt/index.htm l
[21] N. G. Kingsbury and P. J. W. Rayner, ’’Digital Filtering Unsing Logarithmic 
Arithmetic” , Electronic Letters, Volume 7, Number 2, pp. 56-58, 1971
[22] J. A. Erfanian and S. Pasupathy, ”Low-Complexitity Parallel-Structure 
Symbol-by-Symbol Detection for ISI Channels” , IEEE Transaction on Information 
Theory, Volume 41, Number 3, pp. 704-713, 1995
[23] Emmanuel Boutillon, Warren J. Gross and Glenn Gulak, ’’VLSI Architectures for 
the Forward-Backward Algorithm” ,
http://lester.univ-ubs.fr:8080/ boutillon/sanchezt/ForBack_article.ps
[24] L. R. Bahl, J. Cocke, F. Jelinek and J. Raviv, ’’Optimal Decoding of Linear Codes 
for Minimixzing Symbol Error Rate” , International Symposium on Information 
Theory, pp. 284-287, 1972
89
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
REFEREN CES
[25] Patrick Robertson, Emmanuelle Villebrun and Peter Hocher, ”A Comparison of 
Optimal and Sub-Optimal MAP Decoding Algorithms Operating in the Log 
Domain” , Proceedings of the IEEE ICC, pp. 1009-1013, 1995
[26] ’’The M ath Works - MATLAB - The Language of Technical Computing” , 
http://w w w .m athw orks.com /products/m atlab
[27] Samir Palnitkar, ’’Verilog HDL - A Guide to Digital Design and Synthesis” , Sun Soft 
Press - Prentice Hall, 1996
[28] ’’Synopsys” , http://www.synopsys.com
[29] Leila Sabeti, University of Windsor - Masters Thesis : New VLSI Design Of A 
Max-Log MAP Decoder” , 2004
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
A ppendix  A
List o f  Abbreviations
2 P B D P 2 Phase Bundle D ata Protocol
2 P D R P 2 Phase Dual Rail Protocol
4 P B D P 4 Phase Bundled D ata Protocol
4 P D R P 4 Phase Dual Rail Protocol
ack Acknowledge
A P P A Posterior Probabilities
A P T Advanced Processor Technologies
AS Asynchronous Systems
A W G N Additive W hite Gaussian Noise
B C J R Bahl, Cocke, Jelinek, and Raviv
B E C Binary Erasure Channel BEC
B E R Bit Error Rate
B D Bundled D ata
B D P Bundled D ata Protocol
B M U Branch Metric Unit
B P S K Binary Phase Shift Keying
B SC Binary Symmetric Channel
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
A. L IST  OF A B B R E V IA TIO N S
C A D Computer Aided Design
C C Convolutional Coding or Codes
C P U Central Processing Unit
C S P Communicating Sequential Processes
C M O S Complementary Metal Oxide Semiconductor
D M C Discrete Memoryless Channel
D R Dual Rail
D R P Dual Rail Protocol
E D A Electronic Design and Automation
E M Electro Magnetic
F B Forward and Backward
F P G A Field Programmable Gate Arrays
F IR Finite Impulse Response
F S M Finite State Machine
G U I Graphical User Interface
H C Handshake Component
H D L Hardware Definition Language
H U History Unit
H R Infinite Impulse Response
L L R Log Likelihood Ratio
L U T Look Up Table
M A P Maximum A Posteriori
M P Muller Pipeline
N R C Non Recursive Convolutional
N R T Z Non R eturn To Zero
N R T Z B D P Non R eturn To Zero Bundled D ata Protocol
P M U Path  Metric Unit
P R E S T Power Reduction for System Technologies
R A M Random Access Memory
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
A. L IST  OF A B B R E V IA TIO N S
R C IM  Research Center for Integrated Microsystems
req Request
R S C  Recursive Systematic Convolutional
R T L  Register Transfer Level
R T Z B D P  Return To Zero Bundled D ata Protocol 
R T Z  Return To Zero
S i  The Chemical Element Silicon
S M U  Survival Metric Unit
S N R  Signal To Noise Ratio
SS Synchronous Systems
T S C M  Taiwan Semiconductor Manufacturing Company
U N IX  UNiplexed Information and Computing System
VA Viterbi Algorithm
V H D L  Very High Speed Integrated Circuit Hardware Description Language
V L S I Very Large Scale Integration
X O R  Exclusive ’O R’
93
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
A ppendix B
M atlab  Code - see enclosed CD
94
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
A ppendix C
B alsa  Code - see enclosed CD
95
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
A ppendix D
Balsa To Verilog Netlis t  Mapping Files: 
For T SC M  0.18 micron, Single Poly, Six 
Metal, Salicide CM O S Process - see 
enclosed CD
96
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
A ppendix  E
Verilog Code - see enclosed CD
97
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
A ppendix F
Synopsys Area, Pow er And Timing  
Report Files - see enclosed CD
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
VITA A U C T O R IS
Kristofer Perta received his B.A.Sc. degree in Electrical Engineering from the University 
of Windsor and graduated 2001. His undergraduate co-op degree gave him the 
opportunity to take placements outside of his hometown to Toronto and Ottawa with 
Nortel Networks and JDS Uniphase.
In the fall of 2002, Kris began his graduate degree under the supervision of Dr. Tepe in 
the fields of digital VLSI design and convolutional channel coding.
Kris plans on pursuing a career in Hardware Design in the Telecommunications sector 
after the completion of his graduate degree.
99
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
