Mississippi State University

Scholars Junction
Theses and Dissertations

Theses and Dissertations

5-12-2012

Asynchronous Design Investigation for a 16-Bit Microprocessor
William Kalish

Follow this and additional works at: https://scholarsjunction.msstate.edu/td

Recommended Citation
Kalish, William, "Asynchronous Design Investigation for a 16-Bit Microprocessor" (2012). Theses and
Dissertations. 804.
https://scholarsjunction.msstate.edu/td/804

This Graduate Thesis - Open Access is brought to you for free and open access by the Theses and Dissertations at
Scholars Junction. It has been accepted for inclusion in Theses and Dissertations by an authorized administrator of
Scholars Junction. For more information, please contact scholcomm@msstate.libanswers.com.

ASYNCHRONOUS DESIGN INVESTIGATION FOR A 16-BIT MICROPROCESSOR

By
William Kalish

A Thesis
Submitted to the Faculty of
Mississippi State University
in Partial Fulfillment of the Requirements
for the Degree of Masters of Science
in Electrical Engineering
in the Department of Electrical and Computer Engineering

Mississippi State, Mississippi
May 2012

ASYNCHRONOUS DESIGN INVESTIGATION FOR A 16-BIT MICROPROCESSOR

By
William Kalish

Approved:

Robert B. Reese
Associate Professor of Electrical and
Computer Engineering
(Director of Thesis)

Sherif Abdelwahed
Assistant Professor of Electrical and
Computer Engineering
(Committee Member)

Thomas H. Morris
Assistant Professor of Electrical and
Computer Engineering
(Committee Member)

James E. Fowler
Graduate Coordinator of Electrical and
Computer Engineering
(Graduate Program Director)

Sarah A. Rajala
Dean of the Bagley College of
Engineering

Name: William Kalish
Date of Degree: May 11. 2012
Institution: Mississippi State University
Major Field: Electrical Engineering
Major Professor: Dr. Robert B. Reese
Title of Study: ASYNCHRONOUS DESIGN INVESTIGATION FOR A 16-BIT
MICROPROCESSOR
Pages in Study: 64
Candidate for Degree of Master of Science

Asynchronous design is an alternative to the more widely used synchronous
design which allows for the elimination of a global clock network and associated design
issues such as clock skew. Uncle is a toolflow that provides automated assistance for
transforming a synchronous system specified in Verilog RTL to an asynchronous system.
With assistance from Uncle an asynchronous delay-insensitive microprocessor is
implemented using NULL Convention Logic (NCL) and verified to function properly.
An advantage of asynchronous design is that it can be data-driven. Data-driven design
allows specific blocks of logic to only be active when they are needed. Data-driven
design is implemented to bypass parts of the asynchronous microprocessor. These parts
included the ALU and the peripheral hardware multiplier. This resulted in a reduction of
total power consumed and an increase in speed. Overall, it was concluded that
asynchronous design with Uncle was a viable alternative to synchronous design.

DEDICATION

This work is dedicated to all the people who I consider family. I could not have
done it without your love and support. Thank you.

ii

ACKNOWLEDGEMENTS

I thank my major professor Dr. Bob Reese. Not only for his continued support
with this thesis but also for his guidance with my post-graduate education. This work
could not have been completed without his unending willingness to help. From
brainstorming ideas about where to focus this research to helping in the editing of this
thesis, your support is greatly appreciated.

iii

TABLE OF CONTENTS

DEDICATION.................................................................................................................... ii
ACKNOWLEDGEMENTS............................................................................................... iii
LIST OF TABLES............................................................................................................. vi
LIST OF FIGURES .......................................................................................................... vii
CHAPTER
I

INTRODUCTION ...............................................................................................1
1.1 Problem Statement.........................................................................................2
1.2 Organization of Thesis...................................................................................3

II.

OVERVIEW OF ASYNCHRONOUS SYSTEMS USING NCL.......................4
2.1 Delay-insensitive Systems.............................................................................4
2.1.1 Four-Phase Signaling ...........................................................................4
2.1.2 Completion Detection...........................................................................6
2.1.3 Four-Phase Handshaking......................................................................7
2.1.4 Dual-Rail Boolean Logic......................................................................9
2.2 Null Convention Logic ................................................................................10
2.2.1 Threshold Gates..................................................................................10
2.2.2 Dual-Rail Combinational NCL ..........................................................11
2.2.3 Registers .............................................................................................13
2.3 Asynchronous systems using NCL..............................................................16

III.

ASYNCHRONOUS MICROPROCESSOR IMPLEMENTATION ................19
3.1 Demonstrating Uncle...................................................................................20
3.1.1 Generation of Asynchronous System .................................................20
3.1.2 Asynchronous System Simulation......................................................21
3.1.3 Two-bit Counter Example ..................................................................23
3.2 Microprocessor Architecture .......................................................................25
3.2.1 Memory Backbone .............................................................................25
iv

3.2.2 Frontend..............................................................................................25
3.2.3 Execution Unit....................................................................................30
3.2.4 Peripherals ..........................................................................................32
3.3 Asynchronous Implementation....................................................................33
3.3.1 Implementation...................................................................................34
3.3.2 Testing ................................................................................................39
3.3.3 Watchdog Timer Issue........................................................................42
IV.

ASYNCHRONOUS DESIGN ADVANTAGES ..............................................44
4.1 Wavefront Steering......................................................................................44
4.2 Wavefront Steering in the ALU...................................................................47
4.3 Wavefront Steering in the Hardware Multiplier..........................................51
4.4 Results .........................................................................................................58

V.

CONCLUSION .................................................................................................61
5.1 Conclusion ...................................................................................................62
5.2 Future Work.................................................................................................63

REFERENCES ..................................................................................................................64

v

LIST OF TABLES

2.1

NCL Gates [2] .........................................................................................................12

3.1

Instruction Formats [7] ............................................................................................26

3.2

Single-operand Instructions [7] ...............................................................................27

3.3

Two-operand Instructions [7] ..................................................................................28

3.4

Addressing Modes [7] .............................................................................................29

3.5

Jump Conditions [7] ................................................................................................30

3.6

Status Register Bits [7] ............................................................................................31

3.7

Asynchronous Microprocessor Test Results ...........................................................41

4.1

Data-driven ALU results .........................................................................................58

4.2

Data-driven multiplier and ALU results ..................................................................60

vi

LIST OF FIGURES

2.1

Dual-rail encoding .....................................................................................................5

2.2

Single-bit completion detection.................................................................................6

2.3

3-bit wide completion detection ................................................................................7

2.4

Four-phase handshaking ............................................................................................8

2.5

Dual-rail Boolean (a) AND function and (b) XOR function...................................10

2.6

Generic threshold gate [2] .......................................................................................11

2.7

AND function (a) C-gate implementation (b) NCL implementation ......................13

2.8

Threshold gate (a) reset to 0, (b) reset to 1..............................................................14

2.9

Dual-rail NCL 1-bit Latch [2] .................................................................................14

2.10 Dual-rail NCL 1-bit DFF.........................................................................................15
2.11 Single-rail synchronous 2-bit counter......................................................................17
2.12 Dual-rail asynchronous 2-bit counter ......................................................................17
3.1

Synchronous to asynchronous conversion...............................................................20

3.2

Asynchronous testbench flow..................................................................................22

3.3

Verilog RTL of synchronous two-bit counter .........................................................23

3.4

Simulation output of asynchronous two-bit counter................................................24

3.5

Instruction fetch waveform......................................................................................36

3.6

Instruction decode waveform ..................................................................................37
vii

3.7

Execution stage waveform.......................................................................................38

3.8

Result of XOR waveform ........................................................................................39

3.9

XOR test program....................................................................................................40

3.10 XOR Verilog test .....................................................................................................40
4.1

Composition of 1-2 demux [5] ................................................................................45

4.2

Generic 1-N demux .................................................................................................45

4.3

Wavefront steering example....................................................................................46

4.4

ALU datapath ..........................................................................................................47

4.5

Simulation of ALU ..................................................................................................48

4.6

ALU datapath with wavefront steering ...................................................................49

4.7

Simulation of ALU with wavefront steering ...........................................................50

4.8

Peripherals datapath.................................................................................................51

4.9

Simulation of peripheral outputs .............................................................................52

4.10 Peripheral datapath with wavefront steering ...........................................................53
4.11 Multiplier demux self-ack .......................................................................................54
4.12 Multiplier gated ack network...................................................................................55
4.13 Simulation of peripheral datapath with wavefront steering.....................................57

viii

CHAPTER I
INTRODUCTION

Asynchronous design has been used since the 1950s and has provided a solution
to some of the problems related to modern microprocessor design [1]. Asynchronous
designs do not depend on a global clock network for synchronizing data movement
between components. Since asynchronous designs do not depend on a global clock
network, cycle time is based on average time instead of worst case time as in a
synchronous design. Asynchronous designs need more transistors to do a computation.
Therefore, clock energy is traded for computation energy. Also, power consumption is
saved because asynchronous designs only compute when data is available. Another
advantage is that signal transitions do not occur near a clock edge; asynchronous designs
have better noise performance and thus can be beneficial to a mixed signal design [2].
Clock skew is the difference in arrival times of the clock signal at different parts of the
circuit and it has been shown that clock skew increases as the size of circuit features
decrease. Since no clock network is required in an asynchronous design, this means that
clock skew becomes a non-issue [3]. A drawback is that asynchronous designs, such as
the delay-insensitive design style used in this paper, generally require more transistors
and routing than a synchronous design. This is due to extra logic needed for data arrival
detection and the generation of dual rail outputs, both of which will be discussed in
1

further detail later in this document [2]. These are just a few reasons why a circuit
designer may want to use asynchronous design, with the largest drawback being the
increased area requirement.

1.1 Problem Statement
A major limiting factor to the use of asynchronous design from a designer’s point
of view is the lack of supporting CAD tools [4]. As mentioned previously, in a delayinsensitive asynchronous design some type of data arrival detection is required, which
implies that multi-rail logic is required (dual-rail logic is used in this paper). Manual
netlisting of dual rail logic and creation of the associated data arrival logic networks can
become complex and error-prone in a large system. Significant reduction in design times
can be accomplished with the right assistance from automated tools.
Uncle [5] is a toolset that provides automated assistance for generating an
asynchronous design. The initial design is specified in Verilog RTL and then, using a
commercial synthesis tool, is synthesized to a netlist of D-flip-flops, latches,
combinational logic, and a few special gates known by the toolset. Uncle then reads this
netlist and maps it to an implementation technology known as NULL Convention Logic
(NCL). The Uncle toolflow also automatically generates the associated data arrival
acknowledgement networks to ensure that the resulting asynchronous netlist correctly
cycles. This resulting gate level netlist can then be simulated in a Verilog simulator.
A microprocessor is a natural test case for an asynchronous design methodology
given that it is the central element of a computer system. The asynchronous
microprocessor discussed in this work is strongly based upon the Texas Instruments
2

MSP430, which is a synchronous 16-bit RISC CPU. Furthermore, this work describes
the process of creating an asynchronous microprocessor using Verilog RTL that is
compatible with Uncle while maintaining the same basic functionality as the MSP430.

1.2 Organization of Thesis
The organization of this thesis is as follows. Chapter 2 presents introductory
information on the asynchronous design style (dual-rail, delay-insensitive) and the target
implementation technology (NCL). Chapter 3 presents the MSP430 architecture and the
design approach for the asynchronous version based on this architecture. Chapter 4
presents the concept of data-driven design in asynchronous systems and applies it to
subsystems within the asynchronous version of the MSP430. Chapter 5 presents
conclusions and possible areas of future study.

3

CHAPTER II
OVERVIEW OF ASYNCHRONOUS SYSTEMS USING NCL

An asynchronous system is a system without a global clock. Asynchronous
systems can either be delay-insensitive (DI) or delay sensitive, this thesis is concerned
with delay-insensitive design. Delay-insensitive systems are not time dependent and can
handle arbitrary signal arrival times between gates. A circuit implementation technology
for implementing a delay-insensitive system is Null Convention Logic (NCL) which is
discussed in detail in this chapter [2].

2.1 Delay-insensitive Systems
Delay-insensitive systems are defined by having unbounded wire and gate delays
[3]. This means that the delay through a wire or a gate in a DI system can take on any
value and it will not change the functionality of the circuit. Because of the unbounded
delays and absence of a global clock in a DI asynchronous system it is necessary to have
some type of data encoding, completion detection, and a handshaking protocol.

2.1.1 Four-Phase Signaling
DI systems require a data encoding scheme that is used to differentiate between
new and previous data. One possible way to encode data is with dual-rail encoding [4].
4

The type of dual-rail encoding used in this paper is known as true/false encoding, where
each data signal has a true and a false rail. The four possible values using this encoding
method is shown in Figure 2.1 where  is the true rail and  is the false rail of the
signal D.

Figure 2.1 Dual-rail encoding

From Figure 2.1:
•

When both the true and false rails are low the signal is NULL.

•

When the true rail is high and the false rail is low the signal is a DATA-1.

•

When the true rail is low and the false rail is high the signal is a DATA-0.

•

When both the true and false rails are high the signal is invalid [2].
Two rules associated with dual-rail encoding are that a DATA-wave must be

followed by a NULL-wave and both the true and false rail can never be asserted at the
same time [4]. This provides a way to detect the difference between the next and current
DATA-wave because they are always separated by a NULL-wave.

5

2.1.2 Completion Detection
In a DI asynchronous system the ability to detect when computations are
completed on the current data is required [2]. Based on one of the rules of dual-rail
encoding that the true and false rail are never asserted simultaneously, then single bit
output completion detection is accomplished by connecting the true and false rail to an
OR gate, as seen in Figure 2.2. Since Y is the output, if either the true rail or false rail is
high then there is data on the output and the acknowledgement out (ackout) signal is high.
If the true and false rails are both low (NULL) then the ackout is low.

Figure 2.2 Single-bit completion detection

Figure 2.2 shows sufficient completion detection for a single bit of data but extra
logic is required for multiple-bit completion detection. The basic gate used for multiplebit completion detection is called a Muller C-gate [2]. The C-gate is a state-holding
element whose output transitions high once all of its inputs are high, and transitions low
once all of its inputs have returned low. Figure 2.3 shows completion detection for a 3bit wide output.

6

Figure 2.3 3-bit wide completion detection

The C-gate is shown with inputs on the left and the output on the right. Now with
the addition of the C-gate, during a DATA-wave ackout will only become high when all
of the separate bits of the output contain valid data. During a NULL-wave the true and
false rails of the output will both be zero, therefore ackout will be low. Even though
Figure 2.3 only shows completion detection for a 3-bit wide output, it can easily be
expanded to any number of bits by using trees of C-elements (the C-elements used in this
paper are limited to 4-inputs to limit transistor stacking in their CMOS implementation).

2.1.3 Four-Phase Handshaking
Modules in a DI system must have some type of handshaking protocol to signal
when data is ready to be sent to another module and when data has been received from
another module. One way to accomplish this is through four-phase handshaking where
all modules include an acknowledgement in (ackin) signal and an acknowledgement out
(ackout) signal [3]. Figure 2.4 shows a module using four-phase handshaking.

7

Figure 2.4 Four-phase handshaking

Initially, ackin is asserted representing a request-for-DATA. This allows the
DATA-1 value on the input, D, to be passed to the output, Q. The DATA-1 value on the
output causes ackout to transition low which represents a request-for-NULL to the
module supplying the input. Eventually ackin is driven low showing that the previous
DATA-wave was received by the module connected to the output. A NULL value on the
input, caused by ackout being low, and ackin being driven low allows the NULL value to
be passed from the input to the output and causes ackout to be driven high. Finally, ackin
is driven high showing that the module connected to the output received the previous
NULL-wave [3].

8

Using four-phase handshaking allows for multiple DI modules to communicate
with one another without the use of a global clock. Also, it allows the transfer of data
while preventing previous data that may be in use from being overwritten. Therefore,
independent of the computation time of the module, data is only sent once the data and a
receiver are ready. A receiver is only ready after it has completed computation on the
previous data.

2.1.4 Dual-Rail Boolean Logic
Since four-phase signaling requires the implementation of both a true and false
rail, then dual-rail logic must also be implemented in a DI system. Dual-rail logic
ensures that the appropriate value is produced on both the true and false rail of an output.
To create dual-rail Boolean logic functions, all the possible combinations of the inputs
are connected to a separate C-gate. Then, the outputs of those C-gates that satisfy the
Boolean logic function are OR’d together to produce the true rail output and the terms
that do not satisfy the logic function are OR’d together to produce the false rail output
[6]. Figure 2.5 gives an example of a two input dual-rail AND function,   , and the
XOR function,   +   . The dual-rail logic allows the creation of logic functions
that produce NULL, DATA-0, and DATA-1 values.

9

Figure 2.5 Dual-rail Boolean (a) AND function and (b) XOR function

2.2 Null Convention Logic
Null Convention Logic, or NCL, is an alternative to traditional clocked Boolean
logic and can be used to efficiently implement dual-rail logic for delay-insensitive
asynchronous systems. NCL is different from Boolean logic in that it uses a combination
of four-phase signaling, four-phase handshaking, and threshold gates to create dual-rail
logic functions. While dual-rail logic can be implemented by just using C-gates, NCL can
implement the same logic with fewer transistors.

2.2.1 Threshold Gates
Threshold gates are used in NCL instead of tradition Boolean logic gates because
they are only dependent on gate inputs. Generic threshold gates are described as THmn
where m is the number of inputs that must be asserted for the output to be asserted and n
is the total number of inputs [4]. The symbol for a generic threshold gate can be seen in
Figure 2.6.

10

Figure 2.6 Generic threshold gate [2]

As an example, in a TH34 gate there are a total of 4 inputs and the threshold of the
gate is 3. Therefore, at least 3 of the 4 inputs must be asserted for the output to be
asserted. Furthermore in a threshold gate, during a DATA-wave the data on the output
does not become valid until the minimum number of inputs needed to evaluate the gate
become valid and during a NULL-wave the output maintains its data value until all of the
inputs become NULL [4]. Therefore, a THnn gate is the equivalent to an N-input C-gate
and a TH1n gate is the equivalent to an OR gate.
In addition to the generic threshold gates there are also weighted threshold gates
[2]. As opposed to the generic threshold gate where all inputs carry the same weight, in a
weighted threshold gate different inputs can have different weights. For example,
TH24w22 describes a weighted threshold gate with 4 inputs and a threshold of 2. The first
and second inputs, A and B, have a weight of 2 and the remaining inputs, C and D, have
the default weight of 1. So, the TH24w22 implements the Boolean function A + B + CD.

2.2.2 Dual-Rail Combinational NCL
With the use of normal and weighted threshold gates 27 different NCL gates can
be created. From these different logic gates it is possible to create combinational NCL
11

capable of many different functions. A list of the 27 different NCL gates can be seen in
Table 2.1.

Table 2.1 NCL Gates [2]
NCL Gate
Name
TH12
TH13
TH14
TH22
TH23
TH23w2
TH24
TH24w2
TH24w22
TH24comp
TH33
TH33w2
TH34
TH34w2
TH34w3
TH34w22
TH34w32
TH44
TH44w2
TH44w3
TH44w22
TH44w322
TH54w22
TH54w32
TH54w322
THxor0
THand0

Boolean Equivalent
A+B
A+B+C
A+B+C+D
AB
AB + AC + BC
A + BC
AB + AC + AD + BC + BD + CD
A + BC + BD + CD
A + B + CD
AC + BC + AD + BD
ABC
AB + AC
ABC + ABD + ACD + BCD
AB + AC + AD + BCD
A + BCD
AB + AC + AD + BC + BD
A + BC + BD
ABCD
ABC + ABC + ACD
AB + AC + AD
AB + ACD + BCD
AB + AC + AD +BC
ABC + ABD
AB + ACD
AB + AC + BCD
AB + CD
AB + BC + AD

Dual-rail NCL functions are created very similarly to how dual-rail Boolean logic
functions are created as explained in Section 2.1.4. The difference being that Boolean
logic gates are replaced with threshold logic gates. Figure 2.7(a) shows the same dual12

rail C-gate implementation of the dual-rail AND function that was shown in Figure
2.5(a). Figure 2.7(b) shows the NCL implementation of the AND function.

Figure 2.7 AND function (a) C-gate implementation (b) NCL implementation

Figure 2.7(a) and Figure 2.7(b) are logically equivalent but, Figure 2.7(b) uses
fewer transistors. Each C-gate uses 12 transistors and an OR gate uses 8 transistors.
Therefore, the C-gate implementation in Figure 2.7(a) uses a total of 56 transistors. The
TH22 used in Figure 2.7(b) uses 12 transistors and the THand0 uses only 19 transistors [2].
Therefore, the NCL implementation of the AND function uses 31 transistors. Compared
to the number of transistors used in the C-gate implementation this is a significant
decrease.

2.2.3 Registers
As with synchronous systems, in asynchronous systems registers are a necessity
to be able to store data. Also, similar to a synchronous system there must be some

13

method of resetting a register. Figure 2.8 shows the representation of a threshold gate
with a threshold of 2 that (a) resets to 0 and (b) resets to 1.

Figure 2.8 Threshold gate (a) reset to 0, (b) reset to 1

The first type of register is a latch. To construct a dual-rail NCL latch, three gates
are needed for passing data and completion detection. Also, since a clock signal is absent
in an asynchronous latch, acknowledgment signals are required to dictate when data can
and cannot be passed from D to Q. Figure 2.9 shows a NCL latch that resets to DATA-0.

Figure 2.9 Dual-rail NCL 1-bit Latch [2]
14

As seen from the figure, the latch can only pass a DATA value from D to Q when
ackin is asserted. Otherwise, when ackin is low, a NULL wave is passed from D to Q.
When there is a DATA value on Q then ackout is low and when there is a NULL value
on Q then ackout is high. Along with a latch that resets to DATA-0 it is also possible to
have a latch that resets to DATA-1 and NULL by changing the configuration of the two
threshold gates that pass data from D to Q.
The second type of register is a D flip-flop (DFF). An NCL DFF is constructed of
three NCL latches and can be seen in Figure 2.10. Also, in Figure 2.10, as well as in the
remainder of this paper, ‘DR’ will be used to represent components that handle dual-rail
signals.

Figure 2.10 Dual-rail NCL 1-bit DFF

During the first DATA-wave the DATA value on the D input of the DFF will be
passed to the output of the first latch. Also, the NULL value that was on the output of the
first latch will be passed the output of the second latch. Finally, the DATA value that was
on the output of the second latch will be passed to third latch, producing a DATA value
15

on output of the DFF. Next, during the following NULL-wave the NULL value on the D
input of the DFF will be passed to the output of the first latch. Then, the DATA value
that was on the output of the first latch will be passed to the output of the second latch.
Finally, the NULL value that was on the output of the second latch will be passed to third
latch, producing a NULL value on output of the DFF. Therefore, the NCL DFF can
produce DATA values during DATA-waves and NULL values during NULL-waves.
This cycle will continue for each DATA-wave and NULL-wave.

2.3 Asynchronous systems using NCL
Combinational logic and registers using NCL provide the necessary tools for
designing an asynchronous system using NCL. A generic asynchronous system using
NCL can be described as multiple iterations of some type of combinational NCL between
two NCL registers with the addition of an acknowledgment network [2]. It is possible for
a synchronous system to be converted to an asynchronous system that is implemented in
NCL. First, replace the single-rail Boolean logic and the other sequential components
with their dual rail equivalents. Then, replace the dual-rail Boolean logic and the
sequential elements with their NCL equivalent and add the necessary acknowledgement
network. This results in a DI system that is free of a global clock signal. An example of
this process is shown in the following figures, where the system being created is a 2-bit
counter.

16

Figure 2.11 Single-rail synchronous 2-bit counter

Figure 2.12 Dual-rail asynchronous 2-bit counter

As shown in Figure 2.12, the clock signal has been removed and an
acknowledgement network has been added. The acknowledgement network is generated
by assuring that each individual register receives an acknowledgement from all the other
17

registers that are destinations from that register’s output. This is to assure that all the
necessary registers have received the appropriate data. The ackout signal is composed of
the ackouts from the destinations of all primary inputs with the exception of the reset
signal. Also, the ackin signal must be incorporated into the acknowledgement network of
any register whose output traces to a primary output.
The system shown in Figure 2.12 is delay-insensitive and is only dependent on the
arrival of data on the inputs. After reset is released ackin will be driven high and the
system will begin to cycle and count through values between 0 and 3. The only
functional difference between the synchronous and asynchronous counter is the
asynchronous counter will have a NULL-wave inserted between every DATA-wave
whereas the synchronous counter would produce only DATA-waves. This is shown later
when the process of creating an NCL netlist using Uncle is discussed.

18

CHAPTER III
ASYNCHRONOUS MICROPROCESSOR IMPLEMENTATION

From a designer productivity viewpoint, it is advantageous to have automated
software that assists the user in the creation of asynchronous systems. The Uncle toolset
discussed in this chapter provides that assistance. One of the goals of this thesis is to
determine if the Uncle synthesis system can be used to implement a system as complex as
a microprocessor. A second goal is to determine if the microprocessor design can be
changed to take advantage of the unique capabilities of asynchronous design.
This chapter is organized as follows. First, the Uncle toolflow for asynchronous
system design is explained. The counter system that was used in the previous chapter
will be used to demonstrate Uncle’s capabilities. Then, using Uncle, it will be
determined if a more complex system such as a microprocessor can be implemented.
The asynchronous microprocessor being implemented is based on the Texas Instruments
MSP430 microprocessor.

19

3.1 Demonstrating Uncle

3.1.1 Generation of Asynchronous System
As stated previously, Uncle is a toolset that assists in transforming a synchronous
system specified in Verilog RTL to an asynchronous system. This process can be seen in
Figure 3.1.

Figure 3.1 Synchronous to asynchronous conversion

20

The initial design is specified in Verilog RTL that is synthesized by the
commercial synthesis tool to a gate level netlist that uses AND2, XOR2, OR2,
INVERTER, DFF, and D-LATCH components. In addition to these components, there
are a few special components that are used to implement some of the advantages that
asynchronous design has to offer. These special components are explained in further
detail later. After the RTL is converted, the gate level netlist becomes the input netlist for
the Uncle toolflow. Uncle begins by expanding the netlist to its dual-rail equivalent
using a similar process described in the previous chapter. Then, the Boolean gates are
removed and replaced with their NCL equivalent. After the dual-rail expansion,
generation of the acknowledgment network is performed. Finally, Uncle performs some
steps to attempt to reduce the total area of the design. The final result is a netlist of an
asynchronous system using NCL that can be simulated using a Verilog simulator.
There are a few restrictions on the input RTL that must be followed. Only
buffers and inverters may be placed on the asynchronous reset line. The asynchronous
reset signal cannot be used in the logic of the design. Also, if a design contains a register
then there must not be a path from input to output that does not include a register. This
ensures that the acknowledgement network is generated correctly. Furthermore, there
may only be one clock network specified in the synchronous design and there can be no
gating on the clock other than buffers and inverters.

3.1.2 Asynchronous System Simulation
In addition to assisting in generating an asynchronous system from a synchronous
system, Uncle also generates a testbench file that can be used in a Verilog simulator. The
21

testbench takes the single-rail inputs of the testbench and expands them to dual-rail
signals to be connected to the asynchronous system that is being tested. Also, completion
detection is added to the output of the asynchronous system that is being tested. Then the
true-rails of the outputs are connected to a register that is activated when the completion
detection indicates that there is valid data on the output. This captures the appropriate
single-rail outputs at the appropriate time. The testbench also uses the ackin and ackout
signals of the system being tested to identify the appropriate time to send a DATA-wave
or a NULL-wave. The basic flow of the testbench can be seen in Figure 3.2.

Figure 3.2 Asynchronous testbench flow

After the system is reset a DATA-wave is applied. Next, ackout transitioning low
indicates that the data has been consumed. Since the data has been consumed a NULLwave can be applied. Ackout transitioning high signifies that the NULL-wave has been
consumed and that the next DATA-wave can be applied. This cycle continues for as long
as the testbench provides new data.
22

3.1.3 Two-bit Counter Example
Now that it has been explained how Uncle functions, it is used to create an
asynchronous two-bit counter from its synchronous counterpart. The Verilog RTL that
implements the synchronous two-bit counter shown in Figure 2.11 is shown in the
following figure.

Figure 3.3 Verilog RTL of synchronous two-bit counter

23

The RTL in Figure 3.3 is converted to a gate level netlist using a commercial
synthesis tool, and this gate level netlist becomes the input netlist for Uncle. The output
netlist generated by Uncle implements an asynchronous system using NCL. The
simulation of the asynchronous two-bit counter, created from the testbench generated by
Uncle, is seen in Figure 3.4.

Figure 3.4 Simulation output of asynchronous two-bit counter

Figure 3.4 shows that after reset is released the system begins to cycle. At timemarker 1, reset is released and a DATA-wave is applied. Then, at time-marker 2, when
the ackout signal goes low a NULL-wave is applied. When the ackout signal goes high,
at time-marker 3, another DATA-wave is applied. Therefore, between time-markers 1
and 4, when cnt_en is asserted the system will count and when cnt_en is low the system
24

will stop counting. An asynchronous system has successfully been created by only using
synchronous RTL that follows the restrictions set by the Uncle toolset.

3.2 Microprocessor Architecture
As mentioned in the first chapter, the architecture of the asynchronous
microprocessor is based upon the architecture of the Texas Instruments MSP430
synchronous microprocessor. The asynchronous microprocessor is composed of a
memory backbone that sends and receives data to and from program memory, data
memory, and peripherals, a frontend used for decoding instructions, and an execution unit
used to perform instructions. Also, there are different peripherals such as a watchdog
timer, special function registers, and a 16-bit by 16-bit multiplier.

3.2.1 Memory Backbone
The memory backbone is responsible for handling data transfers between program
memory, data memory, and peripherals to the frontend and execution unit. When the
frontend memory bus is enabled then data is read from program memory. When the
execution memory bus is enabled then data is read from or written to data memory, read
from program memory, or read from or written to a peripheral depending on the needs of
the current instruction.

3.2.2 Frontend
The frontend fetches data from program memory and decodes that data. Data is
fetched with the use of a program counter. When the conditions for a fetch are met the
25

frontend memory bus is enabled and the data from the memory location specified by the
program counter is fetched. After the instruction is fetched it is decoded to be sent to the
execution unit.
The decoding process is based upon the fact that there are three different
instruction types, each with their own specific format. There are single operand
instructions, two-operand instructions, and jump instructions. The instruction formats are
shown in Table 3.1

Table 3.1 Instruction Formats [7]
0
SingleOperand
DoubleOperand
0
Jump
15
Bit

0

0

Opcode
0
14

1
13

1

0

0

Opcode

Source reg
Condition
12 11 10

9

8

B/W

Ad

Dest reg

Ad B/W

As

Dest reg

7

PC Offset
6
5 4

3

2

1

In these instruction formats, the opcode represents each instruction’s unique
identifier. In single-operand instructions the opcode is three bits wide and in doubleoperand instructions the opcode is four bits wide. The single-operand and two-operand
instructions are shown in Table 3.2 and Table 3.3.

26

0

Table 3.2 Single-operand Instructions [7]
Name
CALL
PUSH

RETI
RRA

RRC
SWPB
SXT

Byte-width
Description
No
A call is made to an address and the return
address is stored on the stack
Yes
The stack pointer is decremented by two,
then the operand is moved to the RAM
word addressed by the stack pointer
No
Return from interrupt
Yes
The operand is shifted right one position,
the MSB is shifted into the MSB and the
MSB-1
Yes
The operand is shifted right one position
and the carry bit is shifted into the MSB.
No
The operand high and low bytes are
swapped
No
The sign of the low byte of the operand is
extended into the high byte of the operand

27

Table 3.3 Two-operand Instructions [7]
Name
ADD
ADDC
AND
BIC
BIS
BIT
CMP
DADD

MOV
SUB
SUBC
XOR

Byte-width
Description
Yes
Source operand is added to destination
operand
Yes
Source operand is added to destination
operand with use of a carry bit
Yes
Source operand is AND’d to destination
operand
Yes
The inverted source operand is AND’d to
the destination operand
Yes
Source operand is OR’d to destination
operand
Yes
Source operand is AND’d to destination
operand, only status bits are effected
Yes
Source operand is compared to destination
operand, only status bits are effected
Yes
Source and destination operands are treated
as four binary coded decimals and then
added with the use of a carry bit
Yes
The source operand is moved to the
specified destination
Yes
The source operand is subtracted from the
destination operand
Yes
The source operand is subtracted from the
destination operand with use of carry bit
Yes
Source operand is XOR’d to destination
operand

The B/W bit in an instruction signifies if the instruction is a byte-wide or wordwide instruction. A value of 0 represents that the instruction is performed on the default
16-bit wide value and a value of 1 represents that the instruction is performed on an 8-bit
wide value. In Table 2.2 and 2.3 the column labeled “Byte-width” represents that the
instruction can be executed in byte-mode as well as word-mode.
The Ad and As bits represent the addressing mode of the destination and source
registers. The Ad and As bits are used with the destination and source registers to

28

determine the appropriate operand location. The possible addressing modes are shown in
Table 3.4, where Rn is any general purpose register.

Table 3.4 Addressing Modes [7]
Addressing Mode
Register
Indexed

As
00
01

Ad
0
1

Symbolic

01

1

Absolute

01

1

Indirect
Indirect
autoincrement

10
11

-

Immediate

11

-

Memory Location of Operand
Rn
(value stored in Rn) + X,
where X is stored in the following word
(value of PC) + X,
where X is stored in the following word
X,
where X is stored in the following word
value stored in Rn
value stored in Rn,
value of Rn is then incremented by 1 for bytewidth instructions and 2 for word-width
instructions
X,
Where X is stored in the following word. X is
then moved to the value of PC and PC is then
incremented by 1 for byte-width instructions and
2 for word-width instructions

The source and destination registers are used in single and two-operand
instructions. In a single-operand instruction the instruction is performed on the desired
destination register and the result is written back to that same register. In a two-operand
instruction the instruction is performed using both the source and destination registers
and the result is written back to the destination register.
Jump instructions utilize a condition field and a program counter offset. The
condition field is three-bits wide giving the possibility of eight different possible jump
conditions. These conditions are shown in Table 3.5.

29

Table 3.5 Jump Conditions [7]
Condition
JEQ/JZ
JNE/JNZ
JC
JNC
JN
JGE
JL
JMP

Description
Jump if zero bit is equal to 1
Jump if zero bit is equal to 0
Jump if carry bit is equal to 1
Jump if carry bit is equal to 0
Jump if negative bit is equal to 1
Jump if negative bit XOR’d with overflow
bit is equal to 0
Jump if negative bit XOR’d with overflow
bit is equal to 1
Unconditional jump

When a jump condition is satisfied and a jump is taken, the program counter
offset is used and the new value of the program counter is 

=

 +   ×

2 [7]. As shown in Table 3.3 and Table 3.5 some instructions involve certain status bits.
The meaning and generation of these status bits are discussed in the following section.
With the combination of single-operand, two-operand, a jump instructions the
microprocessor is capable of fetching program memory and decoding that data into 27
different possible instructions.

3.2.3 Execution Unit
The execution unit is composed of a register file and an arithmetic logic unit
(ALU). The execution unit is responsible for taking the decoded instructions from the
frontend and implementing them. The execution unit also has access to program
memory, data memory, and peripherals through the memory backbone. The execution
unit may need access to program memory, data memory, or peripherals based upon the
instruction being performed as well as the addressing mode.
30

The register file of the asynchronous microprocessor includes sixteen 16-bit
registers. The register R0 is reserved for the program counter which keeps track of the
appropriate instruction to call. The register R1 is used as a stack pointer and stores the
return address of subroutine calls and interrupts. The register R2 is used as the status
register and contains certain information about the current instruction that is being
executed. Table 3.6 gives a brief description of the status bits that are available in the
status register. There are also other status bits available in the status register but, they
manage system clock generation in the MSP430 which is unnecessary in an asynchronous
microprocessor. The remaining registers, R3-R15, are general purpose registers.

Table 3.6 Status Register Bits [7]
Bit
0

Name
C

1

Z

2

N

3

GIE

4

CPUOFF

8

V

Description
Carry bit
set when result of instruction produces a carry
reset when result of instruction does not produce a carry
Zero bit
set when result of instruction produces a 0
reset when result of instruction produces a non-zero
Negative bit
set when result of instruction produces a negative number
reset when result of instruction produces a positive number
General Interrupt Enable
Allows maskable interrupts when enabled
CPU off bit
when set, turns off CPU
Overflow Bit
set when result of instruction overflows

Along with the register file the execution unit also consists of the ALU. The ALU
is responsible for calculating the result of single and two-operand instructions, as well as

31

calculating the value of the stack pointer when needed. The ALU is also responsible for
generating the C, Z, N, and V bits for the status register.

3.2.4 Peripherals
In addition to the memory backbone, frontend, and execution unit, the
asynchronous microprocessor also includes several peripherals. These peripherals
include a watchdog timer, a special function register (SFR), and a hardware multiplier.
These peripherals are accessed through the memory backbone where each peripheral is
associated with a unique set of addresses. Then, through the memory backbone, values
can be read and written to a certain peripheral depending upon the address.
The watchdog timer is a 16-bit counter that is used to generate a reset or an
interrupt. The reset or interrupt occurs when the counter reaches some pre-determined
value. The pre-determined value is set by writing the appropriate value to the first two
bits of WDTCTL register. This determines the timer interval that is to be used.
The SFR peripheral is used in conjunction with the watchdog timer for generating
watchdog timer resets and interrupts. The value of the SFR register determines if the
watchdog timer is enabled and if the watchdog timer should generate a reset or an
interrupt. For example, when the value of the watchdog timer has been reached then a
signal is sent from the watchdog timer to the SFR. If the watchdog timer reset is enabled
in the SFR then a reset is produced.
In addition to the watchdog timer and the SFR, another peripheral is the 16-bit by
16-bit hardware multiplier. The multiplier takes in two operands and produces the result
of their multiplication. The first operand, depending on the address being written,
32

signifies what type of multiplication to perform. The different choices are an unsigned
multiplication, signed multiplication, unsigned multiplication and accumulate, and signed
multiplication and accumulate. After the second operand is written to the appropriate
address then the hardware multiplier begins to calculate the result. The hardware
multiplier results in a 32-bit wide number that is stored in two separate registers, RESLO
and RESHI. The address of RESLO is accessed to read the lower 16-bits of the result
and the address of RESHI is accessed to read the upper 16-bits of the result.

3.3 Asynchronous Implementation
The asynchronous microprocessor was implemented by using an open-source
version of the MSP430. Using the process described in Section 3.1, the RTL of the opensource MSP430 was used in combination with Uncle to create the asynchronous
microprocessor. It was beneficial to use an open-source version of the MSP430 because
the architecture was initially verified to be functional. Therefore, any problems with the
asynchronous implementation could be traced back to the process of making the RTL
compatible with Uncle and then mapping it to NCL. Furthermore, the open-source
version of the MSP430 contained a set of tests that were ported over to testbenches
similar to the testbench described in Section 3.1.2. This provides a method of assuring
that the asynchronous microprocessor has the same functionality of the synchronous
MSP430.

33

3.3.1 Implementation
The system was implemented by first mapping all of the individual modules to
NCL using Uncle. Then, after the individual modules were verified to be functioning
properly the asynchronous microprocessor was created using a top-level netlist that tied
the individual modules together.
Some challenges were encountered when using Uncle to implement an
asynchronous microprocessor. For example, the Verilog RTL had to be modified to be
compatible with Uncle. One of these modifications was eliminating combinational
feedthrough in standalone RTL blocks that were processed separately by the toolflow.
This, as stated previously, is when a design contains a register there must not be a path
from input to output that does not include a register. An example of where this occurred
was in the execution unit. The execution unit contained several registers but, the ALU in
the execution unit was of purely combinational. This caused combinational feedthrough
and therefore Uncle did not allow the design to be mapped to NCL. This issue was
solved by connecting each individual input of the ALU to a latch. This resulted in the
combinational feedthrough being eliminated while maintaining the original functionality
of the ALU.
Another challenge was that the RTL of the open-source MSP430 used more than
one reset signal and this is not allowed in Uncle. In the original RTL the two reset
signals used were por and puc. The por signal, power-on reset, generated a reset when
the microprocessor was powered on. The puc signal, power-up clear, generated a reset
when the por signal was asserted or when there was a watchdog timer reset. Since puc
included the condition of por being asserted, it was decided to use the puc signal as the
34

only system reset for the asynchronous microprocessor. Being that puc was asserted
when either the system was being powered on or on a watchdog timer reset this raised
another issue of having logic on the main system reset signal. This issue is further
discussed in Section 3.3.3.
Furthermore, another challenge was that Uncle is still currently under
development so there were some issues associated with using the Uncle software. One
issue was there were some conditions that caused the synchronous netlist to not be
mapped correctly to NCL. Also, there was an issue related to generating constant cells
using a high true reset that resulted in not allowing the design to cycle after reset. These
issues have since been resolved and are not present in the current release of Uncle.
After these different challenges were overcome, Uncle produced a functional
asynchronous microprocessor. The following figures show the process of the
microprocessor fetching and executing the two-operand XOR instruction, where the two
operands are the value of R4 and R5. Also in the following figures, the ‘Z’ values shown
in the simulations are bus wires that have either been renamed or eliminated during the
synthesis and mapping process of creating the asynchronous system.

35

Figure 3.5 Instruction fetch waveform

As shown in Figure 3.5 the program counter continues to increment as the system
cycles. When the next instruction is needed from program memory the fetch condition is
satisfied and the fetch signal is asserted. This then enables the frontend memory bus to
access the RAM containing the program memory. The address is the value of the
program counter incremented by two, since this is the address of the next instruction.
The RAM then outputs the value of that address, 0xE405, back to the frontend as
fe_mdb_in to be decoded.

36

Figure 3.6 Instruction decode waveform

In Figure 3.6 after there is valid data, 0xE405, on fe_mdb_in then the decoded
instruction is outputted on the following DATA-wave. The signal inst_type is one-hot
encoded where inst_type[0] represents a single-operand instruction, inst_type[1]
represents a jump instruction, and inst_type[2] represents a two-operand instruction.
Therefore the instruction type has been successfully translated. The inst_bw signal
represents if the instruction is to be performed in word-mode or byte-mode. This
instruction is in word-mode, so all 16-bits of the first operand will be XOR’d with all 16bits of the second operand. The signals inst_dest and inst_src are the instruction
destination and source registers. They are also one-hot encoded where bits 0-15 represent
registers 0-15. Therefore the destination register is R5 and the source register is R4. The
signals inst_ad and inst_as are the addressing modes of the destination and source
37

register. For this instruction both registers are addressed directly but, all addressing
modes can be represented using inst_ad and inst_as. Finally, inst_alu is a series of
control signals for the ALU in the execution unit. The asserted bits in this example
signify that it is an XOR operation and to update all the status bits at the completion of
the operation. The decoded instruction is then passed to the execution unit where the
registers can be accessed and the appropriate operations can be performed.

Figure 3.7 Execution stage waveform

38

As shown in Figure 3.7 during the execution stage, based upon the decoded
instruction signals from the frontend, the contents of registers R5 and R4 are accessed.
Then these contents are used in the ALU as the operands and the XOR operation is
performed. After a result is reached, the result is written back to the destination register.
Also, the status register is updated to show the current status of the previous instruction
executed. Figure 3.8 shows that the result has been written back to the destination
register, R5. Furthermore, the source register, R3, remains unchanged and the status
register, R2, has been updated. Therefore, a successful XOR instruction has been
performed.

Figure 3.8 Result of XOR waveform

3.3.2 Testing
Although the previous section shows that an XOR instruction is executed properly
there are still 26 other instructions in the instruction set, word-mode, byte-mode, different
addressing modes, and peripherals to be tested. The open-source version of the MSP430
included testbenches for verification of all of this functionality. These testbenches
included a unique program for each instruction written in MSP430 assembly language.

39

During testbench simulation of these programs, expected values were compared against
actual values to check correct functionality.

Figure 3.9 XOR test program

Figure 3.9 gives an example of the program used to check the XOR instruction
functionality. The program begins by clearing the status register. Next, it moves values
into the R4 and R5 registers and XORs these two registers. Then, new values are moved
into R4 and R6 and those two registers are XOR’d. Finally, the value of 0x1000 is
moved into R15 and the status register is cleared again. So, when R15 is equal to
0x1000, R5 and R6 should contain the correct values of the previous two XOR
instructions. This is tested using a Verilog simulator using the following code.

Figure 3.10 XOR Verilog test
40

This checks that when R15 is equal to 0x1000, R5 and R6 contain the appropriate
value. If they do not then an error is produced, if they do then the following test is
checked. The XOR program then goes on the check that byte-mode functions properly
and that status bits for the status register are generated correctly. Programs similar to this
were included in the open-source version of the MSP430 for testing all instructions,
addressing modes, and peripherals. These testbenches were then translated to be
compatible with the asynchronous microprocessor. The simplified pass/fail results of
these tests can be seen in the following table.

Table 3.7 Asynchronous Microprocessor Test Results
Test

Description
Test word-mode ADD using different addressing modes,
ADD
status bit generation
Test byte-mode ADD using different addressing modes,
ADD(B)
status bit generation
ADD(ROM) Test ADD when source is in data memory
ADDC
Test ADDC in word and byte-mode, status bit generation
AND
Test AND in word and byte-mode, status bit generation
BIC
Test BIC in word and byte-mode, status bit generation
BIS
Test BIS in word and byte-mode, status bit generation
BIT
Test BIT in word and byte-mode, status bit generation
CALL
Test CALL using different addressing modes
Test CALL using different addressing modes that access
CALL(ROM)
data memory
CMP
Test CMP in word and byte-mode, status bit generation
DADD
Test DADD in word and byte-mode, status bit generation
JC
Test JC for all possible status bit configurations
JEQ
Test JEQ for all possible status bit configurations
JGE
Test JGE for all possible status bit configurations
JL
Test JL for all possible status bit configurations
JMP
Test JMP for all possible status bit configurations
41

Pass/Fail
Pass
Pass
Pass
Pass
Pass
Pass
Pass
Pass
Pass
Pass
Pass
Pass
Pass
Pass
Pass
Pass
Pass

Table 3.7 (continued)
JN
JNC
JNE
MOV
MOV(B)
MULT

Test JN for all possible status bit configurations
Test JNC for all possible status bit configurations
Test JNE for all possible status bit configurations
Test MOV in word-mode using different addressing modes
Test MOV in byte-mode using different addressing modes
Test hardware multiplier peripheral
Test PUSH in word and byte-mode using different
PUSH
addressing modes
Test PUSH in word-mode using different addressing
PUSH(ROM)
modes that access data memory
RETI
Test RETI for all general interrupts
Test RRA in word and byte-mode using different
RRA
addressing modes
Test RRC in word and byte-mode using different
RRC
addressing modes
SUB
Test SUB in word and byte-mode, status bit generation
SUBC
Test SUBC in word and byte-mode, status bit generation
Test SWPD using different addressing modes, status bit
SWPD
generation
SXT
Test SXT using different addressing modes
WDT
Test watchdog timer peripheral
XOR
Test XOR in word and byte-mode, status bits

Pass
Pass
Pass
Pass
Pass
Pass
Pass
Pass
Pass
Pass
Pass
Pass
Pass
Pass
Pass
Fail
Pass

3.3.3 Watchdog Timer Issue
As shown in Table 3.7, the implementation of the watchdog timer in the
asynchronous microprocessor failed. This is due to the fact that Uncle does not allow any
logic on the system reset signal. Therefore, there is no way to generate a watchdog timer
reset. Furthermore, a watchdog timer is beneficial because it allows a certain process to
be monitored for a certain amount of time based upon the system clock. Since an
asynchronous system does not include a clock signal, no precise time can be determined.
Even though the counter in the watchdog timer functioned properly it does not increment
42

based upon a clock signal but based upon they cycle time of the processor. Therefore, the
timer is dependent upon the longest path from input to output which, in an asynchronous
design, may vary. As a result no exact time interval can be specified. It was decided in
consultation with the thesis director that the watchdog timer functionality was to be
omitted from this implementation, and would be a subject of future development of this
project.
A fully functional, aside from the watchdog timer, asynchronous microprocessor
comparable to the MSP430 has been implemented with the use of Uncle. The next
chapter discusses some of the advantages of asynchronous design and how these
advantages are put to use in an asynchronous microprocessor.

43

CHAPTER IV
ASYNCHRONOUS DESIGN ADVANTAGES

Asynchronous design offers unique capabilities that are used to improve upon
designs. One of these advantages is data-driven design. Data-driven design allows for
specific blocks of logic to only be active when they are needed. This is advantageous
because if a block of logic is only cycling when it is needed then the system will have
fewer transitions and save on power consumption. A method of accomplishing datadriven design is through wavefront steering. This chapter will explain what wavefront
steering is and how it can be implemented in the asynchronous microprocessor

4.1 Wavefront Steering
As stated before, wavefront steering is a method that allows certain blocks of data
to only be active when needed. This is accomplished through the use of demux and
merge gates. A demux is a special module that takes in an input and passes it to a certain
output depending on a select line. Figure 4.1 shows a 1-2 demux, meaning the demux
takes in one input and can pass it to two different outputs.

44

Figure 4.1 Composition of 1-2 demux [5]

From Figure 4.1, it can be seen that the input, a, is passed to the outputs y0 or y1
depending on the value of the select signal. If the select signal is low the value of a is
passed to y0 and if the select signal is high the value of a is passed to y1. Furthermore,
the demux must be constructed out of C-gates to prevent a value of DATA-0 from being
passed to the output that is not selected. If a value of DATA-0 was passed on the output
that is not selected this would still cause the following logic block to cycle. Since it is
composed of C-gates the logic block connected to the unused output will remain NULL
until it is used. Figure 4.2 shows the dual-rail symbol for a generic 1-N demux.

Figure 4.2

Generic 1-N demux

45

The other component needed to implement wavefront steering is a merge gate.
The merge gate is used to combine the different outputs of the logic blocks that are being
selected. A merge gate ORs all the true-rail outputs of the different logic blocks together
to produce a single true-rail output and all the false-rail outputs of the different logic
blocks together to produce a single false-rail output. The previous chapter mentioned
special components that can be used in Uncle besides the AND2, XOR2, OR2,
INVERTER, DFF, and D-LATCH. The demux and merge gate are two of these special
components. They are utilized in Uncle by using parameterized modules in the Verilog
RTL that have been previously defined by Uncle. These modules include a 1-2, 1-3, 1-4,
1-8, and 1-16 demux, as well as a 2-1, 3-1, 4-1, 8-1, and 16-1 merge gate.

Figure 4.3 Wavefront steering example

Shown in Figure 4.3 is a system that utilizes wavefront steering. The system
takes in an input and then chooses between two separate logic blocks. Then the outputs
of those logic blocks are merged together and produce the final output. The use of
wavefront steering means that only Logic Block 1 or Logic Block 2 is active at any given
46

time, depending upon the select signal. Therefore, only one logic block is cycling and the
other logic block is not wasting power when it is not needed. Depending on the logic
block sizes a significant amount of transitions may be prevented.

4.2 Wavefront Steering in the ALU
Wavefront steering is beneficial in modules that use the same inputs to produces
several different outputs but only actually select one output. This is exactly what is
taking place in the ALU. A simplified portion of the ALU datapath is shown in Figure
4.4.

Figure 4.4 ALU datapath
47

As shown in Figure 4.4, the ALU takes in the destination operand, op_dest, and
the source operand, op_src, from the appropriate register. Also, it takes in the single
operand instruction, inst_so, and the two-operand instruction, inst_alu. Using op_src and
op_dest the outputs of the ADD, AND, OR, XOR, DADD, RRC, RRA, SWPB, and SXT
operations are computed. The output of ADD is always outputted and depending on the
state of inst_so and inst_alu only the output of one of the other operations is outputted.
The simulation of the ALU performing an XOR operation is shown in Figure 4.5.

Figure 4.5 Simulation of ALU

Figure 4.5 shows what was stated previously. The ALU receives the operands
and the decoded instruction. Then the ALU calculates all the outputs regardless of the
instruction but, only the appropriate calculation is outputted. It can be seen that while it
is an XOR operation the output for a DADD operation is also calculated. Although
Figure 4.5 only shows the output of the DADD operation, the same behavior occurs in
48

the AND, OR, RRC, RRA, SWPB, and SXT blocks. This causes many non-required
transitions to occur which wastes power.
Out of the nine different possible outputs only two outputs are actually used. It
would be beneficial for only the ADD function and the one other function being used to
be active. This is where wavefront steering is used. The ALU datapath with wavefront
steering is shown in Figure 4.6.

Figure 4.6 ALU datapath with wavefront steering

Now, shown in Figure 4.6, during every DATA-wave the only active blocks in the
ALU will be the ADD block and one other block depending on the state of inst_so and
49

inst_alu. The rest of the blocks will remain at NULL until they are needed. This should
lessen the number of transition in the ALU per cycle and conserve power being used.
This is shown in Figure 4.7.

Figure 4.7 Simulation of ALU with wavefront steering

The simulation of the ALU during an XOR operation in Figure 4.7 shows exactly
what was expected. The ALU receives the operands and the decoded instruction and
only calculates the appropriate output. There is only valid DATA on the alu_xor signal
and the remaining ALU blocks remain NULL. The value of alu_xor is then passed the
alu_out through the use of the merge gate.
50

4.3 Wavefront Steering in the Hardware Multiplier
Wavefront steering is also beneficial in large logic blocks that are not used often,
such as the hardware multiplier peripheral. The peripherals in the asynchronous
microprocessor all share a memory bus from the memory backbone. Then depending on
the address produced on that bus by the memory backbone the appropriate peripheral is
either read from or written to. The datapath of the peripherals without wavefront steering
is shown in the following figure.

Figure 4.8 Peripherals datapath

As shown in Figure 4.8, the peripherals all share per_en, per_wen, per_addr, and
per_din. These signals are generated by the memory backbone and are dependent upon
the instruction being executed. As stated in the previous chapter the peripherals do not
51

have any common addresses, therefore only one peripheral will be used at a time. But,
during this time all the peripherals will be active. Since only one peripheral is being used
at a time their outputs are OR’d together for the per_dout signal. The simulation of the
peripherals during an instruction that invokes the hardware multiplier is seen in Figure
4.9.

Figure 4.9 Simulation of peripheral outputs

The simulation in Figure 4.9 shows the necessary conditions to read from a
register in the hardware multiplier. That register is then read and the contents of that
register are seen on per_dout_mpy. Then that value is OR’d with the outputs of the other
peripherals. As stated previously the other peripherals, as well as the hardware
multiplier, still cycle when not being used.
52

The amount of calculation that takes place in the hardware multiplier is rather
large compared to that of the watchdog timer and SFR. So, it is advantageous to prevent
the hardware multiplier from being active when it is not in use. This is accomplished
through implementing wavefront steering in the peripheral datapath. The peripheral
datapath with wavefront steering is shown in Figure 4.10.

Figure 4.10 Peripheral datapath with wavefront steering

The implementation shown in Figure 4.10 still allows the watchdog timer and the
SFR to be active every cycle but, the hardware multiplier is inactive when not in use.
The same inputs that were used in the original implementation are still used, in addition
to the per_sel signal. The per_sel signal determines when the multiplier should be active
53

depending on the per_en signal and the per_addr signal. Also, the demux on the input of
the multiplier is shown as accepting a bus of four different signals but, this is actually
implemented as four different 1-2 demuxes. Furthermore, the demux on the input of the
multiplier has an unconnected output because it is only necessary for the multiplier to be
active when per_sel is asserted. This also means that a self-ack is generated in order to
allow the system to cycle even when the multiplier is not selected. This is shown in
Figure 4.11.

Figure 4.11 Multiplier demux self-ack

As shown in Figure 4.11, ackout will continue to toggle, which allows the system
to cycle, regardless of the multiplier being selected. When per_sel is low and the
multiplier is not selected, ackout changes values based on the values of t_y0 and f_y0.
During DATA-waves t_y0 and f_y0 NOR’d together will cause a low output being that
they are compliments of one another. The output of the two signals NOR’d during a

54

NULL-wave is high. When per_sel is high and the multiplier is selected, ackout changes
values based on the value of the ackout from multiplier.
The output of the multiplier peripheral is connected to a demux2_half1_noack
gate. The demux2_half1_noack gate is another special gate recognized by Uncle and is a
modification of a standard demux gate. The difference being that there is only one input
and the gate is only active when the select line is asserted. Furthermore, the “noack”
version of the demux2_half1 gate prevents the ack network of the registers in the
peripheral multiplier from requesting new data when the multiplier is not in use. This is
accomplished by Uncle generating a gated ack network. The gated ack network makes
the ack network to the registers in the multiplier dependent on the value per_sel signal
rather than only dependent on the values of register outputs. This is shown in Figure
4.12.

Figure 4.12 Multiplier gated ack network

55

As shown in Figure 4.12 the true and false per_sel signals as well as the ackout of
the MSP430 datapath are connected to the inputs of two separate C-gates. These C-gates
are labeled ‘Cr’ to signify that they are initially given a value of zero on reset. In the case
that t_per_sel is asserted the ackin of the multiplier is asserted and the input to the merge
gate is the output of the hardware multiplier. Also connected to the merge gate is the
output of a demux2_half0_noack gate. The demux2_half0_noack gate is similar to the
demux2_half1_noack gate except it is only active when the select line is not asserted.
The input of the demux2_half0_noack is tied to zero to generate data when the multiplier
is not in use. In the case that f_per_sel is asserted the ackin of the Logic0 is asserted and
the input to the merge gate is zero. Therefore, the merge gate produces data every cycle.
That data is then OR’d with the outputs of the other peripherals to produce per_dout.
The simulation of the peripheral datapath with wavefront steering is shown in Figure
4.13.

56

Figure 4.13 Simulation of peripheral datapath with wavefront steering

The simulation in Figure 4.13 functions the same as the peripheral datapath
without wavefront steering but now the multiplier is inactive when it is not in use. Based
on the inputs supplied by the memory backbone the per_sel signal is driven high when
the multiplier is needed and while per_sel is low the multiplier will remain inactive. This
implementation allows data to be generated to the input of the OR gate every cycle, as
seen on the per_dout_mult signal. The multiplier_dout signal, which is the input to the
demux2_half1_noack gate, shows that the multiplier is inactive while it is not selected.

57

4.4 Results
The internal Uncle simulator was used to determine if incorporating the datadriven design paradigm into the asynchronous microprocessor was beneficial. This
internal simulator provides information on the number of output cycles in a simulation,
the average output cycle time, the number of transitions per cycle, and the capacitance
per cycle. In terms of power consumption, lower numbers for transitions per cycle and
capacitance per cycle indicates that less power is consumed.
First, wavefront steering in the ALU was tested. To observe the benefits of datadriven design the same program was run in an implementation of the microprocessor that
used wavefront steering in the ALU and an implementation of the microprocessor that
did not use data-driven design. The program used to test the ALU runs 100 iterations of
the operation being tested. These operations included XOR, SXT, SWPB, RRC, RRA,
DADD, BIS, and AND. The results of these tests are shown in Table 4.1.

Table 4.1 Data-driven ALU results

In Table 4.1, the microprocessor implementation without wavefront steering is
labeled A and the implementation with wavefront steering in the ALU is labeled B. From
the results, it can be seen that implementing wavefront steering did not actually reduce
58

the number of transitions per cycle or the capacitance per cycle. This is due to the fact
that the extra logic needed to implement the wavefront steering had more of a cost than a
benefit. This extra logic included the demux, merge gate, and the logic needed to select
the appropriate computation block. So even though this implementation prevents blocks
from being active when they are not needed, the transitions added from this extra logic
does not outnumber the transitions that are being prevented. Also, it is interesting to note
that even though the data-driven design in the ALU does not reduce power consumption,
the average cycle time is faster than the implementation without wavefront steering. As
mentioned previously, the cycle time is dependent upon the longest path from input to
output. In the implementation with wavefront steering the path through the ALU is
shortened due to the fact that only one operation must completed instead of having to
wait for all eight operations to complete. It was decided that the data-driven
implementation of the ALU would be kept because the increase in speed was more
substantial than the minimal increase in power consumption.
Next, wavefront steering through the hardware multiplier was tested. To test
wavefront steering through the multiplier, the same tests that were described in Table 3.7
were run. These tests were run on an implementation of the microprocessor without any
data-driven design and an implementation with wavefront steering through the multiplier
and ALU. The summarized results of these tests can be seen in Table 4.2.

59

Table 4.2 Data-driven multiplier and ALU results

In Table 4.2, the microprocessor implementation without any wavefront steering
is labeled A and the implementation with wavefront steering through the hardware
multiplier and the ALU is labeled C. As seen previously, the average time and output
cycle average time are both decreased due to the implementation of data-driven design.
But now, the inclusion of wavefront steering through the hardware multiplier has reduced
the average number of transitions per cycle by 14.0%. Also, the average capacitance per
cycle was reduced by 16.4%. Therefore, implementing data-driven design to bypass the
hardware multiplier when it is not needed is beneficial in terms of reducing the total
power consumption of the asynchronous microprocessor.

60

CHAPTER V
CONCLUSION

This work has investigated the concepts of delay-insensitive systems and how
these concepts can be used to create an asynchronous system. Specifically, this work
demonstrated that asynchronous systems can be implemented in place of synchronous
systems and eliminate the global clock network. The asynchronous system was
implemented using NCL with the use of an RTL-based CAD flow (Uncle) that provided
significant productivity advantages over manual netlisting. It was demonstrated that
Uncle was capable of not only creating simple asynchronous systems, such as a counter,
but also more complex asynchronous systems, such as a microprocessor. The
microprocessor created using Uncle was based on publicly-available Verilog RTL that
implemented a version of the Texas Instruments MSP430 microprocessor. The
functionality of the asynchronous microprocessor was verified to maintain the same
functionality of the MSP430, aside from being capable of generating a watchdog timer
reset. This implementation of the asynchronous microprocessor was further improved
upon by taking advantage of a data-driven design technique that is available when
designing asynchronous systems. Wavefront steering was used in the ALU and in the
hardware multiplier peripheral to attempt to have only certain logic blocks active when
they are in use and inactive when they are not in use. This resulted in a speed
61

improvement in the ALU and improvements on power consumption by bypassing the
hardware multiplier when it was not in use.

5.1 Conclusion
Asynchronous design is a viable alternative to the more widely used synchronous
design style. It allows a designer to reach the same functionality while eliminating the
issues that are associated with a global clock network that were described in Chapter 1.
The use of automated software such as Uncle is a practical option for creating simple or
complex asynchronous systems, even though some problems arose as a direct result of
using Uncle. Some of these problems included synchronous netlists not mapping
correctly to NCL and Uncle’s incompatibility with systems using a high true reset. Aside
from these problems Uncle, and tools similar to Uncle, are only going to improve as the
development of these tools continues. As mentioned earlier, the issues caused by Uncle
have already been resolved in the most current release. There is a minimal learning curve
with using Uncle assuming the user already knows how to implement synchronous
systems using Verilog RTL. But, further knowledge of how asynchronous systems
operate is necessary when debugging designs and when taking advantage of data-driven
design.
Furthermore, asynchronous design allows the optimization of power consumption
through data-driven design. Although, that is not to say that data-driven design will
always result in less power consumption. Data-driven logic blocks must be analyzed to
observe if there is actually less power consumed when the overhead of data-driven design
is included. In the asynchronous microprocessor the power consumption was not reduced
62

in the case of the data-driven ALU but power consumption was reduced in the case of the
hardware multiplier.

5.2 Future Work
There is some future work that could improve on the current design of the
asynchronous microprocessor. As mentioned earlier, implementation of a watchdog
timer would be included in the future work of this project. The problem of the watchdog
timer has two parts, being able to generate an accurate timer and being able to generate a
watchdog timer reset. One method that will be investigated is having an external
synchronous watchdog timer module controlled by the asynchronous microprocessor.
This may be a viable solution to generating an accurate timer as well as generating an
external signal that could be tied directly to the reset of the microprocessor.
Also, investigating methods of reducing the total area of the asynchronous
microprocessor is to be included in the future work. Since, as stated in Chapter 1, a
drawback of asynchronous design is that asynchronous systems are generally larger than
their synchronous counterparts. Some possible methods that will be investigated are gate
relaxation and Boolean optimization prior to mapping the synchronous microprocessor to
NCL.

63

REFERENCES

[1] K. Emerson , "Asynchronous design-an interesting alternative," VLSI Design, 1997.
Proceedings., Tenth International Conference on , pp.318-320, 4-7 Jan 1997.
[2] S. Smith and J. Di, Designing Asynchronous Circuits using NULL Convention Logic
(NCL): Morgan & Claypool, 96 pp, 2009.
[3] S. Hauck, "Asynchronous design methodologies: an overview," Proceedings of the
IEEE , vol.83, no.1, pp.69-93, Jan 1995.

[4] G.E. Sobelman and K. Fant, "CMOS circuit design of threshold gates with
hysteresis," Circuits and Systems, 1998. ISCAS '98. Proceedings of the 1998 IEEE
International Symposium on , vol.2, pp.61-64, 31 May-3 Jun 1998.
[5] Uncle user manual. Available: www.ece.msstate.edu/~reese/uncle/UNCLE.pdf
[6] Cheoljoo Jeong; Nowick, S.M.; , "Optimization of Robust Asynchronous Circuits by
Local Input Completeness Relaxation," Design Automation Conference, 2007. ASPDAC '07. Asia and South Pacific , vol., no., pp.622-627, 23-26 Jan. 2007.
[7] MSP430x1xx Family User’s Guide. Available: ti.com/lit/ug/slau049f/slau049f.pdf

64

