







Department of Computer Science
Columbia University




We introduce new architectural optimizations for asynchronous systems These
optimizations allow application of voltage scaling to reduce power consumption while
maintaining system throughput In particular three new asynchronous sequencer
designs are introduced which increase the concurrent activity of the system We show
that existing datapaths will not work correctly at the increased level of concurrency
To insure correct operation modied latch and multiplexer designs are presented
for both dualrail and singlerail implementations The increased concurrency allows
the opportunity for substantial systemwide power savings through application of
voltage scaling
Keywords
Asynchronous Design Data Hazards Handshaking Hazards Latches Low
Power Sequencers Voltage Scaling
 
This is an expanded version of a paper by the same name appearing in the  International Sym
posium on Low Power Electronics and Design 	
y
This author was supported by grants from CONICIT and UNEXPO Venezuela	
z
This author is supported by an NSF CAREER Award MIP




Interest in lowpower and asynchronous systems has grown considerably in recent years
The constant increase in the use of batteryoperated portable devices like cellular phones
notebook computers and even implantable pacemakers has made low power consumption
a high priority Power issues are also becoming critical for nonportable systems Low
power operation can reduce the need for heat dissipation and cooling devices it can also
extend the life of system components by providing a less stressful operating environment
A wide range of techniques is used to reduce circuit power consumption These tech
niques approach lowpower operation at dierent levels of synthesis including IC technol
ogy optimization lowpower circuit design architecture or structural optimizations algo
rithm level optimization and systemwide low power techniques  	 
	 Chandrakasan
et al  show that concurrency is a key to architecturedriven optimizations for lowpower
operation The increased throughput obtained through concurrent operation allows the
reduction of the power supply voltage ie voltage scaling  
	 
The focus of this paper is on asynchronous designs for low power In principle asyn
chronous systems have the potential for low power operation for two reasons First these
systems have no global clock in contrast clock distribution is a major source of power
consumption in synchronous design Second asynchronous circuits have an inherent auto
matic powerdown operation modules are activated only when their operations are needed
Lowpower operation is a major focus of recent asynchronous design including largescale
fabricated examples like a lowpower infrared communications chip  an asynchronous
implementation of the ARM microprocessor  and an asynchronous error corrector for a
DCC player 





  Several methods build asynchronous circuits as networks of communicating
modules Every module is mapped to a circuit element in a library of selftimed modules
Such systems are macromodular  since they are constructed by combining modules into a
working system Macromodular circuits are robust and usually have few timing assump


tions Macromodules are particularly wellsuited for methods that approach circuit design
as a programming activity For example van Berkel et al   have developed a
method to automatically design lowpower asynchronous circuits from highlevel Tangram
programs The programs are compiled using syntaxdirected translation into handshake
circuits an intermediatelevel representation of a circuit as a network of macromodules
The goal of this paper is to reduce power in asynchronous macromodular systems Our
focus is on nonpipelined ie sequential systems which are commonly used in lowpower
DSP applications with modest performance requirements see 
  The operation of
these systems is controlled by basic elements called sequencers
Our strategy is to use architectural optimization to increase the throughput of a sequen
tial system This increased throughput must be achieved without increasing the switching
activity required for a computation otherwise energy consumption could increase Volt
age scaling  
	  is then applied to reduce both power consumption and throughput
The resulting system has no net loss of performance but a signicant reduction in power
In particular we present the following new contributions First we introduce three
new designs for asynchronous sequencers Each design increases the concurrency of the
datapath operations in the entire system Second we show that existing asynchronous
datapaths will not operate correctly at this increased level of concurrency We therefore
modify the datapath to insure correct operation Specically we introduce new designs
for asynchronous latches and multiplexers that handle concurrent operation safely in a
dualrail datapaths and b singlerail datapaths described below
For dualrail datapaths our new components allow roughly twice the throughput of
existing sequential designs In this case after voltage scaling the energy dissipation of the
entire system is reduced by a factor of 
For singlerail datapaths two existing schemes are discussed Our new components
result in twice the throughput of the rst scheme and roughly the same performance as
the second one However our simpler approach has advantages over the latter in i ease
of design and ii glitch avoidance in the datapath
Organization of the paper The paper is organized as follows Section  reviews

background on power consumption and asynchronous circuits Section 	 presents a brief
overview of our approach In section  existing sequencers are examined their limitations
are pointed out and three new lowlatency sequencer designs are introduced Section 
discusses the operation of dualrail datapaths and a basic problem related to concurrent
operation is described New latch designs to allow correct operation are introduced in
this section Similar modications for singlerail datapaths are introduced in section  In
Section  modied multiplexer designs are presented Section  presents results of analysis
and SPICE simulations and Section  presents conclusions
 Background
In this section we review some fundamental aspects of power consumption in CMOS circuits
our target technology and basic concepts related to the operation of asynchronous
circuits
  Power Consumption in CMOS Circuits
There are three major sources of power consumption in CMOS circuits Switching energy
is associated with transitions on gate outputs Shortcircuit energy consumption is caused
by simultaneous conduction during a transition of pullup and pulldown stacks allowing
current ow directly from the power supply to ground Finally leakage energy occurs in
standby mode and is caused by substrate currents and by subthreshold conduction in o
transistors We only consider transition energy because in most CMOS circuits this energy
dominates the other two and contributes up to  of the total energy 
In asynchronous systems there is no global clock so the metric of interest is the total
energy of a computation A wellknown expression for this energy is
Energy of a computation 
 






Where n is the total number of transitions in the computation C
L
is the load capaci
tance being chargeddischarged and V
dd
is the power supply voltage Energy dissipation
can be reduced by reducing the capacitance the number of transitions or the supply volt
age Since energy depends quadratically on the supply voltage supply voltage scaling is
	
an especially attractive scheme for power reduction  Unfortunately voltage scaling has
the undesirable eect of reducing the speed of the circuit
Several techniques are used to compensate for this performance penalty In partic
ular architecturedriven voltage scaling  	 combines architectural optimizations to
increase the concurrent activity and the throughput of the system with voltage scaling
to reduce the power consumption without aecting the throughput Our goal is therefore
to increase the concurrency and hence the throughput of a system If the increase in
performance is achieved without increasing the switching activity required for the compu
tation a substantial reduction in power is possible after voltage scaling with no net loss
in performance
   Asynchronous Circuit Operation
In this paper we focus our attention to asynchronous macromodular systems This type of
asynchronous circuit is designed as a network of predened data and control modules 
Instead of a global clock signal communication channels between modules use handshaking













               phase
Figure 
 phase handshaking for module communication
 Control Signaling
Control signaling usually follows a phase handshake protocol  which is implemented
using request and acknowledge signals Figure 
 shows two macromodules that communi
cate with each other Initially processes are idle and both signals are deasserted module
P asserts r to request module Q to start the processing phase When Q nishes processing
it asserts a to start the acknowledge phase The wires must return to their idle state so P

deasserts r starting the returntozero phase Finally Q deasserts a to return to the idle
phase Usually computation takes place only during the processing phase The rest of the
phases especially the returntozero phase represent dead time from the point of view of
computation
 Data Communication
When data communication is involved an encoding scheme is used to represent and transmit
data Two encodings are most common  dualrail and singlerail




 and  data values respectively and code  represents the spacer or idle
state This encoding eectively combines data and control in the same wires the idle state
indicates invalid data and 
 and 
 indicate valid data and the data value itself This is
a very robust code that guarantees correct operation with arbitrary delays in the circuit
However implementations typically suer from severe power and area penalties due to the
duplication in the number of wires and transitions
 Singlerail Data This code uses one wire for every data bit as in synchronous designs
and one additional wire called a datavalid signal for control The collection of all data
bits and the datavalid signal is called a data bundle This code has good power and area
costs comparable to synchronous implementations The correct operation of singlerail
circuits relies on a local timing assumption all data wires must be valid and stable before




Two basic computation structures are used in asynchronous systems pipelined and sequen
tial Pipelined computation is usually used in highperformance processors eg AMU
LET  The design of asynchronous pipeline control is a very active research area

   Sequential computation is used in some DSP systems with moderate timing

requirements usually aimed at lowpower operation eg FIR lter bank 
 DCC error
corrector  Active research is done in sequencer control also   

The focus of this paper is on sequential computation The optimizations introduced
in this paper improve the throughput of an asynchronous sequential system by increasing
the amount of concurrent activity A key control element that has a large impact on
concurrency is the sequencer In Section  we review existing sequencers point out
their limitations and introduce three new sequencer designs which result in higher system
throughput
Next we show that existing datapaths will not work correctly at this increased level
of concurrency In particular data hazards may appear We therefore introduce new latch
designs to allow correct operation for dualrail asynchronous datapaths in Section  Two
alternative modications are made for singlerail datapaths in Section  Section  presents
modied multiplexer designs to be used with both dualrail and singlerail datapaths
Finally in Section  we combine the architectural optimizations with voltage scaling
The results of SPICE simulations indicate that a signicant reduction in power can be
achieved over existing designs without aecting system throughput
 Asynchronous Control
A basic control operation in macromodular systems is the sequencing of computations or
data processing actions Such sequences can be very long For example Bailey 
 reports
that the longest sequence in the asynchronous error decoder circuit for a DCC player
 consists of  processes Two common operation protocols are used in asynchronous
sequencers sequential and concurrent














 and acknowledge a
i
 signals Figure b shows how the sequencer communicates with
each process using phase handshaking
In this basic sequential protocol the sequencer executes a complete phase handshake
with process P
i
before starting a handshake with P
i 





































P1 R1 P2 R2 P3 R3 P4 R4
PROCESS 1
TOTAL COMPUTATION
                TIME
a b
Figure  Sequencer Block Diagram and Sequential Operation































 phases alternate resulting in a long dead time between computations
In a sequential protocol process P
i 
cannot be started until the previous returntozero
phase has completed
This dead or noncomputation time between processing phases shown in Figure b
is called interprocess latency Other important parameters to evaluate the performance of
a sequencer also shown in Figure b are the initial latency the time from the activation
of the sequencer to the start of the rst computation and the total computation time the
time from the activation of the sequencer to the end of the last computation Optimization
techniques for such sequencers typically focus on reducing the internal control latency of
the acknowledge and idle phases which contribute to the interprocess latency
 Concurrent Protocol A more e cient approach is to introduce concurrent operation
In a concurrent protocol the sequencer can start process P
i 
without waiting for P
i
to



































Figure 	 Concurrent Sequencer Operation
with the returntozero phase R
i
of the previous process as shown schematically in Figure 	
As we will show later a number of concurrent protocols can be implemented resulting
in dierent levels of concurrency in the system
 Previous Sequencers
We now review existing sequencers that implement sequential and concurrent protocols
and indicate their limitations
 Sequential Approaches
 Tangram Sequencer In Tangram way sequencing is implemented using the SEQ
operator  shown in Figure a The sequencer is activated on its passive port or
channel S a passive port is indicated by a small white circle The sequencer then com
municates on active ports P
 and P to activate the rst and second processes respectively
an active port is indicated by a small black circle









for channel Pi A complete phase handshaking occurs on port P

followed by a complete phase handshaking on port P The behavior of the SEQ operator

















































Figure  Tangram Sequencer a SEQ Symbol b SEQ Circuit c Tree Sequencer
An implementation of the SEQ operator is shown in Figure b This circuit is speed
independent 

 ie it operates correctly assuming arbitrary nite gate delays An
Nway sequencer consists of SEQ operators connected in a tree structure as shown in
Figure c There are two problems with the Tangram sequencer i it has a long initial
latency and ii it has long interprocess latencies see Section  Results
 Martin Sequencers In 
 Martin presents two Nway sequencers The Tangram
Nway sequencer corresponds exactly to a Qelementbased Martin sequencer and it has
comparable performance A Delement based sequencer provides no overall performance
improvement
 JosephsBailey CounterDecoder Sequencer In 
 Bailey and Josephs introduced
a centralized sequencer based in a counterdecoder architecture The counter centralizes
the state of the sequencer and the decoder distributes the signals to the processes The
circuit is speedindependent and it is currently used in several designs This sequencer
was designed using a formal procedure based on SIAlgebra see  for details The
implementation has improved initial and interprocess latencies compared to the Tangram
tree sequencer Minor problems are that the circuit is not modular and is designed to work
with an even number of processes
 JosephsBailey Chain Sequencer Bailey and Josephs 
 also introduced a dis
tributed sequencer built as a linear chain of n modules each controlling a process The
modules assume fundamentalmode operation 	 In fundamental mode no new inputs

can arrive until the component has stabilized from a previous input change This design
also has better initial and interprocess latencies than Tangrams
To summarize all of these sequencers implement a sequential protocol and as a result
have long interprocess latencies due to the returnto zero phase In addition some designs
also have a long initial latency
 Concurrent Approaches
There have been a few attempts to implement concurrent sequencers but each has limita
tions
 Unger Tree Sequencer Unger  presented a step module that implements a
concurrent way sequencer The step assumes fundamentalmode operation and relies
on reasonable timing assumptions An Nway sequencer is built as a balanced tree of step
modules  There are several problems with this implementation i the sequencer has
a long initial latency ii the interprocess latency is dierent for every pair of processes
and can be several gate delays depending on how far up and down the tree the signals have
to propagate and iii the area and power consumption of this structure are signicantly
worse than the previous designs see Section  Results
 Farnsworth way Sequencer Farnsworth et al  introduced a concurrent way
sequencer as part of a FIFO control unit No Nway extensions were presented We
explored three dierent tree structures to obtain an Nway sequencer a leftbranching
tree a rightbranching tree and a balanced tree Our attempts resulted either in i a long
initial latency or ii long interprocess latencies
Our goal is to implement a concurrent sequencer e ciently overcoming the problems
and deciencies pointed out in the previous designs
  New Concurrent Sequencers




 BurstMode Concurrent Sequencer
Our rst sequencer implements a concurrent protocol with tightlycoupled overlap pro
cessing phase P
i
overlaps exactly returntozero phase R
i  
 The key point is that this


















Figure  BurstMode Sequencer Operation
We synthesized the circuit using an existing burstmode asynchronous tool UCLOCK

 with extensions to incorporate output feedback The result is a modular design well
suited for distributed control Our Nway sequencer has N modules organized into  types




















Figure  shows two M
i
modules each controlling a process The modules have good
latency area and power In typical computation the interprocess latency the time from
completion of P
i
s processing phase a
i  
 to the start of P
i 
s processing phase r
i

is only  CMOS gate delays an AOIgate followed by an inverter
 
Also each module
contributes only  gate output transitions and 
 transistors to the energy consumption
and area of the system
The correct operation of the sequencer relies on modest timing assumptions related to
the fact that each process acknowledge signal a
i
 is forked to two dierent modules In
 
If a returntozero phase is unusually long the interprocess latency may increase due to synchroniza
tion dependencies	 In particular P
i
























































 in module i and r
i 




must propagate back to the input of the complex gate before a
i
 arrives at




has to propagate through a
short wire while r
i
has to propagate through process P
i
 this restriction is quite reasonable
in practice
One restriction of our burstmode sequencer is that a long returntozero phase such
as R
 
in Figure  may unnecessarily delay the start of the next processing phase P


This observation leads to our second design
 Optimized Concurrent Sequencer
Our second sequencer allows greater concurrency by using a more relaxed synchronization
requirement By starting P
i 
as soon as P
i
is nished independently of the status of R
i
 a
faster sequence of processing phases is allowed The operation of this optimized sequencer
is shown in Figure  Processing phase P
i 
depends only on the completion of P
i
 The










Figure  Optimized Sequencer Operation
Our optimized sequencer design is shown in Figure a Although similar to the burst
mode sequencer three improvements are clear i a wire replaces the AND gate that
generates S
a
 ii each module has one fewer input a
i 
 resulting in a reduced fanout
of the processes acknowledge signals and iii the module implementations shown in

	





























1 < i < N
a b
Figure  Optimized Sequencer a Block Diagram b Module Implementations
Figure 
 shows two optimized modules each controlling a process These modules are
more e cient in terms of speed power and area the interprocess latency is guaranteed to
be  CMOS gate delays and the contributions of a module to the energy dissipation and










The correct operation of the sequencer relies on similar timing assumptions as the burst
mode sequencer In realistic settings these assumptions are very reasonable and should
not be a problem If these timing requirements cannot be met a more robust sequencer


must be used as presented in the following subsection
 Speedindependent Concurrent Sequencer
In a speedindependent circuit an input is not allowed to change until its previous change
has been acknowledged through an output change As pointed out in the previous sections
the burstmode and the optimized sequencers do not meet this requirement due to the




a shows our speedindependent sequencer The fork in a
i
is
eliminated sending the signal only to module i ! 
 which generates r
i 
 A signal from
this module to the previous one b
i
 is added in order to report that the change in a
i
has

































 SpeedIndependent Sequencer a Block Diagram b Module Implementation
The rst process P
 
 is controlled by a handshake Selement  shown in Figure 
above Each of the remaining processes is controlled by an M module shown in Fig
ure 

b The module has a very e cient implementation a single Celement 

Also
each module contributes only  gate output transitions and 
 transistors to the energy

A input Muller Celement produces a  output when both inputs are  and a  output when
both inputs are 	 When the inputs are dierent the output remains at its previous value	


consumption and area of the system

 Modied Dualrail Datapaths
We now briey review the operation of dualrail datapaths and examine their interaction
with concurrent sequencers We point out that existing datapaths will not operate correctly
at the increased concurrency provided by the new sequencers In particular data hazards
can arise We then present modied dualrail latch and multiplexer designs that allow
correct operation of the datapath Singlerail datapaths will be covered in the following
section
 Basic DualRail Datapath Operation
Figure 
 shows schematically a stage dualrail datapath In this example a sequencer
controls two processing actions Process  implementsZ  F Y  and Process  implements
X  GW  W  X Y  and Z are variables that hold data and F and G are blocks of
combinational logic that implement the desired functions
Z  F Y 
X  GW 
LATCH
FUNCTION

















PROCESS  2 PROCESS  1
FUNCTION
    BLOCK
Figure 
 Sequencer controlling  DualRail Datapath Processes
Functions are implemented using hazardfree combinational logic that operates on dual
rail input data and generates dualrail outputs Hazardfree operation is required because

This sequencer works for N   the circuit will deadlock for N  	 A way sequencer can be
built using an Selement to control the rst process an M module to control the second process and an





any glitch in the data wires can be interpreted as a valid data signal and produce erroneous
operation Dualrail variables are usually implemented using a latch with separate read
and write ports The latches are opaque when inactive
The operation of Process 
 is as follows The initial request r
 
 propagates to source
variable Y as a read request Y sends data to block F which computes F Y  Dualrail data
is used and there is no separate handshake signal to control this communication When
F Y  is computed the result is sent to the destination variable Z The dualrail data
itself serves as the write request signal for Z When the data is stored the processing phase
is complete and Z sends its acknowledge signal to the controller as a
 
 The returnto
zero phase is initiated with r
 
 which propagates to Y to nish the read operation Data
signals return to the idle state to indicate the end of the write request to Z Z responds
by deasserting a
 













   PORT   PORT
Figure 
	 Tangram DualRail Latch
An existing handshake latch that stores 





correspond to the dualrail write data and W
a
is the write acknowledge signal
R
r




are the dualrail data outputs When the latch
is inactive all handshake wires are low A read operation is started by the asserting the
read request R
r
 The latch responds with R

 to indicate that a  is stored or
R
 
 to indicate a 
 To complete the read R
r










request and also as an indication of the value to be stored The new value is stored in
the crossedcoupled NOR latch W
a
 acknowledges that the data is stored Note that if
the new data is the same as the data stored in the latch no signals need to propagate


inside the latch and the write operation will be acknowledged immediately The return




going low and is completed when W
a
is deasserted
Concurrent read and write operations to this latch are not allowed If this occurs the latch
may malfunction as indicated below
  Data Hazards in Overlapped Operation
Unsafe overlapped operation of the datapath is caused by concurrent processing and return
tozero phases accessing the same latch Figure 
 shows a datapath structure where 
processes access the same latch X Of four possible forms of interaction three are free
of data hazards i Read after Read RAR data does not change and remains stable
ii Read after Write RAW The second computation reads data that has already been
written to the latch and is stable iiiWrite after Write WAW no read operation is per
formed so data can be written without causing problems actually the use of multiplexers
covered in section  prevents the occurrence of this type of interaction
Z  F X
X  GW 
LATCH
FUNCTION












PROCESS  2 PROCESS  1
FUNCTION




 Scenario for a WAR Data Hazard
The only hazardous interaction shown in Figure 
 is a Write after Read WAR
hazard The second computation may write new data to the latch while its read operation
has not completed The new data can propagate through the latch and cause incorrect
operation
A WAR hazard occurs when a read is rst initiated R
r


















 The new data in the write port can propagate through
the latch to the read port and cause undesired changes in the output data
A classical example where this hazard arises is in a shiftregister The shift register
is a special case of the datapath shown in Figure 
 in which function blocks F and G









 input of the next stage A read request
to one stage therefore produces a write to the adjacent stage see  for details A
sequencer controls each 
bit shift operation The sequencer generates a read request R
r

to each stage in turn and receives the adjacent write acknowledge W
a
 If a concurrent
sequencer is used a WAR hazard will occur in every latch
 Solutions for DataHazardfree Overlapped Operation
We propose two dierent approaches to eliminate WAR data hazards i a hardware
solution " stall the write operation until the read is completed and ii a compiler solution
" avoid overlapped accesses to the same register
 Hardware Solution Interlock circuitry is used to stall the write operation until the




an enable signal for the write data Signal R
r
remains low while the read port
is transparent disabling write operations Note that in practice the interlock circuitry
should have minimal impact on performance for two reasons i The interlock mechanism
will rarely be activated because r
 









 ii In the
event that the interlocked is engaged it will be active only for the duration of the race and
not for the entire phase
If the data being written to the latch is equal to the data already stored in it the write
operation is not stalled and is acknowledged immediately regardless of the state of the
read port This is a safe optimization No changes are caused in the latch and no glitches






























   PORT   PORT
a b
Figure 
 Modied DualRail Latch Implementations a GateLevel b Transistorlevel
The latter solution requires only  added transistors
Compiler Solution At the algorithm level a compiler can easily identify a WAR hazard
between two consecutive computations Therefore the compiler can insert an unrelated
operation between them to eliminate the hazard In a case in which such reordering is
not possible the compiler either inserts a special null operation or falls back on the use
of the modied latch This technique requires the use of our tightlycoupled burstmode
sequencer which allows WAR interactions only between two consecutive computations
 Modied Singlerail Datapaths
Dualrail datapaths are very robust but pay a large penalty in terms of area and power
dissipation We now examine singlerail datapaths as an alternative implementation
 Basic SingleRail Operation
Figure 
 shows schematically a sequencer controlling a stage singlerail datapath that
implements Z  F Y X  GW  Unlike dualrail logic F and G need not be hazard
free combinational logic blocks that operate in synchronous systems may be used
To comply with the bundling constraint ie guarantee that all data wires are valid and
stable before the datavalid signal is asserted delays are inserted in the datavalid wires
Typically these delays are designed to match the worst case delay of the corresponding

Z  F Y 
X  GW 
SEQUENCER
FUNCTION
















PROCESS  1PROCESS  2
r2 a2
MATCHED DELAYS MATCHED DELAYS
DXDG
Figure 
 Sequencer controlling a SingleRail Datapath
datapath block that is DF must equal the worst case delay in the combinational logic
block that implements function F  and DZ must be equal to the worst case delay in latch
Z In CMOS implementations delays depend heavily on the sizes of transistors and their
loading and also on the nal routing and placement of modules so safety margins are
required for correct operation
Figure 
 shows a latch used to implement the variables which appear in Figure 

see 
 Each latch is normally opaque and stored data is always readable at its output
A write request W
r
 makes the latch transparent for writing the subsequentW
r
 makes
the latch opaque latching the result Complementary signals en and ne are generated by










 Tangram SingleRail Latch and Enable Circuit
The operation of the datapath depends on the type of sequencer used and on how the
delaymatching is done We now examine two schemes for singlerail datapath operation
recently presented by Peeters et al 
 




 Conservative Scheme The conservative scheme uses a sequential controller such
as the JosephsBailey counterdecoder sequencer with the singlerail latch Figure 

The delays in the control signals are designed to match worst case delays in the associated
blocks that is DF must equal the worst case delay in F  and DZ must match the response
time of latch Z
Datapath sections operate as follows The sequencer generates an initial request r
 

Data from Y is already present as inputs to F  The request signal propagates through the
matched delayDF as F Y  is being computed When computation is complete the data is
stable and valid The output of the delay acts as the datavalid signal for the result This
data bundle is sent to Z The arrival of the datavalid signal makes Z transparent and
after propagating through DZ it is sent back to the controller as acknowledge signal r
 

At this point the processing phase is complete The sequencer starts the returntozero
phase by deasserting r
 
 which propagates through DF and arrives at Z making it opaque
again After propagation through DZ a
 
is deasserted
In this scheme the result of the computation is valid at the end of the processing phase
Once processing is complete the destination latch Z becomes transparent see Figure 

The key point in this scheme shown in Figure 
a is that the latch is transparent only
when data is valid and stable so no undesired glitches are propagated to the rest of the
circuit The drawback is that performance is poor since the sequencer does not allow
overlapped operation In particular even though the result of the computation is ready at
the end of the processing phase the stage still must go through the returntozero phase
before the next computation can begin
 Fast Scheme A scheme that using a sequential controller can achieve higher through
put by a novel distribution of the computation throughout the phases of the handshake
protocol In this scheme called standard true fourphase protocol by Peeters 
 delays
are designed to match only half the value of the worstcase delay in the functional blocks
As in the previous scheme r
 
propagates through DF and becomes the datavalid signal


































   LATCH





 Singlerail Operation Schemes a Conservative bFast c Concurrent
	
only half of the computation time has elapsed and data is not ready# The signal arrives
as a write request to Z making it transparent The latch acknowledge signal goes to the
controller as an indication of a completed processing phase even though computation is
still going on r
 
 starts the returntozero phase and propagates through the matched
delay At this point the result of the computation is stable and available in the data wires
that feed the latch When the control signal reaches Z the latch is closed
Figure 
b shows the operation of the fast scheme In this case a
i
 indicates that
computation is complete as opposed to a
i
 in the previous scheme The advantage of this
scheme is that it reduces to a half the length of the processing and returntozero phases
of the handshake obtaining roughly twice the throughput of the conservative scheme
However the scheme has two key drawbacks i the matching of delays to half the value
of the delay in the functional blocks is not straightforward and more signicantly ii
the destination latch is made transparent while data is unstable In fact the outputs of
the combinational circuit F can glitch many times during this period and these glitches
will be propagated to every processing stage connected to the latch see discussion in 

This results in unpredictable power consumption that can be large especially if the latch
is connected to deep combinational circuits

  A New Approach	 Overlapped Operation
Our solution is to use one of our concurrent sequencers with the conservative datapath
scheme where the matched delays match the full computation block worstcase delays
The operation of this scheme is shown in Figure 
c The concurrent sequencer initi
ates active phases of adjacent processes in immediate succession allowing high throughput
see Figures 	 and  At the same time each process uses the simple delaymatching
approach of the conservative scheme This results in essentially the same performance
advantage as the fast scheme but without the drawbacks a latch is transparent only when

Peeters  suggests a lowpower true fourphase protocol to reduce the glitchpropagation problem	
Essentially read ports are added to the latches to block the glitches	 However Peeters also points out
that although the lowpower variant minimizes the number of transitions it is likely that the added power
consumption in the latches will eliminate any power savings due to the reduced transitions	

data is stable eliminating glitch propagation In addition the delays are matched to the
worstcase value of the associated functional block
This scheme is a valid solution except for one problem As in dualrail datapaths
overlapped operation introduces the possibility of data hazards if operations interact with
the same latch An analysis of the operation of the singlerail datapath equivalent to
the analysis of the dualrail datapath in the previous section reveals that three type of
interaction are safe RAR RAW and WAW only WAR interactions are unsafe and require
modications
Unsafe overlapped operation of the datapath is caused by concurrent processing and
returntozero phases that access the same latch For example in Figure 
 assume
that latch Y is the same as latch X The sequencer will start the returntozero phase of
Process 
 concurrently with the processing phase of Process  There is a race Process 

is making Z opaque r
 
 while Process  is making X transparent r

 If Process 
wins the race the new data in X may reach Z before it becomes opaque causing it to
store the wrong data
 Modi
cations for Correct Overlapped Operation
The WAR hazard arises because the destination latch Z remains transparent throughout
the returntozero phase while the overlapped processing phase can write to the source
latch X Latch Z already stored the information and is only waiting for r
 
 to propagate























 Latch Enable Circuit a EarlyClose Modication b Interlock Modication

 Early close scheme We can fastforward r
 
 to the destination latch so it closes early
in the returntozero phase instead of at the end Figure 
a shows this simple modi
cation to the latch enable circuit The latch will not open early so no glitch propagation
will occur This scheme relies on reasonable timing assumptions for correct operation
 Interlock scheme A more robust approach is to stall the writing of the source latch
until Z is opaque again In this case the acknowledge signal from the destination latch a
is used as an enable to the source latch write request Figure 
b shows this modication
Stalling the write operation guarantees correct operation independently of the delays in
the circuit but provides less performance improvement than the previous scheme
	 Overlapped Multiplexers
The operation of a datapath either dualrail or singlerail often requires multiple accesses
to the same process In this case the dierent requests must be multiplexed together
An existing handshake multiplexer for control signals  shown in Figure a requires
mutuallyexclusive requests on its two channels A new control multiplexer design shown
in Figure b allows overlapped requests In this design a second request is stalled at
the AND gate until the rst operation is completed
Control multiplexers have been extended to data multiplexing 
 Similar extensions






















We now show how the increased throughput obtained by the new sequencers combines
eectively with the application of voltage scaling to produce signicant energy savings
of an entire asynchronous system We compare important features of all the sequencers
discussed above and then show results of several SPICE simulations
AREA ENERGY TIMING
SEQUENCER  transistors  gate output transitions MODEL
Previous Designs
Tangram N N SI
Martin N N SI
JosephsBailey CounterDecoder 
N N SI
JosephsBailey Chain N N FM





Burstmode N N FM
Optimized N N FM
Speedindependent N N SI
y
In  Farnsworth et al used a concurrent way sequencer No Nway extensions were presented We explored three
di	erent structures to obtain an Nway sequencer
 a leftbranching tree a rightbranching tree and a balanced tree
All extensions lead to the same results
Table 
 Static Characteristics of Nway Sequencers
Sequencer characteristics are summarized in two tables Table 
 compares static char
acteristics for the dierent sequencers The information is given as a function of N  the
number of processing stages being sequenced The total number of transistors and gate
output transitions are used as rst order approximations to area and power consumption
The results show that the new designs are very competitive in both dimensions In fact
the optimized and speedindependent sequencers have better features than all the others
Table  compares dynamic behavior of the dierent sequencers where each sequencer




SEQUENCER LATENCY LATENCY TIME
Previous Designs
Tangram Ng 
g  R NgNPNR
Martin Ng 
g  R NgNPNR
JosephsBailey CounterDecoder g g  R NgNPNR
JosephsBailey Chain g g  R NgNPNR
MINg








Optimized g g NgNP
SpeedIndependent g g NgNP
y
In  Farnsworth et al used a concurrent way sequencer No Nway extensions were presented We explored three
di	erent structures to obtain an Nway sequencer
 a leftbranching tree a rightbranching tree and a balanced tree The
results shown here correspond to our leftbranchingtree extension All extensions lead to the same total computation time
z
If a returntozero phase is unusually long the interprocess latency may increase due to synchronization dependencies
Table  Dynamic Behavior of Nway Sequencers

CMOS complex gate or an inverter P represents the length of a processing phase and
R is the length of the returntozero phase Again the table shows that the new designs
are very competitive The substantial improvement in the computation time is due to the
concurrent operation of the new sequencers which eliminates the N  
R term and to
the e cient implementation that reduces the number of gate delays both for initial and























































No Voltage Scaling With Voltage Scaling
Optimized concurrent sequencer  modied dualrail latches
Figure 
 Simulated Power Consumption for stage DualRail System
Finally to analyze the combined impact on system power consumption of the new se

quencers and the application of voltage scaling we have simulated results using SPICE 

We present results on both dualrail and singlerail sequential systems In particular we
simulated several versions of a sequencer controlling a stage datapath with identical
processes
Simulation results for a dualrail system are shown in Figure 
 Figure 
a shows the
power consumption of a sequential implementation of the system using a JosephsBailey
chain sequencer for control and Tangram dualrail latches with a  volt power supply In
comparison Figure 
b shows the power consumption of a new concurrent implementa
tion of the system using our optimized sequencer and modied dualrail latches operating
at  volts Our design obtains an 	 improvement in total computation time compared
to the previous design Finally Figure 
c shows the result of the application of voltage
scaling to our system the power supply can be dropped to 		 volts with the total energy
consumption of the entire system reduced by a factor of 	 compared to the sequential
design of Figure 
a
Figure  shows the simulation results for the singlerail system Figure a shows
a sequential implementation of the conservative scheme using a JosephsBailey chain
sequencer for control and Tangram singlerail latches with a  volt power supply Fig
ure b shows our optimized design using the early close scheme also operating at 
volts In this case our design obtains a  improvement in total computation time
Finally Figure c presents the simulation of our optimized design after voltage scaling
is applied power supply reduced to 		 volts The total energy consumption of the entire
system is reduced by a factor of  
The fast scheme was also simulated As expected we obtained roughly the same to
tal computation time as with our optimized design However as mentioned earlier our
simpler approach has two key advantages i ease of design delay matching to the en






















































No Voltage Scaling With Voltage Scaling
Optimized concurrent sequencer  Tangram singlerail latches
with earlyclose scheme




This paper has focused on architectural optimizations for lowpower asynchronous systems
These optimizations were targeted to sequential ie nonpipelined computation We
presented three new sequencer designs that eectively increase the concurrent activity
of the system We also showed that existing datapaths will not work correctly at the
increased level of concurrency New latch and multiplexer designs that safely accommodate
the added concurrency were presented for both dualrail and singlerail implementations
In the dualrail case our optimizations resulted in improved throughput providing the
opportunity for signicant power savings through voltage scaling Similar improvements
were demonstrated over one existing singlerail scheme We also indicated benets of our
singlerail approach over a second singlerail scheme
Acknowledgments
The authors would like to thank Prof Stephen Unger Columbia University for critically
reviewing a draft version of this manuscript and making useful suggestions They would
also like to thank Dr Craig Farnsworth Cogency Ltd for introducing them to his
work and Prof Stephen Furber University of Manchester and Dr Ad Peeters Philips
Research Laboratories for their useful comments
References

 A Bailey and M Josephs Sequencer circuits for VLSI programming In Proc Working
Conf on Asynchronous Design Methodologies pages " IEEE Computer Society
Press May 

 E Brunvand Translating Concurrent Communicating Programs into Asynchronous
Circuits PhD thesis Carnegie Mellon University 


	 A P Chandrakasan M Potkonjak R Mehra J Rabaey and R W Brodersen







 A P Chandrakasan S Sheng and R W Brodersen Lowpower CMOS digital design
IEEE J of SolidState Circuits 	" April 

	
 C Farnsworth D A Edwards J Liu and S S Sikand A hybrid asynchronous system
design environment In Proc Working Conf on Asynchronous Design Methodologies
pages 
" IEEE Computer Society Press May 

 S Furber Computing without clocks Micropipelining the ARM processor In





 S B Furber and P Day Fourphase micropipeline latch control circuits IEEE
Transactions on VLSI Systems "	 June 

 M B Josephs and A M Bailey Design of sequencer circuits a case study in SI
algebra Technical Report SBUCISM
 South Bank University September 

 A Marshall B Coates and P Siegel Designing an asynchronous communications
chip IEEE Design 






 A J Martin Programming in VLSI From communicating processes to delay
insensitive circuits In CAR Hoare editor Developments in Concurrency and Com






 R E Miller Sequential Circuits and Machines volume  of Switching Theory John
Wiley $ Sons New York 


 L W Nagel and D O Pederson Simulation program with integrated circuit empha
sis SPICE Technical Report ERLM		 Electronics Research Lab University of
California Berkeley April 
	

	 L S Nielsen C Niessen J Spars% and K van Berkel Lowpower operation using





 L S Nielsen and J Spars% A lowpower asynchronous datapath for a FIR lter bank
In Proc International Symposium on Advanced Research in Asynchronous Circuits
and Systems pages 
" IEEE Computer Society Press March 


 S M Nowick and B Coates UCLOCK Automated design of highperformance
asychronous state machines In Proc International Conf Computer Design ICCD
pages 	"
 IEEE Computer Society Press October 


 S M Nowick and D L Dill Automatic synthesis of locallyclocked asynchronous
state machines In Proc International Conf ComputerAided Design ICCAD pages
	
"	









 A Peeters and K van Berkel Singlerail handshake circuits In Proc Working Conf
on Asynchronous Design Methodologies pages 	" May 


 L A Plana and S M Nowick Concurrencyoriented optimization for lowpower
asynchronous systems In Proc International Symposium on Low Power Electronics





 C L Seitz System timing In C A Mead and L A Conway editors Introduction
to VLSI Systems chapter  AddisonWesley Reading MA 


 I E Sutherland Micropipelines Communications of the ACM 	"	 June


 S H Unger A building block approach to unclocked systems In Proc Hawaii In
ternational Conf System Sciences volume I pages 		"	 IEEE Computer Society
Press January 
	
	 Stephen H Unger Asynchronous Sequential Switching Circuits WileyInterscience
New York 

 K van Berkel R Burgess J Kessels A Peeters M Roncken and F Schalij Asyn






 K van Berkel and M Rem VLSI programming of asynchronous circuits for low





 Kees van Berkel Handshake Circuits an Asynchronous Architecture for VLSI Pro
gramming Cambridge University Press 
	
 K Y Yun P A Beerel and J Arceo Highperformance asynchronous pipeline
circuits In Proc International Symposium on Advanced Research in Asynchronous
Circuits and Systems pages 
" IEEE Computer Society Press March 

 K Y Yun and D L Dill Automatic synthesis of 	D asynchronous state machines
In Proc International Conf ComputerAided Design ICCAD pages " IEEE
Computer Society Press November 

	
