Pipelined Asynchronous Circuits by Lines, Andrew Matthew
Pipelined Asynchronous Circuits
Andrew Matthew Lines
June  revised June 
This thesis presents a design style for implementing communicating sequential processes CSP
as quasi delay insensitive asynchronous circuits based on the compilation method of  Al	
though hand compilation can always yield optimal circuits to a good designer a restricted ap	
proach is suggested which can easily implement circuits with some slack between inputs and
outputs These circuits are fast and versatile building blocks for highly pipelined designs The

rst chapter presents the implementation approach for individual cells The second chapter in	
vestigates the time behavior of complex pipelined circuits with the goal of adding slack where
necessary and adjusting transistor sizes to optimize the overall throughput
 Pipelined Cells
   Pipelines
A pipeline is a linear sequence of buers where the output of one buer connects to the input
of the next buer Tokens are sent into the input end of the pipeline and ow through each
buer to the output end The tokens remain in 
rst	in	
rst	out FIFO order For synchronous
pipelines the tokens usually advance through one stage on each clock cycle For asynchronous
pipelines there is no global clock to synchronize the movement Instead each token moves
forward down the pipeline when there is an empty cell in front of it Otherwise it stalls This
has exactly the same behavior as cars on a freeway The buer capacity or slack of an
asynchronous pipeline is the maximum number of tokens that can packed into it without stalling
the input end of the pipeline The throughput is the number of tokens per second which pass
a given stage in the pipeline The forward latency is the time it takes a given token to travel
the length of the pipeline
  Buer Reshuing
A single rail buer has the CSP speci
cation   LR Using a passive protocol for L and a lazy
active protocol for R the buer will have the handshaking expansion HSE
  L L
a
  L L
a
  R
a
 R  R
a
 R 

The environment will perform    L
a
L  L
a
L and    RR
a
  RR
a
 The wait
for  L is interpreted to be the arrival of an input token and the transition R is the beginning
of the output token An array of these buers preserves the desired FIFO order and properties
of a pipeline
Direct implementation of this HSE will require a state variable to distinguish the 
rst half from
the second half and has too much sequencing per cycle Instead it is usually better to reshue
the waits and events to reduce the amount of sequencing and the number of state variables We
wish to maximize the throughput and minimize the latency of a pipeline
The 
rst requirement for a valid reshuing is that the HSE maintain the handshaking protocols
on L and R That is the projection on the L channel is    LL
a
  LL
a
 and the projection
on the R channel is    R
a
R  R
a
R In addition the number of completed L minus
the number of completed R the slack of the buer must be at least zero This conserves the
number of tokens in the pipeline Also since this is a buer and is supposed to introduce some
nonzero slack the L
a
 must not wait for the corresponding  R
a
 or the reshuing will have
zero slack This is the constant response time requirement
Although these three requirements are sucient to guarantee a correct implementation one
more is useful Soon we will expand the L and R channels to encode data If we were to move
the R past the corresponding L
a
 that data would need to be saved in internal state variables
proportional to the number of bits on R or L It is better to avoid the additional latching
Given these requirements there are only nine valid reshuings
MSFB    R
a
 L R  R
a
 R L
a
  L L
a

PCFB    R
a
 L R L
a
  R
a
 R  L L
a

PCHB    R
a
 L R L
a
  R
a
 R  L L
a

WCHB    R
a
 L R L
a
  R
a
 L R L
a

B    R
a
 L R L
a
  R
a
 L L
a
 R
B    R
a
 L R L
a
  L L
a
  R
a
 R
B    R
a
 L R L
a
  L  R
a
 R L
a

B    R
a
 L R L
a
  R
a
 R  L L
a

B    R
a
 L R L
a
  R
a
 L R L
a

It takes two state variables to implement the MSFB reshuing The PCFB  B  B  B  B 
and B reshuings all require one state variable en short for enable with en inserted after
L
a
 and en inserted before the end
Are any of these reshuings obviously inferior For this thesis we assume that the goal is
fewer transistors and faster operation By that metric it can be shown that B  B  and B are
always inferior to PCFB  They all require the same state variable They produce only a subset
of the trace of PCFB  with additional unnecessary waits These waits add extra transistors and
slow the circuit down compared to PCFB 
What about B and B  They are also very similar to PCFB  except they have more se	
quencing However that extra sequencing simpli
es the production rule for en to R  en
instead of R  L
a
 en in the case of PCFB  It is therefore not possible to say that these
will always be inferior to PCFB  However due to the extra sequencing and additional transistors

elsewhere we assume that these reshuings will seldom if ever be better than PCFB  and will
not consider them further
The MSFB has the least possible sequencing of any of these reshuings However it requires
two state variables and has more complicated production rules than PCFB  Its only possible
advantage in speed is that it allows R to happen a little earlier If one counts transitions it
turns out that the next buer in the pipeline if it is reshued similarly will not even raise R
a
until after L
a
 occurs so this is really not an advantage at all Therefore the MSFB will not be
considered further
That leaves only three interesting reshuings WCHB  PCHB  and PCFB  The names are
derived from characteristics of the circuit implementations WC indicates weak	condition logic
PC indicates precharge logic HB indicates a halfbuer slack


 and FB indicates a fullbuer
slack  In the halfbuer reshuings only every other stage can have token on its output
channel since a token on that channel blocks the previous stage from producing an output
token In practice each of these three reshuings seems to be best for certain applications so
they are all useful With state variables inserted the three reshuings are
PCFB    R
a
 L R L
a
 en  R
a
 R  L L
a
 en
PCHB    R
a
 L R L
a
  R
a
 R  L L
a

WCHB    R
a
 L R L
a
  R
a
 L R L
a

  Logic with Buering
Suppose we wanted to implement a unit with CSP of the form
P   Aa Bb      X f a b     Y ga b        
On each cycle P receives some inputs then sends out functions computed from these inputs
The channels AB X  and Y must encode some data The usual way to do this is to use sets of
	of	N rails for each channel For instance to send two bits one could use two 	of	 rails with
one acknowledge or one 	of	 rails with one acknowledge
As a notational convention a rail is identi
ed by the channel name with a superscript for the
	of	N wire which is active and a subscript for what group of 	of	N wires it belongs to if there
is more than one group in the channel The corresponding acknowledge will be the channel
name with a a superscript or an e superscript if it is used in the inverted sense
As in the single rail buer case we could implement P by expanding each channel communi	
cation into a handshaking expansion Direct implementation of this HSE requires state variables
for the a b variables and more It would produce an enormously big and slow circuit so some
reshuing is obviously desired How should the HSE be reshued The same set of argu	
ments presented in the last section apply and we conclude that the PCFB  PCHB  and WCHB
reshuings will be the most useful ones
The correspondence between the single rail templates for PCFB  PCHB  and WCHB and
a process like P is as follows The L and L
a
will represent all the input data and acknowledges
The R and R
a
will represent all the output data and acknowledges  L indicates a wait for
the validity of all inputs and  L a wait for the neutrality of all inputs  R
a
 indicates a

wait for all the output acknowledges to be false and  R
a
 indicates a wait for all the output
acknowledges to be true L
a
 indicates raising all the input acknowledges in parallel and L
a

indicates lowering them R means that all the outputs are set to their valid states in parallel
R means that all the outputs are set to their neutral states When R occurs it is meant that
particular rails of the outputs are raised depending on which rails of L are true This expands
R into a set of exclusive selection statements executing in parallel
Unfortunately this simple translation will introduce more sequencing then necessary Of the
various actions which occur in parallel like setting all the outputs valid R each action might
need to wait for only a portion of the preceding guard  R
a
L For instance raising X

 or
X

 needs to check  Xa but not  Ya Similarly the semicolons between actions RL
a

might also over sequence However this cannot be easily 
xed while still using the HSE language
For instance in the sequence X Y A
a
B
a
 it might be necessary for A
a
 to wait for  X 
only if Y  did not use the value of A while B
a
 might need to wait for  X  Y  This
case could be written as X Y   X A
a
  X  Y B
a
 but this is starting to get a bit
illegible and if the next actions arent fully sequenced it gets even worse In the limit the HSE
just mirrors the actual production rule set PRS To skirt the issue of weak	sequencing this
thesis will just use HSE which might be a bit over sequenced with the understanding that the
unnecessary sequencing will be optimized out in the compilation to production rules On	going
research includes a more formal examination of when sequencing can be eliminated including
an algorithm which automatically produces weakened PRS from a given HSE However this is
not yet fully developed and is beyond the scope of this thesis
The PCFB version of a P with dual rail channels would therefore be
  X
a
 f

AB      X

 X
a
 f

AB       X


 Y
a
 g

AB      Y

 Y
a
 g

AB       Y

    
A
a
B
a
    
en
 X
a
 X

X

  Y
a
 Y

Y

    
 A

 A

 A
a
  B

 B

 B
a
    
en

In this HSE the f

f

g

 and g

are boolean expressions in the data rails of the input channels
They are derived from the f and g of the CSP and indicate the conditions for raising the
various data rails of the output channels Note that each output channel waits only for its own
acknowledge which is less sequenced than a direct translation of the PCFB template would be
In P it is seen that A
a
and B
a
tend to switch at about the same time They could actually
be combined into a single AB
a
which would wait for the conjunction of the guards on A
a
and
B
a
 Combining the acknowledges tends to reduce the area of the circuit but might slow it down
The best decision depends on the circumstances
  Examples of Logic with Buering
To put the previous section into practice several CSP processes with the same form as P will
be compiled into pipelined circuits The simplest CSP buer that encodes data has a dual rail

input L and a dual rail output R The CSP is   Lx Rx Three HSE reshuings for this
process are
WCHB BUF 
  R
a
 L

 R

 R
a
 L

 R

L
a

 R
a
 L

 L

 R

R

L
a

PCHB BUF 
  R
a
 L

 R

 R
a
 L

 R

L
a

 R
a
 R

R

  L

 L

 L
a

PCFB BUF 
  R
a
 L

 R

 R
a
 L

 R

L
a
 en
 R
a
 R

R

  L

 L

 L
a
 en
After bubble	reshuing which suggests using the inverted acknowledges L
e
and R
e
 the
production rules for the WCHB BUF follow and the circuit diagram is shown in Figure 
R
e
 L

R
e
 L

R

R

R

	 R

L
e
 R


 R


 R


 R


 L
e

 L
e

R
e
 L

R
e
 L

R

R

R

 R

L
e
 R


 R


 R


 R


 L
e

 L
e

The other HSEs can be implemented similarly but they are both somewhat bigger For this
reshuing the validity and neutrality of the output data R implies the validity and neutrality
of the input data L Logic which has this property is called weak	condition It means that the
L does not need to be checked anywhere else besides in R The WCHB also gets some of its
semicolons implemented for free The semicolon between L
a
  R
a
 L is implemented by the
environment as is the implicit semicolon at the end of the loop So it seems that the WCHB
has some inherent bene
ts However it turns out that although WCHB works well for buers
the weak	condition requirement can cause problems with other circuits
ThisWCHB BUF bubble	reshuing has  transitions forward latency and  transitions back	
ward latency for the path from the right acknowledge to the left acknowledge Combining
these times for the whole handshake yields          transitions per cycle

R1
C
C
L0
LeLe
L1
R0
Re
R1
R0
Figure  WCHB BUF
The minimum number of transitions per cycle of any 	phase buer is  There must be at
least  transition at each end of the channel for each phase of the handshake Due to the inverse
monotonicity of CMOS an oscillator must have an odd number of transitions on each half cycle
Therefore each half of the handshake requires an odd number of transitions greater than  So
there must be at least  transitions for each half or  for the whole cycle Why do we add extra
inverters to WCHB BUF to get  transitions per cycle Adding the inverters can actually
speed up the throughput despite the increased transition count because inverters have high
gain Also the  transition per cycle buer would invert the senses of the data and acknowledges
after every stage which is highly inconvenient when composing dierent pipelined cells As a
standard practice most pipelined logic cells will be done with  transitions of forward latency
but more complicated circuits will have   or even  transitions backward latency yielding
transitions per cycle from  to  even numbers only of course
Next we consider a fulladder with the CSP   AaBbC c S XORa b cD MAJ a b c
The AB C S  and D channels are dual rail The acknowledges for AB  and C will be combined
into a single F
e
 We use inverted acknowledges from the start The three HSE reshuings are
WCHB FA 
  S
e
 XOR

AB C   S

 S
e
 XOR

AB C   S


 D
e
MAJ

AB C   D

 D
e
MAJ

AB C   D


F
e

 S
e
 A

 A

 C

 S

 S


 D
e
 B

 B

 C

 D

D


F
e



PCHB FA 
  S
e
 XOR

AB C   S

 S
e
 XOR

AB C   S


 D
e
MAJ

AB C   D

 D
e
MAJ

AB C   D


F
e

 S
e
 S

 S


 D
e
 D

D


 A

 A

 B

 B

 C

 C

 F
e


PCFB FA 
  S
e
 XOR

AB C   S

 S
e
 XOR

AB C   S


 D
e
MAJ

AB C   D

 D
e
MAJ

AB C   D


F
e

en
 S
e
 S

 S


 D
e
 D

D


 A

 A

 B

 B

 C

 C

 F
e

en

In the WCHB FA the validity of the outputs S and D implies the validity of the inputs
because the S must check all of AB  and C  The test for the neutrality of the inputs is split
between S and D This works as long as both S and D check at least one inputs neutrality
completely and if both rails of S and D wait for the same expression In both PCHB FA and
PCFB FA the expression for the neutrality of the inputs is obviously too large to implement as
a single production rule Instead the neutrality test must be decomposed into several operators
The usual decomposition is nor gates for each dual rail input followed by a 	input c	element
F
e
 must now wait for the validity of the inputs just to acknowledge the internal transitions
However this means the logic for S and D no longer needs to fully check validity of the inputs
it is not required to be weak	condition
The bubble	reshued and decomposed production rules for WCHB FA are
S
e
 XOR

AB C 
S
e
 XOR

AB C 
D
e
MAJ

AB C 
D
e
MAJ

AB C 
S

S

D

D

S

	 S

  D

	 D


F
e
 S


 S


 D


 D


 S


 S


 D


 D


 F
e

 F
e


S
e
 A

 A

 C

S
e
 A

 A

 C

D
e
 B

 B

 C

D
e
 B

 B

 C

S

S

D

D

S

 S

 D

 D

F
e
 S


 S


 D


 D


 S


 S


 D


 D


 F
e

 F
e

S0S1 S1 S0
A0A1
B1B1 B0
C1 C0 D0D1D1 D0
B1
C1
C0 C0
A0 A0
A1 A1
A0
C0
A1
B0
B1
C1
D0 D1
S1S0
S0
D0
D1
S1
C1
B0
B0
Se
Se
De
De
FeFe
Figure  WCHB FA
The circuit diagram is shown in Figure  The pull	up logic for S

 S

 D

 and D

has 
p	transistors in series which is quite weak due to the lower mobility of holes Other WCHB
circuits can be even worse Since all the inputs are checked for neutrality before the outputs
reset a process with three inputs and only one output would end up with  p	transistors in
series to reset that output The solution is to use the precharge	logic reshuings PCHB FA
or PCFB FA These test the neutrality of the inputs in a dierent place which is much more
easily decomposed into manageable gates and does not slow the forward latency The PCHB FA
reshuing has the production rules
 
A
	 A

B

	 B

C

	 C

F
e
 S
e
 XOR

AB C 
F
e
 S
e
 XOR

AB C 
F
e
 D
e
MAJ

AB C 
F
e
 D
e
MAJ

AB C 
S

S

D

D

A
v
 B
v
 C
v
S

	 S

D

	 D

S
v
 D
v
 ABC
v
 A
v

 B
v

 C
v

 S


 S


 D


 D


 S


 S


 D


 D


 ABC
v

 S
v

 D
v

 F
e

A

 A

B

 B

C

 C

S
e
 F
e
S
e
 F
e
D
e
 F
e
D
e
 F
e
S

S

D

D

A
v
 B
v
 C
v
S

 S

D

 D

S
v
 D
v
 ABC
v
 A
v

 B
v

 C
v

 S


 S


 D


 D


 S


 S


 D


 D


 ABC
v

 S
v

 D
v

 F
e

This circuit can be made faster by adding two inverters to F
e
and then two more to produce
the F
e
used internally which is now called en This circuit is shown in Figure  A PCFB FA
reshuing would have only slightly dierent production rules

CC
S1
ABCv
D0
D1C1
C0
B1
B0
A0
A1
Fe
en
Sv
Dv
Bv S0
S1 S0S1 S0
B1
C1
C0 C0
A0 A0
A1 A1
C1
B0
Se
Se
en
en
A0A1
B1B1 B0
C1 C0 D0D1D1 D0
B0
en
De
De
en
Cv
Av
Figure  PCHB FA
A

	 A

B

	 B

C

	 C

en  S
e
 XOR

AB C 
en  S
e
 XOR

AB C 
en D
e
MAJ

AB C 
en D
e
MAJ

AB C 
S

S

D

D

A
v
 B
v
 C
v
S

	 S

D

	 D

en  S
v
D
v
 ABC
v
S
v
 D
v
F
e
 SD
v
en
 A
v

 B
v

 C
v

 S


 S


 D


 D


 S


 S


 D


 D


 ABC
v

 S
v

 D
v

 F
e

 SD
v

 en
 en

A

 A

B

 B

C

 C

S
e
 F
e
S
e
 F
e
D
e
 F
e
D
e
 F
e
S

S

D

D

A
v
 B
v
 C
v
S

 S

D

 D

en  ABC
v
S
v
 D
v
F
e
 SD
v
en
 A
v

 B
v

 C
v

 S


 S


 D


 D


 S


 S


 D


 D


 ABC
v

 S
v

 D
v

 F
e

 SD
v

 en
 en
Of the three fulladder reshuings which one is best The WCHB FA has only  transitions
per cycle while the PCHB FA has  and the PCFB FA has   on the setting phase but  on
the resetting phase since the L and R handshakes reset in parallel Although the WCHB FA
has fewer transistors to make it reasonably fast the  p	transistors in series must be made very
large Despite the lower transition count of the WCHB FA both PCHB FA and PCFB FA are
substantially faster in throughput and latency PCFB FA is the fastest of all since it relies
heavily on n	transistors and saves  transitions on the reset phase However PCFB FA is a bit
larger than PCHB FA due to the extra state variable en and the extra completion SD
v
 If the
speed of the fulladder is not critical the PCHB FA seems to be the best choice
In general theWCHB reshuing tends to be best only for buers and copies  Lx Rx  S x
The PCHB is the workhorse for most applications it is both small and fast When exceptional
speed is called for the PCFB dominates It is also especially good at completing 	of	N codes
where N is very large since the completion can be done by a circuit which looks like a tied	or
pulldown as opposed to many stages of combinational logic The reshuings can actually be
mixed together with each channel in the cell using a dierent one This is most commonly
useful when a cell computes on some inputs using PCHB  but also copies some inputs directly
to outputs using WCHB  In this case the neutrality detection for the WCHB outputs is only
one p	gate which is no worse than an extra en gate
Another common class of logic circuits use shared control inputs to process multi	bit words
This is not really any dierent from a fulladder The control is just another input which happens
to have a large fanout to many output channels Since the outputs only sparsely depend on the
inputs usually with a bit to bit correspondence the number of gates in series in the logic doesnt
become prohibitive However if the number of bits is large like  the completion of all the
inputs and outputs will take many stages in a c	element tree which adds to the cycle time as
does the load on the broadcast of the control data To make high throughput datapath logic

it is better to break the datapath up into manageable chunks perhaps  or  bits  and send
buered copies of the control tokens to each chunk This cuts down the cycle time but doesnt
change the high	level meaning except to introduce extra slack
  Conditionally producing outputs
Although the cells discussed in the previous section can be shown to be Turing complete they
can be turned into a VonNeumann state machine with some outputs fed back through buers
to store state they are clearly inecient for many applications A very useful extension is the
ability to skip a communication on a channel on a given cycle This turns out to require only a
few minor modi
cations to the scheme as presented so far
Suppose the process completes at most one communication per cycle on the outputs but
always receives all its inputs The CSP would be
P   AaBb    
 do x a b      X f a b    do x a b      skip
 do ya b      Y ga b    do ya b      skip    

As usual we can reshue this like WCHB  PCHB  or PCFB  The selection statements for
the outputs expand into exclusive selections for setting the output rails plus a new case for
producing no output at all on the channel A dual	rail version of P with a PCFB reshuing is
  do x AB       X
a
 f

AB      X


 do x AB       X
a
 f

AB       X


 do x AB       skip
 do yAB       Y
a
 g

AB      Y


 do yAB       Y
a
 g

AB       Y


 do yAB       skip    
A
a
B
a
    
en
 X
a
	 X

 X

 X

X

  Y
a
	 Y

 Y

 Y

Y

    
 A

 A

 A
a
  B

 B

 B
a
    
en

Note that the resetting of the output channels X and Y must accommodate the cases when
those channels werent used Since they produced no outputs they must not wait for the ac	
knowledges Adding in the X

 X

terms will allow the wait to be completed vacuously It
doesnt actually generate any production rules This HSE can be compiled into production rules
but there are some tricky details
An interesting choice arises from the use of the skip A skip causes no visible change in state
so the next statements in sequence A
a
B
a
     must actually look directly at the boolean
expression for do x AB      and do yAB      in addition to the output railsX

X

Y

Y



The completion condition for setting the outputs would be en  X

	 X

	 do x AB      
Y

	Y

	do yAB      However this expression cannot be used directly in the guards for
A
a
 and B
a
 since if one 
red 
rst it could destabilize the other This would work if A
a
and
B
a
were combined into one acknowledge
A better approach is to introduce a new variable to represent the do x and do y cases
Suppose we replaced the skips with no x and no y respectively and added no x to X

X


and no y to Y

Y

 Now the production rules are simply produced as if X and Y were 	of	
channels instead of 	of	 except the extra rail doesnt check the right acknowledge or in fact
leave the cell
Finally there are many cases were some expression of the outputs is sucient to produce the
output completion expression without reference to the inputs For instance if one input is used
to decide if a certain output is used but is also copied to another output the copied output
could be used to check the completion of the optional output Similarly if two output channels
are used exclusively such that one or the other will be used each cycle the completion for both
is just the or of each ones completion
To put this discussion into practice we will implement a split  a fundamental routing process
which uses one control input to route a data input to one of two output channels The simple
one	bit CSP is   SsLx   s  Ax s  B x The PCHB reshuing is
PCHB SPLIT 
  A
e
 S

 L

 A

 A
e
 S

 L

 A

 S

 skip
 B
e
 S

 L

 B

 B
e
 S

 L

 B

 S

 skip
SL
e

 A
e
	 A

 A

 A

A


 B
e
	 B

 B

 B

B


SL
e


To produce the production rules we 
rst notice that the 
rst two selection statements are
known to be 
nished when A

	 A

	 B

	 B

 Hence this will be used as the guard for SL
e

The bubble	reshued production rules are
S

	 S

L

	 L

SL
e
 A
e
 S

 L

SL
e
 A
e
 S

 L

SL
e
 A
e
 S

 L

SL
e
 A
e
 S

 L

A

A

B

B

S
v
 L
v
A

	 A

	 B

	 B

AB
v
 SL
v
 S
v

 L
v

 A


 A


 B


 B


 A


 A


 B


 B


 SL
v

 AB
v

 SL
e


S

 S

L

 L

SL
e
 A
e
SL
e
 A
e
SL
e
 B
e
SL
e
 B
e
A

A

B

B

S
v
 L
v
A

 A

 B

 B

AB
v
 SL
v
 S
v

 L
v

 A


 A


 B


 B


 A


 A


 B


 B


 SL
v

 AB
v

 SL
e

The circuit is shown in Figure 
L1
A0 A1 B0 B0
Ae Be
SLe SLe
Ae
SLe
S1
L1L0 L0
C
C
Sv
SLv
L0
L1
S1
S0
Lv
ABv
SLe
B0
B1
A1
A0
A0 A1 B1 B1
SLe
S0
Be
Figure  PCHB SPLIT
 	 Conditionally reading inputs
It is also highly useful to be able to conditionally read inputs Normally the condition is read in
on a separate unconditional channel but in general it could be any expression of the rails of the
inputs A CSP template for type of cell this would be

P    do aAB  Aa no aAB  a  unused
 do bAB  Bb no bAB  b  unused    
X f a b   Y ga b       

The A in this context refers to a probe of the value of A not just its availability This is not
standard in CSP but is a useful extension which is easily implemented in HSE Basically the
booleans for do a do b no a and no b may inspect the rails of A and B in order to decide
whether to actually receive from the channels The selection statements will suspend until either
do a or no a are true These expressions are required to be stable that is as additional inputs
show up they may not become false as a result
For the HSE instead of assigning unused to an internal variable the f and g expressions
will examine the inputs directly The results of the do a!no a and do b!no b expressions must
be latched into internal variables u and v  so that A and B may be acknowledged in parallel
without destabilizing the guards of do a and the like The PCFB version of the HSE is
u

 u

 v

 v

    
  f

AB       X

 f

AB       X


 g

AB       Y

 g

AB       Y

    
 do aAB  u

 no aAB  u


 do bAB  v

 no bAB  v

    
 u

 A
a
 u

 skip
 v

 B
a
 v

 skip    
en
 X
a
 X

X

  Y
a
 Y

Y

    
u

 u

  A

 A

	 A
a
 A
a

v

 v

  B

 B

	 B
a
 B
a
    
en

Similarly to the conditional output HSE the guards for A
a
 and B
a
 are weakened to allow
the vacuous case Also the skip again poses a problem since it makes no change in the state
However with the u

and v

variables it is possible to infer the skip and generate the correct
guard for en On the reset phase the u and v must return to the neutral state There are several
places to put this but the symmetric placement which sequences them with the A
a
 and B
a

simpli
es the PRS
In many cases this general template can be greatly simpli
ed For instance if a set of uncon	
ditional inputs completely controls the conditions for reading the others these can be thought
of as the control inputs If raising the acknowledges of the various inputs is sequenced so that
the conditional ones precede the control ones then the variables u and v may be eliminated
without causing stability problems Also in some cases the u and v may be substituted with an
expression of the outputs instead of stored separately
As a concrete example we derive the circuit for the merge process which reverses the split
of the last section by conditionally reading one of two data input channels A and B to the

single output channel R based on a control input M  The CSP is   M m  m  Ax m 
BxX x Here we use the simpli
cation of acknowledging the data inputs A and B before
the control input M  The PCHB reshuing is
PCHB MERGE 
  X
e
 M

 A

	M

 B

  X

 X
e
 M

 A

	M

 B

  X


 M

 A
e
 M

 B
e

M
e

 X
e
 X

X


 A

 A

 M

	 A
e
 A
e

 B

 B

 M

	 B
e
 B
e

M
e


A subtle simpli
cation used here is to make A
e
 and B
e
 check the corresponding M

and
M

 This reduces the guard condition for M
e
 and makes the reset phase symmetric with
the set phase We do some decomposition to add A
v
 B
v
 and X
v
to do validity and neutrality
checks After bubble	reshuing the PRS is
A

	 A

B

	 B

A
v
B
v
M
e
 X
e
 M

 A

	M

 B


M
e
 X
e
 M

 A

	M

 B


X

X

X

	 X

A
v
M

 X
v
B
v
M

 X
v
A
e
	 B
e
M
e
 A
v

 B
v

 A
v

 B
v

 X


 X


 X


 X


 X
v

 A
e

 B
e

 M
e

 M
e

A

 A

B

 B

A
v
B
v
M
e
 X
e
M
e
 X
e
X

X

X

 X

A
v
 M

 X
v
B
v
 M

 X
v
A
e
 B
e
M
e
 A
v

 B
v

 A
v

 B
v

 X


 X


 X


 X


 X
v

 A
e

 B
e

 M
e

 M
e


As usual for PCHB reshuings most of the work is done in a large network of n transistors
The circuit is shown in Figure 
Me
C
C
Me
X0
Xe
Me
X0 X1A1B0
A0 B1
M1M0
Xe
Me
X1
A0
A1
Av
B1
B0 Bv Bv
Av
X0
X1
Xv
M1
M0
Ae
Be
Figure  PCHB MERGE
 
 Internal state
One 
nal extension to this design style is the ability to store internal state from one cycle to the
next A CSP template for a state holding process with state variable s is
P  s  initial s
 Aa Bb      X f s a b     Y gs a b         s  hs a b    
This can be implemented in a variety of ways The simplest which requires no new circuits
is to feed an output of a normal pipelined cell back around to an input via several buer
stages One of these feedback buers is initialized containing a token with the value of the initial
state Enough buers must be used to avoid deadlock and even more are needed to maximize
the throughput as discussed in chapter  Therefore this solution can be quite large For
control circuitry where area is less of an issue this is often adequate As an added bene
t
the feed forward portion of the state machine can be implemented as several sequential stages
of pipelined logic which correspondingly reduces the number of feedback buers necessary and
allows far more complicated functions
Aside from using feedback buers there are three main approaches to retaining state of in	
creasing generality and complexity First pipelining channels by themselves store state Usually

these values move forward down the pipeline passing through each stage only once However if
a stage uses but does not acknowledge its input the input value will still be there on the next
cycle Essentially the token is stopped and sampled many times In CSP this can be expressed
with the probe of the value of the channel A conditional input type of circuit is used which uses
an input to produce outputs without acknowledging that input This technique can be used for
certain problems For example a loop unroller could take an instruction on the input channel
and produce many copies of it on an output channel based on a control input Of course this
type of state variable can never be set only read one or more times from an input
If the state variable is exclusively set or used in a cycle a simple modi
cation of the standard
pipelined reshuings will suce The state variable s is assigned to a dual	rail value at the
same time the outputs are produced On the reset phase it remains stable Unlike the usual
return	to	zero variables s will only briey transition through neutrality between valid states
If s doesnt change it does not go through a neutral state at all The CSP for this behavior
is expressed just like P  except the semicolon before the assignment to s is replaced with a
comma This is made possible by the assumption that s only changes when the outputs X and
Y do not depend on it this avoids any stability problems
The only tricky thing about deriving the HSE for this is the assignment statement Basically
the assignment is done by lowering the opposite rail 
rst then raising the desired rail This
guarantees that the variable passes through neutral when it changes and also bubble	reshues
nicely The completion detection of this assignment is basically equivalent to checking that
the value of s corresponds to the inputs to s So s  x becomes  x

 s

 s

 x


s

 s

  x

 s

	x

 s

 The PCFB version of the HSE for this type of state holding process
is
  X
a
 f

sAB       X

 X
a
 f

sAB       X


 Y
a
 g

sAB       Y

 Y
a
 g

sAB       Y


 h

AB       s

 s

 h

AB       s

 s

    
A
a
B
a
    
en
 X
a
 X

X

  Y
a
 Y

Y

    
 A

 A

 A
a
  B

 B

 B
a
    
en

It is often desirable to decompose the completion detection of the state variable into a  phase
completion variable s
v
which detects the completion of the assignment on the set phase and is
cleared on the reset phase This makes it easier to have multiple state variables One thing to
note is that the assignment sequence and completion has  transitions if it changes state and
therefore often takes more transitions than a typical output channel However on the reset phase
or if the state is unchanged this only takes  transition Another caveat is that the state variable
shown here works best for only dual rail  bit state variables
As an example of this type of state variable consider the register process x  	    C c  c 
Rx c  Lx This uses a control channel C to decide whether to read or write the state
bit x via the input and output channels L and R Obviously the state bit is exclusively used or
 
set on any given cycle This process also conditionally communicates on L and R but since that
was covered in the last two sections we include it here The PCHB version of the HSE is
PCHB REG 
x

 x


  C

 R
e
 x

 R


 C

 R
e
 x

 R


 C

 L

 x

 x


 C

 L

 x

 x


 C

 L
e
 C

 skip
C
e

 R
e
	 R

 R

 R

R


 L

 L

 L
e

 C

 C

 C
e


The PRS has a few tricky features Due to the exclusive pattern of the communications the
rules for C
e
can be simpli
ed The decomposed and bubble reshued PRS follows and the
circuit is shown in Figure 
C
e
 C

 R
e
 x

C
e
 C

 R
e
 x

R

R

R

	 R

R
v
C
e
 C

 L

C
e
 C

 L

L

 x

L

 x

C
e
 x

 L

	 x

 L


L
e
	 R
v
C
e
 R


 R


 R


 R


 R
v

 R
v

 x


 x


 x


 x


 L
e

 C
e

 C
e

C
e
 R
e
C
e
 R
e
R

R

R

 R

R
v
C
e
 L

 L

L
e
 R
v
C
e
 C

 C

 R


 R


 R


 R


 R
v

 R
v

 L
e

 C
e

 C
e


Ce
C
R0 R0 R1 R1
Ce
Re
Ce
Re
C0
X0 X1
Rv
Ce Ce
Le
RvR0
R1
X0
L0 L1
X1
X0X1
L0 L1
Ce
C1
L0 L1
Ce
C1
X0 X1
Le
L0
L1
Figure  PCHB REG
Finally the most general form of state holding cell is one where the state variable can be used
and set in any cycle In order to do this it is necessary to have separate storage locations for the
new state and the old state This may be done by introducing an extra state variable t which
holds the new state until s is used The CSP for this is
P  s  	 
 Aa Bb      X f s a b     Y gs a b     t  hs a b         s  t
When this is converted into an HSE there are several choices for where to put the assignment
s  t  It works best to do this assignment on the reset phase of the channel handshakes After
the assignment s  t  t returns to neutral just like a channel The PCFB version of this type of
cell is
s  	 
  X
a
 f

sAB       X

 X
a
 f

sAB       X


 Y
a
 g

sAB       Y

 Y
a
 g

sAB       Y


 h

sAB       t

 h

sAB       t

    
A
a
B
a
    
en
 X
a
 X

X

  Y
a
 Y

Y

    
 t

 s

 s

 t

 t

 s

 s

 t

    
 A

 A

 A
a
  B

 B

 B
a
    
en

The assignment statements may be compiled into production rules as before Of special interest
is the compilation of the sequence  t

 s

 s

 t

 t

 s

 s

 t

 Due to correlations
of the data this compiles into the simple bubble	reshued production rules

en  t

en  t

s

 t

s

 t

en  s

en  s

 s


 s


 s


 s


 t


 t


The s

and s

should also be reset to the correct initial value The completion of this sequence
is just the normal check for t

 t

 If the state variable doesnt change this sequence takes only
 transition since the 
rst  rules are vacuous If the state changes it takes  transitions This
is  transitions longer than the reset of a normal output channel so this should be considered to
optimize the low level production rule decomposition This type of structure only works well if
s and t are dual	rail although several dual	rail state variables can be used in parallel to encode
more states
Of the various types of state	holding cells the more restricted versions generally have simpler
and faster implementations and should therefore be used if possible For the most general case
either a pair of state variables should be used or if area is not an issue a feedback loop of buers
  Conclusions
This chapter has presented a guide to designing asynchronous cells which combine buering
and computation functions Three main types of handshaking reshuings have proved superior
for dierent circumstances The weak	condition half	buer variety works well for buers and
copies without logic The precharge	logic half	buering is the simplest good way to implement
most logic cells The precharge	logic full	buering has an advantage in speed and is good at
decoupling the handshakes of neighboring units It should be used when necessary to improve
the throughput In addition extensions to these cells which allow for conditionally receiving
inputs or conditionally sending outputs were explained Finally various approaches to storing
internal state in the cells were presented
How far can these techniques go An entire digital 
lter was designed using only WCHB and
PCHB cells  Even a complete asynchronous microprocessor the MiniMIPS uses basically
these types of pipelined cells  The MiniMIPS busses are all varieties of fullbuers the datapath
logic and control logic are usually PCHB and much of the control distribution and buering is
WCHB Even cells as unusual as the caches were essentially implemented as one giant PCHB cell
which has an exclusively used!assigned state variable structure plus a few low level transistor
tricks for the SRAM cells themselves The register locking is a little trickier but all its input
and output handshakes follow the standard PCFB approach Basically these techniques can
account for almost all of the design of any type of asynchronous circuit and strongly inuence
all design decisions Other possibilities may be explored as the design constraints require but
these techniques form a default option for any asynchronous circuit implementation
The prior state of the art was to use un	pipelined weak condition logic Extra buers or
registers would be added between blocks of logic to add some pipelining This approach was
smaller but much slower The extra buers also increased the forward latency Essentially in
the limit of using more and more buers they should eventually be merged into the logic and all

cells should be maximally pipelined That is any discrete state of logic gets its own pipelining
so that no more slack could be added without just throwing in excess buers In practice the cost
of such 
ne pipelining amounts to a " to " increase in area over a completely un	pipelined
circuit It reduces the latency since no separate buers are added and of course increases the
throughput At this natural limit of pipelining all handshakes between neighboring cells require
a small number of transitions per cycle typically  to   The internal cycles usually keep up
This yields a very high peak throughput comparable to  transition per cycle hyper	pipelined
synchronous designs like the DEC Alpha but is more easily composable However composing
fast pipelined cells in various patterns can yield much lower system throughputs unless special
care is taken to match the latencies as well as the throughputs of the units This is discussed in
the next chapter
 Pipeline Dynamics Slack Matching and Transistor
Sizing
Many techniques can be used to optimize the performance of a circuit measured as latency
throughput energy or any other metric The most signi
cant optimizations can be made at
the highest level in the selection of algorithms and the decomposition of processes Once the
processes and topology of communication are 
xed further improvements can be made by im	
proving the transistor implementations as described in Chapter  Final optimization can use
the techniques of slack matching and transistor sizing Slack matching is the insertion or re	
moval of slack buer capacity along channels It is most relevant in a highly pipelined design
Transistor sizing is the selection of transistors widths to optimize performance This chapter in	
vestigates the proper use of slack matching and its inuence on transistor sizing The emphasis
is on optimizing throughput
  Pipeline Dynamics
Removing slack from channels may introduce deadlock if the slack of a channel is not greater
than the number of tokens it must contain If the processes do not rely on inverted probes of
channels or inherent disjunctions as is normally the case then adding slack to channels will not
cause an error since the process has no means of distinguishing the slack of a channel  With
the deadlock restriction in mind it is possible to add or remove slack in two ways One is to
alter the slack inherent in the communicating processes by varying the amount of logic between
latches As suggested in Chapter  the overhead of making all processes introduce slack


or 
between inputs and outputs is fairly small and can substantially improve the throughput The
second way of altering the slack is to add buers to channels
In either case to use slack matching as an optimization tool we must understand the eect
it has on the performance of the system While the critical path may always be computed and
used to check the eects of adding or removing slack a simpli
ed model of pipeline behavior can
make the analysis much easier

  Linear Pipelines
Re-
f+
f-b-
b+
L+
L-
[Re&L]
Le-  R+
Le+  R-
[~Re&~L]
Re+
Figure  Timing Diagram of Halfbuer Pipeline
Suppose we build a linear pipeline out of the WCHB BUF halfbuer of chapter  Figure  is
a timing chart for this type of cell Time progresses downward and the pipeline ows from left to
right Only  delays are necessary to describe the cells the forward latencies f  and f  and the
backward latencies b and b Starting at the wait  R
e
 L we can traverse the directed graph
until it gets back to another  R
e
L in the same cell after N cycles The cycle time of this path
is the sum of all latencies on it divided by N  There are only three single cycle N    paths
They have cycle times of f  b  f  b  f  f  b  b  and f  f  b  b 
There are arbitrarily many multi	cycle paths but they turn out to be combinations of the three
single cycle paths Therefore the critical path has cycle time   maxf  f   b  b 
We wish to analyze the throughput t  x as a function of the number of tokens in the
pipeline x Both t and x are taken to be averages in steady state operation Steady state means
the throughput measured at the input and output of the pipeline is equal and remains roughly
constant over time Let the pipeline consist of N stages
First of all we know that if we have no tokens in a pipeline it will have zero throughput
Also we can never have more tokens in the pipeline that it has buer capacity henceforth called
static slack s In the case of the halfbuers s 
N

 We introduce holes to indicate the
absence of a token where the number of holes is s  x When the pipeline is full none of the
tokens has a hole to move into so the throughput is also zero
If we introduce tokens at widely spaced intervals the tokens ow forward through an empty
pipeline and exit after the total forward latency which is Nmaxf  f  The max is needed
here because if the f  is slower than the f  the tail of the token lags behind the head and will
slow down the tokens behind it The cycle time is the forward latency divided by the number of
tokens
Nmaxf f
x


Starting from a full pipeline we can look at the problem from the point of view of holes moving
backward Once again if the holes are separated widely in time they will not aect each other
and will exit after the total backward latency which is
N

b  b  Here the cycle time is
N

b b
sx
 The intersection of the forward and backward constraints is found to be at x equal to
the forward latency over the cycle time   with a peak throughput of


 The throughput versus
tokens graph is a triangle
The other buer reshuings produce similar results except that they may have internal cycles
which are slower than the handshake ones which can cut o the top of the triangle and make it
a trapezoid The fullbuer has a static slack s of  per stage but otherwise has similar forward
latency and peak cycle time so only the right leg of the triangle is usually much dierent Ted
Williams has done this analysis in  and calls these regions of the trapezoid regions data
limited handshake limited and bubble limited
A new term dynamic slack d is de
ned to indicate the number of tokens at which the
throughput peaks It is a dimensionless unit like static slack In cases where the throughput
curve is a trapezoid this left edge of the at peak is at the minimum dynamic slack d
min
 and
the right edge is at the maximum dynamic slack d
max
 Dynamic slack is inversely proportional
to Ted Williams dynamic wavelength but it a useful concept because it is dimensionless and
analogous to static slack
The equations governing the throughput versus tokens trapezoid of a linear pipeline with peak
throughput T  dynamic slack d
min
to d
max
 and static slack s are
x  T
x
d
min
if x 
 d
min
x  T if d
min

 x 
 d
max
x  T
s x
s d
max
if d
max

 x
t 
 x
The number of tokens in the pipeline is equal to the forward latency over the cycle time and
likewise for the holes and the backward latency This always holds in steady state since the
tokens are evenly spaced in time such that the number in a pipeline will be its latency divided
by time separation of tokens the cycle time Equivalently the forward latency is equal to the
number of tokens over the throughput Therefore the forward and backward latency as a function
of tokens in the pipeline can be inferred from the throughput versus tokens curve The forward
latency remains constant as the throughput increases linearly with the number of tokens When
and if the internal cycle limit takes eect then the forward latency increases proportionate
to the number of tokens since the throughput remains constant in this region Finally when
the pipeline is bubble limited the forward latency is increases as
x
sx
 which asymptotically
approaches in
nity as the pipeline 
lls up A similar graph is made for backward latency The
equations are

fx 
x
x
bx 
s x
x
The throughput versus tokens characterization x has enough information to predict both
steady	state throughput and latency It takes four parameters to completely describe a through	
put trapezoid The parameters may include any four of static slack minimum dynamic slack
maximum dynamic slack peak throughput forward latency or backward latency Only static
slack can be calculated from a circuit diagram the others require a timing simulation or esti	
mate Note that dynamic slack may vary independently of static slack although it must remain
bounded by zero and the static slack This means that the peak is not necessarily at half the
static slack as common rules of thumb imply
A
B
Figure   Two Linear Pipelines
Figure  displays throughput versus tokens triangles for two pipelines made up of the weak
condition logic halfbuer of Figure  These two buers are shown schematically in Figure  
The pipelines were designed with dierent static slack transistor sizes dynamic slack and peak
throughput They have no internal cycles However they do appear to have a slightly chopped
o peak This is because the coincident arrival of two inputs into a c	element has a somewhat
larger switching delay since the early input cant precharge some of the channel Data points
were taken from HSPICE simulations for a   m CMOS process and 
t with a triangle The
A pipeline has static slack  dynamic slack  and peak throughput  MHz The B
pipeline has static slack  dynamic slack  and peak throughput MHz
   Composition of Linear Pipelines
Only trivial circuits consist entirely of a homogenous linear pipeline Many computations combine
dierent types of pipelines that branch and join and feedback in loops Now that we understand

050
100
150
200
250
300
350
400
450
500
0 5 10 15 20 25
t
hr
ou
gh
pu
t 
(M
Hz
)
tokens
Ta(x)
Tb(x)
"ring_a.dat"
"ring_b.dat"
Figure  Throughput Versus Tokens for Two Linear Pipelines
the dynamic behavior of a homogenous linear pipeline we can use that to build up to more
complicated systems
The general approach is to introduce independent variables for each of N homogenous linear
pipeline fx

  x
N
g that indicate the number of tokens in that pipeline in steady state Each
section of linear pipeline must operate within it throughput versus tokens trapezoid
D
i   
 i  N  t
i

 
i
x
i

E
Connections between the pipelines amount to additional constraints on the system Many
forms of synchronization can be described by the constraint
D
 j  j  S 
t
j
w
j
E
This means that the throughputs of the pipelines of set S are constrained to have equal
throughputs with the possible use of a weighting factor w
j
 This synchronization can describe
the constraint introduced between the inputs and outputs of a buering logic cell that always
receives from all inputs and sends to all outputs all w
j
  A series connection is just a
synchronization constraint applied to two pipelines with w   A strictly interleaved split is
modeled as a three way synchronization with w   for the input and w 


for the two outputs

Similarly a strictly interleaved merge is modeled with w 


for the two inputs and w   for
the output
Unfortunately this type of synchronization implies that the throughputs of each of the con	
nected pipelines are in steady state If the synchronization is more complicated and produces
non uniformly spaced tokens the assumptions all break down This happens if the token ow is
data dependent or if the unit produces outputs in bursts For these cases the simpli
ed steady
state model of pipeline dynamics does not apply and a full search for the critical path given a
closed environment may be necessary
Rings can introduce additional constraints on the x
i
 To be in steady state the total number
of tokens in the ring must remain constant so x

 x

    x
N
 C where the xs are for the
segments around the ring and C is the total C is also know initially since upon reset the buers
are know to be full or empty Branching and joining paths also constrain the number of tokens
in each branch to be equal with a constant oset if any tokens were created in the branches on
reset The weights along the paths must also be equal to remain in steady state
Once all the constraints are known the legal range of t
i
can be solved as a function of the x
i

If the pipelines are all connected to each other in some manner via the synchronization model
then all the throughputs will be multiples of each other and one pipeline usually an input or
output can be chosen to have the system throughput All of these constraints are piecewise
linear and convex so this is a straight forward linear programming problem to solve for the
system throughput as a function of the x
i
 Since CMOS circuits transition as early as they can
the steady state behavior will converge to the fastest allowed operating points There may be
many degenerate solutions for x
i
at the peak throughput but only the throughput is important
  Examples of Composite Circuits
The eect of various compositions on the two pipelines of Figure  is illustrated for several
simple but common cases If a pipeline is connected in a ring the number of tokens in it remain
constant and the throughput can be predicted directly from its throughput versus tokens curve
as in Figure 
A
B
1
1
1
1
1
1
Figure  Parallel Composition of Two Pipelines
Figure  shows the behavior when the inputs and outputs of the pipelines are synchronized
as in Figure  Both pipelines must run at the same throughput and must contain the same

050
100
150
200
250
300
350
400
450
500
0 5 10 15 20 25
t
hr
ou
gh
pu
t 
(M
Hz
)
tokens
T(x)
Ta(x)
Tb(x)
"parallel_ab.dat"
Figure  Throughput Versus Tokens for Parallel Composition of Two Pipelines
number of tokens The combined throughput versus tokens curve is simply the intersection of
the original curves which may no longer be a trapezoid but is still convex As expected the
resulting peak is not greater than either component More surprisingly the peak is actually
smaller than both of the original pipelines This is of crucial importance in slack matching
since two equally fast highly optimized pipelines can perform poorly in parallel if they dont
also have the same dynamic slack Note that the static slack is irrelevant if the dynamic slacks
match For the simple but common case of the intersection of two triangles where one is not
wholly contained in the other the resulting peak throughput and dynamic slack are found at
the intersection of the right side of the triangle with lesser dynamic slack and the left side of the
triangle with greater dynamic slack Call the dynamic slack static slack and peak throughput
of pipeline A d
a
 s
a
 and T
a
 and likewise for pipeline B Let d
a

 d
b
 The intersection is at
x 
T
a
d
b
s
a
T
b
s
a
 T
b
d
a
 T
a
d
b
t 
T
a
T
b
s
a
T
b
s
a
 T
b
d
a
 T
a
d
b
The degradation of throughput due to dynamic slack mismatch between parallel pipelines is
fairly gradual Suppose the peak throughput and static slack of the two pipelines were equal
 
A B11
Figure  Series Composition of Two Pipelines
0
200
400
600
800
1000
0 5 10 15 20 25 30 35 40 45 50
t
hr
ou
gh
pu
t 
(M
Hz
)
tokens
T(x)
Ta(x)
Tb(x)
"series_ab.dat"
Figure  Throughput Versus Tokens for Series Composition of Two Pipelines
The worst possible dynamic slack mismatch would be when d
a
  and d
b
 s This would yield
a throughput of t 
T

with x 
s

 Less extreme mismatches run faster
Figure  shows the behavior when the two pipelines are connected in series as in Figure
 Although there are dierent numbers of tokens in the two pipelines it is not important
to distinguish how many tokens are in each part In fact at most throughputs there will be
many degenerate solutions with the same throughput but a dierent distribution of tokens
Instead we consider the throughput versus the total number of tokens allowing the distribution
to automatically optimize itself For each particular throughput we can see what range of tokens
could be in each pipeline as given by the triangle To get the resulting range we add the small
ends and the large ends of the ranges Obviously we cant have a throughput higher than the
slower pipeline This method traces out the trapezoid of Figure  For the simple case of two
triangles this leads to the following equations for T  d
min
 d
max

T  minT
a
 T
b


dmin

d
a
T
T
a

d
b
T
T
b
d
max
 s
a

T
T
a
s
a
 d
a
  s
b

T
T
b
s
b
 d
b

A
B
1
0.5
0.5
0.5
0.5
1
Figure  Interleaved Composition of Two Pipelines
0
200
400
600
800
1000
0 5 10 15 20 25 30 35 40 45 50
t
hr
ou
gh
pu
t 
(M
Hz
)
tokens
T(x)
Ta(x)
Tb(x)
Figure  Throughput Versus Tokens for Interleaved Composition of Two Pipelines
Figure  shows the behavior of strict interleaving between the two pipelines The constraint
here is almost the same as for parallel pipelines However the results are interpreted dier	
ently since the tokens in each branch are dierent This takes the intersection but doubles its

throughput and number of tokens In practice it is impossible to make a perfect interleaver with
an arbitrarily high peak throughput A realistic split and merge can be modeled as a short linear
pipeline with reasonable throughput in series with the ideal interleaver
  Tree Buers
A binary tree buer can be made by recursively interleaving between linear buers Realistic
alternating split and merge cells are modeled as slack


buers followed or preceded by an ideal
interleaver Such buers have long been know to have desirable properties including a forward
latency and energy dissipation that is only logarithmic in the static slack An understanding of
slack matching illustrates several other interesting properties
To compute the overall throughput versus tokens curve for a binary tree buer the easiest
approach is to 
gure out the range of tokens in each segment for any given overall throughput At
each interleaving the throughput is reduced by half Adding up all the lower ends of the ranges
and all the upper ends will give the overall range of tokens at any given overall throughput
For an example assume that the tree is k stages deep and therefore interleaves between 
k
linear buers Each split and merge will be assumed to have an s and d consistent with a
normal buering cell A split is followed by a perfect interleaving split and a merge is preceded
by a perfect interleaving merge There are a total of n linear buers in the center evenly
divided among the 
k
paths Assume each path has
ns

k
total static slack and
nd

k
total dynamic
slack Assume all peak throughputs are T  and that all the throughput versus tokens curves are
triangles The total static slack is ns s
k
 
If a pipeline is run at a throughput t 
 T  we can calculate the range of tokens it might contain
as follows The triangle is described by the equations
t  T
x
d
if x 
 d
t  T
s x
s d
if x  d
Given t we solve for x
min
and x
max
as
x
min

t
T
d
x
max
 s
t
T
s d
For our binary tree buer we must add up the x
min
and x
max
for all the linear buers in
the tree for the particular t each one runs at After some simpli
cation this produces the
expressions
x
min



k
nd kd

xmax
 ns


k
ns nd  s
k
  ks d
Note that the 
rst terms result from the linear buers in the middle of the tree while the

nal terms account for the slack inherent in the splits and merges The equations can also be
separated into overhead and incremental terms

min



k
d

max
 s


k
s d

min
 kd

max
 s
k
  ks d
x
min
 
min
n 
min
x
max
 
max
n 
max
Suppose d 


 and s 


 which are typical values for weak condition halfbuering cells
Then we get the following table
k 
min

max

min

max
      
        
         
        
         
The throughput versus tokens curve for a k   n   binary tree buer is shown in Figure
 It has static slack  minimum dynamic slack   and maximum dynamic slack  
As we increase k 
min
rapidly approaches  and 
max
approaches s In eect the overall
throughput can be maintained with very near the entire range of tokens  
 x 
 s This can
be extremely useful in many ways In some circuits a ring contains a variable number of tokens
at dierent times If the ring was implemented in a linear fashion it would not always operate
at peak throughput However with a binary tree buer in the ring it might operate at peak
throughput for a wide range of tokens
Yet another signi
cant property of a binary tree 
fo is that for some purposes it can be much
smaller than an equivalent linear 
fo Suppose a buer is to operate at peak throughput and
contain a large number of tokens To maintain peak throughput the number of tokens must

Figure  Binary Tree Buer
be within the range of dynamic slack For a linear buer the dynamic slack might be around


of the static slack For a tree buer the maximum dynamic slack can be much closer to the
static slack Since area is proportional to the number of buers and hence the static slack a
binary tree buer that can keep the same number of tokens moving fast might be up to   times
smaller than a linear buer In practice the area overhead of the alternating split and merge
must also be considered but for very large n tree buers this overhead is very small From the
table we see that the 
max
the incremental increase in maximum dynamic slack of a depth
 
fo is already  " of the static slack so no excessive wiring is necessary to obtain the area
bene
t of using binary tree 
fos Naturally if the objective is to have large static slack at low
throughput the linear buer will still be denser
 Slack Matching
Pipeline dynamics should be used to guide the design of a pipelined system Several designs
might be proposed and compared in their expected peak throughputs but a few guidelines can
be used to produce very good circuits without experimentation
   Heuristics for Slack Matching
First specify the topology of communication and the static slacks of all cells and channels The
static slack must be enough to avoid deadlock but no more Assume that all buering cells
will have a characteristic dynamic slack d and peak throughput T  The dynamic slack is more
directly relevant than the static slack to fast operation The goal of slack matching is to add
additional dynamic slack so that the whole system will maintain the T of the individual pieces
It is most convenient to target a dynamic slack per cell which is the inverse of an integer
The choice should reect the properties of the buer implementation Since the dynamic slack
is just the forward latency divided by the cycle time an estimate of these 
gures can yield a
reasonable dynamic slack For the weak condition halfbuers with  transitions per cycle d
should probably be


 and for the precharge logic with  transitions per cycle d should be

	


00.2
0.4
0.6
0.8
1
1.2
0 5 10 15 20
t
/T
tokens
f(x)
Figure  Throughput Versus Tokens for Binary Tree Buer
It is possible to pick other values like



or


 but the peak throughput will depend on how
closely matched the d is to the buer implementation One approach would be to optimize the
throughput of a pipeline independent of the latencies measure its dynamic slack and pick the
nearest inverse integer dynamic slack The reason for 
xing the dynamic slack as well as the peak
throughput of all cells is that it becomes much easier to analyze the dynamics of a composition
of pipelines
Identify any rings in the topology and predict the number of tokens they will contain From
pipeline dynamics we know a ring will operate at peak throughput when the number of tokens
is within the range of dynamic slack If a ring has less dynamic slack than the number of tokens
in it extra buers should be added until the dynamic slack equals the number of tokens If the
dynamic slack of the ring is already greater than the number of tokens in it the circuit should
be redesigned The d was chosen to be the inverse of an integer so that an integral number of
buering cells in a loop will be optimal For instance if d 


 there should be  buers per
token in a loop
If a ring is expected to contain a variable number of tokens a deep binary tree in the ring
can expand the range of dynamic slack such that the ring keeps operating at peak throughput
Note that the pipeline dynamics analysis assumes steady state operation so it wont work if the
number of tokens in the ring changes throughout the computation However if the number is
only changed rarely as when loading numbers in to a ring buer for use over many cycles then
the system will probably have time to reach steady state and the pipeline dynamics analysis will
predict the performance for the most common and time consuming behavior

Any paths which branch and later join will have reduced throughput unless they have equal
dynamic slacks Since we have assumed all cells have the same d we can simply count the number
of buering cells along each path If paths are unequal extra buers should be added to the
shorter segment In a feed forward pipeline with arbitrary branching and joining it is always
possible to add buers until all paths have equal dynamic slack
Adding extra buers to slack match maximizes the throughput but makes the forward latency
the worst case always In an extreme example an N bit ripple adder operating on simultaneously
arriving N bit numbers will have a latency of  to N depending on the length of the longest
carry chain The average case is Olog

N The throughput would vary along with the latency
Slack matching the ripple adder adds a triangle of buers to delay the upper bits so that they
arrive after the same dynamic slack as the rippling carry A similar triangle of buers is added
to realign the output The throughput now remains constant at the fastest cycle time of a full
adder but the latency is always N  Several better solutions can be obtained through architectural
modi
cations A fast rippling adder like a carry	 select or carry	skip adder can minimize the
forward latency while keeping constant high throughput Alternately all numbers could be
bit	skewed if that decision isnt inconvenient elsewhere
Anywhere dynamic slack is added for the purposes of slack matching it is possible to use a
binary tree buer For the purposes of dynamic slack matching binary trees can be signi
cantly
smaller so trees of various depths should be considered to see which is smallest Due to the
overhead of interleaving less deep trees will be better for smaller dynamic slack but deeper
trees will be better for longer dynamic slack In most cases a linear buer or depth  tree is
probably appropriate An additional bene
t of the trees is reduced power dissipation
For moderate amounts of dynamic slack fullbuers are often superior to halfbuers If a buer
is designed to run at higher than the target throughput it can maintain the system throughput
with a range of dynamic slack just like a binary tree buer Since the right leg of the fullbuer
triangle extends twice as far as the halfbuer its maximum dynamic slack extends farther at
lower system throughputs Suppose we have a halfbuer and a fullbuer with d 


which peak
at T  Evaluating the range at T shows that the halfbuer has d
max
   but the fullbuer has
d
max
   Therefore we could use almost half the number of these fullbuers to achieve the
same dynamic slack If this compensates for the increased size of the fullbuer implementation
the fullbuers will be a smaller way to slack match Fullbuers were used for most slack matching
in the MiniMIPS
    Appropriateness of Slack Matching
Slack matching aims to maximize the throughput of a collection of linear pipelines If the logic
is not buering or if early completion units are desired it is not directly applicable Also if the
system does not maintain steady state operation the pipeline dynamics analysis does not apply
Therefore slack matching is best suited for highly pipelined systolic systems with a regular
pattern of computation This is reasonable for many special purpose hardware chips such as
digital 
lters compression encryption and other types of signal processing
However even in a more complex system such as a CPU slack matching can be used to
improve the design Individual pieces may not always be used at full throughput but slack
matching can be used to improve their momentary peak throughputs Reasonably isolated units

may actually operate at full throughput for many cycles The best examples of this are iterative
computations done by rings such as multiplication or division Slack matching is an essential
element to optimizing the performance of such units Also units can be partially slack matched
so that their peak throughput may not be as high as that of linear pipelines but is adequate
for their use Finally the simpli
ed timing model described by pipeline dynamics may be used
to perform timing simulations of complex systems be reducing the complexity of representing
various subunits
 Transistor Sizing
Given a topology of communicating processes it is possible to identify the critical path a
cyclic sequence of transitions which limits the cycle time of the system as described in 
However the search for this critical path can be time consuming and even when it is found it
may not be obvious how to improve the performance of the system Transistor sizing can speed
up transitions on the critical path at the cost of slowing down transitions not on the critical
path If this is done aggressively most paths will be tied as the critical path and no further
improvements can be made by transistor sizing
In the higher level optimization of slack matching certain assumptions were made about the
components of the circuit namely that they had the same peak throughput T and dynamic slack
d Under these assumptions the overall throughput of the system was optimized In this context
the purpose of transistor sizing is to satisfy the assumptions made during slack matching That
is the forward and backward latencies and any internal cycles should meet the desired timing
constraints Once these timing constraints are met another metric could be optimized such as
the energy per cycle or estimated area
These latencies can be computed by adding up the transition delays along the paths The
transition delays depend on the width of the transistors driving them and the capacitive load they
drive also determined by other transistor widths Expressions for the worst forward backward
and internal delays are computed from the circuit and the transistor widths are adjusted by
gradient descent until the constraints are met If the constraints cant be met less ambitions
timing constraints should be tried The target latencies should be consistent the the dynamic
slack assumed for the cells or it would invalidate the slack matching
The delay of a single node transition can be approximated in many ways The most popular
method seems to be the tau model which uses the eective resistance of the transistors in the
channel and the capacitance on the output node to estimate an RC delay for the transition
An alternative is to construct a table which characterizes the peak currents through various
networks of transistors and to assume the delay is proportional to the capacitive load divided
by the peak current This later technique has the advantage of including signi
cant process
unidealities most importantly velocity saturation Parasitic capacitances non gate capacitances
are generally smaller and depend greatly on the layout and physical distance between cells which
may not be known initially Hence it is reasonable to initially size the transistors assuming no
parasitic capacitances or guessing for major sources of capacitance such as long wires After
doing most of the layout parasitic capacitances can be extracted from the layout and used to
resize the transistors Most likely one iteration of this procedure will be adequate since the

parasitic wiring capacitances dont change much with changing transistor widths and the change
in diusion capacitance can be easily predicted
This approach to transistor sizing using slack matching assumptions to set timing constraints
is quite ecient The only delays that must be evaluated are local to each buer cell and
depend only on the capacitive loads of other cells on the outputs Identical cells in identical
environments will give the same delays and therefore need only be evaluated once The algorithm
will take memory only proportional to the number of unique cells in unique environments in
the circuit Likewise the gradient descent will involve similar local calculations For a fairly
complex application like the digital 
lter there are only about  unique cells to size However
there are undoubtably hundreds of thousands of potential critical paths This approach to
optimizing transistors in an asynchronous pipelined circuit shares much in common with the
usual synchronous approach Both methods rely only on local constraints Both constrain the
forward latency but the asynchronous method must also constrains the backward latency and
internal cycle time
 Conclusions
A complex asynchronous circuit may be designed by composing the pipelined cells described in
chapter  evaluating the pipeline dynamics and adding extra buers to increase the system
throughput to the same as the local handshake throughputs Transistor sizing is used at the
end to optimize the slowest handshake cycles and perhaps the latencies Although the slack
matching only applies precisely to circuits with regular systolic patterns of communication the
same concepts can be applied to portions of a more irregular system such as a microprocessor
With the new techniques presented in this thesis asynchronous quasi	delay	insensitive design
has become more competitive with the synchronous alternatives
References
  A J Martin Compiling Communicating Processes into DelayInsensitive Circuits Distributed
Computing Vol No 	 pp 


	 
 
 UV Cummings AM Lines AJ Martin An Asynchronous Pipelined Lattice Structure Filter
Advanced Research in Asynchronous Circuits and Systems IEEE Computer Society Press 	
  AJ Martin AM Lines et al The Design of an Asynchronous MIPS R Microprocessor
Proceedings of the th Conference on Advanced Research in VLSI IEEE Computer Society Press

 	 R Manohar The Impact of Asynchrony on Computer Architecture PhD Thesis Caltech Tech
nical Report 
  T Williams Latency and Throughput Tradeos in SelfTimed Asynchronous Pipelines and Rings
Stanford Tech Report CSLTR	 May 
  S M Burns Performance Analysis and Optimization of Asynchronous Circuits PhD Disserta
tion Caltech 

