Concurrency reduction of untimed latch protocols - theory and practice by Stevens, Kenneth & Varanasi, Santosh N.
2010 IEEE Symposium on Asynchronous Circuits and Systems
C o n c u rre n c y  R e d u c tio n  o f  U n tim e d  
L a tc h  P ro to c o ls  -  T h e o ry  an d  P ra c tic e
Santosh N. Varanasi K enneth S. Stevens G raham  B irtw istle
Electrical and C om puter Engineering D epartm ent o f C om puter Science
U niversity o f U tah The U niversity o f Sheffield
Abstract—A systematic investigation into concurrency reduc­
tion of untimed asynchronous 4-phase latch controllers is re­
ported. Starting with a state graph that exhibits maximal con­
currency, rules are provided for systematically reducing its states 
and thereby curtailing its behaviors. The rules predict liveness 
and occupancy, as well as the regularity and behavior of their 
pipelines. The rules also reveal the precise extent of the design 
space and thus provide a secure platform on which to study the 
implications of concurrency reduction on power, performance 
and area by implementing and evaluating the complete set of 
abstracted controllers. This complete characterization enhances 
the understanding and usage of concurrency and its reduction in 
handshake protocols. Trade-offs have been observed and reported 
which will aid designers in trying to find the best protocols for a 
required specification. Finally, the best synthesized protocols in 
this class have been identified.
I. I n t r o d u c t io n
This study is m otivated by the desire to gain a better 
understanding o f concurrency and synchronization, exam ine 
the im pact com position has on protocols, how to verify such 
system s, and to validate som e assum ptions o f concurrency 
reduction  on synthesized hardw are im plem entations. It is an 
extension o f previous w ork w hich derived a  fam ily o f untim ed 
4-phase latch controller protocols w here data is bundled and 
valid before the request signal rises [2]. In this paper we give 
this fam ily a  sim pler structural categorization, use it to char­
acterize som e behavioral properties, and present experim ental 
results about the V LSI im plem entation o f the com plete family.
The origins for this study lie in w ork carried out designing, 
specifying and verifying control signal structures over a  range 
o f asynchronous m icroprocessors culm inating in variants (one 
show n in Fig. 1) on the M anchester A M ULET3 [11], [1]. 
W hen experim enting w ith architectural changes to data paths 
by varying p ipeline depths and widths, w e noticed that upon 
the m inim ization o f our m odels dow n to the sm allest equiva­
lent state graph, pipeline w idth had no im pact. Each variation 
w ould m inim ize dow n to be equivalent to som e single pipeline, 
but rarely  one w ith the our initial building blocks.
Each asynchronous p ipeline stage consists o f control and 
a data path. W hen the data path is abstracted out, only the 
controller w ith its handshake signals rem ains. Such a controller 
is represented as a single stage controller, S T , in Fig. 2. Such 
single stage controllers are com posed in parallel to create 
series pipelines S P d o f arbitrary depth d. Series pipelines are 
often connected in parallel w ith fork and join com ponents





1 1 1| SE | |  RB
m  r n  r*Ci
I~ |amx| |bMx|
PC i h  Ra ]M |posT | res | fp | r"c
MEM/WBACK







Fig. 1. Abstracted control structure of a simplified AMULET3
to form  a parallel p ipeline P P w ,d o f w idth w  and depth d. 
Such structured p ipeline segm ents can be observed in the 
m icroprocessor in Fig. 1.
The results in this paper address the evaluation o f data 
abstracted linear pipelines w here each stage contains the same 
protocol. D iscussing pipelines w ith feedback and addressing 
the rich  set o f tim ed protocols (e.g. burst-m ode and relative 
tim ed) are not herein addressed.
A . W h a t h as been  do n e
The key phases in this study have been:
1. In c u b a tio n : Som e 40 published latch controller designs 
(usually  specified as STGs) w ere surveyed and translated to 
a generalized state graph notation in w hich internal state 
variables and latch control signals w ere hidden w hilst retaining 
the constraints they im posed on how the external pipelining 
signals interleave. We then ran  experim ents for each protocol 
w hen com posed into single pipelines S P d o f depths 1..8 and 
parallel pipelines P P w ,d o f w idths and depths 1..8 (Fig. 2).














Fig. 2. Single (ST), Series (SP), and Parallel (PP) Pipeline Control Graphs
2. G en e ra liz a tio n : All designs and their pipelines w ere now 
expressed solely in  term s o f the sam e external pipeline control 
signals and could thus be com pared. The m ost concurrent 
protocol was identified, called m a x, into w hich each abstracted 
published design could be com pletely em bedded. A  system atic 
m ethod o f reducing concurrency by cutting away states from  
m a x  was developed to  generate the com plete fam ily o f proto­
cols. U pon exam ination o f the 40 published ST G ’s, w e noted 
that the standard w ay of restricting behavior was to constrain 
upstream  pipeline signals and dow nstream  pipeline signals 
separately. E ach constraint w ould engender a characteristic 
pattern o f states being rem oved, or cut away, from  m a x . These 
w e generalized and call the set o f all upstream  cut patterns R  
and the set o f all dow nstream  cut patterns L .
3. S ea rch  fo r s tru c tu re :
The fam ily design space is form ed applying all pair com ­
binations o f cuts, (one from  L  and one from  R )  on m a x. The 
orthogonal cut sets L  and R  have lattice structures, and can 
be used  to specify, relate and order all fam ily m em bers [5], 
[12]. The cut pairs for a  specific abstracted design calculate 
its liveness, behavior, and capacity w hen pipelined.
4. Im p lem en ta tio n : The effect o f concurrency reduction 
was studied by synthesizing the com plete set o f pipelined 
controllers and evaluating them  for throughput, latency, and 
power.
II. M o d e l in g  4 -P h a s e  P ip e l in e  S ta g es
W e use M ilner’s C alculus o f C om m unicating System s 
(CCS) to m odel and reason about protocols [18]. CCS has 
a  num ber o f pertinent attributes that m ake it attractive for this 
work. It is straightforw ard to capture signal level behavior. 
i t  has a sim ple form al sem antics to support reasoning about 
designs. Flow  graph structure and hierarchy are part o f the 
language, including sem antics for how internal hidden behav­
ior, represented as t , affects externally observable signals. This 
allows us to form ally reason about the signal hiding techniques 
w e used for protocol abstraction. It has reliable public do­
m ain tool support, the C oncurrency W orkbench (CW B) [19].
Fig. 3. Bundled Data Controller and Data Latch
Finally, the CW B im plem ents the very pow erful m odal-^  
property checking calculus [2 1 ].
A long w ith these positive points, CCS suffers from  the usual 
state explosion problem s. In practice this m eans that CCS is 
perhaps best suited to exploring abstract view s and control 
properties o f system s, rather than data path  logic.
A . T he  M A X  P ro to co i A b stra c tio n
Fig. 3  shows a bundled data p ipeline stage. The latch is 
responsible for holding the current data value captured from  
input bus d I N . The latch controller (LC) is responsible for 
synchronizing the input and output channels w ith the data 
stream. Since the latch controller protocol w orks the same 
for all bus values, d IN /d O U T  can be abstracted by tokens 
indicating w hen data can change and becom e stable.
The m odel developm ent used in  this paper is an abstraction 
o f the 4-phase bundled data protocol using a  norm ally closed 
(opaque) latch (the approach is equally valid for norm ally 
open latches). The follow ing five constraints specify the safety 
properties o f the protocol: s0: Liveness: there is a unique 
quiescent state w hich can carry out only one action lr ]  and 
w hich is reachable from  all other states. s1: A  new data value 
m ust be stable on d IN  before the input channel request lr]  
is asserted. s2: The data is captured in the latch before the 
input channel acknow ledgm ent la ]  occurs. s3: The data m ust 
be passed through the latch before the output channel request 
r r ] .  s4: The latch m ust rem ain closed, keeping d O U T  stable, 
until the output channel acknow ledge signal ra ]  is asserted.
The latch behavior is specified as follows. E nable request 
and acknow ledgm ent signals rEn  and aE n  control the opening 
and closing o f the latch. As the d IN  and d O U T  actions of 
Fig. 3 are not m odeled, m arkers o p e n  and c lo s e d  are inserted 
to show the state o f the latch. We assum e that if  a  new data 
value is valid on the input d IN  w hen the latch is opened, it 
w ill have tim e to be stored in the latch and propagate to the 
output d O U T  before the latch is closed. The separating dot . 
betw een actions in a CCS definition m ay be read  as a n d  so m e  
tim e  ia te r .
L A T C H  =  r E n ] . o p en . a E n ]. r E n \.  closed ,a E n [ . L A T C H
The m ost concurrent protocol m a x  is obtained by delaying 
handshake signals in the specification only to prevent a  safety 
violation. A  CCS specification that results in the m ost concur­















Fig. 4. Minimized state graph of max, configured as a shape
L  =  lr] .gS.rEn] .aEn] .rEnj .aEnj .pV.la] .lrj.laj.L
R  = gV.rr\.ra].pS.rrj.raj.R
S  = gS.pS.S V  = pV  .gV.V
LC =  (L | R | S | V) \{ g S ,p S ,g V ,p V }
LATCH = rEn] .open. aEn] .rEnj.closed.aEnj. LATCH
max = (LC | LATCH) \  {rEn, aEn} (1)
The trace variables o p e n  and c lo s e d  have done their job 
and are now omitted. A ll the handshakes betw een L  and 
L A T C H  are treated as CCS t  moves (silent internal actions 
o f arbitrary duration). Thus the interplay betw een the L A T C H  
and L  is equivalent to:
L  =  lr] .g S .T .T .T .T .p V  .la] .lr { .la [ .L  
L A T C H  =  t . t . t . t . L A T C H
w here the four handshakes in L  betw een g S  and p V  are all 
reduced to  t  . As each o f these t  signals represent nothing 
m ore than an arbitrary delay, their net effect is that o f a  single . 
in CCS. In addition the contribution from  the (norm ally closed) 
L A T C H  is com pletely silent. The sam e argum ent w ould apply 
had we used a norm ally open (transparent latch). For exam ­
ple, Efthym iou and G arside [7] give norm ally open/norm ally 
closed variations on four different latches. Each pair has the 
sam e m inim ized state graph after hiding.
Process S  is a  token that ensures property s4  holds and new 
data is not w ritten into the latch until the  previous data has 
been consum ed and the latch has space for the token. Process 
V  is a token that ensures that property s3  holds and data 
has been stored in the latch before the dow nstream  request 
is asserted. Process L  internally ensures property s2  holds 
by com pleting the handshake w ith the latch before the input 
channel acknow ledgm ent can assert. Safety property s1 , that 
the data arrives before lr] , is assum ed to  hold by correct 
system  tim ing w hen s3  holds in the upstream  controller. 
M axim al concurrency is obtained by releasing the tokens as 
early as possible. S  is released allowing latch storage upon 
ra]. V  is released  as soon as data is stored in the latch.
B. S h a p e  R ep resen ta tio n
Fig. 4  displays the m inim ized state graph o f the concurrent 
protocol m ax. W e call this specific state and signal configura­
tion a  sh ap e . H orizontally the labels show the input channel
Fig. 5. STG for the abstracted max protocol
signals; vertically and w rapped around are the output channel 
handshake signals. The initial state is m arked w ith the circle • .  
T he 4 x 2 b lock o f states on the right o f the shape are reached 
w hen the input channel gets ahead o f the output channel. The 
leftm ost three states are reached w hen the converse is true 
and the input channel m ust catch up w ith the output channel. 
T he neck on the righ t is w here the device is ensuring that the 
current value held is not overw ritten until it has been passed 
downstream .
CCS tracks all possible interleavings. N otice that after an 
initial l r ]  action, m a x  perm its (i) L  to com plete the action 
sequence l a ] . l r j . l a j . l r ]  before R  carries out its r r ] , and (ii) 
R  to com plete the action sequence f r ] . r a ] . f r [ .r a [  before L  
carries out its la ]  action. A n equivalent STG  o f the m a x  shape 
is shown in Fig. 5.
A  less cluttered shape for m a x  is derived by rem oving the 
arcs as shown below. This is the shape notation w e prefer to 
use.
o o o • o o o o o
o o o o o
o o o o o o o o o
o o o o o o o o o
III . C o n c u r r e n c y  R e d u c t io n  o n  M A X  w it h  C u t s
O nce each protocol has been m odeled as a shape, it is clear 
that less concurrent abstracted protocols contain few er states. 
This can be represented  in our com pact representation o f a 
shape by replacing the o w ith a . if  the state is unreachable 
(cutaway) in a particular protocol.
Table I pictures the shapes o f m a x  and three published 
designs: KG due to Kol and G inosar [13]; BCKLLS due to 
B lunno et. al. [3]; and FD 6 due to Furber and D ay [10]. 
N ote that m a x  is the union o f two highly concurrent published 
protocols, KG and BCKLLS; the third, FD 6 is their union. Our 
experim ents confirm ed that the m a x  shape is the union o f all 
4 0  published designs w e investigated in the incubation phase.
L ess concurrent shapes can be generated by system atically 
rem oving (or cutting away) states from  m a x , ju s t as the m a x  
shape can be form ed by the union of other shapes. Observation 
o f the shapes o f published designs indicated that concurrency 
reduction  rem oved states from  the left and righ t sides o f m a x . 
Taking our cue from  Table I, w e decided to partition our 
concurrency reduction rules into tw o sets: L  on the left and 
R  on the right. We call our system atic concurrency reduction 




The re la tionsh ip  o f th e  max shape and th re e  published designs
Start here
Rpt o f  row 1




d : o o o : ^ o o o o o
Fig. 8. Left cut L denotation and range. The top row is duplicated at the 
bottom of the shape to more easily show the Left cut ordering.
> o o o o o
: o o o : o o
: • o o : o o o o
j • • o j o o o o o o
L 0 
1
3  : • • • o o o o o
Fig. 9. The shape (above the duplicated line) resulting from cutaway L0123
A . C on cu rren cy  R ed u c tio n  fr o m  R ig h t  C uts R
The states rem oved in a right cut are denoted as R ab cd  
as shown in Fig. 6 . R abcd denotes the rem oval from  m a x  o f 
a  states from  the right end o f row  1 , b  from  row  2 , c  from  
row  3, and d  states from  the right end o f row  4. The m axim al 
cutaw ay per row  is 4 for row s 1 and 2; and 8  for row s 3  and
4, as shown by the dashed box. If we cut away m ore states 
liveness constraints w ill be violated.
o o o o  o o o o o
o o o o o
o o o o o o o o o





Fig. 6. Right cut R  denotation and range
The result o f cut R 2152 is depicted in Fig. 7.
o o o o  o o o
o o o o .
o o o





Fig. 7. The shape resulting from cutaway R2152 
The fam ily o f all R  cuts is generated by the constraints:
0 <  a, b <  4 0 <  c ,d  <  8
a  >  b A b +  4 >  c A c >  d A d >  a (2 )
B . C on cu rren cy  R ed u c tio n  fr o m  L e ft  C uts L
L eft cuts, denoted L ab cd , rem ove from  m a x  a  states from  
the left o f row  2, b  from  row  3, c  from  row  4, and d  from  
the left o f row  1. The potential candidates for a  left cut 
now lie in a  3 x 4  block o f states in the dashed boxed of 
F ig . 8 .  The liveness rule that requires the initial state to be 
present and reachable from  any state is v iolated if  any m ore 
than these states are rem oved. To em phasize the ordering of 
left cuts, F ig . 8  tem porarily  em ploys a  new  representation by 
duplicating the top row  after the last row  o f the shape.
The result o f cut L0123 is depicted in Fig. 9 .
The fam ily o f all L  cuts is generated by the constraints:
0 <  a , b , c ,d  <  3 
a  <  b A b <  c A c <  d (3)
C. U sing  cu ts to g en era te  th e  fa m ily  a n d  to d e fine  liveness  
The com plete fam ily o f protocol shapes is generated by 
applying all pair com binations o f left cuts and right cuts to 
m a x . N ot all these shapes w ill be valid: for exam ple the shape 
L2222 ° R 4444 is not live since it deletes all the states from  
row  2  o f the shape. u s in g  an obvious cut indexing, shape 
L abcd ° R abcd is live iff Eqn. 4 holds. Thus the liveness o f a 
shape can be calculated directly from  its cuts from  m a x .
La + Rb < 5 A Lb +  Rc < 9 A Lc +  Rd <  9 A 
La + Ra < 5 A Lb +  Rb < 5 A Lc + Rc < 9 A Ld + Rd < 9 (4)
D . T h e  U n tim ed  F am ily
The right and left cut constraints in Eqn. 2 and 3 express 
all cuts, including the burst-m ode and relative tim ed protocols. 
Tim ed protocols occur w hen the arrival o f an input from  the 
environm ent can be delayed based on another protocol input 
or output signal. W e restrict the evaluation in this paper to 
untim ed (delay insensitive (DI) and speed independent (SI)) 
protocols. C onstraint rule R1 m ust additionally hold for all 
untim ed protocols. D elay insensitive protocols m ust also obey 
constraint R 2.
1) R 1: input signals lr  and ra  m ust alw ays be accepted
2) R 2: output signals m ay be delayed only by inputs
T h e  S I fam ily : W hen rule R1 is added to the cut and
liveness constraints o f Eqn. 2, 3 and 4 we obtain the speed 
independent fam ily o f cuts. s ta te s  m ust be rem oved in 1 x 2 
pairs by left cuts and 2 x  1 pairs by right cuts. For exam ple, 
referring to Fig. 4, one cannot rem ove just the rightm ost 
state in any row  (e.g. cut R0011) or the input l r |  is delayed 
(resulting in a  tim ed design). The state to left m ust also be
29
rem oved (giving cut R0022). R1 is enforced by the following 
cut equation.
R  : a , b , c , d  are even 
L  : a =  b A  c =  d  (5)
T h e  D I fam ily : A dding rules R1 and R2 to our base cut 
constraints creates the delay insensitive protocols. This is m ore 
restrictive than  the SI cuts, requiring states to be rem oved in 
2 x 2  blocks. This rem oves “output ordering” in the protocol. 
For exam ple, referring to Fig. 4, if  the rightm ost tw o states 
are cut in the top row, output la[ is delayed, producing cut 
R2022. (The bottom  right four states m ust also be rem oved 
for liveness by Eqn. 2). To obey R2 and prevent output la[  
being delayed by output rr ] , the right tw o states m ust also be 
rem oved in  the second row. This produces the D I cut R2222. 
R1 and R2 are enforced by the follow ing equation on both L 
and R  cuts.
about an axis through R2262, R2244, R 4044 (w hich are self- 
com plem entary). E ach cut R abcd has a com plem ent given by 
R(4-b)(4-a)(8-d)(8-c).
IV. E x p e r im e n t a l  D I/S I  f a c t s  a n d  pa t t e r n s
This paper reports on hom ogeneous linear pipelines w ithout 
feedback. Three im portant behaviors w ere revealed by exper­
im ents on these pipelines.
1) P P w ,d (see Fig. 2) is independent o f w. Seen from  the 
outside, each structured parallel p ipeline behaves like a 
single p ipeline S P d o f the sam e depth. The behavior o f 
these pipelines alw ays em ulates a D I protocol and can 
be predicted from  the cuts o f the shape.
2) There are only 23 possible structured parallel P P w ,d 
behaviors and they are the 23 live D I shapes.
3) Som e single pipeline behaviors can change shape w hen 
in single pipelines o f depths 2 or more. Interestingly 
they m ay gain or lose states.
R ,  L  : a , b, c, d  are even A a  =  b A  c =  d  (6 )
E. L  a n d  R  C ut L a ttic e s
The D I definition is a  subset o f the SI family. R ather than 
subtracting them  out, w e prefer to keep them  all and refer 
to  it as the D I/SI family. This results in 10 left cuts and 25 
righ t cuts. Com posing m em bers o f the L  and R  cuts to create 
protocol shapes gives 250 possible protocols, w here 91 are not 
live as their cuts violate the liveness constraint Eqn. 4.
R0000----R0020---- R0022---- R2022---- R2222
R0040----R0042---- R2042---- R2242---- R2262
R0044----R2044---- R2244---- R2264---- R2266
TABLE II 
Pipeline Protocol Behaviors
Basic shape SPd shape PPw,d shape
► o o o o o 
o o o o o 
o o o o o o o o o  
o o o o o o o o o
► o o o o o 
. o o o o  
. o o o o o o o o  
. o o o o o o o o
2REGULAR
o o o o o  
. . o o o 
. . o o o . . . .
. . . o o o o o o
o o o o o
o o o o o
o o o o o o o o o
o o o o o o o o o
o o o o o
. o o o o
. o o o o o o o o
. o o o o o o o o
o o o o o  
. . . o o 
. . . o o . . . .
. . . o o o o o o
o o o o o
o o o o o
o o o o o o o o o
o o o o o o o o o
o o o o o
o o o o o
o o o o o o o o o
o o o o o o o o o
, o •  o o o o o
. . o o o
. . o o o o o o o







Fig. 10. The Symmetric Lattices of Untimed DI/SI Left and Right Cuts
Both sets o f cuts form  sym m etric lattices and each cut has 
a com plem ent. B oth lattices are coherent and com plete. They 
are both shown in Fig. 10. The lattice o f left cuts has 10 
m em bers and is sym m etric about an axis through cuts L0033 
and L1122. Each cut L abcd has a  com plem ent given by L(3- 
d)(3-c)(3-b)(3-a). The cuts on the axis are self-com plem entary. 
The lattice o f righ t cuts has 25 m em bers and is sym m etric
Table II indicates the three possible categories o f valid 
p ipelined behavior that arise in D I/SI shapes:
1) DI: here m a x  = L0000 ° R0000, retain their left and right 
profiles w hen pipelined singly or in parallel.
2) R EG U LA R: here L1111 ° R0000, retain  their left and 
right profiles w hen singly piped, but gain state to a  DI 
shape (in this case m a x ) w hen p iped in parallel.
3) 2REG U LA R: here L2233 ° R0040, changes shape w hen 
singly p ipelined (here it loses two left states and acts 
as L3333 ° R 0040 from  depths tw o onw ards), and, w hen 
run in parallel, changes shape again to a D I protocol (to 
L2222 ° R0000, gaining tw o states on the left and four 
on the right).
A . Tableau o f  D I /S I  exp erim en ts
The results o f the experim ents w hich cover the w hole design 
space are displayed in a L  by R  tableau in Table III.
1) • :  a  shape is DI if  and only if  both its cuts are DI.
DI
o o o o o o o o o
REGULAR





Tableau over all DI/SI cuts
L0000 L0011 L1111L0022 L1122 L0033 L1133L2222 L 2233 L3333 L o R
• i A • A A A • i A R0000
i i i i i i i i i 1 1 i i i R0020
A i A A A A A A i A R0040
• i A • A A A • i A R0022
A i A A A A A A i A R0042
A i A A A A A A i R2022
A i A A A A A A i R2042
• i A • A A A • i A R0044
i i i i i i i i i i i i R2044
A i A A R4044
• i A • A A A • i R2222
i i i i i i i i i i i i R2242
A i A A A A R2262
• i A • A A A • i R2244




A i A R4264
• i A • A • R2266
i i R4266
• i • A R4444
i i i i R4464
A R4484
• i • R4466
A R4486
•  . . R4488
•■ 23 A: 76 □ :60 .: 91 /250
2 ) A: a  shape is regular if  and only if  both its cuts are 
regular or ju s t one is DI.
3) □: a shape is 2regular if  one or both o f its cuts is/are 
2regular.
4 ) .: shows a non-live protocol.
O ne striking result is that the L  and R  cuts have orthogonal 
and persistent behavior. For exam ple, L0011 cuts 2regularly 
(in the sam e way) w hichever R  cut it is com posed with. 
Similarly, R 0040 cuts regularly w hichever L  cut it is com posed 
with.
The tableau is divided into blocks w ith a D I shape •  in 
its top-left corner. This is the m ost state rich  shape in that 
block. The least state rich  shape sits in its bottom  righ t corner 
(only six o f these are live). A ll shapes in a  specific block 
have the sam e parallel pipelining behavior. M athem aticians 
m ay prefer D I shapes because they retain  that shape when 
pipelined; engineers m ay prefer others if  they give rise  to 
faster or low er pow er im plem entations for the sam e pipelining 
behavior. For exam ple, the designs KG, BCKLLS and FD 6 
by Kol and Ginosar, B lunno e t a l , and Furber and Day 
respectively all have m a x 's  behavior w hen com posed into 
parallel pipelines. The doubly latched KG even has m ax’s
behavior w hen com posed into a serial p ipeline o f depth two 
or more. W ithin each block, any 2regular shape w ill change 
shape to a  regular or DI shape from  depth tw o w hen singly 
pipelined. A ny regular or 2regular shape w ill change to the 
D I shape o f its own block w hen p ipelined in parallel, even 
from  depth one. So the 23 viable D I blocks have a significant 
underlying structural significance w hich w ould not have been 
revealed had w e not experim ented w ith parallel pipelines.
Finally the tableau splits into 3 levels. A ll shapes w ith 
R  cuts betw een R0000..R2262 can achieve full occupancy, 
and betw een R 2244..R 4264 half occupancy. The rest are 
unpipelined.
V. C o n c u r r e n c y  R e d u c t io n
The rem ainder o f the paper discusses a study to determ ine 
the im pact o f concurrency reduction on a  protocol family. All 
protocols are derived from  the single m ost concurrent protocol 
shape based on system atic concurrency reduction  rules o f 
the left and righ t cuts. A ll o f the p ipelined protocols in this 
untim ed fam ily w ere synthesized, place and routed, and their 
physical designs w ere characterized in order to perform  this 
evaluation.
The theoretical part o f the paper defines a com plete protocol 
fam ily consisting o f 137 different specifications. This provides 
perhaps the first opportunity to perform  a large scale system ­
atic study o f the effect o f regular concurrency reduction  upon 
a com plete class o f protocols. The theory identifies the fact 
that m any specifications are indistinguishable w hen placed in 
pipelines that are com m on design topologies. It also shows that 
m any properties o f the protocols are persistent. This opens up 
a choice space for a designer to p ick am ongst various designs 
in order to m atch a particular requirem ent and optim ize the 
m etrics m ost im portant for the design w hile m eeting specific 
protocol requirem ents.
A  tractable design space is presented by abstracting out 
the data path logic. However, this also results in substantial 
inaccuracy in  the reported  results if  the goal is to build  
controllers that include a  data path. W e expect there to be a 
substantially larger penalty in im plem enting the latch clocking 
signals for the m ore concurrent protocols than for the protocols 
o f lesser concurrency.
Concurrency reduction results in a com plex interplay be­
tw een logic level optim ization and system  level interaction 
across the handshake channels. In  general, concurrency re ­
duction tends to reduce the com plexity o f the logic w hich 
can speed up the response tim e o f the controller. However, it 
can also result in system  level perform ance degradation by 
delaying output signals on a channel w hen they otherw ise 
w ould be able to p roceed in a m ore concurrent protocol. Thus, 
som e am ount o f concurrency reduction  produces an im proved 
design, but too m uch can degrade perform ance.
Our initial hope was to find that concurrency reduction 
produced a convex function that placed the optim al design 
som ew here in  the m iddle o f the concurrency reduction spec­
trum . W e hoped this w ould be true both globally across the
31
TABLE IV
Number of state variables generated by Petrify
TABLE V 
C o n tro l le r  A rea  in ^m2
L0000 L0011 L1111 L0022 L1122 L0033 L1133 L2222 L2233 L3333 L o R
2 4 2 2 2 2 R0000
3 2 2 2 2 1 2 1 1 R00202 2 2 2 2 2 1 2 1 1 R0040
3 3 2 1 2 2 1 2 1 1 R00222 2 2 - 2 2 1 2 1 1 R00422 2 2 2 2 2 1 2 1 D R20222 1 1 1 1 1 0 0 D R2042
3 2 2 2 2 2 1 2 1 1 R00442 2 1 1 1 1 0 0 D R20442 1 D 1 D 1 D D D D R4044
2 2 2 2 2 2 1 2 1 D R22222 1 1 2 1 1 0 0 D R22422 1 1 1 1 D D D D R2262
2 1 1 1 1 1 0 0 D R22442 1 1 1 0 D D D D R22642 1 D 1 D 1 D D D D R42441 0 D 0 D D D D D D R4264
L0000 L0011 L1111 L0022 L1122 L0033 L1133 L2222 L2233 L3333 L o R
271.6 387.7 244.3 350.2 203.2 196.4 R0000
404.8 - 285.3 288.7 240.9 302.3 217.0 223.7 237.4 189.6 R0020
261.3 339.9 292.1 264.7 339.9 309.1 244.3 206.6 213.5 240.9 R0040
346.7 428.7 203.2 213.5 333.1 210.1 206.6 175.9 155.4 162.2 R0022
418.5 268.2 257.9 - 281.8 331.1 206.6 251.1 189.6 172.5 R0042
319.4 350.2 333.1 206.6 285.3 298.9 196.4 179.3 179.3 D R2022
398.0 268.2 213.5 196.4 203.2 206.6 169.1 175.9 162.2 D R2042
247.6 196.4 237.4 162.2 199.9 227.1 172.5 206.6 152.0 155.4 R0044
206.6 210.1 186.1 189.6 186.1 189.6 162.2 124.7 155.4 D R2044
278.4 193.0 D 155.4 D 165.6 D D D D R4044
350.2 213.5 179.3 165.6 172.5 162.2 152.0 162.2 124.8 D R2222
288.7 223.7 186.1 199.8 172.5 186.1 138.3 131.5 131.5 D R2242
220.3 206.6 175.9 165.6 162.2 D D 155.4 D D R2262
179.3 162.2 128.1 117.8 134.9 128.1 104.2 97.3 83.7 D R2244
196.4 182.7 152.0 128.1 134.9 D D 107.9 D D R2264
189.6 169.1 D 134.9 D 121.2 D D D D R4244
145.2 148.6 D 104.2 D D D D D D R4264
full design space as w ell as locally inside protocol equiva­
lence classes. W hile this trend does generally hold, there is 
substantial noise w ith significant exceptions to the norm.
in  the end we are interested in im plem enting efficient 
system s and are searching to find m ore efficient designs than 
have heretofore been discovered. The exhaustive nature o f this 
evaluation points to locations in the design space to search for 
design optim ization. o u r  literature search uncovered published 
im plem entations for only 40 o f the 137 untim ed shapes [3], 
[6 ], [8 ], [9], [10], [13], [14], [15], [16], [22], [25], [26]. N one 
o f those had the sam e protocol as the best-in-class in the study 
perform ed here.
V I. C ir c u it  C h a r a c t e r iz a t io n
The com plete set o f abstracted controllers w ere synthesized, 
placed and routed, and characterized using post layout ex­
traction em ploying the static A rtisan 12T library on IB M ’s 
65nm  10sf process [23]. Each o f the controllers has been 
characterized for cycle tim e, forw ard and backw ard latency 
and area. Pow er results are not reported  here. Sim ulation 
is perform ed in M odelS im  on the post layout design w ith 
delays extracted from  SoC encounter based on the layout 
and parasitics o f the design. M ost o f the flow has been 
fully autom ated since we started w ith 137 different p ipelined 
controllers.
The characterization starts w ith the controller behavior 
specified as a state graph “ shape” in CCS. The specification is 
then synthesized and technology m apped to the A rtisan library. 
Petrify is used to synthesize the design and tech  m ap to the 
A rtisan library [4]. We created a  technology m apping file in 
the genlib form at that was used by Petrify and applied the 
exhaustive decom position algorithm . o f  the 137 controllers,
la
Fig. 11. Controller with smallest area and backward latency: L2233 ◦ R2244
Petrify could no t find a valid state variable assignm ent for six 
controllers, including m a x .
Petrify has the capability o f adding reset to a design. 
H ow ever those results are inconsistent, and so w e opted to 
m anually add reset to the controllers. This process was aided 
by V e r i l o g 2 C C S  softw are that w e w rote to m ap a Verilog 
m odule to  a form al CCS specification. Verilog2CCS forces the 
inputs low and sim ulates the Verilog to determ ine the logic 
value o f each net. i f  any nets are undefined, reset is added by 
m odifying one or m ore gates to drive the net to the proper 
state. A n attem pt was m ade to have reset create as sm all an 
im pact on area and pow er as possible.
C haracterization is perform ed by placing each structural 
Verilog controller into a 4-deep linear pipeline. This pipeline 
is sufficient to evaluate our design m etrics. Sim ple be­
havioral interfaces are added to the left and right side 
o f the p ipeline that enable sim ulation control o f the ex­
ternal handshake channels. The left interface im plem ents 
a s s i g n  a c k  = g o _ l  & ' r e q  and the right interface 
a s s i g n  a c k  = g o _ r  & r e q .  Latency through the left 





F o rw ard  l a te n c y  in ps. O p tim al a v e ra g e  c u t  in bold.
TABLE VII
B ack w ard  la te n c y  in ps, o p tim a l f u l l  b u ffe re d  c u t  in bold
L0000 L0011 LI 111 L0022 LI 122L0033 LI 133 L2222 L2233 L3333 L o R
607 177 257 552 402 401 R0000
457 - 497 200 318 155 269 478 474 501 R0020
339 367 387 150 534 234 303 366 477 543 R0040
133 486 325 293 533 236 273 362 317 379 R0022
186 309 291 - 331 164 265 405 383 382 R0042
103 210 422 203 291 107 253 295 370 D R2022
289 343 293 206 234 188 206 267 162 D R2042
124 199 301 157 275 130 240 375 299 388 R0044
262 220 255 225 273 186 214 319 253 D R2044
93 203 D 180 D 172 D D D D R4044
291 204 198 180 182 100 194 242 243 D R2222
244 206 271 174 227 68 158 304 277 D R2242
135 214 304 152 198 D D 314 D D R2262
195 126 165 113 186 118 126 226 205 D R2244
198 177 202 116 156 D D 249 D D R2264
115 146 D 135 D 58 D D D D R424488 93 D 85 D D D D D D R4264
L0000 LOO11 LI 111 L0022 LI 122 L0033 L1133 L2222 L2233 L3333 L o R
178 579 302 249 195 165 R0000
299 - 188 492 237 239 233 228 235 174 R0020
278 647 308 446 484 300 248 193 274 283 R0040
292 117 232 274 306 290 257 154 196 180 R0022
416 378 261 - 330 368 285 221 216 203 R0042
286 168 189 184 178 221 174 155 161 D R2022
370 468 274 225 228 263 221 192 173 D R2042
499 439 292 332 357 445 327 219 263 339 R0044
465 438 385 324 340 375 338 355 293 D R2044
658 458 D 392 D 419 D D D D R4044
538 400 376 320 346 274 288 275 183 D R2222
357 402 333 286 300 303 259 270 219 D R2242
341 477 463 315 339 D D 342 D D R2262
175 124 120 131 115 135 97 108 78 D R2244
177 219 195 157 127 D D 138 D D R2264
153 185 D 141 D 121 D D D D R4244
147 142 D 140 D D D D D D R4264
t  half buffered
L0000 LOO11 LI111 L0022 LI122 L0033 LI133 L2222 L2233 L3333
Fig. 17. Backward latency averaged across left cuts Fig. 18. Backward latency averaged across right cuts
controller. Specifically, this is the delay from lr] to rr]  in an 
idle controller. Our simulations measured the delay across the 
four pipeline stages in the design and then divide this delay by 
four to get the latency per controller. Table VI shows forward 
latency in picoseconds.
Concurrency reduction on the incoming channel generally 
reduces the latency as shown in Fig. 16. This improvement is 
directly related to reduction in the complexity of the design as 
concurrency is reduced. This effect of concurrency reduction 
is similar to that of area and state variable reduction.
Concurrency reduction on the outgoing channel displays an 
interesting competition between concurrency reduction that 
increases protocol latency and decreases controller latency 
(Fig. 13). Left cuts that delay rr] will substantially retard 
forward latency due to the reduction in protocol concurrency. 
This occurs when the first components (La and Lb) of the left 
cut increase. Thus as concurrency is reduced from LOOxx to 
L llxx and so forth, the delay of rr]  increases substantially.
However, the Lc and Ld components in the left cut generally 
decrease forward latency through logic simplification.
The fully buffered circuit synthesized by Petrify with the 
smallest forward latency is shown in Fig. 14. This design, 
L0033 ° R4244, contains the maximal right cut and maximal 
left cut where the La and Lb cuts are zero. The smallest 
forward latency in a fully buffered design is L0033 ° R2242, 
just lOps slower. Note that for the same amount of buffering 
in an application such as a FIFO, a half buffered protocol 
needs to pass through twice as many controllers. Therefore 
the fastest forward latency for the same amount of buffering 
is 116ps for the best half buffered protocol versus 68ps for 
the best fully buffered protocol.
C. Backward Latency
There are various ways of measuring backward latency. 
Here we define backward latency as the delay in a stalled 
pipeline from the time that the output channel acknowledges
34
LOOOO LOO 11 L llll  L0022 LU22 L0033 L1133 L2222 L2233 L3333
Fig. 19. Cycle time averaged across left cuts
TABLE VIII 
Cycle time in ps
LOOOO L0011 L llll L0022 L1122 L0033 L1133 L2222 L2233 L3333 L o R
833 844 745 848 672 633 R0000
861 - 751 841 606 834 607 804 808 754 R0020
669 974 726 749 1077 673 601 631 844 929 R0040
688 826 615 632 935 620 563 588 564 612 R0022
773 723 621 - 717 801 592 701 675 673 R0042
674 880 925 505 762 801 562 632 681 D R2022
892 873 632 482 507 592 530 555 514 D R2042
678 703 661 530 687 750 636 654 638 832 R0044
800 736 692 641 698 648 611 754 627 D R2044
804 728 D 643 D 694 D D D D R4044
987 708 627 606 630 613 611 623 522 D R2222
732 730 636 607 650 652 542 733 641 D R2242
580 823 894 581 651 D D 808 D D R2262
790 547 623 620 656 722 566 745 643 D R2244
804 844 849 674 609 D D 878 D D R2264
590 710 D 601 D 559 D D D D R4244
525 523 D 576 D D D D D D R4264
that data has been latched until the input channel begins the 
retum-to-zero transitions. Specifically, this is the delay from 
r a \  to la \  in these controllers. Our simulations filled the 
four-deep pipeline with the maximum number of tokens and 
then measured this delay by allowing the output channel to 
accept the token, measuring the delay to la \  on the input 
channel. This value was divided by four to get the average per 
controller in the 4-deep pipeline. Backward latency is shown 
in Table VII. Figure 11 shows the controller with the smallest 
backward latency.
Concurrency reduction on the outbound channel generally 
reduces the latency as shown in Fig. 17. This improvement is 
a second order effect and is directly related to reduction in the 
complexity of the design as concurrency is reduced.
Concurrency reduction on the inbound channel shows some 
very interesting properties. Consider full buffered protocols. 
Performance initially improves and then dramatically de­
creases. This can be explained by referring to our shape in 
Fig. 4. The final state in the first two rows of the shape 
are reached in a fully stalled pipeline where the second data 
token is offered on the input channel. If the last state on the
Fig. 20. Cycle time averaged across right cuts
second row exists, then maximal progress has been made in a 
stalled pipeline. The tail on the right of the shape in the last 
two rows allows a protocol to quickly respond from a stalled 
condition. If this tail is removed, then backward latency will be 
significantly impacted due to a lack of protocol concurrency. 
The tail will be removed with Rc >  4 (Rxx4x) cuts. R2022 has 
been observed as the optimal average right cut for backward 
latency in fully buffered protocols as shown in Fig. 18. This 
cut applies the the largest possible amount of concurrency 
reduction without removing the top first states of the “tail” 
in the third row of the shape that would negatively impact 
protocol concurrency.
Half buffered protocols show an interesting phenomenon 
with a substantial reduction in backwards latency. These con­
trollers only store data in ever other latch when stalled. This 
results in an interesting artifact where every other controller 
stalls in a different location in the shape, one waiting for 
rising ra j, the other for falling raj. This results in a very fast 
backward latency. However, note that since these protocols 
only store data in half the latches, they need to pass through 
twice as many controllers for equal storage as the full buffered 
protocols (or 2 x the latency shown). Taking that into account, 
they are slower than the best full buffered protocols.
D. Cycle Time
Cycle time provides information about the throughput of 
the pipeline. It is measured as the largest delay between 
the insertion of two tokens in the pipeline. Twelve tokens 
are inserted into an empty four-deep pipeline as fast as the 
pipeline will accept. All tokens are immediately consumed at 
the output channel. The slowest delay between the insertion 
of two adjacent tokens is recorded as the cycle time. Note 
that for this number to be correct, the left and right pipeline 
interfaces that control token and bubble insertion must have a 
cycle time less than the controller itself. The cycle time of all 
the controllers is listed in Table VIII.
Cycle time averaged across different left cuts is graphed in 
Figure 19. This graph shows that for left cuts the controller 
with the best average performance is near the middle of the 
concurrency reduction range. On average, cut L1133 results 
in the best throughput. Figure 20 graphs cycle time averaged 
across different right cuts. This graph also demonstrates a 
tradeoff between circuit simplicity and protocol handshake 
delays for full buffered protocols. The best delay lies in the
35

w here data is bundled and valid before the incom ing channel 
request ( lr])  signal rises. The L  and R  cuts form  separate 
sym m etric lattices and give structure: they predict occupancy, 
the regularity o f p iped behaviors, and the behavior o f non- 
hom ogeneous pipelines. The lattice product enables one to 
relate and com pare protocols: it also reveals the design space.
The com plete fam ily o f untim ed four-phase asynchronous 
handshake controllers is characterized. This provides com para­
tive data to  help understand the effect o f concurrency reduction 
on the area and perform ance (pow er was om itted for lack o f 
space). The controllers are show n to be correct abstractions o f 
controllers w ith full data path control. The logic is synthesized, 
placed and routed  from  form al specifications and then tech- 
m apped using the IBM  65nm  10sf A rtisan Library. R eset was 
m anually added to each controller.
W e have show n that concurrency reduction generally  in­
creases the perform ance o f designs up to a point, after w hich 
the designs begin to  degrade in  perform ance. W e showed this 
is likely the case due to com peting factors o f overall faster 
designs as the circuits are sim plified through concurrency 
reduction, versus larger protocol delays as certain handshake 
signals becom e stalled due to inefficiency in the protocols 
o f highly reduced concurrency. C ycle tim e dem onstrates this 
effect, as three o f the four highest throughput designs are all 
full buffered near the m iddle o f the concurrency reduction, 
w ith three being in the R 2042 cut. There is a notable exception 
that ha lf buffered protocols have som e surprising efficiencies 
that m ake them  m ore com petitive than one m ight expect. 
This is particularly  exaggerated w ith backw ard latency. The 
ultra inefficient unpipelined protocols, w ith high input channel 
concurrency reduction, w ere not included in the graphs.
A final contribution is the data that points engineers to 
designs that optim ize each of the perform ance m etrics. The 
best synthesized circuits are published for each m etric as the 
com plete design space was explored.
X . A c k n o w l e d g m e n t s
This w ork was supported in part through a gift by Sun M i­
crosystem s. W e acknow ledge the helpful suggestions o f Jordi 
C ortadella and Luciano Lavagno in im proving this docum ent. 
Thanks are due to researchers w ho have explained or form ally 
docum ented their circuits (usually as ST G ’s or CSP) so they 
are clear to the com m unity at large. This body o f w ork enabled 
us to m odel real practical designs rather than experim ent w ith 
a few idealized ones, kept us grounded, and was sufficiently 
large to help guide our research directions.
R e f e r e n c e s
[1] Graham Birtwistle and Matthew Morley. Case Study: Specifying and 
Property Checking TK, and Asynchronous AMULET-like Micropro­
cessor. In Alex Yakovlev and Reinder Nouta, editors, Asynchronous 
Interfaces: Tools, Techniques, and Implementations”, pages 13-22, July 
2000.
[2] Graham Birtwistle and Kenneth S. Stevens. The family of 4-phase latch 
protocols. In 14th International Symposium on Asynchronous Circuits 
and Systems, pages 71-82. IEEE, April 2008.
[3] I. Blunno, J. Cortadella, A. Kondratyev, L. Lavagno, K. Lwin, and
C. Sotiriou. Handshake protocols for de-synchronization. In Interna­
tional Symposium on Asynchronous Circuits and Systems, pages 149­
158. IEEE, Apr 2004.
[4] Jordi Cortadella, Michael Kishinevsky, Alex Kondratyev, Luciano 
Lavagno, and Alex Yakovlev. Petrify: a tool for manipulating concur­
rent specifications and synthesis of asynchronous controllers. IEICE 
Transactions on Information and Systems, E80-D(3):315-325, 1997.
[5] B. A. Davey and H. A. Priestley. Introduction to Lattices and Order. 
Cambridge University Press, Cambridge, England, 1990.
[6] Paul Day and J. Viv Woods. Investigation into micropipeline latch design 
styles. IEEE Transactions on VLSI Systems, 3(2):264-272, June 1995.
[7] Aristides Efthymiou and Jim D. Garside. Adaptive pipeline structures for 
speculation control. In Ninth International Symposium on Asynchronous 
Circuits and Systems, pages 46-55. IEEE, May 2003.
[8] S. B. Furber and J. Liu. Dynamic logic in four-phase micropipelines. 
In Second International Symposium on Advanced Research in Asyn­
chronous Circuits and Systems, pages 11-16. IEEE Computer Society 
Press, March 1996.
[9] Stephen B. Furber. A small compendium of 4-phase macropipeline 
latch control circuits. Technical Report v0.3, 17/01/99, University of 
Manchester, Dept. of Computer Science, 1999.
[10] Stephen B. Furber and Paul Day. Four-phase micropipeline latch control 
circuits. IEEE Transactions on VLSI Systems, 4(2):247-253, June 1996.
[11] J. D. Garside, S. B. Furber, and S-H Chung. AMULET3 Revealed. In 
5th International Symposium on Advanced Research in Asynchronous 
Circuits and Systems, pages 51-59, April 1999.
[12] G. Graetzer. Lattice Theory: First concepts and distributive lattices. W. 
H. Freeman and Company, San Francisco, 1971.
[13] Rakefet Kol and Ran Ginosar. A doubly-latched asynchronous pipeline. 
In Proceedings of the International Conference on Computer Design 
(ICCD), pages 706-711, Oct 1996.
[14] M. Lewis, J. D. Garside, and L. E. M. Brackenbury. Reconfigurable 
latch controllers for low power asynchronous circuits. In International 
Symposium on Asynchronous Circuits and Systems, pages 27-35, April 
1999.
[15] Andrew M. Lines. Pipelined asynchronous circuits. Master’s thesis, 
California Institute of Technology, Pasadena, CA, 1998.
[16] JianWei Liu. Arithmetic and Control Componenets for an Asynchronous 
System. PhD thesis, Department of Computer Science, University of 
Manchester, 1997.
[17] Peggy B. McGee and Steven M. Nowick. A Lattice-Based Framework 
for the Classification and Design of Asynchronous Pipelines. In 
Proceedings of the Digital Automation Conference (DAC05), pages 491­
496. IEEE/ACM, June 2005.
[18] Robin Milner. Communication and Concurrency. Computer Science. 
Prentice Hall International, London, 1989.
[19] Faron G. Moller and Perdita Stevens. The Edinburgh Concurrency 
Workbench (Version 7). University of Edinburgh, October 1992.
[20] Kenneth S. Stevens, Yang Xu, and Vikas Vij. Characterization of 
Asynchronous Templates for Integration into Clocked CAD Flows. In 
15th International Symposium on Asynchronous Circuits and Systems, 
pages 151-161. IEEE, May 2009.
[21] Colin Stirling. An Introduction to Modal and Temporal Logics for CCS. 
In A. Yonezawa and T. Ito, editors, Concurrency: Theory, Language, and 
Architecture, number 491 in LNCS, pages 2-20. Springer-Verlag, 1991.
[22] Ivan E. Sutherland. Micropipelines. Communications of the ACM, 
32(6):720-738, June 1989. Turing Award Paper.
[23] Santosh N. Varanasi. Performance Analysis of Four-Phase Untimed 
Asynchronous Handshake Protocols. Master’s thesis, University of Utah, 
Salt Lake City, Utah, May 2009.
[24] Yang Xu and Kenneth S. Stevens. Automatic Synthesis of Computation 
Interference Constraints for Relative Timing. In 26th International 
Conference on Computer Design, pages 16-22. IEEE, Oct. 2009.
[25] Eslam Yahya and Marc Renaudin. QDI Latches Characteristics and 
Asynchronous Linear-Pipeline Performance Analysis. In Integrated 
Circuit and System Design, Power and Timing Modeling, Optimization 
and Simulation, Lecture Notes in Computer Science, pages 583-592. 
Springer, 2006.
[26] Kenneth Y. Yun, Peter A. Beerel, and Julio Arceo. High-performance 
asynchronous pipeline circuits. In Second International Symposium on 
Advanced Research in Asynchronous Circuits and Systems, pages 17-28. 
IEEE Computer Society Press, March 1996.
37
