Slack Matching Quasi Delay-Insensitive Circuits by Prakash, Piyush & Martin, Alain J.
1Slack Matching Quasi Delay-Insensitive Circuits
Piyush Prakash, Alain J. Martin
Department of Computer Science
California Institute of Technology
Pasadena, CA 91125 CA
Abstract— Slack matching is an optimization that determines
the amount of buffering that must be added to each channel of
a slack elastic asynchronous system in order to reduce its cycle
time to a speciﬁed target. We present two methods of expressing
the slack matching problem as a mixed integer linear program-
ming problem. The ﬁrst method is applicable to systems com-
posed of either full-buffers or half-buffers but not both. The sec-
ond method is applicable to systems composed of any combination
of full-buffers and half-buffers.
I. INTRODUCTION
Quasi delay-insensitive(QDI) circuits are typically described
as collections of independent processes that share values exclu-
sively through message passing. Instead of a global clock, local
handshakes are used for synchronization. It has been observed
that the cycle time of an asynchronous system can be greater
than that of the slowest module in the system running on its
own. Systems are often designed with a target cycle time, τ0.
Slack matching is a technique to reduce a system’s cycle time by
inserting buffering into communication channels. Slack match-
ing is only guaranteed to be safe on systems that are slack elas-
tic, i.e., systems such that an arbitrary amount of buffering can
be added to any communication channel without affecting the
correctness of the system [5]. We study the problem of adding
buffers to a slack elastic system to reduce the overall cycle time
of the system to τ0.
Slack matching has been compared to the retiming problem
in synchronous design. While slack matching has been shown
to be NP-complete [2], retiming for optimal throughput can be
solved in polynomial time [3].
We restrict our attention to slack matching systems consist-
ing of processes that use each communication channel exactly
once per cycle. Furthermore, we assume that the communica-
tion actions are implemented using four phase handshakes.
The static slack of a pipeline is the maximumnumber of mes-
sages that can be inserted into the pipeline, with none being re-
moved. If a pipeline of n instances of a process has static slack
n
2 , the process is said to be a half-buffer. If a pipeline of n in-
stances of a process has static slack n, the process is said to be
a full-buffer.
Let p be a pipeline, with p.in being its input channel and
p.out being its output channel. Let p operate in an environment
that keeps the number of messages in p constant. Lines [4]
and Williams [10] have shown that the throughput(reciprocal
of cycle time) of p varies with the number of messages, m, in p.
Lines [4] showed that a plot of throughput of p versusm resem-
bles either a triangle or trapezoid. Figure 1 shows an example
of such a plot. The plot can be divided into three regions based
upon its slope.
1) Each pipeline stage has some delay between the start of
a handshake on its input and the start of the correspond-
ing handshake on an output. If the cycle time of p, with
m messages is τ , then the mean delay between two mes-
sages in p must be less than p’s cycle time, τ . Increasing
m reduces this mean delay, allowing p to operate at a
higher throughput. This corresponds to the region where
the slope is positive.
2) If a pipeline stage does not contain its static slack number
of messages, it can acknowledge events on its input chan-
nels without waiting for an acknowledgment on its out-
put channel. However, as m is increased, some pipeline
stages will contain a number of messages equal to their
static slack. Such stages introduce stalls in the pipeline,
waiting for an acknowledgment on the output channel be-
fore producing an acknowledgment on the input channel.
When this happens, increasing m increases the number
of pipeline stages that stall in this manner. Each stall
sequences events that could occur concurrently, were m
to be smaller. Thus increasing m reduces the system’s
throughput. This corresponds to the region where the
slope is negative.
3) Each pipeline stage must also have some internal limits to
its cycle time. This corresponds to the region of the plot
where the slope is zero. If 1 and 2 always constrain the
p’s cycle time to be above the internal cycle time of all
pipeline stages in p, then the plot is a triangle. However,
if 1 and 2 do not limit p’s cycle time to be above the
internal cycle time of all pipeline stages, then p can only
operate as fast as the stage with the greatest internal cycle
time and the graph is a trapezoid.
Let dmin(p.in, p.out, τ0) and dmax(p.in, p.out, τ0) be re-
spectively the minimum and maximum number of messages
that pipeline p can contain while operatingwith cycle time τ0 or
less. The interval [dmin(p.in, p.out, τ0), dmax(p.in, p.out, τ0)]
is the dynamic slack of p at cycle time τ0.
A. Motivation
Consider the system shown in ﬁgure 2, which has recon-
vergent fanout. Let the system be such that when a message
is inserted into pipeline A, a message must also be inserted
into pipeline B. Similarly, when a message is removed from
pipeline A, a message is also removed from pipeline B. Thus,
there is a ﬁxed relationship between the number of messages in
Proceedings of the 12th IEEE International Symposium on Asynchronous Circuits and Systems (ASYNC’06) 
0-7695-2498-2/06 $20.00 © 2006 IEEE 
Authorized licensed use limited to: CALIFORNIA INSTITUTE OF TECHNOLOGY. Downloaded on April 15,2010 at 18:11:06 UTC from IEEE Xplore.  Restrictions apply. 
2Th
ro
ug
hp
ut
Number of messages
dmax(p.in, p.out, τ0)
1
τ0
dmin(p.in, p.out, τ0)
Fig. 1. Throughput of a pipeline versus number of messages
Pipeline A
Pipeline B
P2
M
P1
F
Q1
S T
Fig. 2. System with reconvergent fanout.
the two pipelines. Let A and B both be initialized with no mes-
sages. When the system operates at cycle time τ0, the number
of messages in pipelines A and B must be within their respec-
tive dynamic slacks at τ0. If the dynamic slacks do not inter-
sect, then the cycle time of one of the pipelines is greater than
τ0, hence the system’s cycle time must also be greater than τ0.
Adding buffers to a pipeline p increases dmin(p.in, p.out, τ0)
and dmax(p.in, p.out, τ0) thus additional buffers can be added
to either A or B until the dynamic slack at τ0 of the two
pipelines intersect.
Figure 3 shows how the cycle time of a ring of half-buffers,
that contains one message, varies with the number of buffers
on the ring. Slack matching does not change the number
of messages on a ring, it can only increase the number of
buffers on a ring r. Let r.a be a channel of r. Increas-
ing the number of buffers on r raises dmax(r.a, r.a, τ0) and
dmin(r.a, r.a, τ0). The number of messages is dictated by the
algorithm the system implements and is assumed ﬁxed. Con-
sider a ring, r, with m messages on the ring. If m is within
r’s dynamic slack at τ0, then there is no need to slack match
the ring, it already operates at the target cycle time. If m is
greater than dmax(r.a, r.a, τ0), buffers can be added to r un-
til m ∈ [dmin(r′.a, r′.a, τ0), dmax(r′.a, r′.a, τ0)] where r′ is a
ring obtained by adding buffers to r. Thus, in this case slack
matching can be used to improve the ring’s cycle time. If m
is less than dmin(r.a, r.a, τ0), slack matching cannot be used
to improve the cycle time of the ring since adding buffers will
only increase dmin(r.a, r.a, τ0) and dmax(r.a, r.a, τ0).
 14
 16
 18
 20
 22
 24
 26
 28
 30
 0  2  4  6  8  10  12  14  16
C
yc
le
 T
im
e
Number of Buffers
Cycle Time of a Ring of Buffers vs. Number of Buffers
Fig. 3. Relationship between cycle time of a ring and the number of buffers
on the ring.
B. Outline
This paper presents two methods of slack matching a system
by solving a mixed integer linear program (MILP). The systems
are described as handshaking expansion(HSE), with the HSE
annotated with delays. The HSE notation used in this paper is
described in the appendix. Event-Rule (ER) systems [1] con-
structed from the HSE are analyzed to determine the system’s
cycle time. Section II, reviews some basic results about ER sys-
tems. In section III, we state sufﬁcient conditions under which a
system composed of a speciﬁed class of half-buffers, has cycle
time less than or equal to the target, τ0. In section IV, we show
how to generate a polynomial sized set of constraints that must
be satisﬁed in order for the system to be slack matched. In sec-
tion V, we present results from the application of this method
to control units in two asynchronous microprocessors. In sec-
tion VI, we express the problem of slack matching systems
composed of full-buffers, half-buffers and slack-zero buffers as
one of satisfying a set of linear equations with a subset of the
variables being restricted to the integers. When the cost func-
tion is linear in the number of slack matching buffers added,
mixed integer linear programming (MILP) solvers can be used
to slack match a system. Section VII relates our results to prior
work. We conclude in section VIII with a discussion of future
work.
II. EVENT-RULE SYSTEMS
In the following sections, we will be analyzing the ER system
that corresponds to a collection of HSE to determine a system’s
cycle time. In this section we review the deﬁnition of ER sys-
tems [1] and state the results about ER systems that are used in
the analysis.
An ER system is a pair (E,R) where E is a set of events
and R is a set of rules that constrain the timings of the events.
Each rule of a general ER system, r ∈ R, is of the form e α→ f ,
where e ∈ E and f ∈ E are respectively the source and target
of r and α ∈ [0,∞) is the delay of r. The set of sources of an
event f is {e|e α→ f ∈ R}. Similarly, the set of targets of an
event e is {f |e α→ f ∈ R}. Whilst E and R may be inﬁnite, the
set of sources of any event e must be ﬁnite.
A timing function of an ER system is any function t : E →
[0,∞) such that t(f) ≥ t(e) + α ∀e α→ f ∈ R. The timing
simulation, tˆ, of an ER system is the timing function such that
for any other timing function t, tˆ(e) ≤ t(e) ∀e ∈ E.
Proceedings of the 12th IEEE International Symposium on Asynchronous Circuits and Systems (ASYNC’06) 
0-7695-2498-2/06 $20.00 © 2006 IEEE 
Authorized licensed use limited to: CALIFORNIA INSTITUTE OF TECHNOLOGY. Downloaded on April 15,2010 at 18:11:06 UTC from IEEE Xplore.  Restrictions apply. 
3A. Repetitive ER Systems
Oftentimes, the behavior of a system of unbounded size can
be described by a ﬁnite structure. Consider the set of events E
generated from a ﬁnite set E′ by E = E′×N. The elements of
E′ are referred to as transitions. Each event (u, i) ∈ E is the
ith occurrence of transition u.
The set of rules R can be generated from a ﬁnite set of 4-
tuples R′ such that r′ = (u, v, α, ) ∈ R′ where R′ ⊆ E′ ×
E′ × [0,∞) × Z. r′ is written as (u, i − ) α→ (v, i). u and
v are respectively the source and target transitions of r′. α is
the delay between the (i− )th occurrence of transition u and
the ith occurrence of transition v.  is said to be the occurrence
index offset of r′. The rule set, R, is generated by instantiating
(an inﬁnite number of times) each element of R′. In order to
ensure that both the source and target of each rule is in E, it is
required that i ≥ . The pair (E′, R′) is said to be a repetitive
ER system, and (E,R) the general ER system generated from
the repetitive ER system.
A linear timing function, t¯ is a timing function such that for
each event e = (u, i), t¯(u, i) = xu + i · pu where xu is the off-
set of transition u and pu its period. The collapsed constraint
graph,G′, of (E′, R′), is the directed graph such that each node
corresponds to a transition and each edge corresponds to an el-
ement of R′. Burns [1] shows that if two nodes, u and v, lie on
the same cycle in G′, then pu = pv . If the collapsed constraint
graph of a repetitive ER system is strongly connected, all nodes
have the same period, p.
The minimum period linear timing function of a repetitive
ER system is the linear timing function that minimizes p. The
minimum value of p can be shown to be a number arbitrarily
close to the average time between successive occurrences of a
transition u. This value of p is said to be the cycle time of the
repetitive ER system. Burns [1] shows that the cycle time of
a repetitive ER system can be determined from its collapsed
constraint graph as
p = max
all cycles c
{
sum of delays on c
sum of occurrence index offsets on c
}
. (1)
Burns further showed that in maximizing equation (1), only
simple cycles(ones that do not have sub-cycles) need to be con-
sidered.
B. Pseudo-repetitive ER Systems
Whilst many systems can be modeled as repetitive ER sys-
tems, many systems of interest consist of a ﬁnite set of ini-
tial events followed by a set of repeated transitions. Such sys-
tems are modeled as pseudo-repetitive ER systems. A pseudo-
repetitive ER system is a 6-tuple (E0, E1, E′0, E′1, R0, R′1)
where E0 is a ﬁnite set of initial events, E′0 a ﬁnite set of ini-
tial transitions, E1 an inﬁnite set of repeated events, E′1 a ﬁnite
set of repeated transitions, R0 a ﬁnite set of of initial rules and
R′1 a ﬁnite set of repeated rules. Note that E0 ⊂ E′0 × N and
E1 ⊆ E′1 × N. The elements of R0 take the form of rules in a
general ER system, and their source must be an initial event.
The elements of R′1 are of the form of rules in a repetitive
ER system. Both the source and the target of a repeated rule
must be repeated transitions. The general ER system that cor-
responds to a pseudo-repetitive ER system is constructed by
setting E = E0 ∪ E1 and letting R be the union of R0 and the
rules of R′1 instantiated in such a manner that both the target
and source of the rule are in E1.
Burns [1] shows that the cycle time of a pseudo-repetitive
ER system is approximated by that of the repetitive system
(E′1, R
′
1).
Example 1: Figure 4 shows the collapsed constraint graph
of the repetitive part of the pseudo-repetitive ER system that
describes the HSE:
PCHB ≡ lo↑; [ri]; ro↓; [¬li]; lo↓;
*[[¬ri ∧ li]; ro↑; lo↑; [ri]; ro↓; [¬li]; lo↓]
All edges have an occurrence index offset of 0 unless they are
marked with a rectangular box which denotes an occurrence
index offset of 1.
li↑
li↑
lo↑
lo↓
li↓
ro↑
ro↑
ri↑
ro↓
ri↓
Fig. 4. Constraint graph of PCHB
III. SUFFICIENT CONDITIONS FOR SLACK MATCHING
SYSTEMS OF HALF-BUFFERS
In this section, we state sufﬁcient conditions that guarantee
that the cycle time of a system composed of half-buffers is at
or below the target cycle time, τ0. Rather than consider the col-
lapsed constraint graph of the entire system, we represent the
system as a process graph. First we deﬁne a process graph.
Next we constrain the HSEs of the processes that comprise the
system. We then label paths in the collapsed constraint graph
of a process satisfying the aforementioned restrictions and state
assumptions about the delays and occurrence index offsets of
these paths. Given these assumptions, we state sufﬁcient con-
ditions on the process graph of the system to guarantee that the
system’s cycle time is at most τ0. These conditions ensure that
each pipeline in the system can simultaneously contain a num-
ber of messages within its dynamic slack.
A process graph G = (V,E) is a directed graph where each
vertex represents a process, and each edge a channel between
two processes. An edge e = (u, v) denotes a channel that is
an output channel of process u and input channel of v. The
source of edge e = (u, v), src(e), is deﬁned to be u and the
sink, snk(e), is deﬁned to be v. A path in a graph consists
of a sequence of one or more distinct edges {ei}, such that
src(ei+1) = snk(ei). The length of path p, denoted as |p|,
is the number of edges that comprise the path. For any path
p, the source of p, src(p), is src(e0) and the sink of p, snk(p),
is snk(e|p|−1). A cycle is a path p such that src(p) = snk(p).
Proceedings of the 12th IEEE International Symposium on Asynchronous Circuits and Systems (ASYNC’06) 
0-7695-2498-2/06 $20.00 © 2006 IEEE 
Authorized licensed use limited to: CALIFORNIA INSTITUTE OF TECHNOLOGY. Downloaded on April 15,2010 at 18:11:06 UTC from IEEE Xplore.  Restrictions apply. 
4An undirected path between vertices a and b is a sequence of
n ≥ 1 distinct edges, {ei} such that there exists a sequence of
n+1 vertices {vi} with the property that for all edges ei, either
src(ei) = vi∧snk(ei) = vi+1 or snk(ei) = vi∧src(ei) = vi+1.
Furthermore v0 = a and vn = b. Let S(u) denote the set of
edges e such that either e = u or there exists an undirected
path in G between snk(u) and src(e) that traverses neither u
nor e. Let K(u) denote the set of edges e such that there is an
undirected path in G between snk(u) and snk(e) that traverses
neither u nor e.
We only consider systems composed of processes such that
their repetitive portion is an implementation of one of following
reshufﬂings, presented in Lines [4].
PCHB ≡ *[[¬ri ∧ li]ro↑; lo↑; [ri]; ro↓; [¬li]; lo↓]
WCHB ≡ *[[¬ri ∧ li]; ro↑; lo↑; [ri ∧ ¬li]; ro↓; lo↓]
B1 ≡ *[[¬ri ∧ li]; ro↑; lo↑; [ri ∧ ¬li]; lo↓; ro↓]
B4 ≡ *[[¬ri ∧ li]; ro↑; lo↑; [ri]; ro↓, ([¬li]; lo↓)]
B5 ≡ *[[¬ri ∧ li]; ro↑; lo↑; [ri ∧ ¬li]; ro↓, lo↓]
If a process has more than two channels, we assume that the
channels of the process can be partitioned into two disjoint sets
L and R such that the projection of the process’s HSE onto
the variables implementing any pair of channels Li ∈ L and
Rj ∈ R is a half buffer with one of the aforementioned reshuf-
ﬂings. The variables implementing Li correspond the variables
li and lo in these reshufﬂing, and the variables implementing
Rj correspond to the variables ri and ro. The channels in L
are said to be input channels of the process, and the channels in
R are output channels. We adopt the convention that the four
phase handshake on channelLj is implemented by the variables
lij and loj , where loj is the variable assigned to by the process
and lij is variable that is only read by the process. Similarly,
the variables rij and roj implement the communication chan-
nel Rj .
In order to determine the occurrence index offsets in the
repetitive ER system, we need to specify the initial state of
the system. We assume that each process, P , can be writ-
ten as a straight line handshaking expansion [1] of the form
P ≡ S; ∗[T ] such that the following conditions hold:
• All assignments occur at most once in S and exactly once
in T .
• The projection of T onto the statements in S is S.
• If an assignment, a, appears in S, then S also contains all
assignments a′ such that a; a′ appears in the projection of
T onto the set {a, a′}.
• If a wait on a variable w, [f(w)] appears in S, then S also
contains all assignments a such that [f(w)]; a appears in
the projection of T onto the set {w, a}.
These restrictions ensure that each channel is used exactly once
per cycle of the process. Since the ith occurrence of any assign-
ment can only depend on either the (i− 1)th or ith occurrence
of a preceding assignment, all occurrence index offsets in the
repetitive ER system that models such a process are either 0
or 1. An LR buffer can be initialized to either a state where
the ﬁrst communication action on the output channel precedes
the ﬁrst communication on the input channel or vice versa. In
the former case, there is an initial communication on the output
channel.
Sink
lok↑ lok↓ rok↑ rok↓
So
u
rc
e li
j↑ σjk0 σjk1 ρjk0 ρjk1
lij↓ σjk2 σjk3 ρjk2 ρjk3
rij↑ λjk0 λjk1 σjk4 σjk5
rij↓ λjk2 λjk3 σjk6 σjk7
TABLE I
CLASSIFICATION OF PATHS IN THE CONSTRAINT GRAPH OF A PROCESS
We state sufﬁcient conditions for a system to have cycle time
at most τ0 in terms of the dynamic slack at τ0 of each process
in the system. We prove that these conditions are sufﬁcient by
considering all cycles in the collapsed constraint graph of the
system. In order to facilitate this proof, we introduce the fol-
lowing notation for the paths in the collapsed constraint graph
of a process P .
The input variables of a process P are deﬁned to be vari-
ables that implement communication channels of P such that
the variables are not assigned to by P . The output variables
of a process P are deﬁned to be the variables that implement
communication channels of P and are assigned to by P . We
classify the paths between transitions on input variables of P
and output variables of P as follows. Each path has a label of
the form ajkn where a ∈ {σ, λ, ρ}. Any path between a transi-
tion on a variable lij and a transition on variable rok has a = ρ.
Similarly, paths between a transition on variable rij and lok
have a = λ. Paths between lij and lok and those between rij
and rok have a = σ. The superscript identiﬁes the channels to
which the source and sink of the path belong. The subscripts
identify the input and output transitions, as shown in table I.
We use the notation δ(p) to denote the sum of the delays of the
rules that constitute the path p. Let (p) denote the sum of the
occurrence index offsets along p. The notation ajkn (Pi) is used
to denote a ajkn path of process Pi.
Let the delays on paths between the input and output vari-
ables of a process, i be bounded by f jki , b
jk
i , w
jk
i , s
jk
i , x
jk
i and
ujki as in table II. The table also show the occurrence index off-
sets on these paths in three cases — no initial communication
on the input or output channels(N), an initial communication on
the output channel(S) and an initial communication on the input
channel(R). If the occurrence index offset of a path is a greater
than that shown in the table, the delay of the path is permitted
to be aτ0 greater than the corresponding bound. Let l,m and n
be processes such that channels (l,m) and (m,n) exist. Deﬁne
M((l,m), (m,n)), the number of initial messages in processm
between channels (l,m) and (m,n), to be 12 if there is an initial
communication on either channel (l,m) or (m,n), 0 otherwise.
Let dmin((l,m), (m,n), τ) be f
ln
m
τ and dmax((l,m), (m,n), τ)
be τ−2b
nl
m
2τ .
First we consider systems comprised of processes with ex-
actly one input channel and one output channel. We state re-
strictions on the collapsed constraint graphs of each process in
the system which allow us to determine whether the system’s
cycle time is at or below the target τ0.
Proceedings of the 12th IEEE International Symposium on Asynchronous Circuits and Systems (ASYNC’06) 
0-7695-2498-2/06 $20.00 © 2006 IEEE 
Authorized licensed use limited to: CALIFORNIA INSTITUTE OF TECHNOLOGY. Downloaded on April 15,2010 at 18:11:06 UTC from IEEE Xplore.  Restrictions apply. 
5Path(p) δ(p) (p) : N (p) : S (p) : R Path(p) δ(p) (p) : N (p) : S (p) : R
λjk0 (i) ≤ bjki + τ02 1 1 0 σjk0 (i) ≤ xjki 1 1 0
λjk1 (i) ≤ bjki 0 0 0 σjk1 (i) ≤ ujki + τ02 1 1 1
λjk2 (i) ≤ bjki 1 0 0 σjk2 (i) ≤ xjki + τ02 1 1 1
λjk3 (i) ≤ bjki + τ02 1 0 1 σjk3 (i) ≤ ujki 0 0 1
ρjk0 (i) ≤ f jki 0 1 0 σjk4 (i) ≤ wjki + τ02 0 1 0
ρjk1 (i) ≤ f jki + τ02 1 1 1 σjk5 (i) ≤ sjki 0 0 0
ρjk2 (i) ≤ f jki + τ02 0 1 1 σjk6 (i) ≤ wjki 0 0 0
ρjk3 (i) ≤ f jki 0 0 1 σjk7 (i) ≤ sjki + τ02 1 0 1
TABLE II
BOUNDS ON DELAYS IN PROCESS i
Assumption 1: Let the input channel of any process i be la-
beled Lk and the output channel be labeled Rl. Furthermore,
let the delays and occurrence index offsets of paths between
the input and output variables of each process be bounded as in
table II.
• For each process, i, fkli + blki ≤ τ02 .
• For each process, i, inequalities (2)–(5) hold.
• Inequalities (6)–(9) hold for all pairs of processes i and j
such that output channelRl of j is the input channel Lk of
i.
wkki + x
ll
i ≥ f lki + blki (2)
wkki + u
ll
i ≥ f lki + blki (3)
skki + x
ll
i ≥ f lki + blki (4)
skki + u
ll
i ≥ f lki + blki (5)
wkki + x
ll
j ≤
τ0
2
(6)
wkki + u
ll
j ≤
τ0
2
(7)
skki + x
ll
j ≤
τ0
2
(8)
skki + u
ll
j ≤
τ0
2
(9)
Next, we state sufﬁcient conditions under which a system
comprised of such processes has cycle time at most τ0. This set
of sufﬁcient conditions is speciﬁed as a set of linear constraints
with a subset of the variables being restricted to the integers.
Note that since each process has exactly one input and one out-
put channel, the process graph of such a system is a collection
of independent rings and lines.
Lemma 1: Let G = (V,E) be the process graph of a system,
composed of processes that have exactly one input channel and
one output channel, satisfying assumption 1. Let the variables
d(i,j),(j,k) take on a value in the dynamic slack of process j be-
tween channels (i, j) and (j, k). Let the variables denoted Sab
be such that Sab +
|p|−2∑
i=0
M(ei, ei+1) is in the dynamic slack of
the pipeline corresponding to the path p, where p is a path with
a as the ﬁrst edge and b as the last edge. The system has cycle
time at most τ0 if there exist Sab and da,b such that equations
(10)–(13) hold.
da,b ∈ [dmin(a, b, τ0), dmax(a, b, τ0)]
∀a, b ∈ E : src(b) = snk(a) (10)
Saa = 0 ∀a ∈ E (11)
Sab = Sac + dc,b −M(c, b) + Sbb
∀a, c, b ∈ E : c ∈ S(a) ∧ snk(c) = src(b) (12)
Sab = −db,a + M(b, a)
∀a, b ∈ E : b ∈ S(a) ∧ snk(b) = src(a) (13)
Proof: Intuitively, the lemma postulates that for a
pipeline, p, composed of process with exactly one input channel
and one output channel,
dmin(p.in, p.out, τ0) ≤
|p|−2∑
i=0
dmin(ei, ei+1, τ0)
and
dmax(p.in, p.out, τ0) ≥
|p|−2∑
i=0
dmax(ei, ei+1, τ0).
(10)–(13) enforce the requirement that it be possible for each
ring and pipeline in the system to simultaneously contain a
number of messages within its dynamic slack at τ0.
Let C be the collapsed constraint graph of the system repre-
sented by G. We prove the lemma by showing for all cycles in
C, the ratio of the delay along the cycle to the sum of occur-
rence index offsets along the cycle is at most τ0.
Let a handshake cycle be a cycle in C such that all the ver-
tices it traverses are either variables local to a process or are
variables that implement exactly one channel. By enumerating
all handshake cycles and computing upper bounds on the ratio
of the delay to the sum of the occurrence index offsets along the
cycle, it is easily seen that a handshake cycle cannot constrain
the system’s cycle time to be greater than τ0.
Consider the collapsed constraint graph of a pipeline of n >
0 processes P = {Pi} such that the output channel of process
Pi is an input channel of Pi+1.
Let the critical path between a pair of vertices in a collapsed
constraint graph be the path, p, such that δ(p)−τ0 ·(p) is max-
imized. We will state the critical path in a pipeline of processes
that satisfy the claim. We use these critical paths to show that
Proceedings of the 12th IEEE International Symposium on Asynchronous Circuits and Systems (ASYNC’06) 
0-7695-2498-2/06 $20.00 © 2006 IEEE 
Authorized licensed use limited to: CALIFORNIA INSTITUTE OF TECHNOLOGY. Downloaded on April 15,2010 at 18:11:06 UTC from IEEE Xplore.  Restrictions apply. 
6no cycle in the collapsed constraint graph of a ring of processes
satisfying the claim can constrain the ring’s cycle time to be
greater than τ0.
Consider all paths p such that src(p) ∈ {P0.li↑, P0.li↓},
snk(p) ∈ {Pn−1.ro↑, Pn−1.ro↓} and p only traverses ρ(Pi)
paths. Let there be no initial messages on the input chan-
nel of P0 and the output channel of Pn−1. It can be shown
that if there are m messages on the pipeline, the critical
ρ0(P ), ρ1(P ), ρ2(P ) and ρ3(P ) paths have delay at most
∑
fi,
τ0
2 +
∑
fi,
τ0
2 +
∑
fi and
∑
fi respectively. The sum of oc-
currence index offsets is respectively m, m + 1, m and m.
Similarly, consider paths p such that src(p) ∈
{Pn−1.ri↑, Pn−1.ri↓}, snk(p) ∈ {P0.lo↑, P0.lo↓} and p
only traverses λ(Pi) paths. It can be shown that when the
number of processes is n such that n = 2x + y, where x and
y are non-negative integers such that y ∈ {0, 1}, the critical
λ0(P ), λ1(P ), λ2(P ) and λ3(P ) paths have delays at most
y · τ02 +
∑
bi, (1 − y) · τ02 +
∑
bi, (1 − y) · τ02 +
∑
bi, and
y · τ02 +
∑
bi respectively. The sum of the occurrence index
offsets is respectively x + y − m, x − m, x − m + 1 and
x + y −m.
Consider a ring, r, of n processes {Pi} such that the output
channel of process Pi is the input channel of process Pi+1 and
the output channel of Pn−1 is the input channel ofP0. LetPr be
the pipeline Pr = {Pi}. Any cycle in the collapsed constraint
graph of r that traverses only ρ paths in each Pi must fall into
one of three cases.
1) The cycle consists of a ρ0 path in Pr: (10)–(13) guarantee
that
∑
dei,e(i+1) mod n =
∑
M(ei, e(i+1) mod n) for all
rings. Thus,
∑ fi
τ0
≤ m. Recall that the critical ρ0 path
has delay at most
∑
fi and the sum of the occurrence
index offsets along this path is m. Thus the cycle cannot
constrain the cycle time to be greater than τ0.
2) The cycle consists of a ρ3 path in Pr: The same analysis
as for the previous case holds.
3) The cycle consists of a ρ1 path in Pr followed by a ρ2
path in Pr: The cycle has delay τ0 +
∑
2fi and the sum
of the occurrence index offsets is 2m+1. Since (10)–(13)
imply that
∑ fi
τ0
≤ m, such a cycle cannot constrain the
cycle time to be greater than τ0.
A similar analysis can be used to show that no cycle in the
collapsed constraint graph of ring {Pi} consisting solely of λ
paths can constrain the cycle time to be greater than τ0.
An inductive argument can then be used to show that no cycle
in C constrains the cycle time to be greater than τ0, proving the
lemma.
Next we consider systems composed of processes that have
at least one input channel and at least one output channel.
Assumption 2: • Assumption 1 is satisﬁed for any system
obtained by projecting each process onto a pair of input
channel and an output channel.
• If Lk and Ln are input channels of process i and Rl and
Rm are output channels of i, then fkli + bmni ≤ τ02 .
• For any pair of output channels Rj and Rk and any input
channel Li of process m, the constraints in table III hold.
• For any pair of input channels Lj and Lk and any output
channel Rl of process m, the constraints in table IV hold.
When the processes have greater than one input channel and
Path(p) δ(p) (p)
σjk4 ≤ δ(σkk4 ) (σkk4 )
σjk5 ≤ δ(σkk5 ) (σkk5 )
σjk6 ≤ δ(σkk6 ) (σkk6 )
σjk7 ≤ δ(σkk7 ) (σkk7 )
(a)
Path Delay
σjk4 ≤ f ikm + τ02
σjk5 ≤ f ikm
σjk6 ≤ f ikm
σjk7 ≤ f ikm + τ02
(b)
TABLE III
CONSTRAINTS ON DELAYS OF PATHS IN PROCESSES WITH MULTIPLE
OUTPUTS
Path(p) δ(p) (p)
σjk0 ≤ δ(σjj0 ) (σjj0 )
σjk1 ≤ δ(σjj1 ) (σjj0 )
σjk2 ≤ δ(σjj2 ) (σjj0 )
σjk3 ≤ δ(σjj3 ) (σjj0 )
(a)
Path Delay
σjk0 ≤ blkm
σjk1 ≤ blkm + τ02
σjk2 ≤ blkm + τ02
σjk3 ≤ blkm
(b)
TABLE IV
CONSTRAINTS ON DELAYS OF PATHS IN PROCESSES WITH MULTIPLE
INPUTS
greater than one output channel, two distinct pipelines may
have the same input channel and output channel. Thus, the dif-
ference in the number of messages in the two pipelines is con-
stant and equal to the difference in the number of initial mes-
sages in the pipelines. Intuitively, (10)–(13) ensure that in such
cases, it is possible for both pipelines to simultaneously contain
a number of messages within their dynamic slack.
Recall that if a message is sent on channel (a, b) then a mes-
sage is inserted into all pipelines that have input channel (a, b).
Similarly if a message is received on channel (c, d) then a mes-
sage has been removed from all pipelines that have (c, d) as
their output channel. Constraints of the form (14)–(17) ensure
that it is possible for all pipelines to simultaneously contain
a number of messages within their dynamic slack whilst still
obeying this relationship.
Sab = Kac + di,c −M(i, c) + M(i, b) + Sbb
∀a, b, i, c ∈ E : c ∈ K(a), src(c) = snk(i) = src(b) (14)
Kab = Kac − db,c −M(b, c)− Sbb
∀a, b, c ∈ E : c ∈ K(a), src(c) = snk(b) (15)
Kab = Sac − dc,o + M(c, o)−M(b, o)− Sbb
∀a, b, o, c ∈ E : c ∈ S(a), snk(c) = src(o) = snk(b) (16)
Kab = −di,a + Mi,a −Mi,b
∀a, b, i ∈ E : b ∈ K(a), src(a) = src(b) = snk(i) (17)
Lemma 2: Let G = (V,E) be the process graph of a system,
composed of processes that have at least one input channel and
at least one output channel, satisfying assumption 2. The sys-
Proceedings of the 12th IEEE International Symposium on Asynchronous Circuits and Systems (ASYNC’06) 
0-7695-2498-2/06 $20.00 © 2006 IEEE 
Authorized licensed use limited to: CALIFORNIA INSTITUTE OF TECHNOLOGY. Downloaded on April 15,2010 at 18:11:06 UTC from IEEE Xplore.  Restrictions apply. 
7DS
A
C T
B
Fig. 5. Pipeline with re-convergent fanout
tem has cycle time at most τ0 if there exist Sab, da,b and Kab
such that equations (10)–(17) hold.
Proof: We now proceed to show that no cycle in the sys-
tem’s collapsed constraint graph constrains the cycle time to be
greater than τ0. Consider a pipeline with reconvergent fanout,
as in ﬁgure 5. The critical ρ0 path in pipeline A has delay at
most
∑
fi and the sum of occurrence index offsets is mA, the
number of initial messages in pipeline A. Similarly, the critical
λ0 path in B has delay at most yB · τ02 +
∑
bi and the sum of oc-
currence index offsets is xB +yB−mB where nB = 2xB +yB
is the number of processes in pipeline B and mB the number of
initial messages in the pipeline. Let the cycle contain a σjk4 path
in process C and a σjk0 path in D.
The delay along this cycle is bounded by∑
i∈A
fi +
∑
i∈B
bi + fC + bD + (1 + yB)
τ0
2
(18)
The sum of the occurrence index offsets along this path is given
by mA + xB + yB − mB + 1. Assumption 2 and (10)–(17)
guarantee that there exists K(C,A)(C,B) such that:
K(C,A)(C,B) ≥
∑
i∈A
fi
τ0
+
fC
τ0
−mA (19)
K(C,A)(C,B) ≤
∑
i∈B
τ0 − 2bi
2τ0
+
τ0 − 2bD
2τ0
−mB (20)
From (19) and (20) it is easily seen that such a cycle cannot
constrain the cycle time of the system to be greater than τ0. A
similar analysis can be used to eliminate all other cycles in the
case of reconvergent fanout.
An inductive argument can then be used to show that any
cycle in the system’s constraint graph does not restrict the sys-
tem’s cycle time to be greater than τ0.
IV. SLACK MATCHING ALGORITHM
In this section, we describe an algorithm for slack matching
a system composed of processes satisfying assumption 2. We
formulate the slack matching problem as a mixed integer linear
program (MILP) that can be solved to determine the placement
of slack matching buffers. Channels that carry data are mod-
eled as data-less channels. If the system is not closed, a source
is connected to each of the system’s input channels and a sink
to each of the system’s output channels. It is assumed that the
environment is such that it does not constrain the system’s cycle
time to be greater than the target, τ0. We state the slack match-
ing problem as a MILP and show how to generate this MILP
efﬁciently.
Consider a process graph G = (V,E) representing a system
comprised of processes satisfying assumption 2. The system
represented by G is slack matched at τ0 if lemma 2 holds.
A. MILP for slack matching
Slack matching is performed by adding buffers along com-
munication channels in such a manner that the resulting system
satisﬁes constraints (10)–(17). Slack matching only changes
a system by adding buffers to communication channels. Let
b be the buffer that is used for slack matching a system. Let
b.L and b.R respectively be the input and output channels of
b. [dmin(b.L, b.R, τ0), dmax(b.L, b.R, τ0)] denotes the dynamic
slack of b. Replacing constraints of the form See = 0 by ones
of the form (21) and (22), we obtain the set of constraints from
lemma 2 for the system with ne slack matching buffers added
to each channel e. Slack matching a system now reduces to
determining non-negative integers ne such that the system of
equations (10),(12)–(17) and (21)–(22) is satisﬁed.
See ∈ [ne · dmin(b.L, b.R, τ0), ne · dmax(b.L, b.R, τ0)]
∀e ∈ E (21)
ne ∈ N∀e ∈ E (22)
Since there may be multiple solutions to the set of equations,
any cost function linear in ne may be used to drive the opti-
mization.
B. Generating the MILP
The set of constraints for slack matching a system can be
computed in O(m2n2) time where m = |E| and n = |V |.
Constraints of the form (21) and (22) can be generated in
O(m) time by looping over the edge set.
It takes O(mn+m2) time to generatem×m matrices S and
K, such that
Sij =
{
1, i ∈ S(j)
0, i /∈ S(j) (23)
and
Kij =
{
1, i ∈ K(j)
0, i /∈ K(j) (24)
This is done by running m breadth-ﬁrst searches(BFS) on
Gm, the undirected version of the graph obtained by removing
edge m from G. The undirected version of a directed graph,
G = (V,E) is the graph G′ = (V,E′) where E′ is such that
∀(u, v) ∈ E : (u, v) ∈ E′ ∧ (v, u) ∈ E′. The BFS on Gm
is rooted at snk(m) and is modiﬁed so that for each vertex v,
it records all the edges in Gm that could possibly be the last
edge on a path from snk(m) to v. The matrices S andK can be
constructed from these lists of edges.
Given matrix S, constraints of the form (10) and (13) can be
generated in O(m2) time by looping over all pairs of edges.
There are O(m2n) 3-tuples of edges (a, b, c) such that c ∈
S(a)∧snk(c) = src(b). Given matrices S andK, constraints of
the form (12),(15) and (17) can be generated by simply looping
over all such 4-tuples.
There are O(m2n2) 4-tuples of edges (a, b, c, i) such that
c ∈ K(a) ∧ src(c) = snk(i) = src(b). Given matrices K and
S, constraints of the form (14) and (16) can be generated in
O(m2n2) time by looping over all such 4-tuples.
Proceedings of the 12th IEEE International Symposium on Asynchronous Circuits and Systems (ASYNC’06) 
0-7695-2498-2/06 $20.00 © 2006 IEEE 
Authorized licensed use limited to: CALIFORNIA INSTITUTE OF TECHNOLOGY. Downloaded on April 15,2010 at 18:11:06 UTC from IEEE Xplore.  Restrictions apply. 
8Channel # Buffers (hand) # Buffers (MILP)
ExtControl - CBUFE 1 1
ExtControl - Router 1 1
ExtControl - IntCtrl 1 1
CBUF - IntCtrl 1 1
Router - SplD1 2 2
Router - MrgI0 1 1
Router - MrgExt 3 3
PCNH - PCIH 1 0
PCNH - PCUH 1 1
PCNL - PCUL 1 1
PCIL - PCUL 1 1
TABLE V
SLACK MATCHING BUFFERS FOR LUTONIUM FETCH
Owing to cycle time constraints on the internal cycle, the
number of channels of a process is usually bounded, and sig-
niﬁcantly smaller than n. In this case, the constraints can be
generated in O(k4n2) time, where k is the bound on the num-
ber of channels of a process.
V. RESULTS
The algorithm from section IV was implemented in Modula-
3 [8] and used with glpsol, a freely available MILP solver.
Two large examples were studied, the fetch loop of the Luto-
nium [7], an asynchronous 8051 microcontroller and a control
loop in the fetch unit of the MiniMIPS microprocessor [6].
A. Example I: Lutonium Fetch Loop
This algorithm was used to slack match the fetch loop of the
Lutonium microcontroller. The objective function minimized
was the estimated energy consumption of the slack matching
buffers. Whilst the instruction memory is not implemented as a
pipeline of half buffers, it can be modeled as one. The memory
is modeled as a stage such that dmax = dmin, with dmin being
determined by the memory’s latency.
Figure 6 shows the fetch loop of the Lutonium microcon-
troller.
Table V shows the buffers needed to slack match the sys-
tem. The table also lists the results of slack matching when
performed by hand on the system. Observe that there are fewer
buffers on the byte channel in the pc increment loop when slack
matching is performed using this algorithm, all other channels
have an identical number of buffers.
B. Example II: Control Loop of MiniMIPS
Figure 7 shows a loop in the fetch of the MiniMIPS. Table VI
shows the buffers required to slack match this loop. It also
shows the results when slack matching was performed by hand.
Note that extra buffers included when slack matching was per-
formed by hand may be needed because a ring composed of a
mixture of half buffers and full buffers was not included when
generating the MILP.
BJ T
MPE
VA IJ
Fig. 7. Control Loop in the MiniMIPS fetch
Channel # buffers (hand) # buffers (MILP)
VA-MPE 4 1
IJ-MPE 4 1
TABLE VI
SLACK MATCHING BUFFERS FOR THE CONTROL LOOP IN THE MINIMIPS
FETCH
VI. ALGORITHM FOR SLACK MATCHING CIRCUITS
COMPOSED OF FULL-BUFFERS AND HALF-BUFFERS
We have presented an algorithm for slack matching systems
composed of solely of half-buffers. However, many systems
of interest are composed of processes with heterogenous static
slacks. In this section, we present an algorithm to slack match
such systems. We require that the buffers used for slack match-
ing, referred to as slack matching buffers, satisfy assumption 1.
This technique does not rely on the abstraction of dynamic
slack, instead the algorithm determines a number of buffers of
to be added to each channel such that there are no cycles in the
resulting system’s collapsed constraint graph that constrain the
cycle time to be greater than τ0.
We restrict our attention to collections of HSEs composed
of processes that can be written as a straight-line handshaking
expansion [1] of the form P ≡ S; ∗ [T ] such that no assign-
ment appears more than once in T . Given the collapsed con-
straint graph,C, of the system, S, to be slack matched, consider
the collapsed constraint graph C′ of the system S′ obtained by
adding nuv slack matching buffers to each channel (u, v) in S.
A set of linear constraints can be derived to determine the val-
ues of nuv such that S′ is slack matched. Thus, if an objective
function linear in nuv is used, S can be slack matched by solv-
ing an MILP.
Recall from Burns [1] that the cycle time of a repetitive ER
system, (E,R), can be determined as the minimum τ for which
there exist offsets ai ≥ 0 such that
av − au + uvτ ≥ αuv ∀(u, v, αuv, uv) ∈ R. (25)
The number of edges and vertices in the collapsed constraint
graph of a pipeline, Puv , of nuv slack matching buffers de-
pends on nuv . Each of these edges has a constant delay and
occurrence index offset. We will show how to represent the
collapsed constraint graph of Puv as a graph with a constant
Proceedings of the 12th IEEE International Symposium on Asynchronous Circuits and Systems (ASYNC’06) 
0-7695-2498-2/06 $20.00 © 2006 IEEE 
Authorized licensed use limited to: CALIFORNIA INSTITUTE OF TECHNOLOGY. Downloaded on April 15,2010 at 18:11:06 UTC from IEEE Xplore.  Restrictions apply. 
9Int
Check Ext−
Control
MrgI0
Irpt−
MrgExt
CBUFE
CBUF
Router
CBUFIE
PCUL
Ctrl
PCUH
PCNL
IMEM
D0
PCIH
Spl
PCIL
Spl
D1
PCNH
IDecode
Fig. 6. Lutonium fetch loop
number of edges and vertices such that the occurrence index
offsets and delays of the edges depend on nuv . Let xuv and
yuv be non-negative integers such that nuv = 2xuv + yuv with
yuv ≤ 1. We represent the collapsed constraint graph of Puv as
shown in ﬁgure 8. The vertices li↑ and ro↑ are duplicated for
the sake of clarity.
li↑
li↑
lo↑
lo↓
li↓
ro↑
ro↑
ri↑
ro↓
ri↓
Fig. 8. Constraint graph of Puv
Recall that the slack matching buffers satisfy assumption 1,
thus as shown in the proof of lemma 1, when nuv ≥ 1, the de-
lays along critical λ(Puv) and ρ(Puv) paths can be expressed
as linear functions of xuv and yuv. The sum of the occurrence
index offsets along these paths can also be expressed as lin-
ear functions of xuv and yuv. Similar techniques to those used
in the proof of lemma 1 can be used to determine the critical
σ(Puv) paths and show that the delay and sum of occurrence
index offsets along these paths are constant for nuv ≥ 1.
The delay of any path between a transition on an input vari-
able of Puv and a transition of an output variable of Puv is
zero when nuv = 0. When nuv = 0, the representation of
Puv can introduce cycles in C′ that would not be present, were
the pipeline Puv removed and the corresponding input and out-
put variables of Puv connected with wires. In particular, when
nuv = 0, λ1(Puv), λ2(Puv), ρ1(Puv), ρ2(Puv) and σ∗(Puv)
paths should not exist in the collapsed constraint graph of Puv .
Thus care must be taken to ensure that the introduction of these
paths does not result in cycles that artiﬁcially constrain the cy-
cle time of S′ to be greater than τ0.
Let Prs be a pipeline of slack matching buffers and t be a
transition on an input variable of Prs. Let At be the critical
path between Puv.ro↑ and t and Bt be the critical path between
Puv.ro↓ and t such that At and Bt do not traverse edges in
the constraint graph of any pipeline of slack matching buffers.
When the delay of the ρ1(Puv) edge is set to zero, and its oc-
currence index offset is chosen to be greater than (26), there
exists a cycle in C′ that traverses ρ1(Puv) which constrains the
cycle time to be greater than τ0 only if there exists a cycle in
C′ that traverses ρ0(Puv)(but not ρ1(Puv)) and constrains the
cycle time to be greater than τ0.
max
t
⌈
δ(Bt)− δ(At)
τ0
⌉
+ (At)− (Bt). (26)
Similar lower bounds on the occurrence index offsets of
ρ2(Puv), λ1(Puv), and λ1(Puv) paths can be derived. In order
to determine a lower bound on the occurrence index offset of the
σ0(Puv) path, consider any path, p, from Puv.ro↑ Puv.ri↑ that
does not traverse any edges in the collapsed constraint graph of
any pipeline of slack matching buffers. The occurrence index
offset of the σ0(Puv) path must be at least (p). The occurrence
index offsets of the other σi(Puv) edges can be determined in a
similar manner.
Proceedings of the 12th IEEE International Symposium on Asynchronous Circuits and Systems (ASYNC’06) 
0-7695-2498-2/06 $20.00 © 2006 IEEE 
Authorized licensed use limited to: CALIFORNIA INSTITUTE OF TECHNOLOGY. Downloaded on April 15,2010 at 18:11:06 UTC from IEEE Xplore.  Restrictions apply. 
10
In order to express the delays and occurrence index offsets
of the various paths in Puv as a linear function of the program
variables, it is useful to determine whether nuv is zero. We
do this by introducing a binary variable zuv such that zuv ≤
nuv ≤ zuv · Nmax where Nmax is the maximum number of
slack matching buffers that can be placed on any channel.
Thus, given C′, we can construct inequalities of the form
(25). Note that τ is set to be τ0, and ab and αab may be lin-
ear functions of the three variables xuv, yuv and zuv if the rule
(a, b) corresponds to a path in the collapsed constraint graph
of Puv . Furthermore, we need to introduce the constraints
nuv = 2xuv + yuv and zuv ≤ nuv ≤ zuv · Nmax. We can
then solve a MILP with these constraints to determine nuv such
that the objective function is minimized. The objective function
must be linear in nuv .
VII. PRIOR WORK
Chapter 2 of Lines [4] studies asynchronous pipelines and
presents some necessary conditions for a system composed of
half-buffers to be slack matched. Pe´nzes [9] presents an exe-
cution model for production rule sets. Using this model, the
slack matching problem is formulated as an integer linear pro-
gramming problem for some examples. However, the integer
linear program needed to slack match any given system is not
speciﬁed. No comment is made on the size of the integer lin-
ear program generated relative to the size of the system being
slack matched. This method is similar to the technique for slack
matching presented in section VI. We explicitly state the in-
teger linear program that needs to be solved in order to slack
match a system. Chapter 5 of Wong [11] presents a method for
slack matching systems composed of processes with the PCHB
reshufﬂing. If the delays of all processes are identical, the slack
matching algorithm presented is similar to that in section IV.
However, for heterogeneous systems Wong [11] needs to study
all the paths in the FBI graph of a system in order to set up the
slack matching problem. FBI graphs are an execution model
for systems composed of processes with the PCHB reshufﬂing.
The FBI graph of process is essentially a simpliﬁed collapsed
constraint graph of the process In contrast, we present a method
to frame the slack matching problem as a mixed integer linear
program based on the process graph even in the heterogenous
case. We show how to construct this program in polynomial
time.
VIII. CONCLUSIONS AND FUTURE WORK
Two methods of expressing the slack matching problem as a
MILP have been presented.
The method in section IV provides a set of conditions for
slack matching systems composed of a speciﬁed class of half-
buffers. A polynomial time algorithm has been presented to
generate the MILP. This method of generating an MILP, and
then solving using general purpose MILP solvers has been ap-
plied to circuits from the Lutonium [7] and the MiniMIPS [6].
Similar conditions to those in section III can be derived for
slack matching circuits comprised of a restricted set of full-
buffers. The next step would be to derive similar constraints
for systems comprised of both full-buffers and half-buffers.
For the circuits studied so far, solving theMILP has not taken
large amounts of time. However, if solving the MILP for slack
matching larger systems does take excessive amounts of time,
it faster heuristics should be explored.
The method in section VI imposes fewer restrictions on the
processes that comprise the system being slack matched; how-
ever the MILP generated is larger than that generated using the
ﬁrst method. The next step is to extend this method to permit
processes with conditional communication.
ACKNOWLEDGMENTS
The research described in this paper was supported by the
Defense Advanced Research Projects Agency (DARPA) as part
of the program in Power-Aware Computing and Communica-
tions (PAC/C) and monitored by the Air Force Ofﬁce of Scien-
tiﬁc Research.
REFERENCES
[1] S.M. Burns. Performance Analysis and Optimization of Asynchronous
Circuits. PhD thesis, California Institute of Technology, 1990.
[2] S. Kim and P. Beerel. Pipeline optimization for asynchronous cir-
cuits:complexity analysis and an efﬁcient optimal algorithm. In Proc.
International Conference on Computer-Aided Design, 2000.
[3] C. Leiserson, F. Rose, and J. Saxe. Optimizing synchronous circuitry by
retiming. In Third Caltech Conference On VLSI, March 1993.
[4] A.M. Lines. Pipelined asynchronous circuits. Master’s thesis, California
Institute of Technology, 1995.
[5] R. Manohar and A.J. Martin. Slack elasticity in concurrent computing. In
J. Jeuring, editor, Proc. 4th International Conference on the Mathemat-
ics of Program Construction, Lecture Notes in Computer Science 1422,
pages 272–285. Springer Verlag, 1998.
[6] A.J. Martin, A. Lines, R. Manohar, M. Nystro¨m, P. Pe´nzes, R. South-
worth, U. Cummings, and T.K. Lee. The design of an asynchronous MIPS
R3000 microprocessor. In Proc. 17th Conference on Advanced Research
in VLSI, 1997.
[7] A.J. Martin, M. Nystro¨m, K. Papadantonakis, P.I. Pe´nzes, P. Prakash,
C.G. Wong, J. Chang, K.S. Ko, B. Lee, E. Ou, J. Pugh, E. Talvala, J.T.
Tong, and A Tura. The Lutonium: A sub-nanojoule asynchronous 8051
microcontroller. In Proc. 9th IEEE Intl Symposium on Advanced Research
in Asynchronous Circuits and Systems, May 2003.
[8] G. Nelson. Systems programming with Modula-3. Prentice Hall, 1991.
[9] P. Pe´nzes. Pipeline composition for asynchronous circuits. unpublished,
September 1999.
[10] T.Williams. Latency and throughput tradeoffs in self-timed asynchronous
pipelines and rings. Technical report, Stanford, 1990.
[11] C.G. Wong. High-Level Synthesis and Rapid Prototyping of Asyn-
chronous VLSI Systems. PhD thesis, California Institute of Technology,
2004.
APPENDIX
All variables in a collection of HSE are boolean variables.
The statements in a HSE are assignments to these variables. a↑
and a↓ respectively denote the assignment of the values true
and false to variable a. Sequential composition is denoted by ;
and , denotes parallel composition. The deterministic selection
statement [G1 → S1[]G2 → S2]waits for one of the guardsGi
to be true and then executes Si . The looping statement *[G →
S] executes S while the guard G holds. The statement [G →
skip] is abbreviated [G]. Similarly, the loop *[true → S]
is abbreviated *[S].
Proceedings of the 12th IEEE International Symposium on Asynchronous Circuits and Systems (ASYNC’06) 
0-7695-2498-2/06 $20.00 © 2006 IEEE 
Authorized licensed use limited to: CALIFORNIA INSTITUTE OF TECHNOLOGY. Downloaded on April 15,2010 at 18:11:06 UTC from IEEE Xplore.  Restrictions apply. 
