Design, implementation, and validation of a new class of interface circuits for latency-insensitive design by Cheng-hong Li et al.
Design, Implementation, and Validation of a New Class of
Interface Circuits for Latency-Insensitive Design
Cheng-Hong Li, Rebecca Collins, Sampada Sonalkar, and Luca P. Carloni
Department of Computer Science - Columbia University in the City of New York
Abstract—With the arrival of nanometer technologies wire
delays are no longer negligible with respect to gate delays, and
timing-closure becomes a major challenge to System-on-Chip
designers. Latency-insensitive design (LID) has been proposed
as a “correct-by-construction” design methodology to cope with
this problem. In this paper we present the design and imple-
mentation of a new class of interface circuits to support LID
that offers substantial performance improvements with limited
area overhead with respect to previous designs proposed in the
literature. This claim is supported by the experimental results
that we obtained completing semi-custom implementations of the
three designs with a 90nm industrial standard-cell library. We
also report on the formal veriﬁcation of our design: using the
NuSMV model checker we veriﬁed that the RTL synthesizable
implementations of our LID interface circuits (relay stations
and shells) are correct reﬁnements of the corresponding abstract
speciﬁcations according to the theory of LID.
I. INTRODUCTION
One of the most critical issues in designing Systems-
on-Chip (SOC) with nanometer technology processes is the
increasing impact of global wire delays: as more and smaller
processing cores are accommodated on a chip, global (inter-
core) wires do not scale in delay as local (intra-core) wires do
because they need to span physical distances that represent
signiﬁcant proportions of the die [1], [2]. As the delays
of global wires are no longer negligible compared to gate
delays, the chip becomes a distributed system, thereby posing
a serious challenge to the traditional CAD ﬂows that are based
on the synchronous design paradigm [3]. Furthermore, since
wire delays are hard to predict at early stages of the design
process, an increasing number of design exceptions in terms of
post-layout timing violations forces costly design re-iterations
(timing-closure problem).
Latency-insensitive design (LID) [4], [5], has been proposed
as a “correct-by-construction” design methodology to handle
the increasing impact of global communication latency in
nanometer integrated circuit design without forcing major
departures from traditional and well-established design ﬂows.
Given a synchronous system speciﬁcation, e.g. a register-
transfer level (RTL) netlist of logic blocks speciﬁed and val-
idated using a hardware-description language, a functionally-
equivalent latency-insensitive system can be automatically
derived by encapsulating each sequential logic block (referred
as a pearl or core) within an automatically generated interface
process (a shell). The advantage of this transformation is
that any communication channel connecting two core/shell
pairs can now present a varying latency in terms of number
of clock cycles without affecting the functional correctness
of the original design. In practice the latency of a channel
is changed through the insertion of relay stations, that are
clocked buffers with twofold storage capacity and simple
Fig. 1. Shell encapsulation, relay station insertion, and channel back-pressure.
ﬂow-control logic. Hence, LID provides a sound way to
address the problem of interconnect delay in nanometer design
by simplifying the application of wire pipelining for global
communication channels at any stage of the design process
and without requiring any re-design of the cores. Furthermore,
it simpliﬁes the assembly and reuse of pre-designed cores
for building complex SOCs because these can be arbitrarily
complex sequential logic blocks as long as they are stallable:
this is the only prerequisite for LID and it can be easily
implemented with clock gating mechanisms [4], [5].
In practice, the LID methodology calls for three steps: (1)
a strictly synchronous (or strict) system is originally designed
and validated as a netlist of stallable cores; (2) a patient system
is automatically derived from the strict system by encapsulat-
ing each core within a shell; (3) any number of relay stations
can be inserted on any channel between any pair of shells.
Fig. 1 shows a latency-insensitive system with ﬁve core-pearl
pairs connected by point-to-point, unidirectional channels. The
shell logic and relay stations implement a latency-insensitive
protocol, which is designed to accommodate any variation
of channels’ latency while guaranteeing that the functional
behavior of the original strict system is preserved (semantics
preservation).
A formal deﬁnition of the properties of relay stations and
shells is given in a denotational framework as part of the
theory of LID [5]. At the core of LID lies the notion of
latency-equivalence: two signals are latency equivalent if they
present the same ordered streams of data items but possibly
with different timing. In a synchronous model of computation
the existence of a clock guarantees a common time reference
among signals and, therefore, a signal must presents an event
at each clock cycle [6], [7]. LID distinguishes between the
occurrence of an informative event (a valid data item or valid
token) and a stalling event (void token). Any class of latency-
equivalent signals contains a single reference signal that does
not present stalling events (a strict signal) while all the other
members of the equivalence class (stalling signals) contain the
same sequence of informative events interleaved by one or
more stalling events. Following the tagged-signal model [7],
13 1-4244-1050-9/07/$25.00 ©2007 IEEE1 2 3 4 5
data A B C C ...
LID-2ss void 0 0 0 0 ...
stop 0 1 1 1 ...
receiver stalled 0 1 1 1 ...
sender stalled 0 0 0 1 ...
data A B B ... ...
LID-1ss void 0 0 0 ... ...
stop 0 1 0 ... ...
receiver stalled 0 1 1 ... ...
sender stalled 0 0 1 ... ...
Fig. 2. Simulations of the two latency-insensitive protocols with different
back-pressure mechanisms.
the notions of latency-equivalence signals, strict signals, and
stalling signals are extended to sets of signals (behaviors) and
sets of behaviors (processes) [5].
In a nutshell, LID allows to derive from the original
reference strict system speciﬁcation, which contains only strict
processes, any possible latency-equivalent implementation,
which contains only patient processes. Each strict process
abstracts the core in the original speciﬁcation while the
corresponding latency-equivalent patient process is obtained
by composing the core with a shell. While the original cores
are not designed to process void tokens, a shell-core pair is a
patient process, i.e. it can tolerate the arrival of a void token
at any of its I/O channel ports at any given clock cycle and
be able to eventually continue with its correct operations.
In a practical implementation, void tokens are used to
capture latency variations on communication channels and are
processed by the shells in a way that makes them transparent to
the cores. In particular, relay stations, which are not present in
the original strict design, are initialized with void tokens when
introduced in the patient design to pipeline a given channel.
Void tokens are then processed by the shells while remaining
transparent to the cores. Informally, any shell acts according
to an AND-ﬁring policy, thereby it stalls its core whenever at
least a valid token is missing on one of its input channels. As a
shell stalls its core, potential valid tokens that may be present
on other input channels are stored locally in input queues
within the shell for future processing by the core. In this way
each shell dynamically absorbs the latency variations across
the channels by realigning the valid tokens before presenting
them to the core. Whenever it is not stalled, the core processes
valid tokens on its inputs as it does in the original strict system.
Since in practice a queue can only have a ﬁnite size, a
downlink shell must be able to inform an uplink shell that is
necessary to postpone the production of valid token for some
cycles (backpressure). In the denotational framework of theory
of LID, a backpressure event at a given clock cycle is also
abstracted as the occurrence of a void token on the channel
between the two shells [5]. While the theory of LID deﬁnes the
general properties that any latency-insensitive protocol must
obey, many possible protocol speciﬁcations and supporting
interface-circuit implementations are conceivable in practice.
A protocol that relies on just two control bits, a void bit to
identify invalid data and stop bit to implement backpressure,
was ﬁrst presented in [4] and discussed in more detail together
with the supporting interface circuits in [3], [8].
Contribution. The latency-insensitive protocol that is dis-
cussed in [3], [4], [8] stipulates that a shell or relay station is
stalled whenever the stop bit is kept high for two consecutive
clock cycles. In this paper we refer to this protocol as LID-2ss,
which stands for two-stop-to-stall. The top of Fig. 2 reports
a simulation trace of a channel according to LID-2ss where
the receiver is being stalled at cycle 2. Because the receiver
is stalled, valid token A is not processed and thus is buffered
by the receiver’s shell. To avoid buffer overﬂow and possible
loss of the data, the receiver stalls the sender by asserting the
stop bit both at cycle 2 and 3. Notice that the sender only
stalls at cycle 4 holding the valid token C on its output port
after receiving two stop signals. This means token B needs
to be buffered by a queue in the receiving shell together with
token A. In fact, both the shell queues and the relay stations
have storage capacity equal to two according to the library of
interface circuits that were proposed to support LID-2ss.
In this paper we describe a simpler latency-insensitive
protocol labeled as LID-1ss, which stands for one-stop-to-
stall, that is based on a different back-pressure convention.
In the new protocol, a shell or a relay station stalls whenever
it receives a single stop signal, as reported by the simulation
trace in the bottom part of Fig. 2: here, the receiver asserts
the stop bit only at cycle 2, and the sender begins to stall
immediately at cycle 3. In our design a queue of capacity
equal to one in the receiver’s shell is sufﬁcient since only data
token A must be buffered there during stalling while B is
preserved uplink in the channel for future processing. Notice
that our new protocol LID-1ss does not allow us to reduce the
storage capacity of a relay station to one because this would
reduce the performance of a latency-insensitive system by half
as explained in the theory of LID [5]. However, it does allow
us to reduce the storage capacity of a shell input queue to one
with respect to the original protocol LID-2ss because we can
take advantage of the storage capacity within the core 1.
We contribute a new set of interface circuits (i.e. shells and
relay stations) that support the LID-1ss protocol and offer
substantial improvements with respect to previous works in
the literature. In particular,
• they offer shorter logic delay and have smaller area
overhead than the circuits supporting the original latency-
insensitive protocol LID-2ss discussed in [3], [4], [8];
• they offer shorter logic delay and, for many systems,
enable higher processing throughput than the interface
circuits for synchronous elastic architectures that were
recently proposed in [10].
We also report on our work to validate both our design
and the original design: using the NuSMV model checker we
formally veriﬁed that the RTL synthesizable implementations
of the key LID building blocks (relay stations and shells) is a
correct reﬁnement of the corresponding abstract speciﬁcations
according to the theory of LID [5].
The paper is organized as follows. In Sec. II we brieﬂy
overview the related work on latency-insensitive design. The
RTL logic of the interface circuits supporting our LID-1ss
1To discuss how the performance of a latency-insensitive system can be
optimized through relay-station insertion and the sizing of shell input queues
goes beyond the scope of this paper and we refer to [9].
14protocol is described in detail in Sec. III. We then discuss
the formal veriﬁcation of these circuits in Sec. IV. Finally, in
Sec. V. we present a comprehensive set of experimental results
that provide a comparative analysis of LID-1ss, LID-2ss, and
SEA in terms of logic delay, effect on system’s processing
throughput, and area overhead.
II. RELATED WORK
The LID methodology has recently raised some interests and
several extensions and related approaches have been proposed
[10]–[15]. Indeed, while it speciﬁes the fundamental properties
of any latency-insensitive protocol, the denotational framework
used to develop the theory of LID [5] leaves open the
possibility of developing various protocol speciﬁcations that
in turn may lead to practical implementations with different
characteristics.
The simpler protocol that we discuss in this paper was
already assumed in [13], [14]. Chelcea and Nowick presented
a mixed-timing relay station that stalls for one clock cycle if
a stop signal is received [13]. As they focus on describing
a complete class of low-latency FIFO interfaces for mixed-
timing systems, they do not discuss the design of shell blocks
to support LID. Lu and Koh use max-plus algebra to analyze
the performance of a latency-insensitive system with back-
pressure [14]. The model of the protocol that they adopt
assumes that a sender is stalled when one or more of its
receivers asserts the stop bit. However, neither the design
of the shell nor the design of a relay station is provided.
Conversely, in this paper we contribute the complete interface
logic for a single-clock synchronous system at the RTL level.
Cortadella et al. recently proposed synchronous elastic
architectures (SEAs) [10] that are based on the synchronous
elastic ﬂow (SELF) protocol; SELF is a new approach to
LID that “combines the modularity of asynchronous design
with the efﬁciency of synchronous implementations” [10].
Like the LID-2ss protocol that was originally proposed in [4]
and the LID-1ss one that we discuss in the present paper,
SELF also relies on valid and stop bits. Further, SEAs rely
on sequential buffers, called elastic buffers (EB), to pipeline
long channel wires, as LID relies on relay stations. On the
other hand, SEAs do not use the idea of shell interfaces with
input queues that store valid tokens during stalling. Instead,
in a SEA it is possible to have elastic buffers with multiple
input/output channels thanks to special elastic fork and join
control structures [10]: when stalling occurs, each valid but
unused token is held by its immediate sender. Robustness with
respect to latency variations is achieved in SEAs by combining
elastic buffers, fork and join structures while performing an
elasticization transformation on the original circuit. This step
consists essentially of replacing each ﬂip-ﬂop in the core
with two transparent latches of different polarity, similar to
a master-slave structure, but with independent enable signals
for the two latches so that “a mechanism for double-pumping
in one cycle” [10] can be realized. By properly setting the
enable signals the elasticized core can either operate as usual,
or be stalled, or store two output data in the two back-
to-back latches. However, using enable signals to control
FF
In1
In2
clock
out1
out2
processing core
comb.
logic
En0
In1
In2
clock
out1
out2
En1
control join fork
comb.
logic
elasticization
validin1
stopin1
validin2
stopin2
validout1
stopout1
validout2
stopout2
LH
Fig. 3. Elasticizing a core with on SEA interface circuits and clock gating.
clocked latches may incur signiﬁcant area overhead because
additional steering logic is needed [16]. In this paper, a slight
modiﬁcation is made in the SEA interface circuits: the latches
are driven by gated clock signals to avoid extra steering logic
for stalling the core and storing two unconsumed data tokens.
This technique was ﬁrst proposed by Jacobson et al. for their
synchronous interlocked pipelines [16]. The elasticization of a
processing core is illustrated in Fig. 3, where the shaded boxes
represent the logic implementing the SEA interface circuits
and stalling mechanism. In particular the join control structures
differ subtly from LID-1ss interface circuits with respect to
the timing of sending a stop bit to a sender. In a LID-1ss
interface this is sent whenever a queue is full. Instead the
join control structure of a processing core with multiple input
channels requests all valid tokens to be resent (by asserting
the corresponding stop bits) whenever at least one invalid
tokens arrives at the same clock cycle. This may have negative
impacts on the performance of a SEA because: (a) it degrades
the overall system throughput and (b) it limits the maximum
clock frequency at which the ﬁnal circuit can run due to long
combinational paths spanning two interconnect channels. In
Section V we present a detailed discussion of these issues in
the context of a comparative analysis of the interface circuits
for the two approaches.
Suhaib et al. [17] propose a framework for validating
families of latency-insensitive protocols by taking a system,
transforming it into a latency-insensitive system and then com-
paring the output behavior of the original system with the one
of the transformed system on a subset of possible inputs. This
technique is good for the development and debugging phase of
new latency-insensitive protocols because it can uncover many
bugs quickly without requiring an exhaustive veriﬁcation. As
described in Sec. IV, our approach is more applicable to a later
phase in the design of the circuit implementation of a latency-
insensitive protocol. In particular, we formally verify the RTL
implementation of relay station and shell in a modular fashion
so that a previously veriﬁed synchronous system does not need
to be re-veriﬁed after it has been transformed into a latency-
insensitive system. This approach has several advantages. New
systems can be veriﬁed independently of the architecture they
will operate on. In addition, formally verifying the shell is
quite demanding in terms of computational memory: to verify
an entire system implementation with numerous cores, each
encapsulated in its own shell would be prohibitively expensive
at the same level of rigor.
15III. A SIMPLIFIED LATENCY-INSENSITIVE PROTOCOL AND
ITS IMPLEMENTATIONS
In this section we discuss in detail the implementation of
the simpliﬁed latency-insensitive protocol LID-1ss that we
introduced in Section I. Brieﬂy, the new protocol differs from
the original LID-2ss protocol discussed in [4] in the back-
pressure mechanism: the LID-1ss protocol uses a single stop
bit to stall a sender. For both the shell and the relay station,
we ﬁrst present sample simulations of their I/O behaviors and
then explain the details of the RTL designs.
Shell. Fig. 4 shows a sample simulation trace of a two-input-
two-output shell and its core with the assumption that both in-
put queues have a capacity of two. A block diagram of the shell
and its stallable core module is illustrated in Fig. 5(a). The core
implements a function f :( Ct+1,D t+1)=f(At,B t), where
At and Bt are data tokens arriving on input channel In1 and
In2 while Ct and Dt are the tokens produced by the core on
output channel Out1 and Out2 at time t, respectively.
Several scenarios are illustrated in this trace. In cycle 1 both
channels In1 and In2 present valid data tokens, and, therefore,
the core can be ﬁred to produce valid output tokens (C2 and
D2) at cycle 2. At cycle 2 the void input token of channel
In1 (void bit is high) causes the shell to stall the core at cycle
3. Therefore, both the output tokens at cycle 3 are marked as
void with their voidOut bits being asserted by the shell.
The scenario in which the shell receives back-pressure
happens at cycle 5, when the downlink receiver of channel
Out2 asserts the stopIn2 bit. Thus the output token D4 is
regarded as void at cycle 5. The core is stalled at cycle 6, and
both C4 and D4 are repeated at cycle 6. However, since the
downlink receiver of channel Out1 has already sampled C4,
the void bit is set for the repeated C4 so the same token will
not be sampled twice on channel Out1. The accompanying
void bit of D4, on the other hand, is not set because token
D4 on channel Out2 has not been sampled yet. In this case
D4 is sampled at the end of cycle 6 (when the clock edges
arrives to start cycle 7).
What follows from cycle 6 shows the case when an input
queue is full. The stop request from the downlink of channel
Out2 causes the input queue of channel In2 to be ﬁlled up at
cycle 6 (two valid tokens are stored in channel In2’s queue
at the end of cycle 5, due to the stalls at cycle 3 and 6), thus
a stop request is raised to the uplink sender of channel In2.
Note that at cycle 6 the shell is not able to store token B6.
The same token is thus resent on channel In2 and is sampled
by the shell at cycle 7.
Next we present the details of the shell RTL logic design.
Fig. 5(a) reports a block diagram of a two-input-two-output
shell, and the logic functions of the controller is listed in
Fig. 5(b). The control logic is general and can be easily scaled
to handle an arbitrary number of input and output channels. All
the logic functions are quite simple and can be implemented
with few logic gates. The clock gating signal ﬁre decides
whether the core module is ﬁred or stalled. It is asserted when
each channel presents a valid token either directly from the
channel input or from its input queue, and no stop request has
arrived on any output channel. The second condition can be
1 2 3 4 5 6 7 8 9 10 11
dataIn1 A1 A1 A2 A3 A4 A5 A6 A6 A6 A8 A9
In1 voidIn1 0 1 0 0 0 0 0 1 1 0 0
stopOut1 0 0 0 0 0 0 0 0 0 0 0
dataIn2 B1 B2 B3 B4 B5 B6 B6 B6 B6 B8 B9
In2 voidIn2 0 0 0 0 0 0 0 1 1 0 0
stopOut2 0 0 0 0 0 1 0 0 0 0 0
dataOut1 C1 C2 C2 C3 C4 C4 C5 C6 C7 C7 C8
Out1 voidOut1 0 0 1 0 0 1 0 0 0 1 0
stopIn1 0 0 0 0 0 0 0 0 0 1 0
dataOut2 D1 D2 D2 D3 D4 D4 D5 D6 D7 D7 D8
Out2 voidOut2 0 0 1 0 0 0 0 0 0 1 0
stopIn2 0 0 0 0 1 0 0 0 0 0 0
Fig. 4. Sample I/O behavior of the new shell. Shaded data tokens are bubbles.
stallable core module
control
bypassable queue
mux
FIFO
dataIn1
voidIn1
stopOut1
voidIn{1,2}
stopOut{1,2}
dataOut1
dataOut2
voidOut1
stopIn1
voidOut2
stopIn2
voidOut{1,2}
stopIn{1,2}
enq1 deq1 full1 empty1 bypass1
mux
FIFO
dataIn2
voidIn2
stopOut2
enq2 deq2 full2 empty2
bypass2
enq{1,2} deq{1,2}
full{1,2} empty{1,2}
bypass{1,2}
1
0
1
0
clk fire
In 1
In 2
Out 1
Out 2
(a)
ﬁre =
￿
i∈I(voidIni + emptyi) ·
￿
j∈O(stopInj · voidOutj)
∀j ∈OvoidOutj
+ =
￿
0 if stopInj · voidOutj is true
ﬁre otherwise
∀i ∈IstopOuti = fulli
∀i ∈Ienqi = voidIni · (ﬁre + emptyi) · fulli
∀i ∈Ideqi = emptyi · ﬁre
∀i ∈Ibypassi = emptyi
(b)
Fig. 5. (a) A block diagram of a two-input-two-output shell and a stallable
core module. (b) Logic functions of the shell controller.
detected by checking the current stopIn and voidOut bits for
each output channel. If the voidOutj bit is high for some
output channel j, the downlink receiver of channel j has
received the latest valid token. In this case the core module
can proceed even if the receiver requests to stop.
The voidOutj bit informs to the downlink module on output
channel j whether the current token is a valid token or not. It is
a sequential signal buffered by an edge-triggered ﬂip-ﬂop. The
condition stopInj ·voidOutj = true means that the downlink
module on channel j is not able to process the current (also
the latest) valid data token. In this case the core module will
be stalled, the current token will be repeated, and voidOutj
will be set low. In all other cases the value of the voidOutj
bit depends on whether the core module will be ﬁred.
The major data-path components in a shell are the by-
passable queues that store unused valid tokens from input
channels. Its minimum forward latency is zero. The by-
passable queue is implemented as a standard FIFO whose
output is multiplexed with the incoming data of the channel.
If the queue is empty, the controller selects the data token
from the input channel and passes it to the core module. The
161 2 3 4 5 6 7 8 9 10 11
dataIn A1 A1 A2 A2 A3 A4 A4 A5 A6 A7 A7
voidIn 0 1 0 1 0 0 1 0 0 0 0
stopOut 0 0 0 0 0 0 0 0 0 1 0
dataOut ∗ A1 A1 A2 A2 A3 A4 A4 A5 A5 A6
voidOut 0 0 1 0 1 0 0 0 0 0 0
stopIn 0 0 0 0 1 0 1 0 1 0 0
Fig. 6. Sample I/O behavior of the new relay station.
mux
0
1
control
mux
0
1
main
FF aux
FF
dataIn
auxEn
0
void
FF
sel
stopOut
voidIn
voidOut
stopIn
dataOut
mainEn
mainEn
(a)
processing
stalling
stopIn & voidIn
sel = 0 
mainEn = 0
auxEn = 0
stopOut = 0
!stopIn + 
(!voidIn & voidOut)
sel = 0 
mainEn = 1
auxEn = 0
stopOut = 0
     stopIn
sel = 0 
mainEn = 0
auxEn = 0
stopOut = 1
stopIn &
!voidIn & !voidOut
sel = 0 
mainEn = 0
auxEn = 1
stopOut = 0
     !stopIn
sel = 1 
mainEn = 1
auxEn = 0
stopOut = 1
(b)
Fig. 7. (a) Block diagram of the new relay station; (b) The state transition
diagram of its controller.
internal queue is a sequential element: all of the operations
(i.e. enqueue and dequeue) and the update of its status (i.e.
full or empty) take place at each clock edge. Hence all of the
stopOut signals, which are the full signals from the queue,
are sequential signals.
Relay station. Fig. 6 reports sample I/O behaviors of a relay
station. From cycle 1 to 4, the relay station simply relays the
received data, void or not, from its input channel to its output
channel. At cycle 9, the relay station receives a stop request
from its downlink receiver. It then stalls (and repeats its output
token) for one cycle to avoid overﬂow its downlink receiver.
Meanwhile, the incoming data token at cycle 9 is buffered in
the relay station’s internal storage, and the stop request is sent
to its uplink sender at next clock cycle.
Sometimes, an optimization can be applied to avoid stalling
the relay station when the downlink receiver asserts the stopIn
bit. This is shown at cycle 5 to 6. At cycle 5 the relay
station receives the stop request and emits a void token at
the same time. Because the void token will not be sampled by
its downlink receiver, the relay station can safely continue to
relay data tokens at cycle 6 without being stalled.
Another optimization occurs when the relay station absorbs
a stop request instead of relaying it to its uplink sender. For
instance, at cycle 7 the relay station receives a void token from
its uplink and a stop request from its downlink. It can actually
discard the void token received at cycle 7, instead of buffering
it, and simply repeat its current output at cycle 8. In this way,
it avoids propagating the stop request.
Fig. 7(a) shows an implementation of the relay station for
the proposed latency-insensitive protocol; Fig. 7(b) reports
the state transition diagram of its controller. The new relay
station uses two edge-triggered ﬂip-ﬂops to store incoming
data tokens, and one ﬂip-ﬂop to buffer the voidOut bit.
The two ﬂip-ﬂops storing data tokens provide the necessary
twofold storage capacity. The output of the main ﬂip-ﬂop is the
data output of the relay station. The controller decides when
to update the three ﬂip-ﬂops and sets stopOut and voidOut
bits according to the protocol. The control logic is discussed
next.
The controller is a two-state Mealy ﬁnite state machine with
three input and four output signals. The initial state is the
processing state, which enables the main ﬂip-ﬂop and sets
the stopOut bit low. In the stalling state, instead, the relay
station uses both the main and the auxiliary ﬂip-ﬂops to store
data tokens, and requests the uplink sender to stop sending
more data tokens by asserting its stopOut bit. Note that the
value of the stopOut bit depends only on the current state of
the controller, and thus no combinational path exists between
stopIn and stopOut.
The switching from the processing state to the stalling state
is triggered by the condition that the stopIn bit is high, and
both the voidIn and voidOut bits are low. The asserted stopIn
bit indicates that the receiver is not able to process the output
data taken of the relay station. Hence the relay station has
to maintain its output token by keeping the same data in the
main ﬂip-ﬂop. On the other hand, the relay station must save
the incoming valid token (indicated by low values of voidIn
and stopOut) in the auxiliary ﬂip-ﬂop, and enter the stalling
state. Note that the incoming voidIn bit is not saved in the
void ﬂip-ﬂop, because in this case it is always low (this is part
of the condition to switch from the processing to the stalling
state) and thus can be easily recovered.
The relay station goes back from the stalling to the process-
ing state when its downlink receiver deasserts the stopIn bit,
indicating that it is ready to receive more valid data tokens.
Then, the relay station moves the token saved in the auxiliary
ﬂip-ﬂop to the main ﬂip-ﬂop. It also updates the void ﬂip-ﬂop
with a constant low value because the accompanying void bit
of the data token in the auxiliary ﬂip-ﬂop must be deasserted.
IV. FORMAL VERIFICATION OF THE LID PROTOCOL
IMPLEMENTATIONS
An important compositional result is proven as part of
the theory of latency-insensitive design [5]: if all modules
in a strict system are replaced by corresponding latency-
equivalent patient modules, then the resulting system is patient
and latency equivalent to the original one. Naturally, this
theoretical result is not enough to guarantee that a particular
implementation of a latency-insensitive system is correct. The
theory tells us that we can build a patient system out of
patient parts, but we must also verify that the parts (the actual
implementations of the shells and relay stations) are patient.
On the other hand, we can verify the implementations of
shells and relay stations in isolation because according to
the compositionality rule for latency equivalence of patient
processes, a system composed of shell-core pairs and relay
stations is also latency equivalent to the original strict system.
We ﬁrst translated by hand the synthesizable VERILOG code
implementing the logic of the shell and relay station described
in Section III into the NuSMV language [18]. Then we used
the NuSMV model checker to verify that they are correct
reﬁnements of the speciﬁcations given in the LID theory.
17dataOut
Queue
Control
Logic
pop push
Environment
Monitor
dataIn
voidIn
stopIn
Station
Relay
stopOut
voidOut
Fig. 8. Veriﬁcation framework for a relay station.
In particular we veriﬁed the design for properties related to
latency equivalence, liveness, and storage capacity. For a relay
station this is sufﬁcient to prove that it is a patient process. The
shell is a little trickier. For the shell, patience also depends on
the functionality of the core that the shell encapsulates and the
shell implementation varies slightly depending on the number
of input and output channels of its core.
Veriﬁcation approach. Fig. 8 and Fig. 9 illustrate our veriﬁ-
cation approach for the relay station and the shell respectively.
The veriﬁcation framework consists of the component-under-
veriﬁcation (CUV) together with the environment, queue, and
monitor modules. The environment generates data items, the
valid bits, and the stop bits in an unconstrained manner: at each
clock cycle, the environment may non-deterministically choose
a value for dataIn, and non-deterministically set voidIn and
stopIn to either true or false values. This enables veriﬁcation
under all possible input sequences; if any possible input
sequence fails, a counterexample is generated. The monitor
checks the correctness of the property to be veriﬁed by
comparing the stream(s) of valid data produced by the CUV
versus the stream(s) of data that passed through the queue.
The correct functioning of a latency-insensitive component is
checked under the assumption that its environment obeys the
latency-insensitive protocol i.e. the environment holds a data
token until it is sampled by the component. We do not impose
this assumption on the environment and instead track the
sampling of data tokens according to the latency-insensitive
protocol.
The queue is a FIFO used to store the valid data tokens
sampled by the monitor until they are matched with the output
tokens. It has standard push and pop operations for adding new
valid tokens to the tail of the queue and popping valid tokens
off the head of the queue. A valid data token is pushed in
the queue whenever the CUV latches in the token. Similarly
a valid data token is popped off the queue whenever the CUV
outputs a data token. These decisions are made by the queue
control logic based on the values of the stop and void bits.
The queue’s pop signal is forwarded to the monitor, and when
a pop occurs the monitor compares the queue’s output to the
CUV’s output.
For the veriﬁcation of the relay station a simple FIFO is
sufﬁcient because the relay station itself has simple store-and-
forward behavior. For the veriﬁcation of the shell, we also
need a core module to perform computation on the given
Queue_C
Core
Shell
Core
Environment
sOut_A
sOut_B
dIn_A
vIn_A
dIn_B
vIn_B
dOut_C
vOut_C
dOut_D
vOut_D
sIn_C
sIn_D
Monitor
pop push
Control
push pop
Control
pop push
Control
push pop
Control
Queue_A
Queue_B Queue_D
Fig. 9. Veriﬁcation framework for a shell.
inputs and produce output data. We chose a 2-input, 2-output
core that computes in parallel the two-input NAND and NOR
logic operations and stores the results in two internal ﬂip-ﬂops.
Separate queues are maintained for each incoming channel,
and a second core module is instantiated outside the shell.
When both input queues have valid data tokens, these are
passed to the core and the results are stored in an output queue.
The monitor compares the output of the shell with the data in
the output queue.
Formal Properties. We checked the properties of latency
equivalence, liveness, and storage capacity. The latency equiv-
alence property expresses that there is no loss, duplication or
reordering of valid tokens in a data stream. To test latency
equivalence of the relay station, we checked that the relay
station’s outgoing data stream is latency equivalent to its in-
coming data stream. To verify latency equivalence of the two-
input two-output shell, we compared the data tokens produced
by the core alone and those produced by the core/shell pair.
The liveness property expresses progress in the system. A
component is live if it produces meaningful data provided the
environment allows it. We imposed a fairness constraint on the
environment for the void and stop bits so that the environment
generates valid data items inﬁnitely often and enables the
downlink stream inﬁnitely often. The liveness property states
that the component generates valid data tokens inﬁnitely often
and enables the uplink stream inﬁnitely often.
The storage capacity property checks that the number of
data items in the monitor queue never exceeds the storage
capacity of the component. The relay station capacity is equal
to two. The storage capacity of the shell depends on the size
of its internal queue, which is at least equal to one.
The above properties were veriﬁed individually for the shell
and relay station Verilog implementations. All of the properties
passed veriﬁcation. The latency equivalence property was
also tested on known erroneous implementations of both the
shell and relay station. The veriﬁcation failed and generated
counterexamples as expected. The veriﬁcation was performed
on a machine with 2 AMD Opteron TM processors and 3.5
GB memory over Redhat Linux with the Fedora Core 6, and
NuSMV version 2.4.1. Time and memory usage from the
veriﬁcation experiments are summarized in Table I.
18Property Module name Time Memory
Latency Relay station 0.2 sec 7.2 MB
Equivalence Shell 15.5 min 2.4 GB
Liveness Relay station 5.5 sec 14.3 MB
Shell 1.4 hours 2.4 GB
TABLE I
MEMORY AND TIME STATISTICS FOR THE VERIFICTION TASKS.
(a)
bubble
token
(b)
Fig. 10. Marked graph models of (a) LID-1ss and (b) SEA interface circuits.
V. COMPARISONS OF LID INTERFACE CIRCUITS AND
SYNCHRONOUS ELASTIC ARCHITECTURES
In this section we present a comparative analysis of the new
class of interface circuits implementing the proposed latency-
insensitive protocol LID-1ss versus the interface circuit imple-
mentation of the original LID-2ss protocol and the interface
circuits for synchronous elastic architectures (SEAs) proposed
in [10]. We completed the semicustom design of the three
classes of circuits with a 90nm industrial standard-cell library
in order to compare them in terms of system throughput, logic
delay, as well as area overhead.
In Section II we provided a brief overview of SEAs and
clariﬁed that they do not use the concept of shell interfaces
but rely instead on elastic fork and join structures. In the
sequel, however, whenever it is convenient we will use the
term “shell” also to refer to the SEA interface logic for a
processing core and, in particular, to the composition of the
control logic of the substitute elastic buffer with the fork and
join control structures.
System Throughput. To make a system robust with respect
to communication latency through the application of either
LID or elasticization may have a negative impact on its per-
formance measured as processing throughput. This is deﬁned
as the ratio of the number of valid tokens over the number
of valid tokens plus void tokens that the system processes
over time. Since both a relay station (RS) and an elastic
buffer (EB) are initialized with a void token and since void
tokens may create more void tokens whenever they stall a
computation, the placement of RSs or EBs on channels that
belong to feedback loops and/or re-convergent paths may
induce permanent degradation of the system throughput. The
system throughput can be computed exactly by using either
marked graph models [9], [10], [19], or equivalently max-plus
algebra [14]. Fig. 10 shows the marked graph models for the
interface circuits of LID-1ss and SEA [10]. Note that in the
shell model the sizes of the shell queues are represented by a
variable q whose value may be set statically (at design time)
to optimize performance [9]. These models are compositional
as they inherit their topological structure from the modeled
system. Fig. 11 reports the LID-1ss model and the SEA model
for the system shown in Fig. 12(a). Note that in the LID-1ss
model each transition takes a single time unit to ﬁre. Instead
in the SEA model a transition takes half a time unit to ﬁre
because it is a latch-based design.
The maximum sustainable processing throughput of a LID
or SEA system is equal to the reciprocal of the cycle time
of its corresponding marked graph model: the cycle time is
equal to the largest cycle metric across all its cycles; the
cycle metric is equal to the sum of each transition’s ﬁring
time divided by the number of tokens along the cycle. (an
invariant number in a marked graph) [20].2 For both models
in Fig. 11 we highlighted the critical cycles, i.e. cycles having
the highest cycle metric. The LID-1ss-based implementation
has a throughput of 3/4=0 .75, assuming all input queues
in a shell have a capacity of one [9], [14]. The throughput
of the SEA version, on the other hand, is lower: 2/3=0 .67.
In this particular example, the ideal system throughput, equal
to 1, can still be achieved for both implementations. For the
LID-1ss version it is necessary either to insert an additional
relay station between cores B and C (or A and B)o rt o
raise to two the size of the input queue in the C shell for
the channel B → C. The second approach is called optimal
channel queue sizing [9], [14]. Since the SEA join structures
do not use queues, the only solution to improve the throughput
is to insert an additional elastic buffer between cores B and
C (or A and B).
For certain systems, however, an SEA-based implemen-
tation cannot achieve the same system throughput of an
implementation based on either LID-1ss or LID-2ss. This
is due to the particular structure of these systems that may
present particular combinations of reconvergent paths and/or
feedback loops. For example, for the system shown Fig. 12(b)
an implementation based on LID-1ss or LID-2ss can achieve
higher system throughput than a SEA-based implementation.
Note that the system has a similar reconvergent path from A to
C as the example in Fig. 12(a), but it has two additional cycles:
(A, B, E, A) and (B, C, D, B). In a LID-1ss implementation,
to achieve the ideal throughput equal to 1 it is necessary to
increase the input queue size of channel B → C in C’s shell
to 2. In this case, however, it is impossible for a corresponding
SEA to achieve such an ideal throughput because, at best, one
can insert an additional elastic buffer between B and C (or
A and B), which brings the throughput up to 3/4 (the cycle
with the inserted EB becomes the new critical cycle).
The two examples in Fig. 12(b) show the impact on system
throughput that input queues at a join point have. Insufﬁcient
queue size at a join point, like in the LID-1ss shell with queues
of size one or in the SEA join structures that lack queues,
degrades the system throughput.3 The reason is the following:
whenever an input queue is full at a join point, the uplink
sender, informed by the stop signal (back-pressure), must re-
send the same data token until the queue has room to accept it.
The more such re-sending happens, as in a SEA join structure,
the more throughput degradation may occur.
2This can be computed by solving the maximum cycle mean problem for
which a number of efﬁcient algorithms have been proposed [21], [22].
3It should be possible to derive an implementation of interface circuits for
the SELF protocol that instead of being based on SEA join structures uses
input queues like in a LID shell block.
19relay station A
B
C
(a) LID-1ss
EB
B
C
A
(b) SEA
Fig. 11. Marked graph models of the example in Fig. 12(a).
A RS
(eb)
B
C
(a)
A RS
(eb)
C
B
D E
(b)
Fig. 12. Examples of systems with unbalanced reconvergent paths.
Interface Logic Delay. The delay of LID and SEA interface
logic affects the overall system performance in two ways. First,
the longest combinational logic path within an interface or
across two communicating interfaces might become the new
critical path of the system, and thus determine the maximum
clock frequency at which the system can run. Second, when
pipelining a wire using repeaters, either relay stations (RS) or
elastic buffers (EB), the smaller the cross-interface logic delay
between two communicating interfaces is, the further the two
interfaces can be stretched away without inserting repeaters in-
between. Thus the deployment of interfaces with smaller cross-
interface logic delay can result in less number of RSs/EBs used
for wire pipelining. Because each inserted RS/EB introduces
an additional void token into the system and may potentially
reduce system throughput, it is desirable to design interfaces
with minimal cross-interface logic delay.
In order to analyze the logic delays of the various interface
circuits we synthesized their RTL Verilog implementations4
with a 90nm industrial standard cell library using SYNOPSYS
DESIGN COMPILER. As shown in Fig. 13, the interface logic
is assumed to drive optimally buffered wires [1], [23]. The
critical logic delays within each individual interface and across
the logic of communicating interfaces are then extracted using
DESIGN COMPILER static timing analyzer.
For the LID-2ss and LID-1ss designs, which are based
on edge-triggered ﬂip-ﬂops (FFs), the slack is derived by
subtracting the maximum logic delay between two ﬂip-ﬂops
and the ﬂip-ﬂop setup time from the clock period. For the
SEA design, which is based on level-sensitive latches, the
slack is calculated by subtracting the maximum logic delay
between two active-high (or active-low) latches and latch setup
time from the clock period.5 When calculating cross-interface
slacks, as shown in Fig. 13 (LID-2ss and LID-1ss) and Fig. 14
4We derived the LID-2ss and LID-1ss implementations, and obtained a gate-
level circuit implementation from the authors of SEA. We slightly changed
the latter to avoid excess area overheads, as discussed in Section II.
5Although a latch-based design allows time borrowing, the total delays over
a path spanning a chain of active-high and -low latches must stay within a
ﬁxed number of clock periods determined by the number of high-low latch
pairs. To simplify the analysis without sacriﬁcing accuracy, we assumed that
the path between two active-high (or -low) latches must be within one clock
period.
core
shell
core
shell
relay
station
...... ...
relay
station
. . . . . .
data, void
stop
...
......
......
FF
FF
tf
tb
Fig. 13. Long wires are optimally buffered by repeaters.
channel 1
channel 2 join
control A
control B
comb. path 1
comb. path 2
control C
EB
control D tb
tf
processing
core
EB
processing
core
Fig. 14. Combinational paths due to the join structure (left) and SEA slack
computation (right).
(SEA), the delays of forward paths (data and void/valid) tf
and of backward paths (stop) tb are both considered (without
counting delays of buffered wires across the channel).
Fig. 15(a) and Fig. 15(b)-15(e) summarize the results of our
analysis of the impacts of logic delay on system performance
in terms of the minimum slacks and the maximum physical
lengths of interconnects as allowed by the three sets of inter-
face logic respectively. Fig. 15(a) reports the minimum slacks
left in each interface logic and the four possible combinations
of communicating interface logic when running at 500 MHz
clock rate, while ignoring the delays of buffered interconnects.
The channel width is assumed to be 64-bit wide, and each
core has two input channels. The more slack an interface
logic has, the faster clock rate can be applied. LID-1ss has
more slack in all but one scenarios, and thus enjoys faster
clock rates than LID-2ss and SEA. Conversely, the slack of
the shell-shell pair in SEA is signiﬁcantly low. This may either
limit the system clock frequency, or require the insertion of
an additional elastic buffer between the two shells to increase
available slack. But inserting an elastic buffer introduces a void
token and, therefore, it may lower the system throughput.
Fig. 15(b)-15(e) report maximum allowable wire lengths be-
tween four different pairs of communicating interface circuits
at various clock frequencies. LID-1ss allows the maximum
interconnect lengths in all four possible scenarios. The “X”
marks indicate that at the given clock frequency the timing
constraint is not met in the corresponding pair of communicat-
ing interfaces, so additional RS/EB must be inserted between
them or the pair must be physically close to avoid long
interconnect wires. The former solution might decrease system
throughput; the latter might constrain physical design tools.
The maximum physical lengths of interconnects allowed
between the RS-shell or shell-shell pairs in SEA are shorter
than what the corresponding slacks imply. This is because the
join structure used in the two-input “shell” in SEA creates
multiple combinational paths running across a single channel
20twice or spanning across two channels, as indicated in Fig. 14.
Therefore the slack available between the two-input shell
and its uplink counterparts are shared among the interface
logic and the corresponding forward path and backward path
between them. As a result, the join structure allows a much
shorter physical length for the interconnects, and physical
design tools must be used to carefully “balance” the lengths of
the “joined” wires to avoid timing violations. These combina-
tional paths are introduced by the interface logic with multiple
input channels (here the two-input shell), regardless of whether
the senders are elastic buffers or other processing cores.
Notice that the combinational paths created by the SEA join
structure are unavoidable. In fact, the lack of of input queues
at the receiver’s end forces the buffering of unused valid data
tokens at the immediate sender’s end. Hence a multi-input core
receiving an invalid token must request the re-transmission
of all the valid tokens received at the same clock cycle as
they arrive. Consequently, combinational paths between the
communicating interface logic are required.
The above analysis of logic delay shows that the proposed
LID-1ss interface logic can support higher system clock rate
and throughput than LID-2ss and SEA counterparts. The
reason is that the interface logic of LID-1ss has more slack,
and requires a smaller number of wire pipelining elements
(relay stations) because it allows longer interconnect between
its interface logic. Latch-based SEA design does provide
additional ﬂexibility to the physical design tools because time
borrowing allows an elastic buffer to tolerate varying wire
delays and thus to be placed in a wider range of area.
Area Overhead Comparisons. Shell interfaces, relay stations
and elastic buffers do occupy active silicon area and therefore
represent a necessary area overhead of any latency-insensitive
design approach. We analyzed and compared area overhead
ﬁgures for the three approaches discussed in this paper after
performing logic synthesis and technology mapping.
Fig. 16(a) reports the area overhead of the shell designs
in LID-2ss and LID-1ss (for both queue of size one and
two) over a range of different channel widths; Fig. 16(b)
shows the corresponding overhead incurred in elasticization
of processing cores with different number of ﬂip-ﬂops. The
area overhead of the LID-1ss shell with queue of size two is
roughly the same as the one of the LID-2ss shell while the
LID-1ss shell with queue of size one is smaller. In fact, the
area of a shell is dominated by the area of its queues, which
depends on the widths of the input channels. For a SEA, the
area overhead of elasticizing a processing core grows with the
number of ﬂip-ﬂops contained in the core. This is because the
substitute latches require a little more area than the replaced
edge-triggered ﬂip-ﬂops.
Fig. 16(c) compares the area overhead of the three LID
shells and their SEA counterpart when they are used to
encapsulate different instances of a 32 × 32 pipelined multi-
plier synthesized from the SYNOPSYS DESIGNWARE IP core
library. For a number of pipeline stages varying from 2 to 6
the bar diagram reports the absolute area of the synthesized
multipliers as well as the area of the corresponding shells. The
overhead ratios between each shell’s area and the multiplier’s
area is labeled on top of each corresponding bar. As expected,
shell RS shell-RS RS-RS RS-shell shell-shell
LID-1ss 1.23 1.28 1.32 1.5 1.33 1.24
LID-2ss 1.14 1.23 1.32 1.32 1.1 1.27
SEA 1.24 1.00 1.21 1.44 1.31 0.92
(a) Slacks (in nanoseconds) of interface logic at 500 MHz clock rate.
0
5
10
15
20
25
30
500 750 1000 1250
clock frequency (MHz)
m
a
x
i
m
u
m
 
w
i
r
e
 
l
e
n
g
t
h
 
(
m
m
)
LID-1ss
LID-2ss
SEA
(b) 2-in-2-out shell → RS
0
5
10
15
20
25
30
500 750 1000 1250
clock frequency (MHz)
m
a
x
i
m
u
m
 
w
i
r
e
 
l
e
n
g
t
h
 
(
m
m
)
LID-1ss
LID-2ss
SEA
(c) relay station → relay station
0
5
10
15
20
25
30
500 750 1000 1250
clock frequency (MHz)
m
a
x
i
m
u
m
 
w
i
r
e
 
l
e
n
g
t
h
 
(
m
m
)
LID-1ss
LID-2ss
SEA
x
(d) RS → 2-in-2-out shell
0
5
10
15
20
25
30
500 750 1000 1250
clock frequency (MHz)
m
a
x
i
m
u
m
 
w
i
r
e
 
l
e
n
g
t
h
 
(
m
m
)
LID-1ss
LID-2ss
SEA
x x
(e) 2-in-2-out shell → 2-in-2-out
shell
Fig. 15. Minimum slacks and maximum physical lengths of interconnects
allowed by interface logic. The input queue size of LID-1ss shell is two.
the absolute area of the shells in LID-2ss and LID-1ss are
constant regardless the number of pipeline stages, but the area
overhead ratio of the LID-1ss shell’s area drop from 16% to
13% (in the case of input queue size q =1 ) as the multiplier’s
logic grows (the same trend applies to LID-2ss). In contrast,
the area of the SEA “shell” grows slightly with the number
of pipeline stages, and its area overhead ratio grows from 5%
to 10%. In this example the area overhead of LID-1ss and
LID-2ss is signiﬁcant but this is greatly reduced for IP cores
that are more complex than a pipelined multiplier.
Fig. 16(d) reports the area of relay stations and elastic
buffers over a range of different channel widths. The area
overhead of the latch-based SEA elastic buffers is 2/3 of their
LID-2ss and LID-1ss counterparts thanks to the clever use of
two latches to provide the needed twofold capacity. Due to
the more complex steering logic between its ﬂip-ﬂops LID-
1ss relay stations are slightly larger than the LID-2ss ones.
210
5000
10000
15000
20000
25000
30000
35000
40000
45000
1 2 4 8 16 32 64 128 256
channel width (bits)
a
r
e
a
 
(
u
m
2
)
LID-1ss (q=1)
LID-1ss (q=2)
LID-2ss
(a) 2-in-2-out shells in LID-
2ss and LID-1ss
0
500
1000
1500
2000
2500
1 2 4 8 16 32 64 128 256
number of the core's flip-flops
a
r
e
a
 
(
u
m
2
)
SEA
(b) 2-in-2-out EB control
13% 14% 15% 16% 16%
17% 18% 20% 21% 21%
19% 19% 21% 22% 23%
10%
9%
8%
6%
5%
0
5000
10000
15000
20000
25000
35000
23456
pipeline stages
a
r
e
a
 
(
u
m
2
)
0
0
Mult (core)
LID-1ss (q=1)
LID-1ss (q=2)
LID-2ss
SEA
(c) Area overhead of encapsulating a
pipelined multiplier.
0
2000
4000
6000
8000
10000
12000
14000
18000
20000
1248 1 6 3 2 6 4 1 2 8 2 5 6
channel width (bits)
a
r
e
a
 
(
u
m
2
)
0
0 LID-1ss
LID-2ss
SEA
(d) RS and EB
Fig. 16. Area of synthesized interface circuits.
VI. CONCLUDING REMARKS
We proposed a new class of interface circuits to support
latency-insensitive design based on LID-1ss, a simpler latency-
insensitive protocol. We presented a detailed experimental
analysis comparing the LID-1ss interface circuits to those
supporting the original protocol discussed in [4], [8], that
we called LID-2ss, as well as to the interface circuits for
synchronous elastic architectures that were proposed in [10].
We showed that LID-1ss offers clear improvements in terms
of area overhead and logic delay with respect to LID-2ss. With
respect to the interface circuits for synchronous elastic archi-
tectures the LID-1ss interface circuits have smaller logic delay
and, for many systems, enable higher processing throughput.
VII. ACKNOWLEDGEMENTS
The authors would like to thank Jordi Cortadella for
providing the SEA interface circuits and Michael Theobald
and Franjo Ivan˘ ci´ c for helpful discussions. This research is
partially based upon work supported by the NSF under Grant
No. 0541278, an NDSEG fellowship, and the GSRC.
REFERENCES
[1] R. Ho, K. W. Mai, and M. A. Horowitz, “The future of wires,” IEEE
Proc., vol. 89, no. 4, pp. 490–504, Apr. 2001.
[2] D. Matzke, “Will physical scalability sabotage performance gains?”
IEEE Computer, vol. 30, pp. 37–39, Sep. 1997.
[3] L. P. Carloni and A. L. Sangiovanni-Vincentelli, “Coping with latency
in SOC design,” IEEE Micro, vol. 22, no. 5, pp. 24–35, Sep-Oct 2002.
[4] L. P. Carloni, K. L. McMillan, A. Saldanha, and A. L. Sangiovanni-
Vincentelli, “A methodology for “correct-by-construction” latency in-
sensitive design,” in Proc. of the Intl. Conf. on Computer-Aided Design.
San Jose, CA: IEEE, Nov. 1999, pp. 309–315.
[5] L. P. Carloni, K. L. McMillan, and A. L. Sangiovanni-Vincentelli,
“Theory of latency-insensitive design,” IEEE Trans. on Computer-Aided
Design of Integrated Circuits and Systems, vol. 20, no. 9, pp. 1059–1076,
Sep. 2001.
[6] A. Benveniste, P. Caspi, S. Edwards, N. Halbwachs, P. L. Guernic, and
R. de Simone, “The synchronous language twelve years later,” Proc. of
the IEEE, vol. 91, no. 1, pp. 64–83, Jan. 2003.
[7] E. A. Lee and A. Sangiovanni-Vincentelli, “A Framework for Comparing
Models of Computation,” IEEE Trans. on Computer-Aided Design of
Integrated Circuits and Systems, vol. 17, no. 12, pp. 1217–1229, Dec.
1998.
[8] L. P. Carloni, “The role of back-pressure in implementing latency-
insensitive systems.” Electr. Notes Theor. Comput. Sci., vol. 146, no. 2,
pp. 61–80, 2006.
[9] R. Collins and L. Carloni, “Topology-based optimization of maximal
sustainable throughput in a latency-insensitive system,” in To appear in
the Proc. of Design Automation Conf. (DAC), Jun. 2007.
[10] J. Cortadella, M. Kishinevsky, and B. Grundmann, “Synthesis of syn-
chronous elastic architectures,” in Proc. of the Design Automation Conf.,
2006, pp. 657–662.
[11] A. Agiwal and M. Singh, “An architecture and a wrapper synthesis
approach for multi-clock latency-insensitive systems,” in Proc. of the
Intl. Conf. on Computer-Aided Design, 2005, pp. 1006–1013.
[12] M. R. Casu and L. Macchiarulo, “A new approach to latency insensitive
design,” in Proc. of the Design Automation Conf., 2004, pp. 576–581.
[13] T. Chelcea and S. M. Nowick, “Robust interfaces for mixed-timing
systems,” IEEE Trans. on Very Large Scale Integrated Systems., vol. 12,
no. 8, pp. 857–873, 2004.
[14] R. Lu and C.-K. Koh, “Performance analysis of latency-insensitive
systems,” IEEE Trans. on Computer-Aided Design of Integrated Circuits
and Systems, vol. 25, no. 3, pp. 469–483, Mar. 2006.
[15] M. Singh and M. Theobald, “Generalized latency-insensitive systems
for single-clock and multi-clock architectures,” in Proc. of the Conf. on
Design, Automation and Test in Europe, 2004, pp. 1008–1013.
[16] H. Jacobson, P. Kudva, P. Bose, P. Cook, S. Schuster, E. Mercer, and
C. Myers, “Synchronous interlocked pipelines,” in Proc. of the Intl.
Symp. on Asynchronous Circuits and Systems, Apr. 2002, pp. 3–12.
[17] S. Suhaib, D. Mathaikutty, D. Berner, and S. Shukla, “Validating families
of latency insensitive protocols,” IEEE Trans. on Computers, vol. 55,
no. 11, pp. 1391–1401, 2006.
[18] A. Cimatti, E. Clarke, F. Giunchiglia, and M. Roveri, “NUSMV: a new
Symbolic Model Veriﬁer,” in Proc. of the Intl. Conf. on Computer-Aided
Veriﬁcation, July 1999, pp. 495–499.
[19] L. P. Carloni and A. L. Sangiovanni-Vincentelli, “Performance analysis
and optimization of latency insensitive systems,” in Proc. of the Design
Automation Conf., Jun. 2000, pp. 361–367.
[20] C. V. Ramamoorthy and G. S. Ho, “Performance evaluation of asyn-
chronous concurrent systems using Petri nets,” IEEE Tran. on Software
Engineering, vol. 6, no. 5, pp. 440–449, Sep. 1980.
[21] R. M. Karp, “A characterization of the minimum cycle mean in a
digraph,” Discrete Mathematics, vol. 23, pp. 309–311, 1978.
[22] A. Dasdan and R. Gupta, “Faster maximum and minimum mean cycle
algorithms for system-performance analysis,” IEEE Trans. on Computer-
Aided Design of Integrated Circuits and Systems, vol. 17, pp. 889–899,
Oct. 1998.
[23] J. M. Rabaey, A. Chandrakasan, and B. Nikoli´ c, Digital integrated
circuits: a design perspective. Prentice-Hall, Inc. Upper Saddle River,
NJ, USA, 2002.
22