Another Glance at Relay Stations in Latency-Insensitive Design  by Boucaron, Julien et al.
Another Glance at Relay Stations in
Latency-Insensitive Design
Julien Boucaron1
Jean-Vivien Millo2
Robert De Simone3
AOSTE Team
INRIA
2004, Route des Lucioles - BP 93
Sophia Antipolis, France
Abstract
We revisit the formal modeling of relay stations, which are speciﬁc connection elements used in
the theory of Latency-Insensitive Design of Globally-Asynchronous/Locally-Synchronous systems.
Relay stations are in charge of taking into account the physical mandatory latencies, while handling
the regulation of signal/data traﬃc so as to avoid starvation, deadlock and congestion of local IP
synchronous computation blocks. Since proposed by Carloni et al, the structure and behaviors
of these relay stations have been amply characterized and analyzed. But previous works did not
provide a fully formal and cycle-accurate description of these mechanisms, amenable to formal
veriﬁcation for instance (instead, mainly simulation models were developed). Due to the needed
precision of the whole scheme we feel such a formal description might be needed. We describe such
an attempt here.
Keywords: Latency Insensitive, Relay Station, Shell, GALS, Formal Veriﬁcation, Marked Graph,
Synchronous, Esterel, SyncCharts
1
Email: jboucaro@sophia.inria.fr
2
Email: jvmillo@sophia.inria.fr
3
Email: rs@sophia.inria.fr
Electronic Notes in Theoretical Computer Science 146 (2006) 41–59
1571-0661 © 2006 Elsevier B.V. 
www.elsevier.com/locate/entcs
doi:10.1016/j.entcs.2005.05.035
Open access under CC BY-NC-ND license.
1 Introduction
Long wire interconnect latencies may induce time-closure diﬃculties in mod-
ern SoC designs, with propagation of signals across the die in a single clock
cycle becoming problematic. The theory of latency-insensitive design (LID),
proposed originally by L. Carloni, K. McMillan and A. Sangiovanni-Vincentelli
[21,22], oﬀers solutions for this issue. The theory can roughly be described as
such: an initial fully synchronous reference speciﬁcation is ﬁrst desynchronized
as an asynchronous network of synchronous block components (a GALS sys-
tem). Then proper interconnect mechanisms are introduced to resynchronize
the global system, but allowing speciﬁed (integer-time) latencies at intercon-
nects, under the form of ﬁxed-sized lines of so-called relay stations. These
relay stations, together with “shell” wrappers around the synchronous “pearl”
IP blocks, are in charge of managing the signal value ﬂows. With their help
proper regulation is performed between computation blocks that may be tem-
porarily unable to run, either because of input data unavailability, or because
of the inability of the rest of the network to store their results if they were
produced. The second problem comes from the boundedness of hardware re-
sources, and the ﬁxed-size buﬀering capacity of the interconnects (the lines of
relay stations).
Since their invention relay stations have been a subject of attention for a
number of research groups. Extensive modeling, characterization and analy-
sis were provided in [12,15,14]. Still, the modeling level has not completely
reached a fully formal stage, so that proofs of correctness are still informal,
either based on textual proof hints, or simulation model executions. We shall
somehow use a paper by Casu et Macchiarulo [25], which provides such an
(excellent) modeling, as our starting point. We depart from their description
on a number of features, though (for instance they do not include the output
functions as part of their FSM state machines describing the control structure
of each relay station).
Each relay station can be conceived as a cell, to be part of a line of n, then
composing the sectioned wire with a latency of n clock cycles. Relay stations
implement a given protocol, that will in a sense be preserved by their chaining,
only increasing the mandatory latency duration. Each station can receive a
valid signal data from its predecessor (either a shell around an IP block or
another station), and pass it down in the next clock cycle to its successor.
The relay station can also receive in the reverse direction a regulation signal,
implementing a “back-pressure” feature, to indicate that the successor node
is unable to accept more data. In this case the station should refrain from
sending its value and keep it instead. It should also still be able to receive the
next one in this cycle (as the previous node was not warned of the congestion
J. Boucaron et al. / Electronic Notes in Theoretical Computer Science 146 (2006) 41–5942
yet), and if necessary should propagate the back-pressure congestion signal to
the previous node in the next clock cycle. The “next-cycle delays” are needed
to respect the physical latency assumption. Of course there are also times
where no valid data is transmitted from the previous node because upstream
computations were temporarily halted due to lack of inputs. It should thus be
noted that any relay station needs a capacity to hold two values simultaneously,
in case it cannot propagate the current one while a new one simultaneously
arrives. It can also be empty, if valid data are produced more slowly than
consumed.
Currently the role of relay stations is two-fold: they implement the on-
line scheduling scheme requested for proper handling of congestion risks, by
back-pressure mechanisms; they also provide the temporary storage for data
for as long as they cannot be forwarded further down the line. The second
role is debatable: if the data were allowed to continue their route, they could
be stored at the destination shell, if it would provide a dedicated buﬀer with
the same size as the accumulated buﬀering capacity of all the relay stations
on this line. Even better, moving all storage to a single spatial location would
then ease the physical synthesis burden. This was noted in [25]. Of course the
traﬃc regulation and the back-pressure mechanisms should still be applied in a
mandatory fashion, since otherwise the end destination buﬀer could overﬂow.
But they would only stall back data traﬃc and computation at shell level,
not halfway through the interconnects. Back pressure mechanisms now show
the net eﬀect of retro-propagating information on the congestion and traﬃc
jam reported “downwards”. They to do so only when needed, but as early as
feasible, while respecting the latencies needed to travel through the long wires.
The paper is organized as follows:
In section 2 we recall brieﬂy the basic contextual deﬁnition of synchronous
circuits (for local components) and GALS systems (as networks of local syn-
chronous computation components connected by unbounded buﬀers). We
mention some initialization issues, solved as in [17] by the data valueless ab-
straction of GALS models into Petri Net Marked Graphs. It should be noted
here that the body of theoretical results developed around Marked Graphs,
also called Event Graphs in the literature, can provide a number of useful
analytical results for the characterization of such systems [16,5]. This is also
true in the case of places with bounded capacity, and it provides answers to
issues mentioned in previous papers on LID systems. In particular it provides
suﬃcient conditions for proper initialization of data in lines, so as to guarantee
liveness as absence of deadlock but also congestion altogether. On the other
hand, Event Graphs (as all Petri Net subclasses) are inherently asynchronous
as a concurrency model, and their application to scheduling and “maximal
J. Boucaron et al. / Electronic Notes in Theoretical Computer Science 146 (2006) 41–59 43
progress” remains for us to be investigated. Here again answers might already
exist in the literature.
In section 3 we provide abstract requirements and formal constraints to
be satisﬁed by relay stations models. We are starting from the model of
[25], which itself somehow summarizes previous works. We provide our formal
model, under the form of a SyncChart, with regular features and output signal
clear timing speciﬁcation. Our model is amenable to description in Esterel
[7] or SyncCharts [3,4], thereby allowing formal methods and model-checking
techniques [9]. Of course this could also be possible by providing a direct
netlist description in blif format for instance, but we gain syntactic ﬂexibility,
to describe easily the combination of several relay stations into a wire of great
latency for instance.
In section 3.2 we specify formally a number of correctness properties, that
can be established on a line of relay stations. Of course brute-force model-
checking does not allow to reason on parametric models (where here the pa-
rameter would be the latency length n of the line), so we need to instantiate
several constant length values.
We describe the shell wrappers (here very close to the version of [25]) in section
4. Again we model-checked them to establish correctness properties.
Related work in the direction of using static scheduling to optimize cy-
cle allocation was started in [11,24,13], under the naming of recycling and
inspired by software pipelining cycle allocation techniques. It extends and
refers somehow to the paradigms of sequential circuit retiming [20].
We conclude with several open questions. The main topic for extension
that attracts our attention is the following one: currently the design method-
ology starts from a monolithic synchronous speciﬁcation. This is needed to
retain several important synthesis techniques from commercial EDA ﬂows.
But if one can recognize that this seemingly synchronous description in fact
contains informations indicating timing ﬂexibility and potential decomposi-
tion into smaller synchronous “pearls”, how could we eﬃciently extend the
approach to use this extra knowledge ? Here we are referring to so-to-speak
asynchronous processes (with the word “asynchronous” here applied to the
computation model), rather than to buﬀered connections (where the word
“asynchronous” is applied to the communication model. Examples of such ex-
tra information could be provided by the user (as multirate/multiclock model-
ing extensions, or exclusive control modes) [1,8,27]. It could also be extracted
by dynamic semantic analysis, as is done in the iso/endochrony theory of
Benveniste et al [23,6] (to the best of our understanding).
J. Boucaron et al. / Electronic Notes in Theoretical Computer Science 146 (2006) 41–5944
2 Preliminaries
Synchronous circuit:
A synchronous circuit is associated with a clock. It has a signal interface
consisting of three sets of (Boolean) input, register and output signals, and
an internal state consisting of a set of (Boolean) registers (or ﬂip-ﬂops). On
each clock tick, it produces current outputs and next-instant register values
from the current values of inputs and registers.
Formally, a synchronous circuit is thus a structure < I,O,R, Out,Next >,
where
• I is a set of Boolean input variables {I0, . . . , In−1}. We call the vector
I =< I0, . . . , In−1 >∈ B
n an input event. It represents the valuation of all
input variables at a given instant.
• O is the set of Boolean output variables {O0, ..., Om−1}, We call the vector
O =< O0, . . . , Om−1 >∈ B
m an output event. It represents the valuation of
all output variables at a given instant.
• R is the set of Boolean register variables {R0, ..., Rp−1}, We call R =<
R0, . . . , Rp−1 >∈ B
p the current state. We also use the next-state R′ =<
R′
0
, . . . , R′p−1 >, using primed names.
• Out is a vector < Out0, . . . , Outm−1 > of Boolean functions, Outj : (B
n ×
Bp) → B. So each function Outj deﬁnes the value of output variable Oj
from the current values of input and register variables.
• Next is a vector < Next0, . . . , Nextp−1 > of Boolean functions, Nextj :
(Bn × Bp) → B. So each function Nextj deﬁnes the next value of register
variable R′j from the current values of input and register variables.
Synchronous or asynchronous networks of synchronous circuits
One can build larger circuits by setting local (IP) synchronous components
in parallel, establishing desired point-to-point interconnections of inputs to
outputs of diﬀerent blocks. This is displayed in ﬁgure 1, if one assumes for
the connections simple wires, and that all components run on the same clock.
The result is then a compound netlist, homogeneous in nature with the local
component synchronous circuits.
On the other hand one can also assumes that local synchronous components
are not globally synchronized, and that connections are established through
“ideal” unbounded FIFO queues. This builds another interpretation of ﬁgure
1, as a global data-ﬂow network. Now each component can be allowed to run
only when all its input data values are present. The eﬀect of its run is to con-
J. Boucaron et al. / Electronic Notes in Theoretical Computer Science 146 (2006) 41–59 45
Fig. 1. Network of synchronous IP blocks (synchronous or asynchronous)
sume one input value on each input channel, and to produce one output value
on its output channel. It can be conceived of as a fully unrestricted GALS
system. We shall use this stage of representation only as an intermediate step
for conceptual modeling.
As noted in [17], the unrestricted GALSmodel maps directly to Event/Marked
Graphs (a well-known subclass of Petri Nets) when disregarding values carried
as signal data. This association helps prove that, under some careful initial-
ization conditions, this asynchronous version is functionally equivalent to the
previous, fully synchronous one (see below the discussion on initialization).
Marked Graphs
Also called Event graphs in the literature, they form a speciﬁc subclass of
Petri Nets where places have exactly one input transition and one output tran-
sition [16]. In our case transitions represent local synchronous components,
which indeed consume one data on each input channel, and produce one on
each output channel in each step. With data abstracted as “tokens” the place
marking represent the number of data currently contained in the interconnect
FIFO queue.
Marked/Event Graphs are “free-choice” nets. Various executions only dif-
fer in relative schedulings of ﬁrings of individual transitions, and these behav-
iors are conﬂuent: the ﬁring of a given transition cannot disallow the one of
another if it was previously allowed. Also, the sum of all places markings in a
given graph cycle remains invariant all along any execution. A Petri Net (PN)
is called live if any transition can still be executed (possibly after a number of
steps) from any reachable marking. It had been proved in [16] that a Marked
Graph is live if each graph cycle contains at least a token in one of its place.
Figure 2 shows a Marked Graph associated with the previous GALS network
(in its asynchronous form).
Marked Graphs with Place Capacities
In GALS theories (such as Latency-Insensitive design and others), the
purpose is usually to build a model “in between” the fully synchronous and
J. Boucaron et al. / Electronic Notes in Theoretical Computer Science 146 (2006) 41–5946
Fig. 2. a live Marked Graph associated with the previous GALS picture
asynchronous ones. In particular it is important in SoC design to be able to
restrict interconnects so as to use only bounded space. The general philosophy
is thus: ﬁrst, desynchronize the fully synchronous speciﬁcation; second, resyn-
chronize it by careful scheduling mechanisms in a way that respect mandatory
physical latencies, while using only bounded communication resources. At the
abstract PN level, this boundedness can be modeled with place capacities (the
scheduling issue will be dealt with elsewhere later).
Capacities are introduced in Petri Nets by requesting that a given place
cannot hold more than n tokens, n being the capacity of that place. Capacities
can be traced back to the foundation of PN history, without a clear seminal
paper (see [5] for deﬁnitions and [2] for a proof of equivalence theorem). In
fact it is immediate to replace a PN with capacity with another equivalent one
without capacity by adding a new place for each existing one, with as marking
as the diﬀerence between the original place capacity and its current initial
marking. This new place is connected to transition in the reverse way as the
original. Figure 3 displays a PN net with capacities (here of 1 for simplicity),
and the equivalent PN with duplicated backward places.
Of course the bounded capacity raises new liveness problems, this time
because of congestion and overﬂow instead of starvation and lack of available
data tokens. Fortunately we can use the important fact that the above com-
pletion preserves the Marked Graph subclass, and inside this context solutions
will be found. As will appear later, the latency-insensitive relax-synchronized
version of our GALS system will possess a capacity of holding 2n data token
on a connection line comprising of n relay-stations.
The ﬁnal models produced in LID theory are (on ﬁrst approximation)
latency-bounded, resynchronized versions of marked graphs with capacities.
In the sequel we shall call them relaxed-synchronous systems, as they combine
both synchronous features (all components and interconnects run on the same
clock), and user-imposed interconnect minimal latencies (a constant integer
delay for the line to transmit its signal/data values. While the data are still
in transit, computation parts are paused by their surrounding shells, using
clock-gating mechanisms. To respect the ﬁxed-sized buﬀering ability a back-
pressure congestion control protocol is applied across relay-stations.
J. Boucaron et al. / Electronic Notes in Theoretical Computer Science 146 (2006) 41–59 47
(a) a Petri net with capacity (b) Equivalent Petri net without capacity
Fig. 3. From a Petri net with capacity (a) to an other Petri net without capacity but with the
same behavior (b)
Initial and well-formedness conditions
We consider here various issues of proper initialization and structural well-
formedness of the networks, ensuring for each semantics, be it synchronous
or asynchronous, both starvation-freedom (or PN liveness), and congestion-
freedom (or PN safety). We also brieﬂy consider the quantitative issue of
production rate (or throughput).
We recall the well-known fact that any graph can be decomposed as a
directed acyclic graph (DAG) of strongly connected components (SCC), with
a SCC possibly containing a single node.
Concerning synchronous networks of synchronous components, a
valid signal/data must be present on each wire at the clock tick. In order to
achieve this (while assuming it from the network primary inputs), one usu-
ally imposes that there is no combinatorial loop across the network. In other
words each loop in the network graph must cross a register, which produces
its output in the next clock cycle than it received it as input. Here the net-
work graph consists of the local dependencies inside the components plus the
interconnections between components. This is a strictly weaker condition as
to impose that all component outputs are latched (as in Moore fashion), even
though the second assumption is often recommended for composite design
style, and is actually implicitly adopted in some of the GALS literature. Note
here that the program of splitting up long combinatorial wires into sections
is only fulﬁlled if not all local outputs are latched. Still, if it is the case one
remains capable of turning unit delays into arbitrarily chosen delays.
Concerning asynchronous networks of synchronous components it
is also the case that the network is live (so that all local components get ﬁred
inﬁnitely often) iff there is at least one token in each network cycle loop
(provided the primary inputs each provide an inﬁnite stream of signal/data of
J. Boucaron et al. / Electronic Notes in Theoretical Computer Science 146 (2006) 41–5948
course). This is a direct consequence of the result of Marked Graphs liveness.
This matches closely the corresponding assumption on synchronous networks,
provided the register is in fact a latch on a local output (but still not all
outputs need to be latched, only one in each network communication cycle).
The latched output can then be, in a sense, drawn from the local component
to become the seed initial value of the interconnecting FIFO queue. Of course
initialization with more values in the queues is feasible with liveness preserved
(the more token the better in this case). But it is problematic to ﬁgure out
how to obtain these seed values in general if starting from a fully synchronous
speciﬁcation with which to retain functional equivalence.
Considering relaxed-synchronous versions, where bounded capacity
channels are replacing the unbounded buﬀering capacity of FIFO queues, a
new kind of liveness problem is raised. Because of potential congestion, local
computation blocks can now get blocked because their output channels are
not ready to accept their results, which they could not store without over-
ﬂow. This issue is theoretically solved by requesting that the completed PN
net do not allow any blank cycle. Here the PN completion consists in adding
the backward places to play the role of capacities. In other words each graph
cycle in the completed graph should contain at least a token mark in one of
its places. The net of ﬁgure 3 is a typical counterexample of this: with places
each of capacity one, the net on the left is blocked; this is made explicit as
blank cycles in the completed net on the right.
As we shall see later, a channel of n relay stations has a buﬀering capacity
of 2n signal/data values. In the (frequent) case where the line is assumed to
be initialized with only one value, then the virtual backward places all contain
at least a token, thereby deﬁnitely disallowing blank cycles.
It has often been remarked in the GALS literature that, ultimately, a
(simply connected) relaxed-synchronous network could run no faster than the
speed of its slowest simple cycle loop. First, any SCC is restricted to the speed
of its slowest cycle (after perhaps an initial phase where enough internal tokens
can allow some parts to take “almost one lap in advance”). Then, whenever
the part located upfront from a SCC starts running ahead, tokens accumulate
at the entrance of this SCC until the bounded buﬀer gets ﬁlled, after which
point there is no choice but to run the SCC behavior part. Similarly for the
downstream parts, which needs data production from the upstream and SCC
parts to be fed to run. It was established that the rate of the slowest loop was
computed as the ratio of the number of data/token over the overall buﬀering
ability over the loop.
J. Boucaron et al. / Electronic Notes in Theoretical Computer Science 146 (2006) 41–59 49
3 Relay Station
We now come to the main part of this article. The purpose is to implement
ﬁxed-size communication channels that divide the long wires into sections,
such that a signal/data can be propagated from one section to the next only in
the next clock cycle. Similarly the signals needed to implement the congestion
control back-pressure must also respect these traveling delays. To this end,
relay stations were introduced in [21]. They are speciﬁc hardware elements
that provide the proper interface between sections (and also the shells at the
channel’s ends). These elements must have some buﬀering activities, to store
data “on route” of course, but also to park these additional data which might
discover that because of congestion, the channel downstream cannot accept
them.
3.1 Relay Station Modeling
Despite the number of publications describing relay stations in the literature,
they are usually informally characterized. Neither their precise constraints
representing the physical time requirements (in clock cycles), nor their formal
model and their proper satisfaction is full described. The paper that comes
nearest to this is [25]. However they do not use a pure synchronous modeling
in their FSM (Finite State Machine). We shall deal now with all these issues.
Fig. 4. Relay Station - Block Diagram
We borrow from [25] the interface of input/output signals. It is depicted
in ﬁgure 4. The data reception is represented by an input signal val in being
raised (it corresponds to ¬τ in the former articles on LID). It is a pure boolean
signal (we can abstract the data values). Then the RS passes the data with
a corresponding val out signal. Concerning back-pressure, the RS can receive
an halting order with the signal. The relay station receives input data with a
valid signal stop out being raised. It then transmits it with a stop in signal
(so stop out is an input, and stop in is an output).
Pseudo-physical requirement:
It is important to note that signal/data cannot be propagated combinato-
rially from one section to the next:
• val in ↪→next val out.
J. Boucaron et al. / Electronic Notes in Theoretical Computer Science 146 (2006) 41–5950
• stop out ↪→next stop in.
On the other hand, there can be combinatorial relations between stop in
and val in (resp. between stop out and val out, as they belong the same
section.
(a) (b)
Fig. 5. (a) Relay Station Structure (b) Relay Station SyncChart
So relay stations need registers (ﬂip-ﬂops for instance) to retain the sig-
nal between reception and propagation. In fact, as shown in [21], they need
two such slots, in case a new data arrives while the current one cannot be
propagated. Then, the congestion mechanism is supposed to guarantee that
no further data can be received (and thus lost), because they are retained
elsewhere upstream. This provides the abstract ﬁgure 5 (a).
3.1.1 Relay Station - SyncChart
We represent in ﬁgure 5 (b) the relay station as a SyncChart [3,4], with ex-
plicit states, handling thus both the output and next state functions. Now we
introduce this syncchart using for state encoding the number of registers free
within the relay station.
The SSM (Safe State Machine) contains 3 states, corresponding to the
occupation of the registers:
J. Boucaron et al. / Electronic Notes in Theoretical Computer Science 146 (2006) 41–59 51
empty when no data are currently buﬀered in the RS; in this state the RS
simply wait for a valid input data, and store it in its main register (goes to
state half). stop out signals are ignored, and not propagated upstream, as
this cell can absorb traﬃc.
half when it holds one data; Then the RS cell only transmits its current, pre-
viously received signal data if ever it does not receive an halting stop out
signal (remember this combinatorial relation is correct, being inside a sec-
tion). If halting, it retains its data, but must also accept a potential new one
from upstream (as it has not sent any back-pressure holding signal yet). In
the second case it becomes full, with the second value occupying its “emer-
gency” auxiliary register. If the RS can transmit (stop out false), it either
goes back to empty or retrieve a new valid data in, remaining then in the
same state. On the other hand it still makes no provision to propagate
back-pressure (in the next clock cycle), as it is still unnecessary due to its
own buﬀering capacity.
full when it contains two data; then it raises in any case the stop in signal,
propagating to the upstream section the hold-out stop out signal received
in the previous clock cycle. If it does not itself receive a new stop out, then
the line downstream was cleared enough so that it can transmit its data;
otherwise it keeps it and remains halted.
NB: A signal is emitted (denoted by /) only when true, otherwise false.
Discussion
With this precise cycle-accurate model, one can for instance wonder whether
it would be feasible to improve the design to be able, while full, to both prop-
agate its current data and accept a new one, remaining full. Of course this
should be useless in practice, because the val in signal could not be received
(since the previous cell, when warned of its stop out, blocks its val out to be-
come the current RS’s val in). But if the RS is connected to another element,
the shell for instance, the constraint stop out ⇒ ¬val out has to be checked
and guaranteed, or at least appropriate behavior must be checked. This can
easily be done using trivial model-checking on our formal description.
3.2 Correctness properties and formal veriﬁcation
Keeping with the kind of remarks of the previous discussion, one can phrase
a number of correctness properties to hold on a relay station, or a line of
relay stations (or later, a network comprising shells and pearls). Remember
that correctness criteria for liveness (seen as freeness from both deadlock and
J. Boucaron et al. / Electronic Notes in Theoretical Computer Science 146 (2006) 41–5952
congestion) were already established as PN graph markings conditions, linked
to data initialization in section 2. Instances of additional properties are:
• relay stations cannot overﬂow or underﬂow.
• data order is preserved.
• at any point in time, the number of valid data produced from a line is
bounded relative to the number entered:
#(val in)+Init line ≤ #(val out) ≤ #(val in)+Init line+2×length line
where Init line is the number of data initially residing in the line of RSs,
and length line is the number of RSs.
• a line of n relay station cannot notify congestion to its source unless it
receives enough similar back-pressure signals, given its initial content.
• conversely, a line receiving enough back-pressure hold-out signals and data
will eventually get ﬁlled and notice congestion.
Fig. 6. Overﬂow, underﬂow observer for RS
The ﬁrst property can be checked by the observer in ﬁgure 6.
The check will then consist in proving that such states are unreachable
in all RSs. The second property could be modeled in a restricted case by
“tagging” the successive data signals with indices, and then checking that
these indexes are returned by the line in the same order as they were entered
in the other end. The simplest scheme is to alternate 0 and 1 tags, providing
an alternated bit protocol type veriﬁcation.
We checked these properties by model-checking, with (low-range) constants
replacing the integer parameters, and observers built from these formulas.
J. Boucaron et al. / Electronic Notes in Theoretical Computer Science 146 (2006) 41–59 53
Fig. 7. Shell - Circuit
4 Shell wrappers
4.1 Shell modeling
Here our model follows rather closely the one of Casu and Macchiarulo [25].
It is depicted in ﬁgure 7.
Shell equations
• stop ini =
∨
stop out0...N
∧
flipflopout.
• V AL INi = val ini
∨
flipflopout.
• val out0...N1 = clock = ¬
∨
stop out0...N2
∧
V AL IN0...N3.
• flipflopin = ((val ini
∨
flipflopout)
∧
stop out0...N2)
∨
(val ini
∧
flipflopout
∧
¬stop out0...N2).
As mentioned in section 2, one can consider the case where shells and pearls
have potential zero-delay propagation (as long as there is no combinatorial
loop involving only shells, without crossing a relay station). The shells will
need the ability to store data that have already arrived, awaiting others still
missing.
The Shell works as follows:
J. Boucaron et al. / Electronic Notes in Theoretical Computer Science 146 (2006) 41–5954
• The internal pearl’s clock and all val outi valid output signals are generated
once we have all val in, while stop is false. The internal stop signal itself
represents the disjunction of all incoming stop outj signals from out-coming
channels.
• the buﬀering register of a given input channel is used meanwhile as long as
not all other input data are available.
• so, internal pearl’s clock is set to false whenever a backward stop outj occurs
as true, or a forward val ini is false. In such case the registers already busy
hold their true value, while others may receive a valid data “just now”.
• stop ini signals are raised towards all channels whose corresponding register
was already loaded (a data was received before, and still not consumed), to
warn them not to propagate any value in this clock cycle. Of course such
signal cannot be sent in case the data is currently received, as it would raise
a causality paradox (and a combinatorial cycle).
• ﬂip-ﬂop registers are reset when the pearl’s clock is raised, following it to
compute for one step and to consume its input data. The signal stop ini is
raised only when the flipflopi is already holding a value and a stop out is
raised.
We should remember the constraint demanded by the relay stations for
proper functioning, namely that on each output channel from the producer (is
this case the shell), one has stop outj ⇒ ¬val outj, which holds here.
4.2 Correctness properties and formal veriﬁcation
Keeping in mind relay stations, we want to show this property:
• data cannot be accepted before the previous one is processed: data order is
preserved.
The property can be checked simply because the shell is connected syn-
chronously to a relay station (or another shell) and thus the relay station
cannot send any data to the shell when the shell is holding a data. The shell
can have only one datum from each channel as said before then it cannot over-
write or loose this data until all needed datum are present to react. The data
order is preserved, because by hypothesis the interconnection network is only
point to point, cannot loose data or alter data ordering, the shell is waiting
for all datum and then react, thus partial order of the desynchronized design
is compatible with the synchronous one. We can also apply the alternated
bit protocol veriﬁcation in this case. The Shell is dead-lock free because we
already established it as PN markings conditions. The model has been built
with Esterel and formally veriﬁed with the modelchecker Xeve [9] and the
J. Boucaron et al. / Electronic Notes in Theoretical Computer Science 146 (2006) 41–59 55
visual tool Autograph [26]. Xeve is a BDD-based modelchecker which can,
in addition of proving properties, provide a canonical minimal automaton for
the component (up to bisimulation). Autograph allows the user to display
graphically this minimal automaton (and recognize its characteristics, when
the state number remain manageable).
5 Further topics
So far the theory developed here only consider the case where local syn-
chronous components all consume and produce data on all input and output
channels in each computation step, and where they all run on the same clock.
In this favorable case functional determinacy and conﬂuence are guaranteed,
with latencies only impacting the relative ordering of behaviors. So it can
be proved that the relaxed-synchronous version produces the same output
streams from the same input streams as the fully synchronous speciﬁcation
(indeed the rank of a data in a stream corresponds to its time in the syn-
chronous model, thereby reconstructing the structure of successive instants).
Several papers considered extensions in the context of GALS systems, but
then ignored the issue of functional correspondence with an initial well-clocked
speciﬁcation, which is our important correctness criterion.
This strong assumption can be weakened in a number of ways. Some are
related to the various relative speeds and cadences of components in clock
cycle rates, some are borne in the extension of Marked/Event graphs to more
general subclasses in Petri nets in the asynchronous setting, and the most
important ones are linking the two.
This concern is reﬂected elsewhere in the Ptolemy environment [19], where
the so called SDF [18] domain corresponds to a slight extension of Event
Graphs, whereas potential conﬂict choices are introduced in more general BDF
and DDF domains [10] with the inherent problem that static bounded FIFO
scheduling becomes undecidable.
One can extend the framework by allowing diﬀerent cadences (so that
various processing blocks run at diﬀerent speeds, expressed as integer multiples
of the master clock). More generally, each component can be assigned its
own clock, with the assumption that all clocks are subclocks of a master
clock, but not necessarily periodic. One can then build multi-rate/multi-clock
systems. But, unless global rates are perfectly equalized around each loops,
this might require fact component with diﬀerent clocks be fed streams of data
of unequal lengths. Usually the link with a fully synchronous speciﬁcation
is attempted by introducing a speciﬁc absent value for every interconnection
signal, so that subclocks are deﬁned as ticking only during the instants where
J. Boucaron et al. / Electronic Notes in Theoretical Computer Science 146 (2006) 41–5956
a given triggering signal is not absent.
In general PN theory a place can be supplied tokens (here abstracting the
data put in a FIFO channel) from various transitions (here processing ele-
ments). It thus merges the two ﬂows (as a mux). The place can also oﬀer
its tokens at the other end to various consumers, thus operating a fork (or
a demux) of the data ﬂow carried through the channel. In other words to-
kens are shared. It gets diﬃcult then to imagine that the rank of a data in a
channel stream will recall the instant it was exchanged in a fully synchronous
speciﬁcation. Still, one can design a “locally-synchronous” version of places
(we consider here the case of two producers and two consumers to this place):
it has a main running clock, and two subclocks (one for input and one for out-
put), so that data are taken from one input channel when the input subclock
is raised, from the other otherwise (and similarly for output).
Of course the two kinds of extensions are linked, since channel sharing im-
poses that multiple productions or consumptions do not clash, so that it can
be established that they are mutually exclusive (by being driven on exclusive
subclocks). The issue of success is to guarantee liveness and throughput in
the global system. This should be attained by devising the proper schedul-
ing, which should generate the clock pulses at proper rates (in latency and
cadence), so that data ﬂow in the system smoothly. Several steps exist in
this direction, with the notion of multiclock systems and clock calculus in
synchronous languages [1,8].The correctness criterion is that no component
should ever require the presence of a signal data that is absent, and that
signal data are not inappropriately lost (sometimes it is ok to ignore and dis-
card them). Studies were also conducted to as when the seemingly monolithic
synchronous speciﬁcation in fact exhibited asynchronous behaviors based on
independent clocks underneath [27].
Finally, the goal would be to deﬁne a general GALS modeling framework,
where GALS components could be put in GALS networks (to this day the
framework is not compositional in the sense that local components need to be
synchronous). A system would consist again of computation and interconnect
communication blocks, this time each with appropriate triggering clocks, and
of a scheduler providing the subclocks computation mechanism, based on their
outer main clock and several signals carrying information on control ﬂow.
Our attention was brought by an anonymous referee to some recent work
[28,29], where relay stations are disposed of and replaced by parallel lines
which carry the data alternatively in a round-robin fashion. More work is
needed to compare our approach with the formal model provided there.
Finally, it was often suggested in previous papers that latency equalization
could be a solution direction; the intent is to add extra (non physical) laten-
J. Boucaron et al. / Electronic Notes in Theoretical Computer Science 146 (2006) 41–59 57
cies to ensure that all proper input data are provided simultaneously to the
local computing block. The “non-physical” new latencies can then be shifted
up and down the network, under some semantic-preserving constraint, to op-
timize the global cycle allocation of computation activities. They can even
be used to allow resynthesis of components under less stringent constraints.
Nevertheless, preliminary enquiries showed us that the property of “equaliz-
ability” is far from obvious to characterize, and there are simple examples of
networks where latencies can indeed not be evenly leveled. More investigations
on this interesting criterion are in order.
References
[1] P. Amagbedon, L. Besnard, and P. Le Guernic. Implementation of the Data-ﬂow synchronous
language Signal. In Proceedings PLDI’95, 1995.
[2] C. Andre´. Structural transformations given B-Equivalent PT-Nets. In Application and Theory
of Petri Nets, 1981.
[3] C. Andre´. Representation and Analysis of Reactive Behaviors: A Synchronous Approach. In
Computational Engineering in Systems Applications, pages 19–29, 1996.
[4] C. Andre´. Semantics of S.S.M. Technical report, I3S, CNRS, Esterel Technologies, 2003.
[5] F. Baccelli, G. Cohen, G.J. Olsder, and J.-P. Quadrat. Synchronization and Linearity. Wiley,
1992.
[6] A. Benveniste, B. Caillaud, and P. Le Guernic. From synchrony to asynchrony. In Proceedings
CONCUR’99, volume 1664 of LNCS, 1999.
[7] G. Berry and G. Gonthier. The Esterel Synchronous Programming Language: Design,
Semantics, Implementation. Science of Computer Programming, 19(2):87–152, 1992.
[8] G. Berry and E. Sentovich. Multiclock Esterel. In Proceedings CHARME’01, volume 2144 of
LNCS, 2001.
[9] Amar Bouali. Xeve, an ESTEREL Veriﬁcation Environment. In CAV ’98: Proceedings of the
10th International Conference on Computer Aided Veriﬁcation, pages 500–504, London, UK,
1998. Springer-Verlag.
[10] J.T. Buck. Scheduling Dynamic Dataﬂow Graphs with Bounded Memory Using the Token Flow
Model. PhD thesis, University of California, Berkeley, 1993.
[11] Luca P. Carloni and Alberto Sangiovanni-Vincentelli. Combining Retiming and Recycling to
Optimize the Performance of Synchronous Circuit. In The Proceedings of the 16th Symposium
on Integrated Circuits and System Design, 2003.
[12] Luca P. Carloni and Alberto L. Sangiovanni-Vincentelli. Performance analysis and
optimization of latency insensitive systems. In Design Automation Conference, pages 361–
367, 2000.
[13] Mario R. Casu and Luca Macchiarulo. A New Approach to Latency Insensitive Design. In
DAC’2004, 2004.
[14] Ajanta Chakraborty and Mark R. Greenstreet. A Minimalist Source-Synchronous Interface.
In Proceedings of the 15th IEEE ASIC/SOC Conference, pages 443–447, September 2002.
[15] Tiberiu Chelcea and Steven M. Nowick. Robust Interfaces for Mixed-Timing Systems with
Application to Latency-Insensitive Protocols. In Design Automation Conference, pages 21–26,
2001.
J. Boucaron et al. / Electronic Notes in Theoretical Computer Science 146 (2006) 41–5958
[16] F. Commoner, Anatol W.Holt, Shimon Even, and Amir Pnueli. Marked Directed Graph.
Journal of Computer and System Sciences, 5:511–523, october 1971.
[17] J. Cortadella, A. Kontratyev, L. Lavagno, and C. P. Sotiriou. A Concurrent Model for De-
Synchronization. In 12th International Workshop on Logic and Synthesis, 2003.
[18] E.A. Lee and D.G. Messerschmitt. Synchronous dataﬂow. In Proc. of IEEE, 1987.
[19] Edward A. Lee. Overview of ptolemy project. Technical report, University of California,
Berkeley, July 2003.
[20] C.E. Leiserson and J.B. Saxe. Retiming Synchronous Circuits. Algorithmica, 6, 1991.
[21] Luca P.Carloni, Kenneth L.McMillan, and Alberto L.Sangiovanni-Vincentelli. Theory of
Latency-Insensitive Design. IEEE Transactions on Computer-Aided Design of Integrated
Circuits and Systems, 2001.
[22] Luca P.Carloni, Kenneth L.McMillan, Alexander Saldanha, and Alberto L.Sangiovanni-
Vincentelli. A Methodology for Correct-by-Construction Latency Insensitive Design. In THE
BEST OF ICAD, 1999.
[23] Dumitru Potop-Butucaru, Benoˆıt Caillaud, and Albert Benveniste. Concurrency in
synchronous systems. In Proceedings ACSD’04, 2004.
[24] Franc¸ois R.Boyer, El Mostapha Aboulhamid, Yvon Savaria, and Michel Boyer. Optimal Design
of Synchronous Circuits Using Software Pipelining. In Proceedings of the ICCD’98, 1998.
[25] Mario R.Casu and Luca Macchiarulo. A Detailed Implementation of Latency Insensitive
Protocols. In FMGALS 2003 Proceedings, 2003.
[26] Vale´rie Roy and Robert de Simone. Auto/autograph. In CAV ’90: Proceedings of the 2nd
International Workshop on Computer Aided Veriﬁcation, pages 65–75, London, UK, 1991.
Springer-Verlag.
[27] Montek Singh and Michael Theobald. Generalized Latency-Insensitive Systems for Single-
Clock and Multi-Clock Architectures. In DATE’04, 2004.
[28] Syed Suhaib, David Berner, Deepak Mathaikutty, Jean-Pierre Talpin, and Sandeep Shukla.
Presentation and formal veriﬁcation of a family of protocols for latency insensitive design.
Technical report, Virginia Tech, February 2005.
[29] Syed Suhaib, David Berner, Deepak Mathaikutty, Jean-Pierre Talpin, and Sandeep Shukla.
A functional programming framework for latency insensitive protocol validation. Technical
report, Virginia Tech, March 2005.
J. Boucaron et al. / Electronic Notes in Theoretical Computer Science 146 (2006) 41–59 59
