Synchronous distribution of SIGNAL programs by Aubry, Pascal et al.
Synchronous distribution of SIGNAL programs
Pascal Aubry, Paul Le Guernic, Sylvain Machard
To cite this version:
Pascal Aubry, Paul Le Guernic, Sylvain Machard. Synchronous distribution of SIG-
NAL programs. 29th Hawaii International Conference on System Sciences (HICSS-29),
Jan 1996, Maui, Hawaii, United States. IEEE Computer Society, pp.656-665, 1996,
<10.1109/HICSS.1996.495517>. <hal-00544057>
HAL Id: hal-00544057
https://hal.archives-ouvertes.fr/hal-00544057
Submitted on 7 Dec 2010
HAL is a multi-disciplinary open access
archive for the deposit and dissemination of sci-
entific research documents, whether they are pub-
lished or not. The documents may come from
teaching and research institutions in France or
abroad, or from public or private research centers.
L’archive ouverte pluridisciplinaire HAL, est
destine´e au de´poˆt et a` la diffusion de documents
scientifiques de niveau recherche, publie´s ou non,
e´manant des e´tablissements d’enseignement et de
recherche franc¸ais ou e´trangers, des laboratoires
publics ou prive´s.
In proceedings of the 29
t
h Hawaii International Conference on System Sciences HICSS-29,
Hawaii, January 1996, to appear, avaible at http://www.irisa.fr/prive/aubry/hicss/.
Synchronous distribution of Signal programs
P. Aubry P. Le Guernic S. Machard
fPascal.Aubry,Paul.LeGuernic,Sylvain.Machardg@irisa.fr
IRISA/INRIA, Campus de Beaulieu, F-35042 Rennes Cedex
Abstract
Signal, a synchronous and data-ow oriented lan-
gage, allows the user to design safe real-time applica-
tions. Its compiler uses a single formalism called \Syn-
chronized Data-Flow Graphs
1
" all along the conception
chain from specication to proof and verication. We
show how this formalism can be kept on until distributed
code generation. The implementation described here,
called synchronous distribution, respects the semantics
of Signal. We nally show the limits of SDFGs and
conclude on the necessity of another model describing
dynamic behaviours of distributed executions.
1 Introduction
Based on the hypothesis of a discrete logical time,
the synchronous languages [1] (Signal, Lustre [2],
Esterel [3]) have proved their eciency for the de-
sign of critical and safe real-time applications. They
are characterized by strong semantics that allow the
programmer to use verication and proof techniques.
Signal [4], one of the synchronous languages, is declar-
ative and data-ow oriented. In this language, a signal
is a sequence of values; the instants of presence of the
signal are called clocks. The Signal compiler [5] is
a formal system able to solve equations and to reason
upon logic. All along the compilation process, it ma-
nipulates only equations and dependencies nally struc-
tured in an internal representation called Synchronized
Data-Flow Graph.
SDFGs can be seen as a generalization of Directed
Acyclic Graphs
2
. Indeed, the signals (vertices of SD-
FGs) are present only at some instants (their clock) and
the dependencies (edges of SDFGs) are clock-labelled
(i.e. take eect only at the instants of a given clock).
Each object of SDFGs is located in a hierarchy of clocks,
resulting of what is called the clock calculus. Those
graphs are used in the nal phases of the compilation to
produce textual outputs, sequential code generation [6]
and hardware synthesis [7].
For dierent reasons (delocalization of sensors, fault
tolerance, frequency increasement, ...), many real-time
applications require code distribution. As translations
between dierent representations are error-prone, safe-
ty requirements of critical applications are insured in
particular by the perservation of a single formalism all
along the design process. From specication to nal
implementation, simulation, proof and verication, the
Signal compiler uses a single formalism: Signal equa-
tions and dependencies.
1
Called thereafter SDFGs.
2
also called DAGs.
We intend to prove that this formalism can be kept
on even until the nal phase of distributed code gen-
eration, once equations have been assigned on proces-
sors, and that this approach allows the preservation of
semantical properties in a distributed system. The im-
plementation uses mixed static/dynamic scheduling.
Firstly, we give a short overview of Signal (sec-
tion 2) to present the SDFGs. Then we show how com-
munications can be introduced in SDFGs (section 3),
how the control part of Signal programs can be dis-
tributed (section 4) to extract sub-graphs correspond-
ing to dierent processors (section 5). Finally, we show
possible implementations (section 6) and give some per-
spectives of this work.
2 Signal
2.1 The Signal language
As Signal is a dataow-oriented language, it de-
scribes processes which communicate through sequen-
ces of (typed) values with an implicit timing: signals.
For instance, a signal X denotes the sequence (x
t
)
t2INnf0g
of data indexed by time.
;Kernel of Signal The kernel of the Signal lan-
guage includes the operators on signals and the process
operators. Four kinds of operators act on signals:
 instantaneous functions is a class of operators
which encompasses all the usual functions (and, <=,
+, fft, ...) extended to act on signals. Let f a sym-
bol which denotes an n-ary function acting on signals
and [[f ]] the corresponding function acting on values,
the Signal process Y:=ffX1,: : :,XNg species that
8t  1 y
t
= [[f ]](x1
t
; : : : ; xn
t
). In the specied be-
havior, one may notice that the value y
t
carried by
Y at instant t is equal to the function [[f ]] applied to
the values held by X1, : : : , XN at the same instant.
This fact is the result of a special specication ap-
proach: the (strong) synchronous approach (see [8]
for an overview). In the dataow synchronous ap-
proach [1], the execution of the operators is assumed
of zero duration
3
, only the logical precedence of val-
ues on a signal represents passing time. Therefore,
ring waits and implicit queueing of data are sup-
pressed at the specication level.
 shift register makes explicit the memorization of
data; it enables the reference to a previous value of
a signal. For instance, the process Y:=X $ 1 denes
a basic process such that y
1
= v0 and 8t > 1 y
t
=
x
t 1
where v0 denotes an initial and constant value
3
In fact this theorical point of view means that durations are
not taken into account.
associated with the declaration of Y. In contrast to
the last two operators, the signals referred to in in-
stantaneous functions or in the shift register must be
bound to the same time index, the same clock.
 the selectionoperator allows us to draw some data of
a signal through some boolean condition. The process
Y:=X when B species that Y carries the same value
as X each time X carries a data and B carries the value
true (B must be a boolean signal). Otherwise, Y is
absent, i.e. Y carries no value.
 the merge operator combines ows of data. The pro-
cess Y:=X1 default X2 denes Y by merging the val-
ues carried by X1 and X2 and giving priority to X1's
data when both signals are simultaneously present.
The four previous operators specify basic processes.
The specication of complex processes is achieved with
the parallel composition operator: the composition
of two processes P1 and P2 is denoted (| P1 | P2 |).
In the composed process, the common names between
P1 and P2 refer to common signals; they stand for the
communication links between P1 and P2. This parallel
composition is an associative, commutative and idem-
potent operator ((| P | P |) P).
The last feature of the kernel is the possibility to
reduce the scope of a signal: in the process P()/X,
the signal X is set local to the process P.
;Extended operators Built on the previous primi-
tive operators, some built-in features often used by pro-
grammer have been added :
 clock extraction makes explicit the clocks of sig-
nals. C := event X means that C is the clock of X
and is equivalent to C := (X=X).
 synchronization between signals induces new con-
straints in programs. synchrofX,Yg means that the
two signals X and Y have the same clock (i.e. must
be present at the same instants). synchrofX,Yg is a
process equivalent to
(| C := (event X) = (event Y) |) /C.
 the memory cell allows the programmer to keep the
previous values of a signal at the true occurences of a
boolean. Y := X cell B is present when X is present
or B is present and true. In the rst case, it is equal
to X; otherwise, it is equal to the last occurence of X.
Y := X cell B is equivalent to
(| synchro{Y, (event X) default (when B)}
| ZX := X $ 1
| Y := X default ZX |) / ZX
The specication of Signal programs is architec-
ture-independent. This independence comes from the
synchronous specication approach and the dataow/e-
quational style of the Signal language. Therefore,
the inference of reliable and ecient implementations
is achieved in two steps. Firstly, we intend to validate
the specication independently from any target architec-
tures. The next subsection describes the compilation
process and describes the nal representation of the
compiled programs: Synchronous Data-Flow Graphs.
Then we show how transformations of these graphs can
lead to distributed implementations.
2.2 Synchronous Data-Flow Graphs
As Signal is equational, the Signal compiler is not
a simple translator from high-level specications to ex-
ecutable code. It is a formal system able to reason upon
logic and clocks.
 The rst step of the compilation is the reduction
into the kernel language of the input source.
 Clock-nodes are then created, gathering all the syn-
chronous signals of the program. A clock c can be:
| the clock of a signal X; it is thereafter noted
^
X and is present when X, which must be an input
signal of the program, is present.
| the positive (respectively negative) sampling of a
boolean condition b; it is thereafter noted [b] (resp.
[:b]) and is present when b is present and true (resp.
false).
| the upper bound of two other clocks c
1
and c
2
;
it is thereafter noted c
1
_ c
2
and is present when at
least one of c
1
and c
2
is present.
| the lower bound of two other clocks c
1
and c
2
; it
is thereafter noted c
1
^ c
2
and is present when c
1
and c
2
are both present.
| the complementary of a clock c
1
in a clock c
2
; it
is thereafter noted c
2
	 c
1
and is present when c
2
is
present and c
1
is absent.
 The main phase of the compilation is called the clock
calculus. By analysing all the clocks of the pro-
gram, it builds clock trees by placing clocks under
their father
4
. For instance, the clocks [b] and [:b] are
placed under the clock
^
b, with regard to one another:
=)
[b] [:b]
^
b
^
b
[:b][b]
The result of the clock calculus is a forest of trees of
which all the roots are free one another. If one single
root is present, the forest is reduced to a single tree;
this means that the compiler has found a main-clock,
quicker than any other one in the program. Such pro-
grams are called endochronous. During this analysis,
circuits in the denitions of clocks are detected, and
clock constraints are also established.
 Each clock c of the hierarchy owns instructions (Sig-
nal equations) explaining the computations of the
signals present at the instants of c. An instruction
can be:
| a denition of signal. It denes a signal X by an
equation like X := exp.
| an external call. It species that an external
function P has to be called when c is present.
| a delay, which is a special denition of signal. It
denes a signal ZX by an equation like
ZX := X $ n window m
where X is a signal, n and m are integer values. ZX
is an m-array containing at any instant t the values
X
t (m+n 1)
, ..., X
t n
.
4
A clock c in a tree can not be present if its father is absent.
2
These instructions induce dependencies between sig-
nals. As signals are not always present (as in DAGs),
dependencies are clock-labelled: a dependency be-
tween two signals X and Y is eective only at the
instant of their dependency-clock c: X
c
 ! Y: The
clock c is always included in
^
X and
^
Y, which means
that the dependency can be eective only if X and
Y are both present. Two additionnal rules apply to
dependencies, caracterizing serialization:
X
c
1
 ! Y
c
2
 ! Z =) X
c
1
^c
2
 ! Z;
and parallelism:
X
c
1
 ! Y
X
c
2
 ! Y

=) X
c
1
_c
2
 ! Y:
A transitive closure on the dependencies of the graph
reveals circuits like:
X
1
c
1
 ! X
2
c
2
 ! :::
c
n 1
 ! X
n
c
n
 ! X
1
:
Thanks to the rule of serialization, the clock of this
circuit is c =
V
i
c
i
; if c = ; (never present clock),
the circuit is never eective and the scheduling of the
computation depends on the clocks c
i
. Such a circuit
is rejected at compile-time, because of the cost of the
analysis needed to gure out sometimes and never-
eective circuits
5
.
We see in this short description of SDFGs their two
main aspects. A clock-hierarchy can be used to re-
duce control computations, and a data-ow graph in
which dependencies (edges) between signals (vertices)
are clock-labelled. In the sequel of this paper, we con-
sider the dual graph: nodes compute signals and clocks.
2.3 An example
Let us consider the following example, specifying in
Signal a counter V synchronous with a boolean RST
and reset to zero when RST is true:
process P =
{ ? boolean RST ! integer V }
(| synchro{ RST, V }
| V := (0 when RST) default (ZV + 1)
| ZV := V $ 1
|) where integer ZV init -1
end
With such an input, the compiler nds three dierent
clocks in the program:
^
rst, [rst] and [:rst]. The signals
V, ZV and RST are synchronous and their clock is
^
rst.
The clocks [rst] and [:rst] are sub-clocks of
^
rst. The
nal SDFG of this program is:
RST
0
V
[:rst]
ZV
^
rst
[:rst]
[rst]
[rst]
^
rst
^
rst
5
Those circuits are rejected also for historical reasons: the
compiler generates, at the moment, only monoprocessor sequen-
tial executable code, statically scheduled at compile-time.
On this SDFG, signal names are bold-faced, clocks are
emphasized. Solid and dotted arrows represent respec-
tively clock and data dependencies; the clocks labelling
the dependencies are located just next to the arrows.
3 Communications in SDFGs
We assume in this section that all the nodes (sig-
nals and clocks) of the graph have been located on a
set of processors. This point of view ts well to some
requirements of real-time applications: though speed is
an important criterian to appreciate reactive systems,
distribution is also needed for specic reasons such as
the delocalization of sensors. Readers interested in the
distribution of Signal programs on quantitative crite-
ria may refer the Signal/Syndex interface [9]. The
way the nodes of the SDFG are assigned to the proces-
sors is not shown here. One can think of directives set
by the user at the source level (pragmas for instance)
or after the creation of the SDFG with an interactive
tool taking place in the design process just before the
distribution itself.
3.1 Data communication between two nodes
Let us consider a signal x produced by a node p on a
processor P and consumed by a node q on a processor
Q at a clock clk included in x^:
p q
x; clk
P
Q
If P = Q, the graph is left unchanged. Otherwise, a
communication is needed between P and Q. We intro-
duce a communication node C
x;P!Q
:
p
qC
x;P!Q
x
Q
; clkx
P
; clk
P
Q
The signal x, produced by the node p and consumed
by the node C
x;P!Q
at the clock clk, is renamed x
P
and a new signal x
Q
, produced by the node C
x;P!Q
and consumed by the node q at the clock clk, is intro-
duced. This simple operation can be seen in Signal
as a ow-renaming, as the synchronous hypothesis says
that computation durations are null (at least ignored
from a practical point of view). Thus, communication
nodes do not change the semantics of a program.
The communication node introduced before can be
cut in two, a write node W
x
P
!Q
on P and a read node
R
P!x
Q
on Q:
p
q
clk
x
Q
; clk
x
P
; clk
W
x
P
!Q
R
P!x
Q
P
Q
The reader should note that a dependency (at the clock
clk) is left between the two nodesW
x
P
!Q
andR
P!x
Q
,
insuring that the dependency between the nodes p and
q is left unchanged. We can then arm that the new
graph obtained by the introduction of read/write nodes
3
is dead-lock free, i.e. it introduces no circuit. It is obvi-
ous that the memory is kept bounded by all the trans-
formations explained above. Finally, from the Sig-
nal point of view, read/write nodes are seen as ex-
ternal functions; as the synchronous hypothesis tells
that such processes have a null computation duration,
the response time theoritically is also kept bounded. In
practice, we have to ensure that computation durations
are bounded, to get a global bounded response time for
the program. As quantitative aspects of distribution
are not studied here, we assume that communication
durations are all bounded. We have shown that the
introduction of communication-nodes
6
in SDFGs does
not change the semantics of the initial program.
3.2 Data communications in a complete graph
In this sub-section, we show how the principle de-
scribed above can be extended to a complete graph,
such as this one:
.
.
.
.
.
.
q
n
k
k
q
1
k
q
1
1
q
n
1
1
p
x; c
1
1
x; c
n
1
1
x; c
1
k
x; c
n
k
k
. . .
Q
1
Q
k
P
For this, let us consider the more general case of a node
p of a processor P producing a signal x consumed by dif-
ferent nodes located on a set of processors fQ
i
g
i2f1::kg
.
The diculty introduced for the transformation of such
a sub-graph is induced by the presence of many clocks.
The choice made to preserve the dependencies at the
right clocks is to introduce one communication for each
processor Q
i
:
.
.
.
.
.
.
.
.
.
q
1
k
q
1
1
. . .
x
1
; c
1
1
C
x;P!Q
1
p
x
k
; c
n
k
k
x
k
; c
1
k
C
x;P!Q
k
x
P
; c
1
x
P
; c
k
x
1
; c
n
1
1
q
n
1
1
q
n
k
k
P
Q
1
Q
k
The clock dependence between the node p and any
node C
x;P!Q
i
is set to c
i
=
W
n
i
j=1
c
j
i
. The communi-
cation nodes can still be seen as simple renamings (of
x
P
into x
Q
i
). The introduction of read/write nodes
is then problemfree thanks to the dependence between
two corresponding read/write nodes:
6
Called thereafter comm-nodes.
.
.
.
.
.
.
.
.
.
q
n
k
k
q
1
k
q
1
1
q
n
1
1
. . .
W
x
P
!Q
1
x
1
; c
1
1
x
1
; c
n
1
1
W
x
P
!Q
k
c
k
R
P!x
k
p
R
P!x
1
c
1
x
k
; c
n
k
1
x
k
; c
k
1
x
P
; c
1
x
P
; c
k
Q
1
Q
k
P
Of course, such transformations can be performed for
every edge of a Signal graph. The result is a new
graph, in which the dependencies between two nodes
located on dierent processors are no more data-de-
pendencies but (simple) clock-labelled dependencies
7
:
Each sub-graph is located on a single processor. The
independant sub-graphs can now be extracted from the
complete graph to get independent programs:
P
j
P
i
P
k
Presented this way, the problem of the distribution
seems very simple. In fact, we have intentionally hid-
den two main issues: the distribution of the control and
the consequences of the extraction of sub-graphs into
independent programs
8
on their semantics. They are
discussed in the next sections.
3.3 Discussion
To be ecient, the algorithm must not unmark any
already-marked node. This could happen when intro-
ducing a new clock-node depending on a signal x of
which the production node p has already been treated
(marked). As the communications between p and the
nodes consuming x may be aected by the new con-
sumption of x, communication nodes should then be
changed. To prevent this, only nodes without unmarked
successors are treated. The proof shows below that it
is always possible.
In a rst approximation, let us say that the commu-
nication clock chosen for a signal is c
Q
=
W
i2f1::kg
c
i
,
which corresponds to the lower bound of the possible
clocks. Other possible clocks are x^ (the production
clock of x) and all the clocks between c
Q
and x^ in the
clock-hierarchy. In these cases, no supplementary node
is introduced (because the communication clock is al-
ways already in the graph) but this way communica-
tions are red more often than necessary.
3.4 Proof
To insure the correctness of this algorithm, we have
to prove that it never locks, that it always ends, that
7
the only meaning is a temporal precedence between the two
nodes which has to be respected by any scheduler.
8
to generate executable code.
4
the initial dependencies between the initial nodes of the
graph are left unchanged and that the new graph is a
SDFG according to the subsection 2.2.
Dead-locks may happen if the choice (third line of
the algorithm) of a node p is impossible. If no node
was added to the graph, this would be obvious. As some
new clock-nodes may be introduced (as unmarked), we
have to prove that this can not lead to the following
dead-lock situation: all the unmarked nodes have un-
marked successors. This is strictly equivalent to say
that there is a cycle in the graph. We have then to
prove that the new nodes do not induce cycles. As this
is the crucial point of the proof, it is detailed in 3.5.
The only solution for the algorithm to innitly loop
is that it generates more nodes than it consumes: the
new nodes are introduced unmarked, which means that
the number of unmarrked nodes may not reduce at each
step. In fact, as the initial number of clocks of the
graph is xed, and as the algorithm only introduce up-
per bounds of already existent clocks, the number of
new clocks is also bounded
9
, which means that the al-
gorithm will always end.
The last point to verify is the corectness of the new
graph regarding to the properties of the rst one. All
the objects introduced (the clocks c
Q
and the signals
(x
Q
) are placed exactly as in the clock calculus of the
Signal compilation. Let us note that the signals x do
not need to be moved after the introduction of commu-
nication nodes. Finally, thanks to the dependencies be-
tween two corresponding read/write nodes, the depen-
dencies between the initial nodes are left unchanged.
Finally, this algorithm is correct if the communication
clock does not introduce circuits in the graph.
3.5 Communication clocks
Let us consider this Signal program:
(| { I, B1, J } := f{ E }
| I1 := I when B1
| I2 := I when B2
| X := I1 default J
| B2 := g{ X } |)
The compilation process introduces the clocks c
1
and c
2
denoting respectively the instants the booleans B1 and
B2 are true. To make the associated SDFG as clear
as possible, we did not mention the main clock of the
program, synchronous with E, I, B1, J, X and B2:
B1
c
1
B2
c
2
I2
E
J X
I1
I
If we assume that the programmer wants to produce
I on a processor P and I1 and I2 and a processor Q,
a comm-node C
I
from P to Q is needed. The upper
bound of the consumption clocks is c
1
_ c
2
and is noted
9
because the upper bound is associative and commutative,
and the clocks are stored in the graph in a single way.
c
12
. The introduction of c
12
obviously entails a circuit
in the graph:
B1
c
1
B2
c
2
I2
E
J X
I1
C
I
I
c
12
After analysis, the circuit is never eective because its
dependency-clock is null. This problem, rstly encoun-
tered in [10] is left unresolved. Indeed, we did neither
nd examples where communications clocks induce ef-
fective circuits nor managed to prove that those circuits
are always never-eective.
As the correctness of the previous algorithmdepends
on the choice of the communication clock, we choose c
Q
in such way that no circuit is introduced. If
W
i2f1::kg
c
i
is not present in the graph, it is introduced, produced
on the processor Q
10
. If ring the comm-nodes at this
clock induces circuits, we choose the slowest clock an-
cestor up to
^
X that can re the comm-nodes without
introducing any new circuit. In the worst case, the
clock chosen is
^
X and X is communicated as often as it
is produced.
4 Control distribution
Two corresponding read/write nodes are activated
at the same clock. This clock should then be present
on each processor. With our hypothesis saying that no
signal is duplicated, the clock is produced on one pro-
cessor and should then be communicated to the other
one: data communications imply control communica-
tions.
Let us assume that clocks are implemented as boole-
ans
11
(true/false values assumed as present/absent sta-
tes). So they can be treated exactly like the other sig-
nals. The only dierence is that they can not be com-
municated at their own clock but at a clock quicker than
it (its communication clock). This implies the imple-
mentation of this communication clock on the sending
and the receiving processor: control communications
imply other control communications. Of course, the
phenomenon is bounded because the number of clocks
of any program is xed by the compiler while creat-
ing the SDFG and thus the depth of any clock is also
bounded.
A possible choice is to distribute all the clocks of the
program on each processor. This trivial and systemati-
cal method is very expensive and may generate useless
communications in most of the programs. To improve
the distribution, a reduction of the clock-hierarchy on
each processor is needed.
4.1 Extraction of useless clocks
The rst step of the simplication of the hierarchy
is the elimination of useless clocks. To see which clocks
10
Because all the clocks c
i
are already on this processor.
11
This hypothesis is justied in [6].
5
can be immediatly suppressed, we introduce a classi-
cation.
 A clock c is said necessary on a processor P if and
only if at least one of these assertions is true:
| an instruction has to be red on P when c is present
(a signal has c as its clock).
| a signal computed on P depends on c.
| another clock, dierent from any sub-clock, com-
puted on P depends on c.
 To keep the richness of the initial hierarchy, we say
that a clock c is useful if it is necessary or if at
least two sub-hierarchies of c own necessary clocks.
This way the evaluation of sub-clocks is not neces-
sary when their ancestor is not present, reducing the
computation of the control part of a program (see the
production of sequential mono-processor code in [6]).
These useful clocks have to be implemented on P.
 At the opposite, a clock c is said useless on a processor
P if and only if:
| no instruction has to be red when c is present (no
signal has c as its clock).
| no signal on P depends on c.
| no other useful clocks produced on P depends on
c.
| all the sub-clocks of c are useless (on P).
 The other clocks of the program, that are neither
useless nor necessary, are clocks with no instruction,
no successor (signal or clock) but their useful sub-
clocks. Those clocks are said intermediate.
As a useless clock c does not precede any signal or use-
ful clock computed on P, no need to implement it on P.
Useless clocks can be extracted from the clock-hierarchy
without any problem. The resulting hierarchy is said
puried. The following gure shows such a transfor-
mation on a clock-hierarchy projected on a processor
P:


u
u
uu
u

u
i
i
i


u
u
u


=)
u
u
u
u
i
i
i
u
u
u u
u
In the hierarchies, useful, useless and intermediate clocks
on P have been represented respectively with the sym-
bols u, 
 and i. As all the useful clocks have to
be implemented on P, reducing the puried hierarchy
means extracting intermediate clocks.
4.2 Extraction of intermediate clocks
As said before, a possible choice is to communicate
any clock c when its upper clock in the clock-hierarchy
is present (when c is the main clock of the program,
there is no need to communicate it if absent because no
computations are needed). This leads to the duplica-
tion of almost all the clocks on each process, keeping all
the richness of the hierarchy but obviously generating
useless communications: the communication of a clock
may need the communication of all its ancestors in the
clock-hierarchy.
An extremal solution consists in moving all the use-
ful clocks of the hierarchy up to the main clock of the
program, thus present on all the processors. But as
the reduction is an improvement of the distributed pro-
gram, thus theoritically leading to better implementa-
tions, it must preserve the richness of the hierarchy.
What we propose here is an intermediate solution.
Let us consider a clock c useful on P. If it is not pro-
duced on P, as it must be implemented (because used
on P), it must be read from the processor Q produc-
ing it. As clocks are read as booleans, the problem is
to determine at which clock c will be read (let us note
this clock r). This clock must obviously be an ancestor
of c in the hierarchy (quicker than c). If we want to
extract from the graph a maximum number of inter-
mediate clocks while preserving the hierarchy, the best
choice for r is the nearest useful ancestor of c, noted n
(this way, all the intermediate clocks between n and c
become useless). As the boolean used to communicate
the clock c must be written and read at the same clock,
n must be present on Q, which is not always veried in
practical cases.
; If n is useful on Q, c can be communicated when
n is present:
u
i
i
n
u
c
u
n
u
c
u
?
c
Q
:=...
default not ng
b
c
:= read c()
c
P
:= when b
c
on P on Q
write cfc
Q
On P, b
c
is read and c
P
is produced when n is present;
on Q, c
Q
is still produced when its father in the clock-
hierarchy is present and c
Q
is written to P when n is
present. The two intermediate clocks between c and n
can be extracted from the processor P. If all the clocks
can be communicated this way, the gain is optimal and
all the intermediate clocks are suppressed. With our
example, this leads to the following hierarchy on P:
u
u
uu u
u
u
u
u
; If n is not useful on Q,
 One possibility is to communicate c at a clock faster
than n. In the worst cases, this would lead to make
all the communications at the fastest clock of the pro-
gram. This way, a part of the hierarchy would be lost,
so this solution can not be taken into account.
6
 Another possibility is to force n to be implemented on
Q. That would result in increasing the computation
frequency on Q if n is faster than the quickest useful
clock of Q. This solution is rejected.
 As c is produced on Q, its immediate ancestor a is
useful on Q. This shows that there is at least one
clock between n and c useful on Q. We choose to
communicate c at the fastest clock f between n and
c (in the worst case, this clock is a). This interme-
diate clock f on P becomes useful. In our previous
example, if a is dierent from f , we see that a can be
extracted from P:
b
c
:= read c()
c
P
:= when b
c
write cfc
Q
default not fg
u
n
u
c
u
c
Q
:=...
on P on Q
u
i
n
u
c
f
aa
f
u u
On P, b
c
is read and c
P
is produced when f is present;
onQ, c
Q
is still produced when its father in the clock-
hierarchy is present and cQ is wrote to P when f is
present. All the (intermediate) clocks between c and
f can be extracted from the clock hierarchy.
Let us remark that in the worst cases, this method
sets new clocks as useful in the hierarchy only among
the ancestors of c. As control communications must
be introduced on the whole hierarchy, the traversal of
the clock-tree must be made from the leaves up to the
root. Moreover the clocks must be treated one by one
on each processor: if we apply this method processor
by processor on the hierarchy we may not benet from
the clocks introduced by the communications needed
on the others processors.
4.3 Root clocks
Let q be the quickest useful clock on a processor P
after purication of the hierarchy (thanks to the previ-
ous algorithm, q is the latest usefull clock encountered).
If q is not the main clock of the program, it is produced
by another processor and sent to P. When q is absent,
no computations are needed on P and thus there is no
need to communicate q to P: only the present (true)
values of the clock are interesting. We can then com-
municate q only when it is present. Moreover, the com-
munication clock does not need any more to be faster
than q:
Finally, when q is the quickest clock of a process P,
only the present values of q need to be sent to P and
the communication clock can be q itself
12
.
The roots on each processor are detected during the
purication of the hierarchy and stored in an array
called root. When introducing communication nodes,
the algorithm checks whether the current clock is root
on the current processor or not and sets up correct com-
munications.
12
In an implementation, the reception of a false value can be
the signal for the processor P to end its execution.
4.4 Gathering communications
As the number of communications is one of the qual-
ity criteria of distribution, we want to gather commu-
nications if possible. First of all, because of the for-
malism of the SDFG, two comm-nodes can be grouped
if red at the same clock. Secondly, it is only inter-
esting if the source and destination processors are the
same. But these conditions are not sucient; indeed,
grouping nodes this way may introduce circuits in the
SDFG. Let us consider the following SDFG, where all
the nodes are synchronous (red at the same clock c):
1
3
4
A
B C
Q
P
2
If we agregate the comm-nodes A and C in a single
comm-node AC, we introduce a circuit:
1
3
4
B
Q
P
AC
2
Of course, circuits have to be rejected. When the agre-
gation of many comm-nodes into a single one does not
introduce circuits, it is possible but not always wanted
because it can reduce concurrency by forcing a new
synchronization (in one instant) in the program. Let
us now consider this new example:
A
B
3
1
4
QP
2
All the nodes are assumed to be synchronous again. If
we gather the comm-nodes A and B (see below), we
make the nodes 3 and 4 to be ready to be executed at
the same moment when one can be executed indepen-
dently from the other one:
3
1
4
AB
QP
2
In order to prevent useless reduction of the concurrency,
we gather comm-nodes if and only if 1) their source and
destination processors are the same, 2) they are red at
the same clock, 3) their successors on the destination
processor are the same, and 4) their predecessors on the
sourse processor are the same
13
. The assertion 2) tells
us that the optimization should be done by any traver-
sal of the clock-hierarchy while the third one insures
that no new circuit is introduced in the graph.
13
In fact, this condition can be extended later by \their prede-
cessors belong to the same cluster" when considering non-atomic
nodes.
7
5 Extraction of sub-graphs
We said before that the graph obtained after trans-
formation keeps all the properties of the initial graph.
By simply extracting the sub-graphs from the complete
graph, we lose the temporal dependencies between the
newly introduced read/write nodes and it could lead
to incorrect implementations. To convice the reader of
this, let us consider the following graph:
1
2
2
2
2
3
3
3
3
3
C
D
E
F
1
1
H
3
3
G
2
3
A
B
Computation and comm-nodes are represented respec-
tively by circle and squares, each comm-node standing
for one write-node and its corresponding read-node.
5.1 Direct extraction
If we simply extract from this graph the nodes lo-
cated on the processor 3, we get the sub-graph
3
3
3
3
3 3
3
3
wA rC
rD rH
rG
wE
wF
which can, in the absence of additionnal information
lead to incorrect implementations; here is a possible re-
inforcement of the dependencies made by a static sched-
uler, for instance:
3
3
3
3
3 3
3
3
wA rC
rD rH
rG
wE
wF
Obviously, though this reinforcement does not intro-
duce any cycle regarding to the sub-graph, as there ex-
ists some dependence between the couples of read/write
nodes, it would lead to an easilly predictable dead-lock.
This matter is fully described in [6] and solved by the
\abtract graph method". What we propose here is just
another algorithm to get correct sub-graphs: rstly, we
make explicit the dependencies between the read/write-
nodes of the whole SDFG by a transitive closure applied
on each processor; secondly, we show a fast algorithm
(applying to these nodes only) to get the minimal set
of dependencies to add to the read/write-nodes of each
processor to extract independant sub-graphs.
5.2 Reinforcement of sub-graphs
To be sure that any correct reinforcement
14
of the
dependencies by a scheduler leads to a correct exe-
cution, we must add some dependencies between the
input/output nodes of the graph. Moreover the de-
pendencies should be minimal: a too strong reinforce-
ment prevents from getting some possible executions.
A possiblity consists in a transitive closure on the com-
plete graph. Already implemented in the Signal com-
piler, this easy solution is quite expensive because the
only dependencies we want to add are dependencies
14
Briey, a reinforcement is said correct if it does not introduce
circuits in the SDFG.
from write-nodes to read-nodes on the same processor.
Firstly, we make explicit the dependencies from comm-
nodes located on the same processor. Applied to the
previous graph, we get the following dependencies:
C
D
E
F
H
G
B
A
3
1
2
2
3
3
2
2
3
3
2
1
2
1
3
3
After this selective transitive closure on each processor,
a global transitive closure applied to the very limited
subset of communication nodes reveals the necessary
dependencies that should be added. This abstraction
of the subgraphs to their interface results in faster and
correct extractions. The algorithms leading to the ex-
tractions will be explained in a complete version of this
paper.
5.3 Example
Applied to our previous example, The new depen-
dencies added by the algorithm are:
C
E
G
D
F
H
B
A
3
2
3
2
3
2
2
3
1
2
3
2
3
1
1
3
Projected on each processor, we see on this example
that necessary and sucient dependencies have been
added:
1
F
1
1
H
B
2
2
2
2
C
D
E
G
2
A
B
3
3
3
3
3
C
D
E
F
H
3
3
G
3
A
6 Simulation
All the previous modications of the SDFG lead to
one sub-graph per processor. We have proved in the
previous sections that the composition of these sub-
graphs is equivalent to the initial graph but a new is-
sue, specic to distributed systems, appears for simu-
lation. On a single processor, the execution of consec-
utive instants is exclusive because all the instructions
of an instant must be terminated before the following
instant starts. On a distributed system, without any
additionnal information, the processor producing the
main clock of the program (predecessor of any node)
may be ready to execute an instant while the execution
of the previous one is not ended on other processors.
Obviously, this is in contradiction with the semantics of
Signal that tells that instants are successive, but may
8
be wanted by the user to get data-ow-like simulation.
In this section, we see some possible implementations
leading to:
 synchronous executions, where the execution of an
instant can not start while the previous instant is
not ended;
 asynchronous executions, where overlays between in-
stants are allowed but bounded, controlled by FIFO-
queued communications and validated at compile-time;
After those dierent macro-behaviours, we show the
implementations of the sub-graphs and nally the com-
munications between processors.
6.1 Synchronous executions
We want here to infer a synchronous execution pre-
serving the semantics of Signal: the execution of an
instant can not start while the previous instant is not
ended. This can be done if and only if the processor P
producing the main clock c
m
of the program is informed
of the end of the execution of the current instant on all
the other processors
15
. As we want to keep on with the
Signal formalism, this will be done in three steps.
We note T
Q
the set of computation and clock nodes
without any successor
16
. For each processor, if T
Q
is
not empty, we introduce a virtual node e
Q
succeeding
to all the elements of T
Q
. This way, the execution
of an instant is ended when e
Q
is executed. As the
execution of e
Q
must be transmited to P, if P 6= Q,
e
Q
is a write-node sending a dummy value to the
processor P.
We introduce on P the corresponding read-nodes e
Q
P
,
all of preceeding another new node called e
P
.
The nodes added by this transformation are shown
on the following graph:
f
Q
f
R
t
t
t
t
t
t
e
P
c
m
t
t
processor P
processor Q
processorR
f
P
where the nodes belonging to T
i
are represented by the
symbol t. Each node e
i
is set to the main clock of the
process i. It is not the slowest one; indeed, the upper
bound of all the clocks of the t-nodes of the processor
i is the best one but it may not be present in the SDFG
on i and its introduction may introduce circuits as seen
in 3.5. It is easy to see that the three steps do not
introduce circuits because the t-nodes had initially no
15
because c
m
is preceeding all the nodes of the SDFG.
16
The read-nodes always have succesors on Q and the write-
nodes on another processor.
successor. The reader may note that the node e
P
is
not necessary: the only presence of the read-nodes e
i
P
insures that there will be no overlay between instants if
c
m
is not red before the previous executions of these
nodes is ended.
As well as the desynchronizations introduced by the
oversampling in the communications of clocks and the
reduction of the main clock on each processor, the trans-
formations above preserve the observationnal semantics
of the initial program. The next execution overview be-
low does not.
6.2 Asynchronous executions
Synchronous executions can be interpreted seen from
the previous processor P (producing the main clock) as
if the other processes were acting on its authority: the
only desynchronizations observed are in the instant. In
other words and with a temporal point of view, as over-
lays between instants are not allowed, the gain of time
is minimal because the processors are kept idle in par-
ticular since the moment they achieve their execution
until the beginning of the next instant when they could
start its execution.
What we describe here is another desynchronization
of the program: if we want to make preemption possi-
ble by idle parts of the program, it is easy to see that
the Signal formalism is not sucient because SDFG
only deal with static properties. Asynchronous execu-
tions require some deep transformations that can not
be performed without a ne knowledge of dynamic be-
haviours: another modelization, describing the execu-
tion of synchronous programs is needed.
Indeed, let us assume that we do not introduce the
previous nodes e
i
and e
i
P
. As some processors may
execute their sub-graphs faster than other processors,
without any acknowlegment, such asynchronous execu-
tions can lead to the accumulation of tokens on the
communication media between the processors. A sim-
ple way to resolve this problem is to bound the desyn-
chronisation between two processors, by having FIFO-
queued communications. This also means a dynamic
scheduler refusing the emission of values by write-nodes
if the corresponding FIFO is full. Obviously, these
asynchronous implementations require some long pre-
liminary studies that we can not develop here.
6.3 Scheduling
During the extraction of sub-graphs onto the pro-
cessors, dependencies between comm-nodes located on
the same processor have been added to allow any static
scheduler to rule the execution of sub-graphs. Thus,
a possible choice for the implementation of sub-graphs
is to generate statically sequenced code; all the tech-
niques described in [6] can be applied to the sub-graphs.
On a single processor, executable code statically se-
quenced at compile-time always makes ecient pro-
grams. Choices made for the scheduling do not have
any eect on losses due to idle periods. On many pro-
cessors, an \at least partially dynamic" scheduler is
needed because idle periods in a statically sequenced
code can be reduced only with a ne knowledge of ex-
ecution costs on the dierent processors. As Signal
specications are architecture-independent, those infor-
mations are missing and scheduling must be made at
9
run-time. Of course, a dynamic scheduler is much more
expensive than a static one. The nal implementation
should then statically sequence a maximum of nodes of
the sub-graphs while leaving enough freedom between
statically sequenced parts to benet from concurrency
for the best.
To achieve this goal, we perform clustering tech-
niques shown in [11] on each sub-graphs to get clusters
that can be implemented in a procedural way.
6.4 Communications
The rst implementations of the distribution of Sig-
nal programs has been made on unix; processors are
unix processes (possibly on many machines of dierent
sites). As one of our goals is to nally provide sepa-
rate compilation and hardware/software implementa-
tions, we wanted an abstraction of communications.
The rst implementations are currently made on the
P.O.M. (Parallel Observable Machine [12]) developped
at I.R.I.S.A. in the PAMPA team.
In a near future, we plan to use corba [13]. Indeed,
corba provides a standardization of object-oriented
communications between applications. The abstraction
of physical communications is made through an idl
17
object-oriented and commonly dened by many hard-
ware designers and software developpers. Final imple-
mentations should use the support of a run-time kernel
and libraries. The programs generated will shift the re-
sponsability of communications on to corba's Broker,
used as a \black box". Of course, the use of such a
standard reduces the performances of the nal imple-
mentation, but allows more portability.
7 Perspectives
We have shown a complete method to generate dis-
tributed implementations from a Signal program.
What we did not mention here is the way the nodes
of the SDFG were assigned to processors. From our
point of view saying that the distribution motivations
are only qualitative, the user's directives should be set
in the Signal source program to make his work eas-
ier. Moreover, this way the directives can be kept at
the source level through successive improvements of the
specication while the graph level is just a temporary
representation, invisible by programmers. These fea-
ture is currently implemented in version V4 of Signal.
A big improvement of the control distribution would
be the denition of cost functions with quantitative in-
formations on durations. As this is not part of Sig-
nal scope because of the architure-independence, the
method presented in section 4 to reduce the clock-hi-
erarchies on the processors could be adaptated to quan-
titative tools.
Finally, and this is the main part of our future work,
Signal specications should allow asynchronous exe-
cutions. Our present studies have shown the limits of
the SDFGs, ruling only static properties. This leads us
to dene a new model for the execution of synchronous
programs able to represent dynamic behaviours of ap-
plications.
17
Interface Description Langage, described in [13].
References
[1] Albert Benveniste and Gerard Berry. The synchronous
approach to reactive and real-time systems. Proceed-
ings of the IEEE, 79(9):1270{1282, September 1991.
[2] P. Caspi, D. Pilaud, N. Halbwachs, and J. A. Plaice.
Lustre: a declarative language for programming syn-
chronous systems. In 14th ACM Symposium on Prin-
ciples of Programming Languages, pages 178{188, Mu-
nich, 1987.
[3] F. Boussinot and R. De Simone. The Esterel lan-
guage. Proceedings of the IEEE, 79(9):1293{1304,
September 1991.
[4] Paul Le Guernic, Thierry Gautier, Michel Le Borgne,
and Claude Le Maire. Programming real-time ap-
plications with Signal. Proceedings of the IEEE,
79(9):1321{1336, septembre 1991.
[5] Loc Besnard. Compilation de Signal: horloges,
dependances, environnement. PhD thesis, Universite
de Rennes 1, France, September 1992. in french.
[6] O. Maes. Ordonnancements de graphes de ots syn-
chrones; Application a Signal. PhD thesis, Universite
de Rennes 1, France, January 1993. in french.
[7] Mohammed Belhadj. Conception d'architectures en
utilisant Signal et VHDL. PhD thesis, Universite de
Rennes 1, France, December 1994. in french.
[8] Albert Benveniste, Paul Caspi, Paul Le Guernic, and
Nicolas Halbwachs. Data-Flow Synchronous Lan-
guages. In J.W. de Bakker, W.P. de Roever, and G.
Rozenberg, editors, Lecture Notes in Computer Sci-
ence 803, Proc. of the REX School/Symposium, Noord-
wijkerhout, Netherlands, pages 1{45, Springer{Verlag,
June 1993.
[9] C. Lavarenne, O. Segrouchni, Y. Sorel, and M. Sorine.
The Syndex software environment for real-time dis-
tributed systems design and implementation. In Euro-
pean Control Conference, pages 1684{1689, June 1991.
[10] Bruno Cheron. Transformations syntaxiques de pro-
grammes Signal. PhD thesis, Universite de Rennes 1,
France, September 1991. in french.
[11] Bernard Le Go and Paul Le Guernic. The granules,
glutton: an idea, an algorithm to implement on mul-
tiprocessor. In R. Cori M. Wirsing, editor, STACS
88, Lectures Notes in Computer Science, AFCET,
Springer-Verlag, Bordeaux France, February 1988. Vol-
ume 294.
[12] F. Guidec and Y. Maheo. POM: a virtual parallel ma-
chine featuring observation machanisms. Research re-
port 902, IRISA, January 1995.
[13] OMG, editor. The Common Object Request Broker:
Architecture and Specication. Object Management
Group, 1992.
10
