Distributed Implementation of SIGNAL: Scheduling & Graph Clustering by Maffeis, Olivier & Le Guernic, Paul
Distributed Implementation of SIGNAL: Scheduling &
Graph Clustering
Olivier Maffeis, Paul Le Guernic
To cite this version:
Olivier Maffeis, Paul Le Guernic. Distributed Implementation of SIGNAL: Scheduling & Graph
Clustering. Third International Symposium Organized Jointly With The Working Group Prov-
ably Correct Systems, Procos, Sep 1994, Lu¨beck, Germany. Springer-Verlag, pp.547-566, 1994,
LNCS vol. 863. <hal-00544101>
HAL Id: hal-00544101
https://hal.archives-ouvertes.fr/hal-00544101
Submitted on 7 Dec 2010
HAL is a multi-disciplinary open access
archive for the deposit and dissemination of sci-
entific research documents, whether they are pub-
lished or not. The documents may come from
teaching and research institutions in France or
abroad, or from public or private research centers.
L’archive ouverte pluridisciplinaire HAL, est
destine´e au de´poˆt et a` la diffusion de documents
scientifiques de niveau recherche, publie´s ou non,
e´manant des e´tablissements d’enseignement et de
recherche franc¸ais ou e´trangers, des laboratoires
publics ou prive´s.
Distributed Implementation of SIGNAL:
Scheduling & Graph Clustering
?
Olivier Maffe

s
1
and Paul Le Guernic
2
1
GMD I5 - SKS, Schloss Birlinghoven, 53754 Sankt Augustin, Germany
2
IRISA, Campus de Beaulieu, 35042 Rennes Cedex, France
Abstract. This paper introduces the scheduling strategy and some key
tools which have been designed for the distributed implementation of
Signal, a real-time synchronous dataow language. First, we motivate a
scheduling strategy with respect to the reactivity and time-predictability
requirements bound to real-time computing. Then, several key tools to
implement this scheduling strategy are described. These tools are acting
on the concept of Synchronous-Flow Dependence Graph (SFD Graph)
which denes a generalization of Directed Acyclic Graph and constitutes
the abstract representation of Signal programs. The tools presented in
this paper are: (a) the abstraction of SFD graphs which enables grain-size
tuning according to the target architecture, (b) the notion of scheduling
over SFD graphs and (c) qualitative clustering tools based on the notion
of Compositional Deadlock Consistency.
1 Introduction
Although distributed architectures are becoming increasingly popular, imple-
menting large-scale applications onto them still remains a very dicult chal-
lenge. The problem of implementing a program onto a distributed architecture
is often stated as: partitioning and scheduling the nodes of a Directed Acyclic
Graph (DAG) < N;  > onto a set of potentially heterogeneous processors
fP
i
j i = 1; : : : ; pg. In the abstract representation < N;  > of the application,
N is a set of nodes (tasks) which stand for indivisible
3
program computations,
and   is a set of arcs which represent the precedence constraints (including the
data paths). If the scheduling goal is the minimization of the program comple-
tion time, the associated scheduling problem is NP-complete even if the number
of processors is unbounded [18]. Since the optimal solution of this scheduling
problem can only be computed by exponential complexity algorithms (unless
P=NP), the purpose is to dene fast heuristic techniques which eciently com-
pute optimal or near-optimal solutions for restricted scheduling problems, i.e.
?
This work has been initiated at IRISA (Institut de Recherche en Informatique et
Systemes Aleatoires). It has been completed at RAL (Rutherford Appleton Labo-
ratory, England) and GMD where it has been supported by an ERCIM (European
Research Consortium for Informatics and Mathematics) fellowship.
3
in the sense that no attempt is made to use intranode parallelism
fast heuristic techniques acting on a previously stated scheduling strategy.
In this paper, we present the scheduling strategy and some key tools we designed
for the distributed implementation of Signal [13], a real-time synchronous data-
ow language. The scheduling strategy we chose is presented in section 2; it
takes into account responsiveness, robustness, eciency and time-predictability
requirements that real-time implementations must satisfy.
The tools implementing this scheduling strategy are acting on a generaliza-
tion of the notion of Directed Acyclic Graphs called Synchronous-Flow Depen-
dence Graphs (SFD Graphs). These graphs, which have been initially designed to
represent abstractly Signal programs, constitute now the graph format shared
by the languages Esterel [5],Argos [17], Lustre [8] and Signal; SFD graphs
are presented in section 3.
The following three sections dene three key tools over SFD graphs to im-
plement our scheduling strategy: (a) the abstraction of SFD graphs enabling
grain-size variations in section 4, (b) the notion of compile-time scheduling over
SFD graphs in section 5 and, (c) clustering tools based on a new qualitative
criterion, namely the compositional deadlock consistency, in section 6.
2 Scheduling Strategy
The possible scheduling strategies go from fully dynamic up to fully static. In
the fully dynamic scheduling strategy, the assignment of the tasks to the pro-
cessors and their ring time are determined at run-time. In contrast, the fully
static approach realizes these two actions at compile time. Static or dynamic
task assignment is the rst criterion which motivates the choice of a scheduling
strategy. A dynamic assignment scheduling strategy induces ecient parallel pro-
gram executions if the target architecture has relatively low communication costs
compared with the processor performance; a fully dynamic scheduling strategy
performs well for shared-memory architectures with few processors but not for
large-scale distributed memory architectures. Due to the relatively weak scal-
ability of the dynamic assignment scheduling strategies, we chose a scheduling
strategy with a static assignment of the tasks.
2.1 Related Work
Many works on static assignment scheduling strategies have been achieved but
they often consider only particular target architectures (the processors are ho-
mogeneous, the network topology is a ring, etc.) or particular application graphs
(chains, trees, etc.) [3] for which polynomial optimal algorithms can be found.
One relevant approach for a general purpose multiprocessor scheduling strat-
egy has been proposed by Kim & Browne [9]. Their scheduling strategy decom-
poses the application graph into a set of linear clusters. A linear cluster is a set of
nodes in which, for every couple of nodes, one precedes the other. The clustering
algorithmmerges iteratively the nodes on the critical path. After some clustering
renements, each cluster is mapped to one processor of the target architecture.
In [18], Sarkar has proposed a similar scheduling strategy: a clustering phase
(called internalization prepass) followed by a mapping phase. His clustering al-
gorithm considers the arcs of the application graph in a descending order with
respect to their communication weight. It iteratively merges the extremity nodes
of the rst arc in the list if this clustering does not increase the parallel comple-
tion time; this clustering algorithm performs non linear clustering. When the arc
list has been exhausted, the mapping phase is achieved using a list scheduling
algorithm.
Another relevant approach has been dened by Gerasoulis & Yang in [7].
This approach, implemented in the PYRROS environment [20], uses a cluster-
ing algorithm which merges the extremity nodes of the highly weighted arc on
the Dominant Sequence path. The Dominant Sequence path is the longest (i.e.
critical) path in the estimated static scheduling
4
. Like in Sarkar's approach, this
algorithm performs non-linear clustering.
2.2 The SIGNAL Approach
To implement Signal programs on a distributed architecture with p processors,
we advocate a slightly dierent scheduling strategy:
(a). Gather the nodes of the application graph into u clusters (u  p).
(b). Merge the u clusters into p connected virtual processors.
(c). Map the p virtual processors onto the p physical processors.
(d). Partition each virtual processor i in v
i
clusters.
(e). Compute a static schedule for each cluster; the resulting sequences
of code will be dynamically scheduled.
In contrast with the other scheduling strategies which are fully static, our strat-
egy envisages mixed static/dynamic scheduling at the processor level (step (e)).
This modication has been motivated by the kind of applications we have to
cope with: reactive instead of transformational systems [16]. In transformational
systems, the inputs are dened before the execution of the system. Therefore,
an implementation of this system may schedule the reading of its inputs at com-
pile time; a fully static scheduling strategy may induce ecient implementations
for transformational systems. As we consider real-time systems, a timely kind
of reactive systems, the inputs are supplied at run-time and we have to cope
with robustness, responsiveness and time-predictability. Therefore, a scheduling
strategy with some dynamic scheduling is more suitable for the implementation
of real-time systems since it enables some exibility in task ring. The dynamic
scheduling is intended to provide the implementation with responsiveness and
robustness to the variations in time of the input occurrences. But, this dynamic
scheduling must be strictly conned to satisfy the time-predictable requirement.
4
Note that Critical Path and Dominant Sequence path are equivalent notions in a
linear clustering algorithm.
Note that, in our scheduling strategy, we have extended the notion of cluster.
Usually, a cluster is dened as a set of tasks which are executed on the same
processor. In the expression of our scheduling strategy, we only consider a cluster
as a set of tasks which are treated as a whole in the next non-clustering steps. In
the distribution steps (steps (a), (b) and (c)), this induces that all the tasks of a
cluster will be implemented on the same processor. In the implementation steps
(steps (d) and (e)), it signies that dynamic scheduling will be only considered
between the clusters, a sequence of code being associated with each cluster v
i
.
In the sequel of this paper, we do not pretend to provide a complete set of tools
to implement our scheduling strategy, but only to present some key qualitative
tools:
{ abstraction of SFD graphs in section 4. With this abstraction, SFD graphs
may constitute the only modeling along the inference of a parallel implemen-
tation;
{ compile-time scheduling of SFD graphs in section 5. With this notion, a rst
step towards the inference of parallel implementations over SFD graphs is
achieved.
{ clustering tools based on new qualitative scheduling criterion: composi-
tional deadlock consistency in section 6. These clustering tools may be
used in the implementation of steps (a) and (d) of our scheduling strategy.
Before presenting all these tools, let us present shortly the notion of Synchronous-
Flow Dependence Graph (SFD Graph).
3 Synchronous-Flow Dependence Graphs
Let us illustrate the notion of Synchronous-Flow Dependence Graph (SFD Graph)
over a simple Signal example, a counter with reset:
(| ZV:= V $1
| V:= ( 1 when RST )default( ZV+1 )
|)
The $ operator is used to recall past values: the process ZV:= V $1 means
that ZV carries the previous value of V. The when operator lters data according
to a boolean condition and the default is a merge with priority:
V:= ( 1 when RST )default( ZV+1 )
species that V is reset to 1 when RST holds true; otherwise it increments its
previous value. Sequences of values that RST, ZV and V may take are:
RST :    t   
ZV :    5 6 7 8 9 1 2 3 4 5   
V :    6 7 8 9 1 2 3 4 5 6   
This behavior is abstractly represented in two connected structures: an equation
system which translates the relations among the occurrences of data, and a
dependence graph which represents the ows of data.
An Equational Control Modeling
The equation system is expressed over a set C of characteristic functions called
clocks: for any signal V, its clock bv is equal to 1 if V is carrying a data at the
considered instant, it is equal to 0 if no data is present on V. The logical rela-
tions among the occurrences of data, which are implicit in Signal processes,
are translated into equations over clocks. For instance, the counter example is
translated as:
ZV:= V $1 czv = bv (i)
V:= ( 1 when RST )default( ZV+1 ) bv =
c
rst _czv (ii)
As expressed in equation (i), each time V is holding a value, ZV is carrying a
value (in fact the previous value of V). Equation (ii) expresses that V carries
a value whenever a reset occurs or ZV holds a value. Formally, the equation
system  which encodes the occurrence relations evolves in a boolean algebra
B =< C;_;^;
b
0;
b
1 > where:
C is a set of clocks; B is called a clock algebra;
b
0 denotes the least element of B which stands for the never present clock;
it is used to denote something that never happens;
b
1 is the greatest element of B, the always present clock.
As boolean algebras are lattices, an alternative representation of the clock alge-
bra B is achieved through a partial order:
< C;  > with bx  by () bx _ by = by (, bx ^ by = bx)
Over the counter encoding, we can deduce that bv =
c
rst _ bv or equivalently
c
rst  bv. This result intuitively means that the activity of the counter includes
the reset operations.
A Clock-Labeled Dependence Graph
The equation system describes algebraically the reachable control states of the
process. Over the counter example, the relation
c
rst  bv induces that the
control state where bv = 0;
c
rst = 1 is unreachable. According to the reachable
control states, dierent ows of data may occur; the dierent ows of data which
may occur in the counter are abstractly represented by the dependence graphs
in Fig. 1-a, Fig. 1-b and Fig. 1-c, one for each reachable control state.
In Fig. 1-a where RST occurs (
c
rst = 1), ONE (a constant signal) is assigned to
V, the value of ZV is not used. Otherwise, V is dened
5
by the value of ZV as
depicted in Fig. 1-b. When bv = 0 (the counter is not counting), nothing happens
as it is accurately presented in Fig. 1-c.
5
For presentation reasons, we have substituted ZV+ 1 by ZV.
vzvone
v
zv
(a) bv = 1;
c
rst = 1 (b) bv = 1;
c
rst = 0 (c) bv = 0;
c
rst = 0
Fig. 1. The Data-Dependencies According to the Control States
The abstract representation of the ows of data using Synchronous-Flow De-
pendence Graphs (SFD Graphs) is dened by superimposing all the possible
data-dependence graphs. Superimposing all the data-dependence graphs drawn
in Fig. 1 induces the SFD graph depicted in Fig. 2.
one
v
zv
c
rst
c
rst
f
N
(v) = f
N
(zv) = bv
f
N
(one) =
c
rst
with bzv = bv;
c
rst  bzv
c
rst = bv ^ (
b
1 
c
rst)
Fig. 2. A Synchronous-Flow Dependence Graph
The paths taken by the data according to the control states are described over
SFD graphs by means of two mappings f
N
and f
 
. These two mappings respec-
tively label its nodes and its vertices:
{ f
N
(one) =
c
rst means that ONE is only present when RST occurs
6
;
{ f
 
(zv; v) =
c
rst means that V is dened from ZV when RST does not occur.
The new clock-label
c
rst denotes the control state (b) in Fig. 1:
c
rst = 1 when
bv = 1 and
c
rst = 0. The denition of
c
rst is
7
:
c
rst = bv ^ (
b
1 
c
rst) .
Formally, a SFD graph is dened by:
< G;C;; f
N
; f
 
> is a Synchronous-Flow Dependence Graph (SFD graph) i:
{ G =< N; ; I;O > is a dependence graph < N;  > with communication
nodes: the inputs I and the outputs O are such that I  N;O  N and
I \O = ;.
{ < C; > is an equational control representation where  is a set of con-
straints over a set C of characteristic functions called clocks;
{ f
N
: N  ! C is a mapping labeling each node with a clock; it species the
existence condition of the nodes.
{ f
 
:    ! C is a mapping labeling each edge with a clock; it species the
existence condition of the edges.
6
Note that the clock of a constant signal is dened in a demand-driven way.
7
c
rst is not equivalent to
c
rst which is the complementary of
c
rst:
c
rst =
b
1 
c
rst
Directed Acyclic Graphs (DAGs) are a very common abstract program represen-
tation [1] of the ows of data which may occur in a program. A SFD graph is
nothing but a set of directed graphs packed together, the way these graphs are
packed being described by a boolean labeling of the elements of this graph. For
this reason, we say that SFD graphs are a generalization of DAGs. In contrast
with DAGs, the clock labeling provides SFD graphs with a dynamical feature.
To express precedence constraints, this clock labeling imposes two constraints
which are implicit for DAGs:
{ an edge cannot exist if one of its extremity nodes does not exist.
This property translated into the clock algebra, the image set of the map-
pings, is:
8(x; y) 2   f
 
(x; y)  f
N
(x) ^ f
N
(y)
{ a cycle of dependencies stands for a deadlock.
This property is veried over DAGs by denition. Over SFD graphs, it is
expressed as:
A SFD graph < G;C;; f
N
; f
 
> is deadlock free i,
for every cycle x1; : : : ; xn; x1 in G,
f
 
(x1; x2) ^ f
 
(x2; x3) ^ : : :^ f
 
(xn; x1) =
b
0
Intuitively, this equation translates the property that a deadlock does not
exist if all the dependencies of a cycle in a SFD graph cannot be present at
the same time.
As Signal is a dataow language, SFD graphs dene naturally a ne-grain
parallel representation of programs. Implementing Signal programs onto a par-
allel architecture needs to tune the grain of the abstract program representation
according to the target parallel architecture. For this purpose, we present in
the next section the notion of abstraction over SFD graphs. This notion of ab-
straction frees SFD graphs from the ne-grain representation they were initially
bound.
4 Abstraction of Synchronous-Flow Dependence Graphs
A key concept in software engineering is the concept of abstraction [10] which
supplies the sucient information to compose processes leaving aside any inter-
nal feature: it is the key concept for modularity. In programming languages, an
abstracted process if often conned to an identier and a set of input/output
nodes. In some high-level languages, process abstraction may include (a) formal
parameter to introduce some program genericity or (b) some high-order inputs
like the procedure entry level in ADA [19].
More generally, the concept of abstraction is designed for the verication of
global properties by the composition of synthesized representations. If R
1
and R
2
are two representations with some semantics and j is a composition operator, the
Abs synthesizing mechanism for the compositional verication of the property
P must verify the following relation:
P (R
1
) ^ P (R
2
) ^ P (Abs(R
1
)jAbs(R
2
)) =) P (R
1
jR
2
)
According to their mixed nature, abstraction of SFD graphs involves two syn-
thesizing mechanisms to verify deadlock freedom and to perform control consis-
tency [15] by composition:
{ A synthesis of the internal dependencies
This synthesis, required to verify deadlock by composition, is achieved through
the transitive closure of the dependence graph and its projection (sub-graph)
upon the input and output nodes. The transitive closure of SFD graphs is
simply computed with the two following rules.
rule of series x
 !
bc
y
 !
b
d
z) x
    !
bc ^
b
d
z
rule of parallel
x
 !
bc
y
x
 !
b
d
y
)
) x
    !
bc _
b
d
y
{ A clock equation projection.
This control projection synthesizes the relations (equivalence, inclusion, ex-
clusion) among the clocks which label (a) the edges of the synthesized graph
to enable compositional deadlock detection and (b) the input-output nodes
to perform control consistency [15] by composition.
The counter example is too small to illustrate the abstraction of SFD graphs.
The reader interested in such an example is referred to [15, 14].
In contrast with a lot of common abstract representations of programs, the
abstraction over SFD graphs provides them not with black box abstractions but
rather with grey box abstractions since it even synthesizes the control. Moreover,
as abstractions of SFD graphs are SFD graphs, all the tools previously dened for
SFD graphs are reusable modularly: modularity may be introduced in the whole
compilation process. With this notion of abstraction, steps (a) and (d) of our
scheduling strategy can be achieved without giving up the SFD graph modeling.
This modeling homogeneity warrants a greater reliability (every modeling change
constitutes a possible source of error) which is a critical requirement for real-
time systems. In the next section, we go one step further towards the inference
of time-predictable parallel implementations by means of the notion of compile-
time scheduling over abstractions of SFD graphs.
5 Compile-Time Scheduling
Compile-time scheduling, that is programming at compile-time the execution of
tasks, can be considered at two levels: at the logical level, compile-time schedul-
ing is to set the precedence constraints veried at run-time among the tasks; at
the physical level, compile-time scheduling is to dene the exact ring time of
the tasks. In this section, we only consider compile-time scheduling at the logical
level since we do not want to introduce quantitative data as required for physical
compile-time scheduling.
When an application is abstractly represented as a directed acyclic graph <
N;  >, scheduling at compile-time a set N of tasks is specied by adding prece-
dence constraints to   while making sure that no deadlock is introduced. If we
call reinforcement the addition of precedence constraints, and deadlock consis-
tency the action of \making sure that no deadlock is introduced", a scheduling of
< N;  > is dened as a deadlock-consistent a reinforcement it. Let us transpose
this denition from DAGS to SFD graphs:
{ reinforcement: < N; 
0
> is a reinforcement of< N;  > i     
0
Transposing the reinforcement denition to SFD graphs implies:
x
 !
b
k
y is a reinforcement of x
 !
b
h
y i
b
h 
b
k
Note that the absence of dependency between two nodes can be equivalently
represented over SFD graphs by a dependence labeled with the null clock
b
0.
As for DAGs, reinforcement provides a set of SFD graphs based on the same
node set with an order relation.
{ deadlock consistency: < N; 
0
> is deadlock-consistent for < N;  > i
< N;  > deadlock free =)< N; 
0
> deadlock free
Transposing the notion of deadlock consistency to SFD graphs implies:
x
 !
b
k
y is deadlock-consistent for x
 !
b
h
y i
8z1; : : : ; zn 2 N such that y
  !
b
l0
z1
  !
b
l1
z2 : : : zn
  !
b
ln
x :
n
^
0
b
li ^
b
h =
b
0 =)
n
^
0
b
li ^
b
k =
b
0
Over a transitive closure or an abstraction of a SFD graph, the above con-
dition of deadlock consistency is rewritten in a simpler form:
x
 !
b
k
y is deadlock-consistent for x
 !
b
h
y i
b
h ^
b
l =
b
0 =)
b
k ^
b
l =
b
0 with y
 !
b
l
x
As for DAGs, deadlock consistency provides a set of deadlock-free SFD
graphs based on the same node set with an order relation.
A compile-time scheduling of a graph is dened as a deadlock-consistent rein-
forcement of it. Let us focus on what precisely means the combination of these
two properties over the SFD graph abstraction depicted in Fig. 3. In this gure,
x and y stand for any two nodes, they may be internal, input or output nodes.
S(x; y)
y
x
f
+
f
 
Fig. 3. A Basic SFD Graph Abstraction
The clock which labels the dependency from x to y is denoted f
+
, its converse
is denoted f
 
. S(x; y) stands for a logical Scheduling of x before y.
As a scheduling is a deadlock-consistent reinforcement, S(x; y) must ensure that
the cycle x
     !
S(x; y)
y
  !
f
 
x does not represent a deadlock. Therefore, it must
satisfy the condition (1).
S(x; y) ^ f
 
=
b
0 (1)
By combining reinforcement with deadlock consistency, we demonstrate that
S(x; y) denes a compile-time scheduling i the condition (2) holds.
f
+
 S(x; y)  bx ^ by ^ (
b
1  f
 
) (2)
Proof : the lowerbound of scheduling is the straightforward expression of the rein-
forcement property which is attached to the notion of compile-time scheduling. The
upperbound of scheduling is induced from the conjunction of the deadlock consistency
condition with the inclusion condition. The inclusion condition bound to SFD graph
imposes that an arc cannot exist if one of its extremity node does not. Over the nota-
tions of Fig. 3, this inclusion condition is translated as: S(x; y)  bx ^ by.
By means of elementary clock calculus, the deadlock consistency property is rewritten
as an inequation:
S(x; y) ^ f
 
=
b
0 , S(x; y) ^ (
b
1  f
 
) = S(x; y)
, S(x; y)  (
b
1  f
 
)
The intuitive meaning of this formally proven upperbound of scheduling is:
x may be schedule before y at most when
bx ^ by x and y are present,
^(
b
1  f
 
) and y does not precede x
By means of clock expressions, dierent kinds of scheduling may be expressed
at compile-time. If S(x; y) is equal to bx ^ by, it expresses that x is scheduled
before y as soon as x and y are dened: the underlying scheduling is static. The
existence of cycle such that S(x; y)^S(y; x) =
b
0 denotes a scheduling depending
on boolean conditions evaluated at run-time, it induces pre-constrained dynamic
scheduling. The lack of dependency between x and y, which occurs when S(x; y)
and S(y; x) are both equal to
b
0, induces a dynamic scheduling.
Since scheduling is dened as the conjunction of reinforcement with deadlock
consistency, it provides a set of deadlock free SFD graphs based on the same
node set with an order relation. Therefore, an execution schema can be designed
progressively by successive reinforcement of a graph. Moreover, this design can
be performed at any level of abstraction since this scheduling is applicable over
SFD graph abstractions.
Besides the proper denition of the notion of compile-time scheduling over
SFD graphs, the purpose of this section was to illustrate the way to express
by clock expressions the control of the execution of processes. The same tech-
nique is used in the next section to dene the notion of compositional deadlock
consistency on which our clustering algorithms are based.
6 Compositional Deadlock Consistency
The general problem of partitioning/mapping an application graph onto a set of
processors while minimizing the maximal completion time is NP-complete [18].
Bypassing this complexity can be achieved through clustering heuristics which
detect properties of sub-graphs that are considered as atomic unit for the map-
ping process. A clustering phase is intended to increase the granularity of the
graph thereby reducing the size of the mapping problem without compromis-
ing the implementation eciency. Then, on this size-reduced application graph,
mapping algorithms with higher complexities can be reasonably used.
With respect to subtle variations of the scheduling goals, several cluster-
ing heuristics have been dened in the literature |see [6] for a survey of these
heuristics. The clustering sub-goals that are used can be split in two classes: the
quantitative goals (called performance goals in [6]) and the qualitative ones. Two
dierent quantitative data are usually added to the application graph < N;  >
for quantitative scheduling: the execution time e
ik
of the task n
i
(n
i
2 N ) on the
processor P
k
, and the communication cost c
ij
between the tasks n
i
and n
j
when
they are mapped on two directly connected processors (null communication time
is assumed if n
i
and n
j
are mapped on the same processor). According to these
two kinds of quantitative data, the two extreme sub-goals are: the maximization
of the execution eciency and the minimization of the communication volume.
In contrast with the quantitative goals which are architecture dependent, the
qualitative goals focus on the shape of the clusters. The qualitative goals which
have been used for clustering include:
{ linearity [9]. A linear cluster is a set of nodes in which, for every couple of
nodes, one precedes the other; the nodes of a linear cluster belong to a single
path in the dependence graph. As linear clustering merges only sequentially
executable nodes, it preserves the parallelism embedded in the graphs;
{ convexity [18]. Sarkar denes the convexity as the property that ensures that
a macro-actor can run to completion once all its input are available. In other
words, its execution can be split into three periods sequentially performed:
waiting for all the inputs; computing; emitting all the outputs. Therefore,
we say that the execution of convex macro-actors is function-like at the I/O
level. A graph-theoretic approach to convexity has been studied in [12].
In this section, we dene a new qualitative criterion, namelyCompositional Dead-
lock Consistency, which allows one to encompass linear as well as convex clus-
tering in a single framework. This extension has been motivated by the reactive
feature of real-time systems which imposes to consider the environment of the
real-time systems at all their design stages.
6.1 Example
Let us consider the graph in Fig. 4 which depicts the abstraction of a process
with two input signals I1 and I2, and two outputs O1 and O2. In this graph,
the solid arrows (i1 ! o1, i1 ! o2 and i2 ! o1) represent the dependencies
induces from the abstraction of the specication of the process.
read(i1);
...;
read(i2);
...;
emit(o2);
...;
emit(o1)
i2i1
o2o1
Implementation
Environment
Specication
read(o2);
...;
emit(i1)
(a) Implementation (b) Composition graph (c) Environment
Fig. 4. A Deadlock between a Process Implementation and its Environment
A topological sort of these nodes may induce the static scheduling in Fig. 4-
a. Transposing this static scheduling over the graph in Fig. 4-b introduces the
dashed arrows. If we compose the implementation in Fig. 4-a with an environ-
ment implementing the scheme in Fig. 4-c, a deadlock is created. At the graph
level, this deadlock is denoted by the cycle i2 o2 i2 . This deadlock
is present at the implementation level but not at the specication level since it
includes a dashed arrow. As the scheduling i2 o2 may create a deadlock
with an environment which is correct with respect to the process specication,
this scheduling is said not compositionally deadlock-consistent.
In contrast, the scheduling o2 o1 is compositionally deadlock-consistent
since it does not create a deadlock with the environment in Fig.4-c, and this
environment is the only one which can read outputs and emit inputs of the
process without creating a deadlock with it at the specication level.
6.2 Denition
Let us focus on what precisely means the notion of compositional deadlock con-
sistency over the generic SFD graph abstraction depicted in Fig. 5.
S(x; y)
f
+
f
kX
f
 
f
kl
i1 ik
ol
f
Y l
y
x
oqo1
F
env
ip
Fig. 5. A Generic SFD Graph Abstraction
In this gure, i1   ip represent the input nodes, o1   oq the output ones and,
x and y stand for any two nodes which may be internal nodes as well interface
nodes
8
. Translated over the notations in Fig. 5, the notion of compositional
deadlock consistency imposes that S(x; y) must verify:
8ik; ol f
kX
^ S(x; y) ^ f
Y l
^ F
env
=
b
0 (3)
In this equivalence, F
env
denotes a dependency from ol to ik outcoming from
the composition with an environment. This environment is acceptable if it is not
deadlocked with the specication of the process. Thus, the following condition
must be veried:
8ik; ol F
env
^ f
kl
=
b
0
As for the proof of the upperbound of scheduling (formula (2) in section 5), the
above condition can be equivalently rewritten in the inequation F
env
 (
b
1 f
kl
) .
Consequently, the condition of compositional deadlock consistency (formula (3))
is rewritten in:
8ik; ol S(x; y) ^ f
kX
^ f
Y l
^ (
b
1  f
kl
) =
b
0
This quantied equation can be rewritten in inequation (4).
S(x; y) 
^
k;l
(
b
1  f
kX
^ f
Y l
^ (
b
1   f
kl
)) (4)
Proof : The equation S(x; y) ^ f
kX
^ f
Y l
^ (
b
1  f
kl
) =
b
0 can be rewritten in:
S(x; y) ^ (
b
1  f
kX
^ f
Y l
^ (
b
1  f
kl
)) = S(x; y) 8ik 2 I; ol 2 O
, S(x; y)  (
b
1  f
kX
^ f
Y l
^ (
b
1  f
kl
)) 8ik 2 I; ol 2 O
, S(x; y) 
V
k;l
(
b
1  f
kX
^ f
Y l
^ (
b
1  f
kl
))
8
If x is the input node ik, it is equivalent to consider for the sequel of this paper that
f
kX
is equal to bx. A symmetric remark can be expressed if y is the output node ol.
6.3 Fully Deadlock Consistent Compile-Time Scheduling
By combining the compile-time scheduling characterization (formula (2)) with
inequality (4), we dene the criterion of fully deadlock consistent compile-time
scheduling (fdc scheduling) which is formally characterized by:
S(x; y) is denes a fully deadlock consistent compile-time scheduling of
x before y i f
+
 S(x; y)  S
>
(x; y) with:
S
>
(x; y) = bx ^ by ^ (
b
1  f
 
) ^
^
k;l
(
b
1  f
kX
^ f
Y l
^ (
b
1   f
kl
)) (5)
The proof of this inequality is straighforward. The complex clock expression
which species the upperbound of scheduling may be intuitively read as:
x may be scheduled before y i
x does not precede y and : bx ^ by ^ (
b
1  f
 
)^
if a scheduling path ik; x; y; ol is created
:
^
k;l
(
b
1  f
kX
^ f
Y l
^
then ik precedes ol by specication (
b
1   f
kl
))
The two main promising properties of this scheduling criterion are: (a) it may
induce architecture independent clustering since it is a qualitative scheduling
criterion; (b) as it is based on the abstraction of SFD graphs, it may be applied to
any subset of nodes: it denes an any level scheduling criterion. Exploiting these
properties to perform clustering needs to use this criterion accurately to avoid
the NP-complete problems that its general use will encounter. The practical uses
of this new scheduling criterion for clustering are presented in the next section.
7 Clustering
By applying the fdc scheduling criterion to a set of nodes, the associated process
may constitute a cluster by:
{ Linear Clustering if all the nodes may belong to a single path of fdc schedul-
ing. Note that, as a fdc scheduling is a reinforcement of a graph, any linear
clustering over a graph (as performed in [9]) is a linear clustering over a fdc
scheduling of this graph. But, in contrast with Kim & Browne's linear clus-
tering, linear clustering over fdc scheduled graphs may reduce the parallelism
embedded in the initial graph;
{ Convex Clustering if all the nodes may belong to a single path of fdc
scheduling where inputs and outputs are not alternating. Therefore, any
convex cluster is a linear cluster.
The practical use of the fdc scheduling criterion to do linear and convex clustering
will encounter NP-problems at two levels:
{ complex calculi in a boolean algebra lead to NP-complete problems. This rst
obstacle has been overcome with the heuristic algorithm that implements
the clock calculus [2]. Despite the breakthrough achieved by this heuristic
algorithm, the boolean calculi submitted to it must be as simple as possible.
{ optimal partitioning/clustering of general graphs with respect to non trivial
criteria is a NP-complete problem. To cope with this obstacle, we can use
optimal algorithms with exponential complexity on very small (sub-)graphs,
polynomial but often sub-optimal algorithms on large graphs, or a combina-
tion of both.
The rst optimization achieved by both the convex and the linear clustering
algorithms is to do clustering in two steps. Firstly, only a size-reducted problem
is considered by restricting the scope of fdc scheduling from any pair of nodes to
pairs of interface nodes. Secondly, the properties detected at the interface level
are propagated to the internal nodes to perform convex and linear clustering.
7.1 Convex Clustering
A naive algorithm for convex clustering at the interface level would be to enumer-
ate the possible elementary paths of the maximal fdc schedulings of an interface
abstraction, the maximal fdc schedulings of a graph being computed by recur-
sively substituting each arc by its upperbound of fdc scheduling. If one of these
paths does not alternate inputs and outputs, the associated process may dene
a convex cluster.
The major drawback of this naive algorithm is its complexity: it requires
two phases (computation of the maximal fdc scheduling and path enumeration)
which have an exponential complexity in the general case. Consequently, we
have investigated the other possibility which goes through the upperbound of
fdc scheduling of an interface graph. The upperbound of fdc of a graph is com-
puted by substituting in parallel each arc by its upperbound of fdc scheduling.
This upperbound is the superimposition of all the maximal fdc schedulings. For
instance, let us consider the interface abstraction depicted in Fig. 6-a. In this
abstraction, we assume that f
N
(i1) = f
N
(i2) = f
N
(o1) = f
N
(o2) =
b
k and
b
h 
b
k . The upperbound of fdc scheduling of this abstraction is the SFD graph
in Fig. 6-b.
In the general case, the upperbound of fdc scheduling does not dene a
scheduling as it may include cycles representing deadlocks. The upperbound of
scheduling depicted in Fig. 6-b includes two of these cycles, one between i1 and
i2 and the other between o1 and o2. The conjunction
b
h ^
b
k of the clocks
labeling the dependencies of these cycles is equal to
b
h since
b
h 
b
k: the cycles
exist at
b
h. In contrast with these two rst cycles, the third elementary cycle which
occurs between i2 and o2 does not stand for a deadlock since
b
h^ (
b
k 
b
h) =
b
0 .
This remark is in fact a general property as proved in [14]:
no deadlock cycle including inputs and outputs may occur
at the upperbound of fdc scheduling.
bk
b
h
b
k
b
k
i1
i2
o1 o2
b
h
b
k
b
k
b
k 
b
h
b
h
b
k
b
k
i1
o1 o2
i2
b
k
b
h
b
h 
b
k
(a) (b)
Fig. 6. A Graph and its Upperbound of Fdc Scheduling
As no cycle may alternate inputs and outputs, a cycle among inputs induces that
these inputs can be scheduled in a sequence without outputs; a similar discussion
may occur for cycles among outputs. This property of the cycles occurring at the
upperbound of fdc scheduling motivates the following algorithm which performs
convex clustering:
1. compute the upperbound of fdc scheduling among the inputs;
2. for each set of inputs belonging to a cycle at
b
h: cluster to this set of inputs
the internal and outputs nodes which depend exclusively on these inputs.
Note that a symmetric convex clustering algorithm may start from the outputs
instead of the inputs. This variation of the clustering algorithm may be useful if
there is less outputs than inputs to deal with a smaller problem. Applied to the
interface graph in Fig. 6-a, this algorithm detects that the associated process
denes a convex cluster at
b
h.
7.2 Linear Clustering
By convex clustering may result a partition into processes which can run to
completion once all their inputs are available; in these processes, all the inputs
may precede all the outputs at the implementation level. Looking for a parti-
tion into linear clusters which are not convex clusters leads to search for fdc
scheduling paths which alternate inputs and outputs. Therefore, one way to re-
duce the search space of the algorithm which does this search is to start from
a fdc scheduling dependency connecting an output to an input. Starting from
such a scheduling dependency, the algorithm may proceed by looking backward
and then forward to get the longest path of fdc scheduling. Previously to this
algorithm, a transitive reduction algorithm may be applied to reduce even more
the search space.
Applied to the graph in Fig. 6-a, the algorithm starts from the fdc scheduling
dependency o2 i2 at
b
k 
b
h. Then, by going backward, the node i1 is added
as the starting point of this scheduling path. By going forward, the node o1 is
appended to the path. Finally, this algorithm detects that the associated process
denes a linear cluster but not a convex one at
b
k  
b
h. By combining this
result with the convex clustering detected on the same set of nodes, linear and
possibly convex cluster are detected. By this combination, the process abstractly
represented in Fig. 6-a denes a linear cluster at:
b
h _ (
b
k  
b
h) =
b
k
After this denition of the convex and linear clustering algorithms, let us con-
clude this paper by presenting the way we intend to implement the ve-steps
scheduling strategy we advocated, and how the fdc scheduling criterion is used
in this framework.
7.3 Scheduling Strategy Implementation
In the beginning of this paper, we advocate a ve-steps scheduling strategy.
(a). Gather the nodes of the application graph into u clusters (u  p).
(b). Merge the u clusters into p connected virtual processors.
(c). Map the p virtual processors onto the p physical processors.
(d). Partition each virtual processor i in v
i
clusters.
(e). Compute a static schedule for each cluster; the resulting sequences
of code will be dynamically scheduled.
The two clustering steps (a) and (d) will be based on the convex and linear clus-
tering algorithms previously presented. Steps (b) and (c) will be implemented by
means of the coupling of the Signal software design environment with the Syn-
dex system. Syndex, which stands for Synchronous Distributed Executive, is a
system which enables the inference of implementations over various distributed
architectures. It performs this inference by mapping SFD graphs over a graph
representation of the architecture
9
. This inference is performed in three steps:
(a) the user may constrain some mapping of processes onto processors; (b) Syn-
dex completes the mapping and produces scheduled distributed code for the
target architecture and (c) Syndex provides the user with static analyses of
the performance of the inferred implementation. An iteration among these three
steps is required to infer for complex applications an ecient implementation on
a distributed, eventually heterogeneous, architecture.
Implementing step (e) may take once again benet of the fdc scheduling
criterion but in a slighly dierent way than it has been achieved for clustering.
Dening an implementation requires an order relation. From the upperbound of
fdc scheduling of an interface graph, two ways exist to get an order relation: break
the cycles or merge the nodes belonging to a cycle. Using these two methods over
the upperbound in Fig. 6-b, we infer the two graphs in Fig. 7 which respectively
dene:
9
In fact, the graph representation of the architecture may be an hypergraph since the
target architecture may include buses.
{ an interface execution scheme in Fig. 7-a.
The rst step in the inference of this high-level execution scheme in Fig. 7-a
is to break the cycles representing deadlocks by removing the edges between
inputs and between outputs at the clock at which convex clustering was per-
formed. This leads to suppress in Fig 6-b the arcs i2 i1 and o1 o2.
The second step is the unfolding of the acyclic graph according to the dif-
ferent control states referred in the remaining cycles. The remaining cycle
between i2 and o2, which does not denote a deadlock (
b
h ^ (
b
k 
b
h) =
b
0), im-
poses a conditional scheduling denoted by the labeled fork-join in the graph
in Fig. 7-a. Note that static (i.e. non conditional) scheduling is achieved at
the two extreme cases:
b
h =
b
k and
b
h =
b
0.
{ a communication scheme in Fig.7-b.
The cycle between the input nodes expresses that, when
b
h occurs, i1 may
be scheduled before or after i2 without creating a deadlock. For this reason,
the values on i1 and i2 can be received gathered without the creation of
a deadlock. In other words, the communications of the values of i1 and i2
may be vectorized at
b
h if they come from the same processor. To express the
design of such a communication scheme at the graph level, it is sucient to
partition the nodes according to the cycles. Applied to the upperbound graph
in Fig. 6-b, we may deduce the input communication interface presented
in Fig. 7-b. In this implementation, the values carried by i1 and i2 are
communicated gathered at
b
h through the new node ci12: f
N
(ci12) =
b
h . A
symmetric result may be achieved over the outputs.
b
k 
b
h
b
k 
b
h
b
k 
b
h
i1
i2
o2
i2
o2
o1
b
h
b
h
b
h
ci12
b
h
b
k 
b
h
b
k 
b
h
b
h
ci1
i2
i1
ci2
(a) (b)
Fig. 7. Execution and Communication Schemes
8 Conclusion
The paper has motivated a scheduling strategy for the distributed implementa-
tion of Signal programs. This scheduling strategy diers from the usual one by
the dynamical scheduling it includes. This variation has been motivated by the
reactive requirements that Signal, as a real-time language, must fulll.
For the implementation of this scheduling strategy, we have dened sev-
eral tools, all of them acting on Synchronous-Flow Dependence Graphs (SFD
Graphs). These graphs, which constitute the abstract representation of Signal
programs, dene a generalization of the notion of Directed Acyclic Graph. Three
tools are dened in this paper to implement this scheduling strategy:
{ Abstraction.
This rst tool is intended to free SFD graphs from the ne-grain parallel
abstract representation they were initially bound. By means of this abstrac-
tion, we are able to tune the grain-size of the representation according to
the one of the target architecture without giving up with the SFD graph
modeling;
{ Compile-time Scheduling.
This denition of the notion of scheduling constitutes the rst step towards
the inference of implementations. The purpose of this denition was also to
illustrate the method to express over SFD graphs the scheduling of processes
with a complex control;
{ Clustering.
A new qualitative criterion, namely compositional deadlock consistency, is
dened and used at several steps in the scheduling strategy. In particular, this
new criterion is used to implement the two clustering steps of our scheduling
strategy. This new criterion enable to embraces in a single framework two
usual qualitative clustering criteria, linearity and convexity.
The abstraction tool is currently integrated into the Signal software design en-
vironment; the programming of the clustering tools is underway. The Signal
software design environment intends to encompass all the stages of the design of
real-time systems. This environment includes (a) a graphic specication inter-
face to specify real-time systems, (b) several formal verication tools to prove
properties thereby to enhance the safety of the implementations and (c) tools to
infer implementations over sequential architectures as well as distributed ones.
The inference of distributed implementations for Signal programs is only
partially implemented in the Signal compiler; the architecture-dependent trans-
formations are performed by the Syndex system [11]. The coupling between the
Signal compiler and the Syndex system is achieved by means of a textual de-
compilation of SFD graphs [4]; an extended version of this decompilation denes
the common graph format shared by Esterel [5], Argos [17], Lustre [8] and
Signal.
References
1. A. V. Aho, R. Sethi, and J. D. Ullman. Compilers: Principles, Techniques, and
Tools. Addison-Wiley, 1986.
2. T. Amagbegnon, L. Besnard, and P. L. Guernic. Aborescent canonical form of
boolean expressions. Research Report 826, IRISA, June 1994.
3. S. H. Bokhari. Partitioning problems in parallel, pipelined, and distributed com-
puting. IEEE Trans. on Computers, 37(1):48{57, January 1988.
4. P. Bournai, C. Lavarenne, P. Le Guernic, O. Maes, and Y. Sorel. Interface
SIGNAL-SynDEx. Research report 2206, INRIA France, Rennes, march 1994.
5. F. Boussinot and R. De Simone. The Esterel language. Proceedings of the IEEE,
79(9):1293{1304, Sept. 1991.
6. A. Gerasoulis and T. Yang. A comparison of clustering heuristics for clustering
dags on multiprocessors. Journal of Parallel and Distributed Computing, Special
Issues on Scheduling and Load Balancing, 16(4):276{291, Dec. 1992.
7. A. Gerasoulis and T. Yang. A static-dataow scheduling tool for scalable parallel
architectures. In Summer School on Scheduling Theory and its applications, pages
382{417. Chateau de Bonas(Gers), INRIA, Sept. 1992.
8. N. Halbwachs, P. Caspi, P. Raymond, and D. Pilaud. The synchronous data ow
programming language Lustre. Proc. of the IEEE, 79(9):1305{1321, Sept. 1991.
9. S. J. Kim and J. C. Browne. A general approach to mapping of parallel compu-
tation upon multiprocessor architectures. In Int. Conf. on Parallel Processing,
volume III, pages 1{8, 1988.
10. C. W. Krueger. Software reuse. ACM Computing Surveys, 24(2):131{183, June
1992.
11. C. Lavarenne, O. Segrouchni, Y. Sorel, and M. Sorine. The Syndex software envi-
ronment for real-time distributed systems design and implementation. In European
Control Conference, volume 2, pages 1684{1689, June 1991.
12. B. Le Go, P. Le Guernic, and J. Araoz Durand. Semi-granules and schielding
for o-line scheduling. Research Report 1228, INRIA France, Rocquencourt, May
1990.
13. P. Le Guernic, T. Gautier, M. Le Borgne, and C. Le Maire. Programming real-
time applications with Signal. Proceedings of the IEEE, 79(9):1321{1336, Sept.
1991.
14. O. Maes. Ordonnancements de graphes de ots synchrones; Application a Sig-
nal. PhD thesis, Universite de Rennes 1, France, Jan. 1993.
15. O. Maes and P. Le Guernic. Combining dependability with architectural adapt-
ability by means of the Signal language. In 3rd Int. Workshop on Static Analysis,
pages 99{110. LNCS no 724, Springer-Verlag, Sept. 1993.
16. Z. Manna and A. Pnueli. The Temporal Logic of Reactive and Concurrent Systems.
Springer-Verlag, 1991.
17. F. Maraninchi. The Argos language: Graphical representation of automata and
description of reactive systems. In IEEE Workshop on Visual Languages, Oct.
1991.
18. V. Sarkar. Partitioning and Scheduling Parallel Programs for Multiprocessors. Re-
search Monographs in Parallel and Distributed Computing. MIT Press, Cambridge,
Massachusetts, and Pitman Publishing, London, U.K., 1989.
19. USDD. Reference Manual for the Ada Programming Language. United States,
Department of Defense, 1983. ANSI:MIL-STD-1815A-1983.
20. T. Yang and A. Gerasoulis. Pyrros: Static task scheduling and code generation for
message-passing multiprocessors. In Proc. of the 6th ACM Int. Conf. on Super-
computing, pages 428{437, 1992.
This article was processed using the L
a
T
E
X macro package with LLNCS style
