Efficient Configuration of Protocol Software for Multiprocessors by Fischer, Stefan et al.
REIHE INFORMATIK
6/94
Ecient Conguration of Protocol Software
for Multiprocessors
{ Revised Version {
S. Fischer und W. Eelsberg
Universitat Mannheim
Seminargebaude A5
D-68131 Mannheim

Ecient Conguration of Protocol Software for
Multiprocessors
Stefan Fischer and Wolfgang Eelsberg
Praktische Informatik IV, University of Mannheim
D-68131 Mannheim, Germany
email:fstes, eelsbergg@pi4.informatik.uni-mannheim.de
November 25, 1994
Abstract
Ecient implementation of communication software is of crucial importance for high-
speed networks. One way to improve the runtime performance of protocol implementa-
tions in the network nodes is the use of parallelism. Formal description techniques like
Estelle improve the specication process in many respects and allow for semiautomatic
code generation. Therefore, they are now widely accepted. We present a code generator for
Estelle that compiles and automatically congures protocol software for a multiprocessor.
Software modules are distributed over the available processors and executed concurrently.
We report performance results on a KSR1 with 28 available processors under the OSF/1
operating system.
1 Introduction
Existing protocol suites such as the ISO/OSI or the INTERNET protocols were designed with
relatively slow links in mind. The end systems were fast enough to process complex protocols
because transmission speed was slow compared to the time needed for protocol execution.
Given the fast transmission media based on ber optics that are now available, current end
systems are too slow for high{performance communication. Communication software has be-
come the major bottleneck in high{speed networks [CT90, Svo89]. Ecient implementation
of the protocol stack is of crucial importance for the networking future. One approach to im-
prove protocol performance is the use of parallelism [HEHK92, BZ92, DDK
+
90, PS92]. Parallel
execution of protocol entities leads to a signicant speedup in protocol execution [Zit92].
For specication purposes, the formal description technique Estelle (among others) was stan-
dardized by ISO [ISO89]. Formal description techniques improve the correctness of speci-
cations by avoiding ambiguities and by enabling formal verication. In addition, they allow
1
2 SPECIFYING PARALLELISM WITH ESTELLE 2
semiautomatic code generation. Generating code from specications has several advantages:
The code can be maintained more easily since the system is specied in an abstract, problem{
oriented language, and the code is well{structured. It is also much easier to port an imple-
mentation to another system. But one of the major problems has been the performance of
implementations produced automatically from a formal specication.
Often, today's code generators are primarily designed to easily get rapid prototypes for simula-
tion purposes, e.g. [SS93]. Some of the generators already make use of parallelism [JJ89, Pet91],
but mainly for validation and simulation purposes. In contrast we have implemented a code
generator for Estelle aiming at good performance of the generated code at runtime [FH93]. It
maps an Estelle specication to a multiprocessor system. The runtime system of our compiler
allows the concurrent execution of protocol entities.
Though the rst results we obtained with our system were good, they were not close to the
theoretical limit for the possible speedup for the execution of Estelle specications derived in
[Hof94]. As a consequence, we present in this paper an improved solution for the design of our
Estelle code generator. The protocol software is congured for a multiprocessor in an optimized
way. Unlike other approaches ([PPVW92] or [HP91]), we do not mean by conguration the
selection of certain protocols but the mapping of certain software parts to selected processors. A
similar approach can be found in [MT93] where a formal language as well as an implementation
environment for the development of parallel systems is described.
The paper is organized as follows: Section two briey introduces Estelle and concentrates on
its features for the description of parallelism. Section three compares the theoretical limit
for the speedup for parallel execution of Estelle specications to results we obtained with
our Estelle compiler. In section four, we describe a better design for the compiler, taking
into account machine, operating system and Estelle specication characteristics. Section ve
presents current performance results. Section six concludes the paper.
2 Specifying Parallelism with Estelle
Estelle specications consist of a hierarchically ordered set of nite state machines, so-called
modules. Modules communicate via bidirectional channels. A channel connects two interac-
tion points (one per module), both of which have unlimited queues for asynchronous message
reception associated with them. Estelle modules can be nested: Within the body of a module,
other modules, called child modules, can be dened. Thus all modules of a specication form a
tree.
When a module has no attribute, it is said to be inactive, i.e. it has no transition besides the
initialization (often the root module is inactive). All other modules are active.
The execution sequence of the modules is controlled in two ways: according to their position
within the tree, and by means of an attribute given to each module. The basic tree rule is that
a parent module always takes precedence over its children, i.e. a child can only execute if the
parent has nothing to do. A parent and a child can never run in parallel.
2 SPECIFYING PARALLELISM WITH ESTELLE 3
Themodule attributes control the parallelism between modules at the same level of the hierarchy.
There are four attributes: systemprocess, systemactivity, process and activity. In the
following we use the term system module for systemprocesses and systemactivities. Then
the following Estelle rules apply:
 Every active module must have one of the four attributes.
 A system module cannot be contained in another attributed module.
 Each process module and each activity module must be contained (perhaps indirectly)
in a system module.
 A process module or a systemprocess module can contain other process or activity
modules.
 An activity module or a systemactivity module can only contain other activity mod-
ules.
 Children whose parent module is of type process or systemprocess may all run in
parallel. Children whose parent module is of type activity or systemactivity are
mutually exclusive, i.e. only one of them can run at a time.
As a consequence, a module containing a systemmodule must be inactive; it is typically located
at the root of the tree. Also, in each path of the tree, from the root to a leaf, there is exactly
one system module, i.e. each active module belongs to exactly one system module.
The dynamics are as follows. At runtime, a module instance can only be dynamically created
and destroyed by its parent module. Thus the number of module instances can vary at runtime,
but their relative position within the tree is predetermined. When the system is initialized,
exactly one instance of each system module is created. As opposed to the activity modules
and the process modules, the structure of the systemactivity modules and systemprocess
modules is static at runtime. The system modules themselves are mutually independent and
can run asynchronously and in parallel.
The motivation behind these semantics are that a typical communication system has static parts
and dynamic parts (to be created at runtime). For example a protocol entity implemented as a
process can accept a new CONNECT request and then create a new child module to handle the
new connection. All child module instances for parallel connections will then be independent
of each other, and able to execute in parallel.
Most existing execution environments for Estelle run on sequential machines and use a very
simple runtime scheduler to guarantee the above semantics. The execution is organized in
cycles. The cycle begins with the system module. If it has a transition to execute ("re"),
it will do so, ending the cycle. If not, it passes the right to re to its children. There, the
same procedure is repeated, all the way down to the leaves of the tree. The number of children
allowed to execute their transitions in parallel is determined by the attribute of the parent
module: Children whose parent module is of type process or systemprocess may all run
3 THEORETICAL LIMIT AND PRACTICAL RESULTS 4
in parallel. An activity or systemactivity module may only allow one child to execute a
transition in each cycle.
It should be noted that asynchronous parallelism in Estelle specications could be increased
considerably by slightly extending syntax and semantics, thereby maintaining the practical-
ly relevant logic of the semantics, while relinquishing the unnatural restrictions introduced
by the cyclic execution model. This extension provides for much more potential for ecient
implementation on multiprocessors. The details can be found in [BG93].
Obviously it is possible to implement an Estelle module tree on a multiprocessor, taking ad-
vantage of the two possible forms of parallelism in Estelle specications:
 Asynchronous parallelism between all system modules
 Synchronous parallelism in subtrees rooted at process or systemprocess modules (syn-
chronized by the parent module).
In order not to violate the Estelle semantics described above, the parallel implementation still
has to guarantee that parent and child modules may never run in parallel, nor any of the
modules inside a subtree rooted at an activity or systemactivity module.
3 Theoretical Limit and Practical Results
When implementing Estelle specications on a multiprocessor, the possible speedup is bounded
by two factors: rst, by the number of processors, and second, by the Estelle semantics described
in section 2.
Obviously, the speedup can never be greater than the number of processors. For the limitations
introduced by Estelle semantics, we developed a formula for an upper bound of the speedup
[Hof94]:
Theorem 1 Let m be the total number of modules in a specication, b the number of leaf
modules, h the height of the module tree, t
sel
the average time for the selection of a transition
and t
exe
the average time for its execution. Then, the following upper bound is valid for the
maximum speedup s
max
:
s
max

m + b
t
exe
t
sel
h +
t
exe
t
sel
(1)
For t
exe
 t
sel
, this leads to the following bound:
s
max

m
h
(2)
For t
exe
 t
sel
, the bound computes to:
s
max
 b (3)
3 THEORETICAL LIMIT AND PRACTICAL RESULTS 5
Initiator Responder
Initiator Responder
Start Stop
.
 
.
 
.
.
 
.
 
.
.
 
.
 
.
.
 
.
 
. h GP
GPI
GPI
GPIGPI
GPI
GPI
GPI
GPI
.
 
.
 
.
v
Root Module
Trans−
portpipe
Figure 1: Generalized Protocol Stack for Execution Time Measurements
4 IMPROVING THE COMPILER 6
The derivation and proof of these formulas can be found in [Hof94].
These formulas help us to structure our Estelle specication for optimal parallelism: rst, the
module hierarchy has to be designed as at as possible. This keeps the height h small and
the number of leaf modules b big thus increasing the upper bound in formula (1). Second,
the number of modules should be big, increasing the overall parallelization potential. As a
consequence, large modules should be checked if they can be split into more modules. When
doing this, the ratio
t
exe
t
sel
has to be regarded. The time for transition selection should be smaller
than the transition's execution time.
To test our compiler, we used the generalized protocol stack shown in Figure 1. The Generalized
Protocol Instances (GPI) take messages from their upper resp. lower interaction point, do some
processing, and send the message to their lower resp. upper interaction point. The GPI module
is specied in a manner that allows us to connect GPI modules to each other. The internal
processing is produced by some module{internal Estelle commands. This is a loop with variable
upper bound, doing some computations.
This model is very exible with respect to the number of connections v, height of the protocol
stack h
GP
and the time spent on protocol processing. It is possible to vary all these parameters
easily. By the third parameter, the time spent on protocol processing, we model the ratio
t
exe
t
sel
.
The higher the number of iterations in the loop, the higher will be the time spent on protocol
processing, and the higher will be the ratio.
For measurements, we always set up all connections, start the timer, send 1000 data requests
from the initiator to the responder and wait for the last ACK indication. Then we stop the
timer. With a varying number of connections, height of the stack and processing overhead, we
got the results shown in Table 1. The parameters m, b and h can be computed as follows:
m = (3 + 2h
GP
)v + 1
b = m  1
h = 2
Thus, we have a very at specication with a huge number of leaf modules, resulting in a high
degree of parallelism.
The results are already good and show a signicant speedup when executing protocols in par-
allel. However, they are not close to the theoretical limit (which is maximally 28, the number
of processors). In the next section, we describe the determinants of the time needed for parallel
protocol execution and some ways to increase the speedup.
4 Improving the Compiler
For our experiments, we have modied the Estelle compiler Pet/Dingo from NIST [SS93].
The most important drawback of this modied compiler is that it has a very simple mapping
algorithm: each Estelle module is mapped onto an operating system thread of OSF/1, and all
the threads implementing the modules of one system module tree are packed into one OSF/1
4 IMPROVING THE COMPILER 7
Speedup theor. Limit Speedup/Module
v h
GP
m
1 000 5 000 10 000 m=h b 1 000 5 000 10 000
1 1 6 1,8 1,9 1,9 3 5 0,3 0,3 0,3
1 2 8 2,9 3,4 3,5 4 7 0,4 0,4 0,4
1 3 10 3,9 4,7 4,8 5 9 0,4 0,5 0,5
1 4 12 4,6 5,8 6,0 6 11 0,4 0,5 0,5
2 1 11 3,0 3,5 3,5 5,5 10 0,3 0,3 0,3
2 2 15 5,1 6,5 6,8 7,5 14 0,3 0,4 0,5
2 3 19 6,7 8,6 8,9 9,5 18 0,4 0,5 0,5
2 4 23 8,0 11,1 11,5 11,5 22 0,3 0,5 0,5
3 1 16 4,3 5,2 5,5 8 15 0,3 0,3 0,3
3 2 22 7,0 9,5 9,7 11 21 0,3 0,4 0,4
3 3 28 9,2 12,1 13,3 14 27 0,3 0,4 0,5
3 4 34 10,4 15,4 15,9 17 33 0,3 0,5 0,5
4 1 21 5,3 6,9 7,0 10,5 20 0,3 0,3 0,3
4 2 29 8,5 11,8 12,7 14,5 28 0,3 0,4 0,4
4 3 37 10,6 15,8 17,2 18,5 36 0,3 0,4 0,5
4 4 45 9,5 13,5 15,2 22,5 44 0,2 0,3 0,3
Table 1: Measurement Results
process. This approach does not take the actual execution environment into account. We see
three components that inuence the execution time of protocols:
 the machine architecture
 the operating system
 the structure of the specication
These issues will be discussed in the remainder of this section.
4.1 The Machine
Concerning the machine
1
on which the protocol stack is running, we identify two important
performance parameters: the number of available processors, and the communication cost
between each pair of processors.
The number of processors is important as it is a bound on the number of threads. If the number
of threads exceeds the number of processors, we can show that we lose more time with thread
1
We do not elaborate on the selection of MIMD as our parallel architecture. More details on this point are
given in [FH93]
4 IMPROVING THE COMPILER 8
synchronization and context switching than we gain by parallelism. This implies that it is
better to not fully parallelize a protocol implementation when the number of threads exceeds
a certain limit. In section 4.3, we show how to adapt the degree of parallelism smoothly.
Another problem on many multiprocessors is the communication cost between processors. This
includes thread synchronization and data exchange. Often, threads or processes running on
one processor need to access data which is located in the memory of some other processor. The
data has rst to be located and then to be transferred. Let us look at the KSR1 architecture,
our implementation machine. At most 32 processors are connected to a ring which they use
to communicate. If more than 32 processors are used, they have to be connected to dierent
rings. Those rings are then interconnected. Data exchange between processors in one ring is
executed during one round trip: Receiving a data request, the processor who has the data puts
it on the ring. With no processor on the ring having the data, the request will be forwarded to
the higher{level ring controller. Obviously, communication cost will then increase.
The solution to this problem is to map communicating threads (i.e. Estelle modules) onto
processors in a cost{minimizing way. Threads which are not communicating may be mapped
to processor pairs with higher communication costs.
For our new conguration process, the machine conguration will be stored once in a con-
guration le. It contains the number of processors and, for each pair of processors, the
communication costs.
4.2 The operating system
For our work, we decided to use a Mach{based operating system which we consider well{
suited for MIMD computers. However, there are still some subtle dierences between dierent
Mach{based systems. One important aspect is the thread synchronization mechanism. Estelle
modules often have to synchronize with their parent or child modules, and as modules are
mapped to threads, thread synchronization plays an important role in protocol execution.
Basically there is one mechanism available in any Mach{based operating system: It is synchro-
nization using locks and condition variables. We use it as follows: To synchronize the threads
running parent and child modules, we use two shared memory variables and a lock and a con-
dition variable for each of them. One variable is used by the parent thread to specify the action
that should be performed by the child (e.g. execute a transition). The other one contains the
number of child modules that received the command. Any child that nishes its action will
decrement that variable by one. When the variable reaches zero, the parent knows that all
children are ready. It will immediately continue with its own work.
On some systems, there is also a so-called barrier synchronization. A barrier has a barrier
master thread and some barrier slave threads. They both may check in and check out from
a barrier. When slaves check in, they cannot continue their work until the master checks in.
Afterwards, both master and slaves run in parallel. When the master checks out, he has to
stop until all the slaves have checked out.
For use in our Estelle synchronization, the parent module is mapped to the master thread
while all its children become slave threads. When the parent wants the children to perform an
4 IMPROVING THE COMPILER 9
action, he sets the shared variable accordingly and checks in. The slaves begin their execution
by rst reading the variable and then performing the requested action. The master checks out
immediately after his check in, thus waiting for the completion of the children. A ready child
simply checks out.
Interestingly, the mechanism of barrier synchronization comes very close to Estelle's parent{
child synchronization. In addition, the implementation uses one shared variable less than in
lock synchronization, avoiding a considerable amount of communication between the proces-
sors. That results in a much more ecient synchronization. However, barriers have a serious
drawback: threads waiting at a barrier perform busy waiting, i.e. as soon as the number of
threads becomes larger than the number of processors, or other processes run on that machine,
the performance will degrade signicantly. As a consequence, barriers should only be used
when the number of threads is small. For exact measurement results, see section 5.
4.3 The specication structure
The solution of mapping modules to threads presented so-far still does not take the structure
of the specication into account. There are three important issues which limit the processor
usage:
1. The computational complexity of modules running in parallel may be very dierent.
2. The computational complexity of some modules is so small that synchronization time
exceeds protocol processing time.
3. Estelle's prohibition of parent{child parallelism makes at least one processor running
idle. When the parent module passes the right to execute to its children, it waits for all
responses wasting one processor.
The solution for all three problems is the \intelligent" reduction of Estelle parallelism by map-
ping more than one module onto one thread. As a result, we get an increase in the overall
system utilization, as there are more processors available for other modules.
From the three problems described above, we derive the following mapping rules:
1. Threads which are running in parallel but are synchronized due to Estelle semantics
(i.e. children of the same parent) should have nearly equal computational complexity.
Threads which are ready will not have to wait a long time. For asynchronous modules
(i.e. systems), we need no such rule, as threads running in such a way do not have to wait
for each other.
2. Modules with higher synchronization than protocol processing time should not run ex-
clusively in their own thread. Running them sequentially together with another module
will save synchronization time.
4 IMPROVING THE COMPILER 10
M
M M M
M M
0
1 2 3
4 5
t5
t4
t3
t2
t1
t0
time
threads
process module
OSF/1 thread thread is synchronizing
Mi thread is executing Estelle  code in module M i
M0 M
M
0
M1
M2
3
M4
M5
(a) Example thread mapping in the first appproach (b) Thread synchronization in the first approach (6 processors)
Figure 2: Mapping of modules to threads in the rst solution
3. The thread executing a parent module should always include one child module. This
thread will then not wait, but execute a child module in parallel to other threads running
other child modules of the same level.
A comparison of the traditional and the improved thread mapping scheme is illustrated in
Figures 2 and 3. Let us assume that modules M
0
, M
1
, M
2
, M
3
, M
4
and M
5
have average
execution times of 20, 80, 70, 10, 30 and 30 milliseconds, respectively
2
. So, module M
1
and M
2
have nearly equal protocol execution times, while moduleM
3
has much less. For that reason, in
the new approach, it is grouped together with modules M
4
and M
5
. In the traditional solution,
we have a total execution time of 100 ms
3
using six processors; in the new one, we also have
100 ms
4
, by only using three processors. The remaining three processors may be assigned to
other modules.
To determine the runtime of real protocol modules, we instrument Estelle implementations with
time measurement routines allowing us to extract protocol execution times during runtime. So
we are able to detect modules which have very small execution times, and we are also able to
detect modules which are running in parallel, but have very dierent execution times.
Currently, our system does not yet allow to compare protocol execution and synchronization
time. This comparison is highly implementation{ and machine{dependent. Such numbers
2
We concentrate on average execution times. Experience shows that there is very little data dependency in
protocol processing.
3
20ms for M
0
and 80 ms from M
1
which is the longest-running module from the three running in parallel.
4
Again, 20 ms from M
0
and 80 ms from M
1
. The sequential execution of modules M
3
, M
4
and M
5
only
adds up to 70 ms, less than the execution time of concurrent module M
1
.
4 IMPROVING THE COMPILER 11
Mi
M
M M M
M M
0
1 2 3
4 5
time
threads
t0
t1
t2
M M0 1
thread is running Module M i
(a) Example thread mapping in the second approach
thread is synchronizing
Estelle module
OSF/1 thread
M0
M2
M M M543
(b) Thread synchronization in the second approach (3 processors)
Figure 3: Mapping of modules to threads in the improved solution
must be added manually to the machine conguration le. For the KSR1, we extracted the
synchronization time from [PB93].
4.4 The new conguration process
The new protocol software mapping and conguration methodology takes into account all three
results described above.
Conguration Methodology:
1. Generate the implementation code from the specication, compile it and link it with the
modied libraries implementing the Estelle semantics and the module mapping.
2. The protocol software runs sequentially for a certain amount of time specied in the
conguration le. Each module measures its runtime.
3. The protocol is congured for parallel execution. The conguration process uses no more
threads than processors are available.
4. The protocol software runs in parallel for the amount of time specied in the conguration
le. This is the operational phase in the software life cycle. Each module keeps measuring
its runtime.
5. Go to step 3.
5 EXPERIMENTAL RESULTS 12
Time Measurements
OSF/1 thread Estelle module
(b) An improved solution(a) The traditional approach
Runtime System
Implementation
Specification Machine
Characteristics
Specification
and
Code generator
CPU
CPU
CPU
CPU
sequential run
Configuration Process
Time Measurements
Configuration Process
Figure 4: Conguration process
As a result, we get dynamic protocol conguration depending on the number of processors,
the Estelle specication structure and the execution time of all the protocol modules. For a
comparison between the traditional methodology and ours, see Fig.4.
5 Experimental Results
The described process was implemented [Ber94]. As the goal of our code generation was to
obtain ecient protocol implementations for high{speed networks, we were very interested in
the performance improvements between a standard conguration and the improved version.
Thread Synchronization
First, we implemented the barrier synchronization mechanism. Table 2 presents measurement
results for Estelle specications implemented with either 8, 15 or 22 threads. For all three
measurements, the maximum number of processors was 16. Obviously, the synchronization
times do not depend on the number of data requests exchanged between modules. Barrier
5 EXPERIMENTAL RESULTS 13
Data requests Speedup with 16 processors and
8 threads 15 threads 22 threads
100 1.25 1.17 0.04
200 1.21 1.14 0.05
300 1.22 1.13 0.05
400 1.17 1.11 0.05
500 1.25 1.11 0.05
600 1.18 1.12 0.06
700 1.14 1.14 0.05
800 1.11 1.16 0.05
900 1.19 1.13 0.06
1000 1.14 1.15 0.05
Table 2: Barrier vs. Lock Synchronization
synchronization is between 10% and 25% faster than lock synchronization, but only as long
as the number of threads is smaller than the number of processors. In that case, barrier
synchronization performance decreases heavily due to busy waiting of threads at the barrier.
Conguration Methodology
To compare the new conguration methodology to the fully parallel version, we performed
measurements with a slightly modied specication. We left out connection establishment
and began immediately with the data transfer phase assuming that the connection is already
established. Also, vertical dependencies were eliminated. We implemented a conguring and a
fully parallel version as well as { for comparison purposes { a sequential version. The conguring
version was restricted to four processors. The sequential version used one processor, while,
during one experiment, we had the fully parallel version run on four processors and, during
another, on as many processors as were needed to get full parallelism. For all experiments, we
varied the machine load between one and eight connections. Thus, the fully parallel version
used at most 17 processors (2 per connection and one for the root module). The results are
shown in Figure 5.
The diagram shows that the new conguration methodology is far better than the fully par-
allelized implementation. It is not only better on the same number of processors, but also on
much fewer processors
5
. We achieve a speedup between 3.3 and 3.5 compared to the sequential
execution. This is very near to the optimum of four (the number of processors for the con-
guring version). It is also obvious that it is unreasonable to use (many) more threads than
processors are available. The parallel version on four processors shows a dramatic runtime
increase with the number of threads increasing.
In Figures 6 and 7, we see a typical snapshot of the processor usage of the two versions (fully
parallel and congured) on the KSR. The usage of the parallel version is very dierent for single
processors. The average load of a processor is relatively small. The congured version shows
5
This last point is mainly due to the specication structure which introduces a little disadvantage for the
parallel version because of synchronization variable accesses
6 CONCLUSIONS AND OUTLOOK 14
20
40
60
80
100
120
140
1 2 3 4 5 6 7 8
R
un
tim
e 
[s]
Connections (Machine Load)
Comparison between different implementation versions
sequential
configured with 4 processors
parallel with 4 processors
parallel with up to 20 processors
Figure 5: Runtime Comparison between dierent implementations
an equal load distribution on its four processors. For each processor, the load is relatively high.
Thus, we achieve a better usage of the processors involved in system execution, resulting in the
same or even better peformance.
These results have two important implications: First, parallel protocol implementations will not
only be successful on massively parallel supercomputers such as the KSR or an Intel Paragon
etc., but also on much lower priced multiprocessor workstations, equipped with two or four
or even more processors, e.g. a SUN SPARC 20. Second, we could easily restrict the protocol
excution on fewer processors and achieve the same performance. This may be used for Quality
of Service issues. With our compiler, we can e.g. restrict a certain set of connections in a
multimedia system (the non-time critical ones) to one or two processors and can then reserve
the other processors for connections with very high performance requirements supporting the
users' QoS specication.
6 Conclusions and Outlook
In previous work, we developed an Estelle code generator that automatically derives parallel
code from an Estelle specication and congures parallel protocol software for a multiprocessor.
As we were not fully satised with the performance results in comparison with theoretical
speedup bounds, we developed an improved code generator and runtime system which uses
information about the specication structure and the runtime environment to increase the
speedup.
REFERENCES 15
Figure 6: Typical Processor usage for the
parallel version
Figure 7: Typical processor usage for the
congured version
We performed several measurements with the new systems and showed its usefulness especially
for machines with few processors.
Our current measurements are based on the generalized protocol stack described in Fig.1 and
on some protocol subset specications. But we are also interested in results for \real" systems.
We already have results for the ISO Session and Presentation Layers [Hof94], and we are cur-
rently implementing an application layer protocol for multimedia systems using our technology
[KFE94].
We are also thinking about a Quality of Service-based mapping of Estelle modules to processors.
Thus, the conguration would become more requirement-driven and could be controlled by the
user. Currently, it is driven by actual module load and does not take into account any QoS
specications.
References
[Ber94] Nicole Berier. Ein Werkzeug zur ezienten Implementierung von Kommunikation-
sprotokollen auf Multiprozessorsystemen. Master's thesis, University of Mannheim,
Praktische Informatik IV (in German), to appear in December 1994.
REFERENCES 16
[BG93] Jan Bredereke and Reinhard Gotzhein. Increasing the concurrency in Estelle. In
Richard L. Tenney, Paul D. Amer, and

Umit Uyar, editors, Formal Description
Techniques VI { Forte'93, Boston, USA. Participants' proceedings, October 1993.
[BZ92] T. Braun and M. Zitterbart. Parallel Transport System Design. In Danthine and
Spaniol [DS92], pages H3:1{H3:16.
[CT90] D. D. Clark and D. L. Tennenhouse. Architectural Considerations for a New Gen-
eration of Protocols. In SIGCOMM '90 Symposium Communication Architectures
& Protocols, pages 200{208, Philadelphia, September 1990.
[DDK
+
90] W. Doeringer, D. Dykeman, M. Kaiserswerth, B. Meister, H. Rudin, and
R. Williamson. A Survey of Light{Weight Transport Protocols for High{Speed
Networks. IEEE Transactions on Communications, 38(11):2025{2039, November
1990.
[DS92] A. Danthine and O. Spaniol, editors. 4th IFIP conference on high performance
networking, Liege, 1992.
[FH93] Stefan Fischer and Bernd Hofmann. An Estelle Compiler for Multiprocessor Plat-
forms. In Richard L. Tenney, Paul D. Amer, and

Umit Uyar, editors, Formal
Description Techniques VI { Forte'93, Boston, USA. Participants' proceedings,
October 1993.
[HEHK92] B. Hofmann, W. Eelsberg, T. Held, and H. Konig. On the Parallel Implementation
of OSI Protocols. In IEEE Workshop on the Architecture and Implementation of
High Performance Communication Subsystems, Tucson, Arizona, February 1992.
[Hof94] B. Hofmann. Deriving Ecient Protocol Implementations from Estelle Specica-
tions. PhD thesis, Universitat Mannheim, Praktische Informatik IV, 1994. (in
German).
[HP91] N.C. Hutchinson and L.L. Peterson. The x-Kernel: An Architecture for Imple-
menting Network Protocols. IEEE Transactions on Software Engineering, January
1991.
[ISO89] Information processing systems { Open Systems Interconnection { Estelle: A formal
description technique based on an extended state transition model. International
Standard ISO 9074, 1989.
[JJ89] C. Jard and J. M. Jezequel. A Multi{Processor Estelle to C{Compiler to Prototype
Distributed Algorithms on Parallel Machines. In E. Brinksma, G. Scollo, and C. A.
Vissers, editors, Protocol Specication, Testing, and Verication, IX, pages 161{
174. IFIP WG 6.1, Elsevier Science Publishers B.V. (North{Holland), Amsterdam,
1989.
[KFE94] Ralf Keller, Stefan Fischer, and Wolfgang Eelsberg. Implementing Movie Control,
Access and Management - from a Formal Description to Working Multimedia Sys-
tem. In Liba Svobodova, editor, Int.Conference on Distributed Computing Systems
{ ICDCS14, Poznan, Poland. Participants' Proceedings. IEEE, June 1994.
REFERENCES 17
[MT93] A. Mitschele-Thiel. The DSPL Programming Environment. In Proc. Conf. on
Programming Models for Massively Parallel Computers, Berlin. IEEE Computer
Society Press, September 1993.
[PB93] Jean-Daniel Pouget and Helmar Burkhart. Performance Studies on the KSR1.
In Robert Schumacher, editor, One Year KSR1 at the University of Mannheim {
Results and Experiences, Technical Report No.35/93. Computing Center, University
of Mannheim, 1993.
[Pet91] D. Peter. Entwurf, Realisierung und Integration eines Protokolls zur verteilten
Ausfuhrung von Estelle{Spezikationen. Master's thesis, Universitat Hamburg,
Februar 1991.
[PPVW92] T. Plagemann, B. Plattner, M. Vogt, and T. Walter. A Model for Dynamic Con-
guration of Light{Weight Protocols. In IEEE Third Workshop on Future Trends
of Distributed Computing Systems, 1992.
[PS92] T.F. La Porta and M. Schwartz. A High{Speed Protocol Parallel Implementation:
Design and Analysis. In Danthine and Spaniol [DS92], pages C3:1{C3:16.
[SS93] Rachid Sijelmassi and Brett Strausser. The PET and DINGO tools for deriving
distributed implementations from Estelle. Computer Networks and ISDN Systems,
25(7):841{851, 1993.
[Svo89] L. Svobodova. Measured Performance of Transport Service in LANs. Computer
Networks and ISDN Systems, 18(1):31{45, 1989.
[Zit92] M. Zitterbart. Parallel Protocol Implementations on Transputers | Experiences
with OSI TP4, OSI CLNP, and XTP. In IEEE Workshop on the Architecture and
Implementation of High Performance Communication Subsystems, Tucson, Ari-
zona, February 1992.
