Modula-2* and its compilation by Philippsen, Michael & Tichy, Walter F.
Modula and its Compilation
Michael Philippsen and Walter F Tichy
Universitat Karlsruhe
email philippseniraukade
This paper appeared in First International Conference of the Austrian Center for Paral
lel Computation Salzburg Austria  pages 	
 Springer Verlag Lecture Notes in
Computer Science  
Abstract
Modula an extension of Modula is a program
ming language for writing highly parallel programs in
a machineindependent problemoriented way The
novel attributes of Modula are that programs are
independent of the number of processors independent
of whether memory is shared or distributed and inde
pendent of the control modes SIMD or MIMD of a
parallel machine
This article brie	y describes Modula and dis
cusses its major advantages over the dataparallel pro
gramming model We also present the principles of
translating Modula programs to MIMD and SIMD
machines and discuss the lessons learned from our 
rst
compiler targeting the Connection Machine We con
clude with important architectural principles required
of parallel computers to allow for ecient compiled
programs
  Introduction
Highly parallel machines with thousands and tens
of thousands of processors are now being manufac
tured and used commercially These machines are of
rapidly growing importance for highspeed computa
tion They have also initiated a major shift within
Computer Science from the sequential to the parallel
computer One of the major problems we face in the
use of these new machines is programmability How
to write with no more than ordinary eort programs
that bring the raw power of a parallel computer to
bear on a problem
Two major approaches to the programming prob
lem can be distinguished The 
rst is to automati
cally parallelize sequential software Although there
is overwhelming economic justi
cation for it this ap
proach will meet with only limited success in the short
to medium term see for instance  The goal of
automatically producing parallel programs can only
if ever be achieved by program transformations that
start with the problem speci
cation and not with a se
quential implementation In a sequential program too
many opportunities for parallelism have been hidden
or eliminated
The second approach is to write programs that
are explicitly parallel We claim that only minor ex
tensions of existing programming languages are re
quired to express highly parallel programs Thus pro
grammers will need only moderate additional training
mainly in the area of parallel algorithms and their
analysis This area fortunately is well developed
see for instance textbooks  and  In compiler
technology however new techniques must be found
to map machineindependent programs to existing ar
chitectures while at the same time parallel machine
architecture must evolve to eciently support the fea
tures that are required for problemoriented program
ming styles
We take the approach of expressing parallelism ex
plicitly but in a machineindependent way In sec
tion  we analyze the problems that plague most par
allel programming languages today Section  then
presents Modula an extension of Modula  for
the explicit formulation of highly parallel programs
The extension is small and easy to learn but pro
vides a programming model that is far more general
and machine independent than other proposals Next
we discuss compilation techniques for targeting MIMD
and SIMD machines and report on experience with
our 
rst Modula compiler  for the Connection
Machine We conclude with properties of parallel ma
chine architectures that would improve the eciency
of highlevel parallel programs
 Related Work
Most current programming languages for parallel
and highly parallel machines including LISP C
MPL VAL Sisal Occam Ada FORTRAN Blaze
Dino and Kali          suer
from some or all of the following problems

  Whereas the number of processors of a parallel
machine is 
xed the problem size is not Be
cause most of the known parallel languages do
not support the virtual processor concept the
programmer has to write explicit mappings for
adapting the process structure of each program
to the available processors This is not only a te
dious and repetitive task but also one that makes
programs nonportable
  Colocating data with the processors that operate
upon the data is critical for the performance of
distributed memory machines Poor colocation
results in high communication costs and poor per
formance Good colocation is highly dependent
on the topology of the communication network
and must at present be programmed by hand
It is a primary source of machine dependence
  All parallel machines provide facilities for inter
process communication most of them by means
of a message passing system Nearly all paral
lel languages support only low level send and get
communication commands Programming com
munication with these primitives especially if
only nearest neighbor communication is available
is a time consuming and error prone task
  There are several control modes for parallel ma
chines including MIMD SIMD data	ow and
systolic modes Any extant parallel language tar
gets exactly one of those control modes What
ever the choice it severely limits portability as
well as the space of solutions
Modula provides solutions to the basic problems
mentioned above The language abstracts from the
memory organization and from the number of physi
cal processors Mapping of data to processors is per
formed by the compiler optionally supported by high
level directives provided by the programmer Com
munication is not directly visible Instead reading
and writing in a virtually shared address space sub
sumes communication A shared memory however is
not required Parallelism is explicit and the program
mer can choose among synchronous and asynchronous
execution mode at any level of granularity Thus pro
grams can use SIMDmode where proper synchroniza
tion is dicult or use MIMDmode where synchro
nization is simple or infrequent The two modes can
even be intermixed freely
The dataparallel approach discussed in  and ex
empli
ed in languages such as LISP C and MPL
is currently quite successful because it has reduced
machine dependence of parallel programs Data
parallelism extends a synchronous SIMD model with
a global name space which obviates the need for ex
plicit message passing between processing elements
It also makes the number of virtual processing ele
ments a function of the problem size rather than a
function of the target machine
The dataparallel approach has three major ad
vantages  It is a natural extension of sequential
programming The only parallel instruction a syn
chronous forall statement is a simple extension of
the well known for statement and is easy to under
stand  Debugging dataparallel programs is not
much more dicult than debugging sequential pro
grams The reason is that there is only a single lo
cus of control which dramatically simpli
es the state
space of a program compared to that of an MIMD
program with thousands of independent loci of con
trol  There is a wide range of dataparallel al
gorithms Most parallel algorithms in textbooks are
dataparallel compare for instance   According
to Fox  more than  of the  existing paral
lel applications he examined fall in the class of syn
chronous dataparallel programs Furthermore sys
tolic algorithms as well as vectoralgorithms are spe
cial cases of dataparallel algorithms
But dataparallelism at least as de
ned by cur
rent languages has some drawbacks  It is a syn
chronous model Even if the problem is not amenable
to a synchronous solution there is no escape In par
ticular parallel programs that interact with stochastic
events are awkward to write and run ineciently 
There is no nested parallelism This means that once
a parallel activity has started the involved processes
cannot start up additional parallel activity A paral
lel operation simply cannot expand itself and involve
more processes This property seriously limits parallel
searches in irregular search spaces for example The
eect is that dataparallel programs are strictly bi
modal They alternate between a sequential and a par
allel mode where the maximal degree of parallelism is

xed once the parallel mode is entered To change the
degree of parallelism the program 
rst has to stop all
parallel activity and return to the sequential mode 
The use of procedures to structure a parallel program
in a topdown fashion is severely limited The problem
here is that it is not possible to call a procedure in par
allel mode when the procedure itself invokes parallel
operations this is a consequence of  Procedures
cannot allocate local data and spawn data parallel op
erations on it unless they are called from a sequential
program Thus procedures can only be used in about

half of the cases where they would be desirable They
also force the use of global data structures on the pro
grammer
When designing Modula we wanted to preserve
the main advantages of dataparallel languages while
avoiding the above drawbacks The following list
contains the main advances of Modula over data
parallel languages
  The programming model of Modula is a strict
superset of dataparallelism It allows both syn
chronous and asynchronous parallel programs
  Modula is problemoriented in the sense that
the programmer can choose the degree of paral
lelism and mix the control mode SIMDlike or
MIMDlike as needed by the intended algorithm
  Parallelism may be nested at any level
  Procedures may be called from sequential or par
allel contexts and can generate parallel activity
without any restrictions
  Modula is translatable eectively for both
SIMD and MIMD architectures
 The Language Modula
Modula has been chosen as a base for a paral
lel language because of its simplicity There are no
reasons why similar extensions could not be added
to other imperative languages such as FORTRAN
or ADA The necessary extensions were surprisingly
small They consist of synchronous and asynchronous
versions of a forall statement plus simple optional
declarations for mapping array data onto processors
in a machine independent fashion An interconnec
tion network is not directly visible in the language We
assume a shared address space among all processors
though not necessarily shared memory There are no
explicit message passing instructions instead reading
and writing locations in shared address space subsume
message passing This approach simpli
es program
ming dramatically and assures network independence
of programs The burden of distinguishing between lo
cal and nonlocal references and substituting explicit
message passing code for the latter is placed on an
optimizing compiler The programmer can in	uence
the distribution of data with a few simple declara
tions but these are only hints to the compiler with no
eect on the semantics of the program whatsoever
 Overview of the forall statement
The forall statement creates a set of processes that
execute in parallel In the asynchronous form the in
dividual processes operate concurrently and are joined
at the end of the forall statement The asynchronous
forall simply terminates when the last of the created
processes terminates In the synchronous form the
processes created by the forall operate in unison until
they reach a branch point such as an if or case state
ment At branch points the set of processes partitions
into two or more subsets Processes within a single
subset continue to operate in unison but the subsets
are not synchronized with respect to each other Thus
the union of the subsets operate in MSIMD
 
mode A
statement causing a partition into subsets terminates
when all its subsets terminate at which point the sub
sets rejoin to continue with the following statement
Variants of both the synchronous and asynchronous
form of the forall statement have been introduced by
previously proposed languages such as Blaze C Oc
cam Sisal VAL LISP       and oth
ers  Note also that vector instructions are simple
instances of the synchronous forall
None of the languages mentioned above include both
forms of the forall statement even though both are
necessary for writing readable and portable parallel
programs The synchronous form is often easier to
handle than the asynchronous form because it avoids
synchronization hazards However the synchronous
formmay be overly constraining and may lead to poor
machine utilization The combination of synchronous
and asynchronous forms in Modula actually per
mits the full range of parallel programming styles be
tween SIMD and MIMD
The syntax of the forall is as follows

FORALL ident  SimpleType IN PARALLEL j SYNC
StatementSequence
END
The identi
er introduced by the forall statement is
local to the statement and serves as a runtime con
stant for every process created by the forall Sim
pleType is an enumeration or a subrange The
forall creates as many processes as there are elements
in SimpleType and initializes the runtime constant
 
MSIMD Multiple SIMD Few but more than one instruc
tion streams operate on many data streams A compromise
between SIMD and MIMD

We use the EBNF syntax notation of theModula language
denition with keywords in upper case j denoting alternation
      	 optionality and 
       grouping of the enclosed sentential
forms

of each process to a unique value in SimpleType
The created processes all execute the statements in
StatementSequence
 The asynchronous forall
The created processes execute StatementSequence
concurrently without any implicit intermediate syn
chronization The execution of the forall terminates
when all created processes have 
nished Thus the
asynchronous forall contains only one synchroniza
tion point at the end Any additional synchronization
must be programmed explicitly with semaphores and
the operations WAIT and SIGNAL
In the following example an asynchronous forall
statement implements a vector addition
FORALL iN IN PARALLEL
zi 	 xi 
 yi
END
Since no two processes created by the forall access
the same vector element no temporal ordering of the
processes is necessary The N processes may execute
at whatever speed The forall terminates when all
processes created by it have terminated
A more complicated example illustrating recur
sive process creation is the following Procedure
ParSearch searches a directed possibly cyclic graph
in parallel fashion It can best be understood by
comparing it with depth
rstsearch except that
ParSearch runs in parallel It starts with a root of
the graph and visits nodes in the graph in a parallel
and largely unpredictable fashion
PROCEDURE ParSearch v NodePtr 
BEGIN
IF Marked v  THEN RETURN END
FORALL svsuccessors IN PARALLEL
ParSearch succv s 
END
visit v 
END ParSearch
The procedure ParSearch simply creates as many
processes as a given node has successors and starts
each process with an instance of ParSearch Before
visiting a node ParSearch has to test whether the
node has already been visited and marked Since mul
tiple processes may reach the same node simultane
ously testing and setting the mark is done in a criti
cal section implemented with a semaphore associated
with each node by the procedure Marked If the
graph is a tree no marking is necessary
 The synchronous forall
The processes created by a synchronous forall ex
ecute every single statement of StatementSequence in
unison To illustrate this mode its semantics for se
lected statements is described in some detail below
  A statement sequence is executed in unison by ex
ecuting all its statements in order and in unison
  In the case of branching statements such as IF
C THEN SS ELSE SS END the set of participat
ing processes divides into disjoint and indepen
dently operating subsets each of which executes
one of the branches SS and SS in the exam
ple in unison Note that in contrast to other
dataparallel languages no assumption about the
relative speeds or relative order of the branches
may be made The execution of the entire state
ment terminates when all processes of all subsets
have 
nished
  In the case of loop statements such as WHILE C
DO SS END the set of processes for any iteration
divides into two disjoint subsets namely the ac
tive and the inactive ones with respect to the
loop statement Initially all processes entering
the loop are active Every iteration starts with
the synchronous evaluation of the loop condition
C by all active processes The processes for which
C evaluates to FALSE become inactive The rest
forms the active subset which executes statement
sequence SS in unison The execution of the whole
loop statement terminates when the subset of ac
tive processes becomes empty
Hence synchronous parallel operation closely re
sembles the lockstep operation of SIMD machines
with an important generalization for parallel branches
As an example consider the computation of all
post
x sums of a vector V of length N  The pro
gram should place into V i the sum of all elements
V i   V N   A recursive doubling technique as
in reference  computes all post
x sums in OlogN 
time where N is the length of the vector
Figure  illustrates the process The program op
erates by computing partial sums of length s  
j

where j counts the iterations The inner forall creates
N processes Note that there is a onetoone mapping
between process numbers and elements of the vector
In each iteration the length of the partial sums is dou
bled by parallel summation of neighboring sums The
if statement inside the forall disables all processes
that must not participate in the computation during
a given iteration

VAR V  ARRAY  N OF REAL
VAR s  CARDINAL
BEGIN
s 	 
WHILE s  N DO
FORALL iN IN SYNC
IF i
sN THEN
Vi	 Vi
Vi
s
END
END
s 	 s  
END
END
v v v v v v v v v v v
v v v v v v v v v v v
v v v v v v v v v v v
v v v v v v v v v v v
v v v v v v v v v v v
0 1 2 3 4 5 6 7 8 9 10
0,1 1,2 2,3 3,4 4,5 5,6 6,7 7,8 8,9 9,10 10
0,3 1,4 2,5 3,6 4,7 5,8 6,9 7,10 8,10 9,10 10
0,7 1,8 2,9 3,10 4,10 5,10 6,10 7,10 8,10 9,10 10
0,10 1,10 2,10 3,10 4,10 5,10 6,10 7,10 8,10 9,10 10
Figure  Computing post
x sums of a vector
 Allocation of array data
Colocation of data with the processors that access
the data is important for parallel machines without
uniform access time to memory locations Poor align
ment of data and processors may cause excessive com
munication overhead We therefore provide a simple
machineindependent construct for controlling the al
location of array data This construct is optional and
does not change the meaning of a program it aects
only performance A compiler for a machine with uni
form memory access time may ignore the construct
The allocation of array data to processors is con
trolled with one allocator per dimension The modi

ed declaration syntax for arrays is as follows
ArrayType  ARRAY SimpleType allocator
f SimpleType allocatorg OF type
allocator  LOCAL j SPREAD j CYCLE j RANDOM j
SBLOCK j CBLOCK
Array elements whose indices dier only in dimen
sions that are marked LOCAL are associated with the
same processor This facility is used to avoid distribu
tion of data in a given dimension
Dimensions with allocator SPREAD are divided into
segments one for each of the available processors A
vector with n elements is assigned to P processors by
allocating a segment of length dnP e to each proces
sor While utilizing all available processors it mini
mizes the cost of nearestneighbor communication
Dimensions with allocator CYCLE are distributed in
a roundrobin fashion over the available processors
Given P processors the elements of a vector whose
indices are identical modulo P are associated with the
same processor In contrast to SPREAD CYCLE max
imizes the cost of nearestneighbor communication
neighboring array elements are always in dierent pro
cessors leading to better processor utilization if a par
allel algorithm operates on subsegments of a vector at
a time
Dimensions with allocator RANDOM are distributed
randomly over the available processors In contrast to
CYCLE RANDOM leads to a better processor utilization
if a parallel algorithms accesses the dimension in a
random pattern
If either SPREAD CYCLE or RANDOM apply to sev
eral successive dimensions then these dimensions are
unrolled into one pseudovector with a length that
is the product of the lengths of the individual dimen
sions This scheme idles fewer processors than apply
ing SPREAD CYCLE or RANDOM to individual dimen
sions
Allocators SBLOCK and CBLOCK apply SPREAD and
CYCLE resp to each dimension individually For two
successive dimensions SBLOCK has the eect of creat
ing rectangular subarrays and assigning those to the
available processors With this arrangement nearest
neighbor communication in all dimensions is best sup
ported when the interconnection network can be con

gured into the same number of dimensions as the
arrays
CBLOCK for two dimensions also creates two
dimensional subarrays but the rows and columns of
these subarrays are then distributed independently
in a roundrobin fashion over the processor grid
Again SBLOCK minimizes nearestneighbor communi
cation while CBLOCK allows high processor utilization
if smaller subarrays are processed in parallel
 Implementing Modula
When discussing the principles of compiling
Modula we 
rst present the more dicult case of
compiling for MIMD and then introduce the simpli

cations for SIMD The latter have actually been im
plemented in our CM compiler Although this 
rst

compiler does not contain sophisticated optimization
it helped us in understanding the main sources of op
timizations when compiling for massively parallel ma
chines
 Asynchronous forall
 Asynchronous forall MIMD implemen
tation
A straightforward approach to implementing the
forall is to create the required number of lightweight
processes or threads Threads of a forall share the
same program the enclosed statement sequence but
have their own stacks The resulting runtime struc
ture is a tree of stacks
On a distributed memory machine it pays to repli
cate the entire program to all processors Code repli
cation can be accomplished quickly during startup
either using a broadcast facility or recursive doubling
The main problems when compiling forall state
ments are thread creation thread termination and
load balancing All of these problems must be solved
in a parallel fashion Sequential implementations
would cause a serious bottleneck and for algorithms
with 
ne granularity result in essentially sequential
programs Note also that foralls may be nested so
there may be several new sets of threads being created
simultaneously
The process reaching a forall called the spawner
must create the set of threads prescribed by the forall
The spawner actually creates only the initial thread of
the forall called the leader A simple optimization
is to let the spawner take on the role of the leader
The leader then replicates itself All threads created
keep replicating themselves again and again until the
required number is obtained This method is another
variant of recursive doubling A small number of pa
rameters controls the replication When a thread has
replicated itself a sucient number of times it simply
jumps to the beginning of the foralls code sequence
and begins execution
Synchronization and thread termination at the
foralls end follow the same pattern Each thread
has a semaphore for receiving termination signals from
other threads A thread that reaches the end of
its forall 
rst waits for termination signals from all
the threads it spawned during the replication process
then signals its creating thread and destroys itself If
all n threads of a forall terminate at about the same
time then the leader learns about the combined ter
mination in time proportional to Ologn signals the
spawner and kills itself or simply resumes the role of
the spawner
The problem of load balancing is to distribute
threads over the available processors so that  the
load on the processors is equalized  the threads are
colocated with their data and  coscheduling of
threads within the same forallinstance becomes pos
sible Again a centralized solution must be avoided
One possibility is for processors to keep a running to
tal of ready processes and the overall average The
overall average can be updated periodically say at
the end of a timeslice by another recursive doubling
technique in which all processors participate Newly
created threads are moved between neighboring pro
cessors depending on the current load in comparison to
the average Under certain circumstances migrating
a longrunning thread including its data to another
processor may be advantageous In addition static
compiler analysis can indicate preferred processors for
colocating data and threads
Coscheduling of threads in the same forall is nec
essary to avoid delays inherent in context swaps when
the threads communicate Without coscheduling
communicating threads may enter a situation where
they execute alternatingly or in coroutine fashion in
stead of in parallel  Coscheduling can be ac
complished by increasing the thread priority with the
nesting depth of foralls or by providing special mech
anisms for task forces ie for scheduling groups of
threads simultaneously
Obviously thread creation termination and load
balancing must be as fast as possible Various opti
mizations for bulk thread generation are feasible but
will not be discussed here for lack of space
The above techniques have not been implemented
in our 
rst compiler since the CM is a SIMD machine
However work has started on a Modula compiler
targeting a Transputer cluster where the techniques
will be used We are also exploring special hardware
facilities to speed up these tasks
 Asynchronous forall SIMD implementa
tion
The synchronous nature of a SIMD machine coupled
with the broadcast bus from the frontend makes
all three of thread creation termination and load
balancing operate in constant time or nearly con
stant time For generality assume nested foralls m
threads each execute a forall statement each creat
ing n new threads Thus the number of threads to be
created is t  nm If n is not uniform for all the m
spawners then a Ologmtime summation instead of
constanttime multiplication must be performed to

compute t
Once t is known t stacks are created by assigning
to each of p processors a segment of dt  pe stacks
This operations takes constant time and balances the
load perfectly Process termination also takes con
stant time since there is no synchronization overhead
However it may be necessary to provide each thread
with some initial data such as its number during
creation Spreading this information takes again loga
rithmic time but as demonstrated by the Connection
Machine special instructions for spreading data are
so fast that in practice they can be regarded as con
stant
What remains to be discussed is the scheduling of
instructions Since the asynchronous forall prescribes
no scheduling of the threads at all the compiler writer
can choose one that works well on a given SIMD or
MSIMD machine We describe brie	y the implemen
tation we chose for the Connection Machine CM We
assume initially that the number of available proces
sors equals the number of threads
Activity Bits The central idea of control 	ow on
SIMD computers is deactivation and reactivation of
processors controlled by an activity bit associated
with each processor When the activity bit is o the
processor does not execute the instructions issued by
the frontend This facility is sucient for simulating
the usual control 	ow constructs in a parallel context
All that is needed is a stack of activity bits for each
thread The top of each activity stack is stored into
the activity bit of a processor Suitable manipula
tion of the activity bits turns threads on and o as
required by the instruction stream issuing from the
frontend
There are two small extensions of the usual con
trol 	ow mechanism for SIMD machines They are
needed for recursion and for exit and return state
ments First consider parallel loops ie loops within
a forall On a SIMDmachine the frontend repeat
edly issues the instructions for the loop body until
the termination conditions of all threads executing
the loop are met The usual technique is to evalu
ate a threads termination condition directly into its
activity bit Before each iteration the frontend tests
whether there are any positive activity bits left If
not the loop terminates An exit statement may also
terminate a loop by turning o the activity bit of the
corresponding thread However since an exit state
ment may be nested several levels deep within a loop
it must not only set the topmost activity bit to false
but all those that have been stacked since the last
loop was entered Similar considerations apply to the
return statement
Consider the following example
FORALL iN IN PARALLEL
LOOP
IF ODDi THEN EXIT END
SS
END
END
When control 	ow reaches the exit then two ac
tivity bits have been stacked for each thread one for
the loop and one for the if statement To prevent a
thread that has already executed the exit from being
reactivated after the if its top two activity bits must
be set to FALSE
Recursion termination is similar to loop termina
tion If a recursive call occurs inside a parallel if or
case then the frontend must sense whether there
is any active thread left in a branch If not then
the branch terminates Without this provision un
bounded recursion would ensue
Parallel Procedure Call Because procedures can
be called from both sequential and parallel contexts
each procedure must be compiled twice Once for ex
ecuting entirely on the frontend in sequential mode
and a second time for executing within a forall state
ment The dierence is that in the parallel version the
procedure call and return instructions are executed
only on the frontend Thus we need two types of
stacks On the frontend we stack return addresses
On the stacks associated with the parallel threads we
store parameters and local data This division is a
direct consequence of SIMD and would even occur if
frontend and parallel processors had the same instruc
tion set On the CM the instruction sets dier and
so the sequential and parallel versions are completely
dierent
Our compiler relies on a minor language restriction
Procedures may not be nested within each other The
reason is that uplevel addressing is quite expensive
Since it is in general unpredictable in what context a
procedure is called each memory access would have
to distinguish at runtime whether it references data
on the frontend or the parallel processors
Processor Virtualization Simulating more
threads than there are processors available is called
processor virtualization In SIMD mode it is not pos
sible to simply create new processes on demand and
let the operating system schedule them Instead the

frontend has to issue the instructions implementing
the body of a forall in a loop The number of iter
ations of this loop is given by the ratio of threads to
available processors
The PARIS instruction set of the CM provides au
tomatic processor virtualization This means that pro
cessor virtualization is transparent to the program
mer The 
rmware simulates as many threads as re
quired The maximum number of threads is only lim
ited by the available memory because the local mem
ory of each processor must be shared out among the
assigned threads
Our Modula compiler uses the automatic proces
sor virtualization However this virtualization is quite
expensive The main reason is that the virtualiza
tion actually implements synchronous virtualization
which requires many temporary variables In essence
this virtualization wraps every single instruction into a
virtualizing loop even though a loop around the entire
body of a forallwould suce since the asynchronous
forall prescribes no scheduling of threads The latter
simulation would be obviously much more ecient
 Synchronous forall
 Synchronous forall MIMD implementa
tion
The synchronous forall requires many more synchro
nization points than the asynchronous form There
must be a synchronization point between every two
statements inside a forall and in the case of the as
signment even within a single statement A parallel
assignment of the from L  R means that the value
of R is evaluated synchronously and stored in a tempo
rary Similarly the address represented by L is evalu
ated synchronously and stored in a temporary Only
after both of these parallel evaluations have completed
can the assignment be made Otherwise interference
is possible as in the assignment Ai  Ai
A synchronization point is implemented with a
scheme similar to the one used to terminate an asyn
chronous forall except that now the threads do not
terminate but wait for a signal to proceed First
a logarithmic reduction informs the leader that all
threads in the process have reached the synchroniza
tion point Then a logarithmic doubling process sends
signals back out to the threads to continue
Clearly synchronization points are expensive We
are currently investigating methods to eliminate them
where possible For instance the synchronization
point inside an assignment is not necessary if the left
and right hand sides do not interfere Furthermore by
scheduling processes in a certain fashion the overlaps
may be reduced greatly Even synchronization points
between statements can be eliminated if there are no
dependencies Much of the dependency analysis de
veloped for parallelizing compilers applies here
 Synchronous forall SIMD implementa
tion
The SIMD implementation of the synchronous forall
was simple on the CM the builtin virtualization does
the job However this virtualization cannot take ad
vantage of the optimizations described above Instead
it must make conservative assumptions The resulting
virtualization is far from ecient An optimizing com
piler could produce a much faster virtualization in the
majority of cases Consider the following example
FORALL i N IN SYNC
Ai 	 Ai 
   
END
Below are two possible virtualizations on p proces
sors expressed in Modula
s 	 CEILINGN p
FORALL j    p IN PARALLEL
FOR i	 js TO MINj
sN
DO
TMPi 	 Ai 
 
TMPi 	 TMPi  
END
END
FORALL j    p IN PARALLEL
FOR i	 js TO MINj
sN
DO
Ai 	 TMPi
END
END
s 	 CEILINGN p
FORALL j    p IN PARALLEL
FOR i	 js TO MINj
sN
DO
reg 	 Ai
reg 	 reg 
 
reg 	 reg  
Ai	 reg
END
END
The program on the left shows the conservative vir
tualization as performed by PARIS The optimized
version on the right hand side exploits the fact that
only one temporary location is required By using a
single register for it on every processor the number of
writes to memory are reduced to one third of the un
optimized version Furthermore no synchronization
is necessary On a SIMD machine this means that
the two loops can be merged on a MIMD machine

we save the synchronization point Furthermore if
the individual processors have a vector capability the
computation in each processor can even be interleaved
While implementing the synchronous forall for the
CM we have identi
ed the main sources of optimiza
tion in compiling for massively parallel machines We
have started to include these optimizations in the next
compilers for MasPar CM and Transputer including
the necessary datadependence analysis
 Recommendations for Parallel Ma
chine Architectures
The following list itemizes some broad requirements
that parallel machine architectures should ful
ll to al
low for ecient compiled programs These require
ments are likely to be encountered when designing the
translation schemes for parallel imperative languages
  Hardware support for fast process creation and
synchronization
  Shared address space All processors should be
able to generate addresses for the entire mem
ory on the system In particular the frontends
memory should be part of that address space
A source of great diculty in our compiler were
the many dierent types of addresses The com
piler has to distinguish between local addresses
global addresses addresses in the frontend gen
eral communication addresses and communica
tion addresses on a grid Optimizing for all these
cases is often impossible even with detailed inter
procedural analysis Furthermore parallel point
ers are quite expensive to implement without a
shared address space  one basically has to simu
late the shared address space in software

  Uniform communication mechanism Most paral
lel machines today provide a set of instructions for
accessing local memory a second one for accessing
memory in direct neighbors and a third set for
accessing distant memory units The dierences
in speed are signi
cant and therefore require that
the compiler detect the faster cases However it is
often impossible to know statically for which case
to optimize For instance we found that in most
cases it was impossible to determine in the com
piler whether a procedure would access local or
nonlocal memory The generated code thus has

A shared address space does not imply shared memory
to check all three cases at runtime Such a sim
ple and frequently repeated case analysis could be
done much more eciently in hardware
  Autonomous addressing capability An au
tonomous addressing capability means that each
processor can generate its own address for access
ing memory The Connection Machine does not
have such a facility  on the CM each proces
sor must use the same address The lack of au
tonomous addressing not only makes many ap
plications awkward to write but also precludes
certain optimizations in processor virtualization
  Single instruction set SIMD machines today typ
ically have dierent instruction sets for frontend
and parallel processors This property implies
that the code generator of the compiler has to
be written twice Also each procedure has to
be translated twice doubling code size A speed
dierential between frontend and parallel pro
cessors however does not appear to be a major
problem
  Small instruction set The CM oers about 
PARIS instructions only a few of which a com
piler can actually generate A study determining
the most frequently used instructions in parallel
programs is sorely needed
 Conclusion
Ease of programming as well as portability of pro
grams will be of overwhelming importance for the
acceptance of highly parallel machines Modula
supports both few extensions of a sequential pro
gramming language suce for writing highly paral
lel problemoriented programs and compilers that
can generate ecient code for a wide range of par
allel machines appear feasible Improvements in hard
ware architecture operating systems programming
languages and compiler technology should eventually
render the current practice of machine dependent par
allel programming as obsolete as machine dependent
sequential programming
References
 Selim G Akl The Design and Analysis of Paral
lel Algorithms Prentice Hall Englewood Clis
New Jersey 
 American National Standards Institute Inc
Washington DC ANSI Programming Language

Fortran Extended Fortran  ANSI X	


 
 Henry E Bal Jennifer S Steiner and Andrew S
Tanenbaum Programming languages for dis
tributed computing systems ACM Computing
Surveys  September 
 Georey C Fox What have we learnt from using
real parallel machines to solve real problems In
Proc of the Third Conference on Hypercube Con
current Computers and Applications volume 
pages  Pasadena CA  ACM Press
New York
 Alan Gibbons and Wojciech Rytter Ecient
Parallel Algorithms Cambridge University Press

 W Daniel Hillis and Guy L Steele Data par
allel algorithms Communications of the ACM
 December 
 Charles Koelbel and Piyush Mehrotra Support
ing shared data structures on distributed mem
ory architectures In Proc of the nd ACM SIG
PLAN Symposium on Principles and Practice of
Parallel Programming PPOPP pages 
March 
 Ralf Kretzschmar Ein ModulaCompiler f!ur
die Connection Machine CM Masters thesis
University of Karlsruhe Department of Informat
ics May 
 James McGraw Stephen Skedzielewski Stephen
Allan Rod Oldehoeft John Glauert Chris
Kirkham Bill Noyce and Robert Thomas
SISAL Language Reference Manual Lawrence
Livermore National Laboratory March 
 James R McGraw The VAL language Descrip
tion and analysis ACM Transactions on Pro
gramming Languages and Systems 
January 
 Piyush Mehrotra and John Van Rosendale The
BLAZE language A parallel language for scien
ti
c programming Parallel Computing 
 November 
 Michael Metcalf and John Reid Fortran  Ex
plained Oxford Science Publications 
 John K Ousterhout Donald A Scelza and
Pradeep S Sindhu Medusa An experiment in
distributed operating system structure Com
munications of the ACM  February

 Prentice Hall Englewood Clis New Jersey
INMOS Limited Occam Programming Manual

 M Rosing R Schnabel and R Weaver DINO
Summary and example In Proc of the Third
Conference on Hypercube Concurrent Computers
and Applications pages  Pasadena CA
 ACM Press New York
 Thinking Machines Corporation Cambridge
Massachusetts Lisp Reference Manual Version
 
 Thinking Machines Corporation Cambridge
Massachusetts C Programming Guide Version
 November 
 Walter F Tichy Parallel matrixmultiplicationon
the Connection Machine International Journal
of High Speed Computing  
 US Government Ada Joint Program Oce
ANSIMILStd 

 A Reference Manual for the
Ada Programming Language January 
 Niklaus Wirth Programming in Modula Third
corrected Edition SpringerVerlag Berlin Hei
delberg New York 

