Modula-2*: An extension of Modula-2 for highly parallel programs by Tichy, Walter F. & Herter, Christian G.
Research Institute for Advanced Computer Science
NASA Ames Research Center
Modula-2*: An Extension of Modula-2
for Highly Parallel Programs
Walter F. Tichy* and Christian G. Herter
RIACS at NASA Ames Research Center _
RIACS Technical Report 89.34
September 1989
l) /_'7_- JYzJ_',':a'_
L
(NASA-CR-IBB855) MOOULA-Z_: AN EXTENSION OF
MODULA-2 FOR HIGHLY PARALLEL PROGRAMS
(Research Inst. for Advanced Computer
Science) 22 p CSCL 09B
c3/61
N92-I1654
Unclas
0043041
https://ntrs.nasa.gov/search.jsp?R=19920002436 2020-03-17T15:19:53+00:00Z

Modula-2*: An Extension of h/Iodula-2
for Highly Parallel Programs
Walter F. Tichy* and Christian G. tlerter
RIACS at NASA Ames Research Center t
RIACS Technical Report 89.34
September 1989
t
i
,!
?..

Modula-2*: An Extension of Modula-2
for Highly Parallel Programs
Walter F. Tichy* and Christian G. Herter
RIACS at NASA Ames Research Center t
RIACS Technical Report 89.34
September 1989
Abstract
Highly parallel computers with tens of thousands of processors will
be of rapidly growing importance for highspeed computation. Paral-
lel programs for these machines should be machine-independent, i.e.,
independent of properties that are likely to differ from one parallel
computer to the next. In particular, parallel programs should be inde-
pendent of:
1. memory organization and communication network,
2. number of physical processors available,
3. control mode of the parallel computer (SIMD, MIMD, or MSIMD).
This paper describes extensions of Modula-2 for writing highly par-
allel, portable programs meeting these requirements. The extensions
are:
• Synchronous and asynchronous forms of a forall statement;
• Control of the allocation of data to processors.
Sample programs written with the extensions demonstrate the clar-
ity of parallel programs when rrmchine-dependent details are omitted.
The principles of efficiently implementing the extensions on SIMD,
MIMD, and MSIMD machines are discussed. The extensions are small
enough to be integrated easily into other imperative languages.
"Supported by Cooperative Agreement NCC 2-387 between the National Aeronau-
tics and Space Administration (NASA) and the Universities Space Research Association
(USRA).
tAuthors' permanent address: University of Karlsruhe, D-7500 Karlsruhe, FRG.
1 Introduction
Highly parallel machines with thousands and tens of thousands of proces-
sors are now being manufactured and used commercially. These machines
will be of rapidly growing importance for highspeed computation. They
also indicate that a fundamental paradigm shift from the sequential to the
parallel computer is in progress. This shift is fundamental because it affects
virtually all areas of computer science, computer engineering, and computer
applications.
Ease of programming will be of overwhelming importance for the accep-
tance of highly parallel machines. At present, writing highly parallel pro-
grams is still a poorly understood and extremely complicated craft. What
makes highly parallel programs difficult to write and maintain is that they
must deal with a plethora of machine-dependent details such as the memory
organization and interconnection network, the number of processors avail-
able, and whether the target machine runs in SIMD or MIMD mode. To
make parallel programs easier to write, maintain, and port, parallel pro-
gramming languages must abstract from machine-dependent details and al-
low programs to be formulated in a problem-oriented way.
The current programming style for machines such as the 65,000-processor
Connection Machine[3,7] is best characterized as "interconnection program-
ming". This style involves exploiting details about the interconnections
among processors and memory units for squeezing the last bit of perfor-
mance out of the available hardware. Interconnection programming has the
same undesirable properties as assembly programming: Programs are diffi-
cult to understand and maintain, and have to be rewritten for every new
machine type. The rational for interconnection programming is that the
communication networks of today's parallel computers are critical bottle
necks, much too slow compared to the speed of the processors. However,
given the youth of the field, l_aral]_memory organization and compiler
technology are likely to improve significantly, and might render interconnec-
tion programming as obsolete as assembly programming. It the_refor_e seems
approi_r_te::t_ign programmln-g languages in _v_ details about the
memory organization and interconnection network are irrelevant. Instead,
programs should simply exchange data by reading and writing memory in
parallel, while fast interconnection hardware and compiler technology im-
plement efficient data transport. The goal is to let the problem dictate the
data exchanges, and not a particular computer architecture.
A related machine dependence involves the number of physical proces-
2
sots. On most parallelmachinestoday, programmersare repeatedlyfaced
with the problemof simulatinga largenumberof parallelthreadsof control
ona comparatively small number of real processors. The resulting programs
are extremely difficult to understand, because the code for multiplexing the
processors and for packaging and shifting the data accordingly may obscure
even simple algorithms. Instead, the problem and the algorithm should dic-
tate the number of processes to be used, and the underlying runtime system
should organize the allocation of data and processes to real memories and
processors.
A third issue when programming parallel machines is whether they ex-
ecute in the modes MIMD, SIMD, or MSIMD. MIMD stands for multiple
instruction streams, multiple data streams and means that each hardware
processor has its own instruction pointer, executing its own program on its
own data. Processors run independently of each other, except when syn-
chronizing or exchanging data. SIMD stands for single instruction stream,
multiple data stream and means that all processors execute the same in-
structions in synchrony on their own data, or idle for some instructions. An
SIMD machine consists of a single control processor and a large number of
processing dements. The control processor stores the program and issues
the instructions to the processing elements. Because of the synchronous ex-
ecution and the elimination of many race conditions, an SIMD machine is
easier to program than an MIMD machine. An SIMD machine also costs
less to build than an MIMD machine, First, it needs less memory, be-
cause the program is stored only once. Second, the processing elements are
simple arithmetic and logic units without program counters, and therefore
cheaper to build than full-fledged, general-purpose CPUs. These savings
are important for machines that incorporate tens of thousands of processing
dements. The drawback is that an SIMD machine may be difficult to utilize
fully: Whenever an instruction is issued, only a portion of the processing
elements may actually be in a state where they can execute it; the rest of
them idle.
A compromise between MIMD and SIMD is MSIMD, short for multiple
SIMD. An MSIMD machine is similar to an SIMD machine, except that the
single controller is replaced by several, each of which may issue a different
stream of instructions. The processing elements can choose dynamically
which instruction stream to follow. The underlying assumption is that, al-
though a parallel program may branch out into several independent threads
of control, the number of such threads is much smaller than the number of
processing elements. For instance, two branches of an IF-statement could
' 3
be executedsimultaneouslyon an MSIMD machine with two controllers,
while an SIMD machine would first idle one set of the processing elements,
then the other. MSIMD may also be viewed as VLIW SIMD, or Very-Large-
Instruction-Word SIMD.
It is evident that writing a program explicitly for an MIMD, SIMD, or
MSIMD machine is another source of machine-dependence. For example, a
program written for an MIMD computer such as the N-Cube will normally
not run on an SIMD computer such as the Connection Machine, and vice
versa. To preserve portability, parallel programs should be written in such a
way that the synchronous or asynchronous parallelism is determined by the
problem at hand, not dictated by a particular machine architecture. It is the
task of the compiler to map synchronous or asynchronous parallelism to the
capabilities of the available hardware. The synchronous and asynchronous
language constructs presented below can be executed efficiently on MIMD,
SIMD, and MSIMD architectures.
The rest of the paper discusses Modula-2*, an extension of Modula-
2[8] for writing highly parallel programs. The extensions abstract from
the memory organization and the number of physical processors, and let
the programmer choose explicitly between synchronous and asynchronous
execution. The necessary extensions were surprisingly small. We chose
Modula-2 as a base, because we wanted to start experimenting with a sim-
ple language. Similar extensions can be integrated into other imperative
programming languages, such as C++ or Ada. We also discuss the prin-
ciples of how to implement the constructs on MIMD, SIMD, and MSIMD
machines.
2 Parallel Programming Constructs
The extensionsof Modula-2 consistof synchronous and asynchronous ver-
sionsof a forallstatement,plus a simple,optionaldeclarationformapping
array data onto processors.Furthermore, the restrictionsin Modula-2 on
compile-timeevaluationofarraybounds had tobe lifted,toallowforflexible,
parallel array processing.
For presenting the syntax of the extensions, we use the EBNF notation of
the Modula-2 language definition[8], with keywords in upper case, I denoting
alternation, ( ... ) grouping, and [ ... ] optionality of the enclosed sentential
form. ___ _ _ :_ _.... :
2.1 Overview of the forall statement
The forallstatement createsa setof processesthat execute in parallel.In
the asynchronous form, the individualprocessesoperate independently;the
forallsimply terminateswhen the lastof the createdprocessesterminates.
In the synchronous form, the processescreatedby the foralloperatein uni-
son,but may branch out intomutua_y independent subsetsand then rejoin.
Although the asynchronous form isthe more generalone, the synchronous
form iseasierto understand because itcauses fewer race conditions,justas
a clocked hardware circuitcausesfewer race conditionsthan an unclocked
one. Where necessary,explicitsynchronizationofasynchronous processesis
possiblewith semaphores and the proceduresSEND and WAIT, as specified
(thou_,_hnot as implemented) in Chapter 30 of [8].
The syntax of the forallisas follows.
ForalIStatement= FORALL ident":" SimpleType IN (PARALLEL ISYNC)
StatementSequence
END
The identifierintroduced by the forallstatement islocalto the state-
ment and follows the usual scope rules. "SimpleType" is an enumera-
tion or a subrange of another enumeration. The basic enumerations IN-
TEGER, CARDINAL, LONGINT, CHAR, and BOOLEAN as wellas any
user-definedenumeration may be used.
is as
1.
2.2 The asynchronous forall
The actionof the asynchronous forallstatement
FORALL C : T IN PARALLEL SS END
follows.
Assume the number of valuesof type T isN. The statement creates
N processes,each suppliedwith a constantC bound to a unique value
of T.
. The N processesexecute the statement sequence SS concurrently.No
assumptions about the relativespeeds of the processesmay be made,
unlessexplicitlysynchronized.The statements in $$ may referto C
or any other identifierglobalto the statement. Ifseveralprocesses
write the same globalvariable,then itisindeterminatewhich value is
eventuallystoredin it.
I
5
3. The forall statement terminates when the last of the N created pro-
cesses terminates.
In the following simple example, an asynchronous forall statement im-
plements a vector addition.
FORALL i: [O..N-I] IN PARALLEL z[i] :- x[i] + y[i] END
Since no two processes created by the forall access the same variable, no
temporal ordering of the processes is necessary. The N processes may exe-
cute at whatever speed. The forall terminates when all processes created
by it have terminated.
Our asynchronous forall is a simplification of the forall statement found
in the dataflow languages VAL and SISAL[6,5]. It can express the same
degree of explicit parallelism as its dataflow variants. However, dataflow
machines can also exploit implicit parallelism, by detecting at runtime that
certain subexpressions are independent, and then executing these subexpres-
sions in parallel. A parallel machine constructed out of numerous, individual
data_low processors might be able to exploit this type of implicit parallelism.
2.3 The synchronous forall
The synchronous forall statement
FORALL C : T IN SYNC SS END
differs from the asynchronous form only in that the created processes execute
the statement sequence SS synchronously. Roughly stated, synchronous ex-
ecution means that all processes that follow the same path through the
control flow graph execute instructions in lock step. However, processes on
differing control flow paths may execute asynchronously. This scheme is not
SIMD, since control flow may diverge in conditional statements. For exam-
ple, consider the synchronous execution of an if statement with two arms.
First, all processes evaluate the condition synchronously. The evaluation
splits the set of processes into two subsets, depending on the result of the
condition evaluation. The subset with processes containing the value TRUE
then executes one arm synchronously, the other subset the other arm. Both
sets may operate simultaneously. Though processes in the same subset op-
erate in lock step, the speed of processes in different arms are incomparable.
When both subsets terminate, they are joined again into one set.
The synchronous forallstatement isa generalizationof the forallfor
SIMD machines described by Hillisand Steele[4].MSIMD machines can
executeour synchronous foralldirectly.SIMD machines can alsoimplement
itefficiently,because thereisno order impliedamong divergingcontrolflow
branches. The lack of orderingpermits a process scheduling that greatly
reduces the idlingof processors.(See Section 3.2 formore details.)
Below isthe predse definitionof synchronous execution.The definition
isrecursiveand given foreach statement type.
Sequence: A statement sequence ofthe form
TI;T2; ... Tk
is executed synchronously by executing the statements Ti synchro-
nously in sequence.
Assignment: An assignment statement of the form
L:=R
is executed synchronously by N processes as follows. First, all N
processes evaluate the designator L synchronously, yielding N (not
necessarily different) results each designating a variable. Second, all
N processes evaluate the expression R synchronously, yielding N (not
necessarily different) values. Third, all processes store their values
computed in the second step into their respective variables computed
in the first step. If the third step results in several values being stored
into the same variable, then it is indeterminate which of those values
will actually be stored after the assignment terminates.
if." An if statement of the form
IF El THEN TTi
ELSIF E2 THEN "Fr2
ii,
ELSE TTk
END
is executed synchronously by N processes as follows. First, all N
processes ewluate expression E1 synchronously. Those processes for
" 7
which E1 evaluatesto TRUE then execute TT1 synchronously,while
the other processesexecute E2 synchronously.Those whose evaluation
ofE2 yieldsTRUE then execute TT2 synchronously,and so on. Thus,
each IF and ELSIF clausedividesthe set ofremaining processesinto
two independent subsets.The processesremaining afterthe lastEL-
SIF clause(ifany) finallyexecute TTk synchronously.No assumptions
may be made about the relativespeeds of pairsifprocessesexecuting
differentexpressionsEi or statementsequences TTi. The synchronous
executionofthe ifstatement terminateswhen the lastnon-empty sub-
set ofprocessesterminates.
while: A while statement of the form
WHILE E DO TT END
isexecuted synchronouslyby N processesasfollows.Assume processes
may be designatedeitheras activeor inactive.
1. Designate all N processes as active.
2. All active processes execute expression E synchronously. Those
processes, whose evaiuation of E yields false, are designated as
inactive.
3. If the set of active processes is empty, then the synchronous exe-
cution of the while statement terminates.
4. Otherwise, the activeprocessesexecute statement sequence TT
synchronously.
Continue with step2..
forall:A foraI1 statement of the form
FORALL D : U ... TT END
isexecuted synchronously by N processesas follows.I First,allN
processes compute the range U in synchrony. Then each of the N
processes spawns a new set of processes given by U. If the forall
specifies synchronous execution, all processes thus created execute the
2This is a synchronous or asynchronous forall nested within another, synchronous
forall.
statement sequence TT synchronously; otherwise, they execute asyn-
chronously. Synchronous execution of the forall terminates, when all
created processes have terminated.
WAIT and SEND: Synchronous execution of a WAIT by N processes
causesallArprocessestoblockifany ofthem blocks.Ifthe N processes
have been blocked,they willcontinueonly afterallindividualprocesses
that caused the blockinghave been unblocked by a SEND from other
processes.Clearly,the N processescannot unblock themselves.
The synchronous executionofexpressions,designators,procedure calls,
case statements,repeat statements,for statements,loop statements,wlth
statements,return statements,and exit statements isdefinedanalogously.
Of specialimportance are procedure and functioncalls,because they allow
multiple,synchronous subprogram invocations.The definitionsare omitted
herefor the sake of brevity.
2.4 Example
Consider the problem of summing the elements of a vector in parallel. By
using a recursive doubling technique, the sum can be computed in O(log N)
time, where N is the length of the vector. Figure 1 illustrates the process.
The recursive doubling technique operates basically by computing partial
sums of length 2, 4, 8,... N. There is a one-to-one mappingbetween process
numbers and elements of the vector. By inspecting the assignment statement
we note that only process i will update the/'th dement of the vector. In the
first iteration, all odd numbered processes are disabled by the if statement,
since that statement has no second arm. Thus, only the even numbered
processes update their respective vector elements. Each of those processes
does so by retrieving the element to the right of V[i] and adding it to V[i].
(The second condition in the if statement makes sure that the last process
does not attempt to access a non-existing vector element.) In the next
iteration, only the processes divisible by 4 will update their values, but this
time they reach for elements that are a distance of 2 away. These are the
even numbered elements. Note that these elements already contain sums of
subvectors of length 2. The reset is thatnow the updated array elements
contain partial sums of length 4. This process continues by doubling the
length of the partial sums in each step, until V[0] contains the desired result.
° 9
IY
v(
12 tE
Z/V  
, ) v, lvoI v, lv_
vo
,o_,Iv,_I
)111Vll Iv12
"i vobo,,Iv,,Iv,_i
21v,_,o,,iv,,lv,21
,21v,iv231v3lv,,Iv_lve,l v7Ivs,,lv,_,o,,iv,,iv,,I
Figure 1: Computing the Sum of a Vector
VAR V : ARRAY[O .. N] OF REAL;
VAR stride: CARDINAL;
BEGIN
stride :- 11
WHILE stride <= N DO
FOBALL i : [0 .. N] IN PARALLEL
IF ((i MOD (stride*2))=O) AND ((i÷stride)<=N) THEN
VEil :ffiVii] ÷ Vii÷stride]
END
END;
stride :ffistride * 2
END (* sum in £[0] *)
END
Note that the process selectionissuch that in each iteration,none of
the processesinterfere.Each processreads and writesitsown pairof vector
elements. Thus, we can use the asynchronous form of the forallstatement.
The only requirementisthat allprocessescomplete beforethe next iteration
commences, but that property isassured by the semantics ofthe forall.
I0
For illustrating the use of the synchronous forall, consider interchanging
the whUe and forall statements in the above program. How would that
change the execution of the program? First, each process would now control
its own loop, so the loop control variable stride must be replaced by an
array. Furthermore, the individual processes must now be constrained to
execute synchronously. Otherwise, we would obtain unpredictable results,
because the processes may overtake each other arbitrarily. For instance,
one process might read a vector element that has not yet been updated, or
it might overwrite a vector element whose old value is still needed. The
synchronous forall guarantees that no such interference can happen. The
resulting program is below.
VAR V : ARRAY[O .. N] OF REAL;
VAR stride: ARRAY[O .. N] OF CARDINAL_
BEGIN
FORALL i : [0 .. N] IN SYNC
stride[i] := I;
WHILE stride[i] <= N DO
IF ((i MOD (stride [i]*2) )=O) AND ((i+stride[i])<=N) THEN
V[i] := Vii] ÷ V[i÷stride[i]]
END
END;
stride[i] := stride[i] * 2
END (* sum in V[0] *)
END
This program could be transformed again in such a way that not all processes
execute the loop for the same number of steps. Merging the condition of the
if statement into the condition of the while statement would stop each loop
at the right time. Yet another transformation would use N semaphores to
control the summing process asynchronously.
2.5 Allocation of array data
Co-location of data with the processors that operate upon them is impor-
tant for parallel machines without uniform access time to memory locations.
Poor alignment of data and processors may cause excessive communication
overhead. We therefore provide a simple construct for controlling the allo-
cation of array data to the available processors. This construct is optional,
and does not change the meaning of a program; it affects only performance.
' 11
A compiler fora machine with uniform memory accesstime may ignorethe
construct.
The allocationofarraydata toprocessorsiscontrolledwith one allocator
per dimension. The modified declarationsyntax for arraysisas follows:
ArrayType = ARRAY SimpleType [allocator]
{',"SimpleType [allocator]}OF type
allocator = LOCAL ISPREAD [SCATTER
Ifthe allocatorismissing,itisassumed to be SPREAD. The interpre-
tationof the allocatorisas follows.
LOCAL: Allocateallelements ofa dimension marked LOCAL to a single
processor.
SPREAD: Distributethe elements of a dimension marked SPREAD over
allavailableprocessors.Elements whose indicesdifferonly by unityin
the marked dimension must be allocatedto the same processor,as far
as that ispossiblegiven that allavailableprocessorsshould be used.
SCATTER: Distributethe elements of a dimension marked SCATTER
over the availableprocessors.Elements whose indicesdifferonly by
unity in the marked dimension must be allocatedto differentproces-
sors.
As an example, consider the following declarations.
A: ARRAY [I..L] SPREAD [I..H] LOCAL OF T
B: ARRAY [I..L] SCATTER [I..14]LOCAL OF T
The LOCAL allocatorforceseach row of A and B intoone processor. If
the number of availableprocessors,P, islargerthan L, then each row is
allocatedto exactly one processorin both cases. If1 < P <_L, then row
r of A is allocatedto processor((r- 1)+ [L/P]), while row r of B is
allocatedto processor((r- 1)rood P). Thus, SPREAD assignssequences
of [L/P] successiverows to a singleprocessor,while SCATTER distributes
thesesequences over the P processors.
For a multidimensionalarray,thereisat most one dimension where the
differencebetween SPREAD and SCATTER matters. Consider a multidi-
mensional array C with n dimensions.
12
C: ARRAY [ln..unJ allocn ... [11..ul] allocl OF T
1 if alloci = LOCALdi = ui - li + 1 otherwise
f 1-I_=k di if k _< l
D( k, l) 1 otherwise
Determine the largestm(1 _<m _<n) such that
P < D(m, n)
where P is the number of available processors. If no such m exists (i.e.,
P > D(1, n)), then there are enough processors to distribute all elements of
dimensions marked SPREAD or SCATTER to different processors. (There
is no difference between SPREAD and SCATTER in this case.) If m ex-
ists, it identifies the dimension where the difference between SPREAD and
SCATTER applies. If the allocator of that dimension is SPREAD, then
the array dements C_,,... ,jm,...] and C[j,,... ,(Jm .4- 1),...] must be
allocated to the same processor, as far as that is possible, given that all
available processors should be used. If the allocator is SCATTER, then any
two such array elements must be allocated to different processors. _
Dimensions higher than m that are marked SPREAD or SCATTER are
simply distributed over the available processors. Dimensions lower than m
are automatically treated as LOCAL, since there are no additional proces-
sors available to distribute the data. An implementation may also map the
dimensions m and lower into one dimension (i.e., "unroll" them into one,
long vector) and then treat the new dimension according to the allocator of
dimension m.
The function F defined below provides a suitable mapping of elements
C[j,,... ,jl] to processors numbered 0...P- 1. Many other choices are
possible, depending on the interconnections among the processors.
F(m,j,,... ,jl)
G(m,j,,... ,jl) rood P
if allocm = SCATTER
G(m,j,,...,jl)+ rD(m,n)/P]
if allocm = SPREAD
_IfP = D(m), thereisagainno differenceb tweenSPREAD and SCATTER.
÷ 13
nG(rn, j,,...,jl) = _ D(m,i- 1) x S(i) x (jl- u,)
S(i) = { 0 ifalloci=LOCAL1 otherwise
Function F can even be used in the case where P > D(1,n), by setting
m-1.
The SPREAD allocator is used to minimize communication overhead in
case of nearest-neighbor communications, while still utilizing all available
processors. The SCATTER allocator can keep processor utilization high
if segments of an array are not being processed, as for example in LU-
decomposition.
Callahan and Kennedy[l] have made a different proposal for the distri-
bution of array data. In their proposal, programmers must provide explicit
mapping functions for array indices to processor numbers. In our design,
these mapping functions are created automatically from much simpler al-
locators, while keeping the program independent of the number of physical
processors. On the Connection Machine, the default allocation is equiva-
lent to SPREAD. LOCAL or SCATTER allocations must be programmed
explicitly.
2.6 Other extensions of Modula-2
The original definition of Modula-2 in reference [8] places several restrictions
on arrays. The first concerns open arrays. An open array is an array without
declared bounds. Open arrays are essential for subprograms that operate
on arrays whose size is unknown until runtime. Modula-2 allows only one-
dimensional, open arrays. For convenient handling of higher-dimensional
arrays, open array types should be allowed to have more than one dimension.
Multi-dimensional, open arrays are actually proposed for the ISO-standard
of Modula-2[2].
Another troublesome restrictions involves compile-time constants. For
example, the forall statement uses subranges, whose bounds, according to
the original language definition of Modula-2, would have to be compile-
time constants. This restriction is inappropriate for array parameters whose
array bounds are not known until runtime. Similarly, it is often necessary
to create temporary, local arrays whose size is determined by the size of
14
another array that is passed as a parameter. We therefore propose that
constant expressions are evaluated at runtime. When a constant expression
is used in a constant declaration, the expression is evaluated and used to
initialize the constant. No assignments to constants are permitted. When
used as an array bound, a constant expression is evaluated when the array
is allocated; the array bounds remain unchanged for the lifetime of the
array. Similarly, when used in a subrange of a forall statement, a constant
expression is evaluated once and used to determine the number of processes.
The constant expression is not reevaluated for the duration of the forall
statment.
As an example, considerthe procedure Count. Count returnsthe num-
ber of bitsin a bitvectorthat have the valueTRUE. A possiblesolutionis
to sum a vector that is initializedto I where the bitvectorhas the value
TRUE, and to 0 elsewhere.This vectormust be allocatedat runtime,since
itisunknown what sizetochooseat compile time.3 Itwould be wastefuland
awkward to requirethat the callerprovide the array. Count isillustrated
below.
PROCEDURE Count (bits : ARRAY OF BOOLEAN) : CARDINAL;
VAR temp : ARRAY[O .. HIGH(bits)] OF INTEGER;
VAR stride: CARDINAL;
BEGIN
FORALL i : [0 .. HIGH(bits)] IN PARALLEL
IF bits[i] THEN temp[i] := 1
ELSE temp [i] := 0
END
END
(* Now compute the sum of elements of temp with *)
(* recursive doubling, as described earlier. *)
• , •
RETURN temp[O] ;
END Count
Another restriction that can be lifted is that set types must have a base
type with a small cardinality, for example the wordlength of a computer.
With highly parallel machines, there is no rational for such a severe restric-
tion. Instead, sets should be allowed to be as large as memory permits. Of
SSumming the bitvector itself would not work unless each bit occupies a follow word.
15
course,an implementationis freeto packa set type denselyinto memory
words.
3 Implementation of the forall Statement
We consider implementing the synchronous and asynchronous forms of the
forall statement on both synchronous and asynchronous architectures. Par-
ticular emphasis is on how to simulate a large number of processes, p, that
potentially exceeds the number of physical processors, P. We assume p > P
in the following.
3.1 Process-to-processor assignment
An efficient assignment of processes to processors is important when there
are thousands of processors. The assignment can be performed statically by
the compiler, or dynamically by the runtime system. A static assignment
has the advantage of eliminating queues of ready processes. The overhead
for managing these queues might easily exceed the actual work to be done
in a fine-grained parallel algorithms such as those presented earlier. On the
other hand, a poor static assignment might not use the available processors
well. Clearly, any reasonable process assignment must take the allocation of
data to processors and the communication network into account.
As an example, consider the following program fragment.
A: ARRAY[1..q] SPREAD OF T;
B: ARRAY[1..r] SPREAD OF T;
FORALL i: [1..p] IN PARALLEL A[el(p)] := B[e2(p)] END
where el(p) and e2(p) are expressions in p. Without any further assumption
about these expressions and the relations among p, q, and r, a reasonable
assignment is to spread the p processes over the same processors that are
available to the larger of the arrays A and B. Assume these are P p_ocessors.
Let v = fp/P]. Processor i would then execute processes vi, ..., v(i -t- 1)::1
in sequence.
Note that the process-to-processo r assignment may actually change from
statement tostatement within a single forall, depending on what data struc-
tures are being accessed. Control may therefore jump processors from one
statement to the next, or even within a statement. The code produced by a
compiler optimizing for memory references may therefore be quite different
16
from the traditional method of rescheduling a process onto a potentially dif-
ferent processor only at synchronization events. Much research in process
management on highly parallel machines remains to be done.
Many interconnection networks can treat certain communication pat-
terns better than others. For instance, on a hypercube network, near-
neighbor communication in a n-dimensional grid and communication over
distances that are powers of 2 can be treated more efficiently than others.
Suppose the index expressions in the above program fragment are linear in
p, i.e., of the form ci x p + c2, where cl and c2 are constants, perhaps even
powers of 2. In those situations, emcient communication instructions can
be chosen by a compiler optimizing for a hypercube network. Compilers for
other communication networks may be able to exploit other special cases.
Clearly, future research in optimizing compilers must address the problem
of minimizing communication time in highly parallel machines.
3.2 Implementation of the asynchronous forall
The asynchronous forall is easy to implement. Since no assumptions about
the relative speeds of the processes can be made, an implementation is free
to choose any order, for example a fully asynchronous, fully synchronous, a
vectorized, a sequential, or even a random order.
Recall that the asynchronous forall statement does not terminate until
all created processes terminate. Thus, on a MIMD machine, all processes
must perform a synchronization step at the end. An asynchronous reduction
tree similar to the one described for summing the elements of a vector can be
used to avoid linear synchronization time when processes terminate nearly
simultaneously. On an SIMD machine, no such synchronization is necessary.
On an MSIMD machine, only the controllers need to synchronize.
An important issue on an SIMI) machine is how to fully use the avail-
able processors. Recall that whenever control flow splits into two or more
branches, only one set can be fed instructions at a time, while the other sets
idle.
To avoid the idling of processors, more sophisticated scheduling algo-
rithms are possible. The goal is to assign processes to processors in such
a fashion that nearly all processors remain busy. This goal can be accom-
plished with a simple rescheduling at every branchpoint. For example, con-
sider an if statement with two arms, and assume that each of the P physical
processors is assigned v processes. The usual simulation is to feed both arms
of the if statement v times to the processors, effectively idling half of the
i7
processors.Instead,consider the followingscheme. First,each processor
evaluatesthe conditionforallitsassignedprocessesand dividesthem into
two sets,depending on the results.Next, allP processorsselectprocesses
with valueTRUE an then receivethe instructionsforthe correspondingarm
from the controller.When allprocessorsare done with processescontaining
the value TRUE, the controllerswitchesto the other arm. For sufficiently
high r and evenly distributedvaluesof the conditions,few processorswill
actuallyidle,achievingnearlyfullutilization.Note that such a scheduling
isvalidbecause the asynchronous forallmakes no assumptions about the
relativeorder of the two arms of the ifstatement.
On an MSIMD machine, severalbranches can be executingin an over-
lapped fashion,untilthe number of parallelbranches exceeds the number
of availablecontrollers.Up to that point,allprocessorscan be kept busy.
After that,the remaining branches are executed in SIMD fashion,possibly
with reschedulingas discussedabove.
Finally,the asynchronous forallcan be executed efficientlyon vector
computers, since order is immaterial. For simulation on sequentialma-
chines,allthatneeds to be done isto replacethe forallstatement with a for
statement over the same range. Note, however, that such a simpleminded
simulationmay mask many potentialprogramming errors.Perhaps a ran-
domized order of the processesismore appropriatein helpingprogrammers
with detectingerrorswhen testingparallelprograms on sequentialmachines.
3.3 Implementation of the synchronous forall
Clearly, a MSIMD machine can implement the synchronous forall state-
ment directly. When the number of parallel branches exceeds the number of
controllers, the remaining branches must be simulated by multiplexing the
available processors, _ is done on SIMD machines. Since separate control
flow branches execute asynchronously, efficient process scheduling is possi-
ble, as explained in the previous section.
Achieving synchronous execution on an MIMD machine can be expen-
sive. A simple, but inefficient simulation would be to insert a synchroniza-
tion command after every instruction. Then each instruction would essen-
tially take time proportional to the logarithm of the available processors,
which would slow down all programs significantly. Fortunately, a Synchro-
nization command is not needed for every instruction, but only before and
after every memory write. This synchronization suffices ])ecause processes
are affected by other processes only through changes in memory. Neverthe-
18
less, a modest amount of hardware for simulating SIMD, such as a global
clock for instruction execution, would provide a much more efficient imple-
mentation of synchronous execution.
If the number of processes exceeds the number of processors, it is impor-
tant that the multiplexing does not violate the semantics of the synchronous
forall. Consider the following statement.
FORALL i: [....] IN SYNC x[i+l] := x[i] END
If process i actually executes before process i + 1, a naive implementation
would produce incorrect results, even on a SIMD machine. Instead, all pro-
cesses have to follow faithfully the steps in the definition of the synchronous
execution of the assignment statement. Each process must first evaluate the
left and right sides of the assignment before any write to memory takes place.
This means that each process must save both a pointer and a value and wait
for all processes to complete their evaluation of the left and right hand sides
before the actual assignment. Since evaluating the right and left sides might
cause side effects (through function calls, for instance), these computations
must be carried out such that they appear synchronously even if there are
more processes than processors.
Correct, synchronous execution requires overhead in space and time.
This overhead can be reduced on both SIMD and MIMD architectures if
the synchronous forall can be transformed into the asynchronous form with
no or infrequent synchronization.
4 Conclusions
We have presented simple language constructs for writing highly paral-
lel, machine-independent programs. These constructs can be implemented
efficiently on SIMD, MSIMD, and MIMD computers. Simple, machine-
independent control over the mapping of data to processors allows compilers
to optimize communication time on architectures with distributed memory.
Work on optimizing compilers for a highly parallel machine is in progress.
References
[1]David Callahan and Ken Kennedy. Compiling programs fordistributed-
memory multiprocessors. The Journal of Supercomputing, 2:151-169,
1988.
19
[2]BSI Modula-2 StandardisationWorking Group. First working draft
Modula-2 standard. 1989.
[3] W. Daniel Hillis. The Connection Machine. The MIT Press, 1985.
[4] W. Daniel HUlls and Guy L. Steele. Data parallel algorithms. Commu-
nications of the A CM, 29(12), Dec. 1986.
[5]James McGraw, Stephen Skedzlelewski,Stephen Allan,Rod Oldehoeft,
John Glauert,Chris Kirkham, BillNoyce, and Robert Thomas. SISAL
Language ReferenceManual. Lawrence Livermore National Laboratory,
March 1985.
[6]James R. McGraw. The val language: descriptionand analysis.A CM
Transactions on Programming Languages and Systems, 4(1):44-82, Jan-
uary 1982.
[7] Horst D. Simon, editor. Scientific Applications of the Connection Ma-
chine. World Scientific Publishing Co., 1989.
[8] Nilaus Wirth. Programming in Modula-2. Springer Verlag, third, cor-
rected edition, 1985.
20
