The force on the flex: Global parallelism and portability by Jordan, H. F.
~~ 
(. 
(9 
t, 
NASA Contractor Report 178161 
leASE REPORT NO. 86-54 
ICASE 
/Y/lsr? t;(-17%; I~ 
NASA-CR-178161 
19860020908 
THE FORCE ON THE FLEX: GLOBAL PARALLELISM AND PORTABILITY 
Harry F. Jordan 
Contract No. NASl-17070 
August 1986 
INSTITUTE FOR COMPUTER APPLICATIONS IN SCIENCE AND ENGINEERING 
NASA Langley Research Center, Hampton, Virginia 23665 
Operated by the Universities Space Research Association 
NI\SI\ 
National Aeronautics and 
·Space Administration 
Langley Research Center 
Hampton, Virginia 23665 
\ \U\\\\\ \\U \\\\ U\\\ U\\\ \\\\\ \\\\\ \\\\ \U\ 
NF00191 
llBftfiRV C@I~V 
SE P i3 1~tj6 
lL\NGL£V RESEARCH CENTER 
UERARY,. NASA 
a\! !P1OU, YillG!Wll 
https://ntrs.nasa.gov/search.jsp?R=19860020908 2020-03-20T14:38:06+00:00Z
" 
q 
" 
,., 
THE FORCE ON THE FLEX: 
GLOBAL PARALLELISM AND PORTABILITY, 
HARRY F. JORDAN+ 
ABSTRACT 
.-\ parallel programming methodology, called the force, supports the con-
strlldioll of programs to be exeeuted in parallel by an unspecified, but poten-
t.ially I.1rgp, number of processes. The methodology was originally develop~d 
on a pipl'lined, shared memory multiprocessor, the Denelcor HEP, and embo-
dit·s t.he primit.ive operations of the force in a set of macros which expand into 
Illlllt.ipwe('ssor Fortran eode. A small set of primitives is sufficient to write 
large parallel programs, and the system has been used to produce 10,000 line 
programs in C"omputational fluid dynamics. The level of complexity of the 
foree primitives is intermediate. It is high enough to mask detailed architec-
tural diff(>renees between mult.iprocessors but low enough to give the user 
con trol over performance. 
The system. is being ported to a medium scale multiprocessor, the 
Flex/:32, which is a 20 processor system with a mixture of shared and local 
nH'mory. Memory organization and the type of processor synchronization 
support('d by the hardware on the two machines lead to some differences in 
eme-ient. implementat.ions of the force primitives, but the user interface 
remains the same. An initial implementation wns done by retargeting the 
macros t.o Flexible Computer Corporat.ion's ConCurrent C language. Subse-
qlll'ntly, the macros were eaused to directly produce the system calls which 
form the basis for ConCurrent C. The implementation of the Fort.ran based 
system is in step with Flexible Computer Corporations's implementation of a 
Fortran system in the parallel environ men t. 
+University of Colorado, Bouldt'r, CO 8030!l-O·I~5. 
Research was supported in part by NASA Contract No. NASl-17070 and by 
.the Air Force Office of Scientific Research under Grant No. AFOSR 85-
'0189 while the author was in residence at the Institute for Computer 
Applications in Science and Engineering, NASA Langley Research Center, 
Hampton, VA 23665. 
i 
'#= 
;if6-3M9tJ 

" 
, . 
.' 
. , 
The Global Parallelism Concept 
The unifying idea behind the programming environment discussed in this 
paper is that of "global" parallelism. In contrast to the dataflow point of view 
we retain the idea of multiple instruction streams but insulate the user from 
the detailed management of the streams on an individual basis. One view of 
this unifying idea is as a way of incorporating parallelism into the structural 
hierarchy of a program. It is in contrast to the encapsulation of parallelism 
into one or more program modules and can be viewed as parallelism with the 
largest possible "grain" size. 
The view of a computation as an hierarchically structured set of func-
tions is well established and maps into the subroutine calling hierarchy in 
most programming languages. The level of the (usually tree structured) func-
tional hierar("hy at which parallelism enters into the description of an algo-
rit.hm is an important issue. The leaf level, where SIMD parallelism ·is 
appropriate ("an be denoted as fine grained parallelism. As rvn~lD parallelism 
is applied at higher levels, we can speak of algorithms with coarser grained 
parallelism. \Vith fine grained parallelism, the major issue in expressing the 
compu t.at-ion is to specify exactly what is to be done in parallel in each of the 
smnll grains. Very tight synchronization must be the rule (as in SHdD) for 
fine grained parallelism to make sense. In a program with coarse grained 
parallelism the amoun t. of code devoted to expressing the parallelism may be 
vcry small a.nd localized in a high level module. In exchange, the specification 
of synchroniza.t.ion becomes the major issue and may appear explicitly at any 
le\'el of strueture, all the way down to the leaf. 
One possible way to fi t. MIl\lD parallelism in to the calling hierarchy is to 
try t.o encapsulate parallelism below a certain level, or grain size. This has 
the advantage that the upper levels of the program can be written without 
knowing anything about parallel computation. Using the Fork/Join mechan-
ism [II t.o man nge parallel processes, a. single instruction stream would fork 
within some subroutine into multiple st.reams which would perform a parallel 
comput.at.ion and then join int.o a' single stream before returning from the 
su brou tine. The drawbacks in this scheme lie in the area of performance. It 
is well known that even a small amount of sequential code in an otherwise 
parallel program can decrease effieiency significan tly on a system with a large 
degree of parallelism. The encapsulat.ion idea forces all code above a certain 
leYel of st.ructure to be sequentia.l. Furthermore, there is overhead associated 
wit·h managing processes and execution environments in fork and join which 
is invoked whenever the program passes into or out of the parallel level of 
structure . 
Since encapsuintion overheads tend to make larger grained parallelism 
more efficient regardless of the grnill size, there is a good reason to locate 
parallelism at the highest level of program structure in the MIMD environ-
ment. Experience shows that it is quite feasible to write applications pro-
grams with "global" parallelism. In this environment one begins a program 
under the assumption that it may be executed by an arbitrary number P of 
processes. There is no explicit code for process management. The processes 
are managed by entry level, system dependent code which chooses the 
-2-
number of processes on the basis of hardware structure and availab"le 
knowledge of algorithm needs. The explicitly appearing code to deal with 
parallplism is all related to process synchronization and data sharing. The 
idea of global parallelism applies to the decomposition of algorithms on the 
basis of data rather than function. With a high degree of parallelism some 
data deeomposit.ion of an algorithm is surely necessary since the number of 
independen t. functions is limit.ed. Thus this idea is probably most appropri-
at.e to systems supporting many processes. . 
The above concept of global parallelism has been incorporated into a pro-
gramming met hodology called the "force". The force [2J methodology for 
parallel programming arose in trying to produce high performance parallel 
programs in a shared-memory multiprocessor running up to 200 processes on 
the same user program [3J. Multiprogramming was not an issue, and all 
emphasis was on single problem solution speed. Partly for performance meas-
urement purposes and partly for program manageability, a programming 
style emerged in whieh a single piece of code was written which could be exe-
cut.ed by a force of proeesses in parallel. The number of processes constitut-
ing the force is const.ant. during execution but is bound as lat.e as the begin-
ning of ex('eution, and may be one. Similar t.echniques have been developed 
for progr:1mming some more recent multiprocessors, notably the Bolt Beranek 
and Newman BlItterfly [elJ and the IBM research processor RP3[5J. 
SeY(')'al advantages arise out of independence from the number of 
processes. It is not neeessary to design algorithms with a detailed depen-
dE-nee on the', potent.ially ypry large, number of processes executing them. 
The choice of the optimal number of procpsses can be made at run time on 
the basis of system hardware configuration and load. Since complete 
independ(,llee from t.he number of processes implies correct execution wit.h 
only one process, the issuE'S of a.rithmetic correctness and multi-process syn-
chronizat.ion can be separat.ed ill the testing of a program. 
St.atements written in a force program are implicitly executed by all 
proeesses in parallel. Variables appearing in statements are divided into local 
variable'S, having separate inst.ances for each process, and global variables, 
shared among all processes of the force. An assignment statement, for exam-
ple, may combine the values of global and local variables to produce a local or 
global rpstIlt.. If the result is local, no assignment conflict is possible. If it is 
global, then assignment. conflict must be prevented, either hy allocation of 
disjoin t. seet.iolls of a globa.l data structure to multiple proc.esses or by syn-
chronizing the assignment across processes, say by enclosing it in a critical 
section or by using producer/consumer synchronization on the variable 
assigned. Library or user subroutines which are either free of side effects or 
carefully synchronized can be invoked in parallel, one copy for each process. 
Realization of the Concept 
The programming language associated with the force consists of some" 
simple ext.ensions t.o the Fortran language, which are currently implemented 
as macros expanded by a language independent preprocessor. The target 
" 
\ , 
," 
I, 
-3-
Fortran system must, of course, include ways of creating multiple processes 
and of supporting synchronized access to global variables. The macros 
interact through the variables of a parallel environment, which contains some 
general informa.tion such as the number of processes and some machine 
dependen t items. 
The macros currently constituting the force can be divided into several 
classes, as shown in Fig. 1. The first class deals with parallel program struc-
ture. The macros Force and Forcesub respectively begin parallel main pro-
grams and parallel subroutines. They make the parallel environment vari-
ables available to the macros within that program module as well as making 
the number of processes and a unique identifier for the current process avail-
able to the user at run time. An End Declarations macro marks the beginning 
of exec-utable code and provides target locations for declarations and start up 
code which may be generated by the macros. A Join macro terminates the 
parallel main program. It is the last statement executed by all processes of 
the foree. 
Macros of the second class deal with variable declaration. This class 
eurren t.ly includes only Global and Local macros. Globa.l variables are associ-
ated wit.h Fortran common while local variables are ordinary Fortran vari-
ables local to a separately compiled program module. Sharing of local vari-
ables among several program modules, but local to one process, can only be 
accomplished by parameter passing. The static allocation flavor of Fortran 
makes it diffic:ult to build a structure of common variables with one instance 
for each process when the number of processes is not known until execution 
time. 
Macros of another class distribu te work across processes. The most fami-
liar construet is the DOALL, which is employed when instances of a loop 
body for different index values are independent and can thus be executed in 
any order. Two versions are provided. The Presched DO divides index values 
among proc:esses in a fixed manner which depends only on the index range 
and the number of proeesses. The SelJsched DO allows processes to schedule 
themselyes over index values by ohtaining the next available yalue of a 
shared index as they become free to do work. For situations in which it is 
desirable to parallelize over both indices of a doubly nested loop, both 
prescheduled, PregDO, and self scheduled, Sel/gDO, macros are available. 
Independence of the loop body instances over both indices is, of course, 
required for correct operation. A similar construct is the parallel case, Pease, 
which distributes different single stream code blocks over the processes of the 
force. Execution conditions can be associated with each block, and any 
number of these conditions may be true simultaneously. No order of evalua-
tion of the conditions is specified, and each will be evaluated by one arbi-
trarily selected process. Thus conditions depending only on global variables 
are most meaningful. 
At the heart of the force methodology are the synchronization macros. 
They cha.racterize the approach to parallel programming and provide t~e 
means for can trolling the force so that coherent and deterministic computa-
tion can be performed. Two subclasses of synchronization are control flow 
-4-
~hcros associated with program structure: 
Force <name> of <# procs> ident <proc #> 
< declarations> 
End declarations 
< force program> 
Join 
Forc('sub <name> of <#procs> ident <proc #> 
< dcelarations > 
End header 
<subroutine body> 
RETURN 
Foreeeall <name>( <parameters» 
Dt'eiarat.ion maC'ros: 
Global <variable names> 
LoC'al <Fortran declaration> 
~laC'ros specifying parallel execution: 
Pcase on <variable> 
< code block> 
Used 
< code block> 
End pease 
[Prl'jSelf]sehcd DO <n> <var>= <il>, <i2>, <i3> 
<loop body> 
< n > End [prefelf]sched DO 
Synchronizing macros: 
B:.urier 
<code block> 
End bnrrier 
CritiC'nl S<yariable> 
<eode block> 
End critical 
Produce < variable> = < expression> (producer) 
= ... Use( <variable» ... (consumer) 
Figure 1: Specific l'vlacros for a Force Program 
. oriented ·synchroniza.tions and data oriented synchronizations. The key con-
trol oriented synchronization is the barrier since it provides control of the 
.. 
" 
.j 
,. 
" 
-5-
entire force. Its semantics are that all processes must execute a Barrier 
macro before one arbitrarily chosen process executes the code block between 
Barrier and End Barrier. \Vhen the code block is complete, the entire force 
begins execution at the statement following the End Barrier. Although all 
but one process are temporarily suspended by a barrier, no process termina-
tion or ereat.ion takes place and all local process states are preserved across 
the barrier. Opprations which depend on the past computation, or determine 
the future progress, of the entire force are typically enclosed in a barrier. 
Another control based synchronization is the critical section, familiar 
from the operating systems literature. Statements between 
Critical <rariable> and End Critical may only be executed by one process of 
the force at a time. This mutual exclusion extends to any other critical sec-
tion wit.h the same associated variable. Data orien ted synchronization is pro-
yicl('(\ by the element.ary producer-consumer mechanism, in which global vari-
abl('s hayp a binary state, full or empty, as well as a value. Execution by 
some proeess of the macro, Produce <'variable> = < expression>, waits for 
thl' varinbll' to be in the empty state, sets its value to that of the expression 
and make'S it full, all in a manner which is atomic with respect to the progress 
of any ot h('1' proeess. Similarly, the macro, User <variable> j, appearing in an 
expression returns the value of the variable when it becomes full and sets it 
empty. Variables in the wrong state may cause these macros to block the 
progress of a process. Auxiliary macros for full/empty variables are 
Purge <variable>, which sets a variable empty regardless of its previous 
state, and Copy( <variable> j, which waits for the variable to be full and 
returns its yalue but does not empty it. 
A major weakness in the current set of force macros is that it does not 
smoothly support decomposition of a program in to parallel componen ts on 
the basis of functionality. The Pcase macro offers the rudiments of this, but 
only allows one process to execute each of the parallel functions. 'What is 
desired is a macro, Reso/-ve, which will resolve the force into components exe-
cu ting different parallel code sections. The section of code for each co~­
ponent would start with Compon enl <name> strength <number>, which 
would name the component and specify the fraction of the force to be 
devoted to this component. The component strengths would be estimated by 
the programmer on the basis of any knowledge available about the computa-
tional complexity of each component. A macro, Unify, would reunite the 
com))Qnl'nts int.o a single force. The implementation of Resq/ve is compli-
cat.ed by t.he confliding demands of generality and efficiency. If the number 
of components is larger than the number of processes in the force, then 
inter-component synchronization may deadlock unless the components are 
co-scheduled over the available processes. An implement.ation which pro-
duces process rescheduling at every possible deadlock point and is still 
efficient when the number of processes exceeds the number of components is 
under deHlopmen t. 
Incorporation of a Resolve macro will make it useful to extend the barrier 
. idea. A barrier should be able to specify whether only the processes in the 
current component. are to be blocked or whether all processes in the parent 
-6-
foree are to part.icipate. In the case of recursively nested Resolve constructs, 
the barrier migh t specify a nesting level relative to the one in which it 
appears. 
The Resolve idea promises a mechanism for functional decomposition of 
programs into parallel components, but there is one more capability of paral-
lel programming environments wit.h explicit process management which is not 
addressed by the foree. This is the ability to give away work to ,"available" 
proel'sses in a dynamic manner during execution. This ability is most called 
for by t.ree algorithms and dynamic divide-anel-conquer methods. It would be 
ciE'sirable for the foree to contain a mechanism for efficiently hanelling such 
algorithms wit.hout. making the user responsible for explicit process manage-
ment or losing the benefit.s of independence of the number of processes. A 
meehanism related to resolve might be applied at. each tree node but could 
lead to much process management overheael in cases where the correct thing 
t.o do is merely to t.raverse a su btree with the one remaining process. 
Status and Applications 
Thl' force macros described above represent a parallel programming 
environment in which process management is suppressed, and programs are 
inc\€'jH'llClpnt. of the number of processes executing them, except for perfor-
mance. The system makes parallel execution the normal mode; sequential 
operation mllst be explicitly invoked. Two features combine to ensure that 
there is no topological struct.ure to the parallel environment. First, processes 
are identical in capability, and, second, all variables are either strictly local 
to one process or uniformly shared among all of them. This eliminates much 
of the complexity of the "mapping problem" encountered in construct.ing 
parallel versions of algorithms for machines with visible processor topology. 
Primitive operations of the force are available to support both fine-
grained and coarse-grained parallelism. Many of the primitives, especially 
those supporting fine grained interaction, require only local analysis to deter-
mine eorrect.ness of the synchronization. This locality strengthens the case 
for being able t.o au tomate this analysis. The ability to recursively su bdivide 
thl' forel', coupled with the support for parallelization on the basis of data 
partit.ioning, orient.s the system towards "massive" parallelism in that the 
activity of huge numbers of processes can be compactly specifie·d. 
The system is curren tIy tied fairly tightly to shared memory with 
undifferentiated processes and, for that reason, does not support message 
passin g. One could view t.he Produce and Consume primitives as a weak form 
of send and receive operations with the associated variable playing t.he role of 
an unhuffered, one word, message channel. 
The' force system has been used to produce a parallel Gaussian elimina-
tion subrolltine[2] identical in int.erface and operation to the SGEFA routine 
of LINPACK[G]. As well as being efTective in this library subroutine type of 
applicati'oll, it has been used to write large parallel fluid dynamics programs, 
. including SOR algorithms for incompressible flow[7]' [8] and MacCormack's 
method for a shock tube model[9J. It has also been used to implement a new 
I. 
,"" 
'\ 
-7-
parallel pi\'oting algorithm for solving sparse systems of linear equations[lO]. 
The Machines 
The issues which arise in implementing the force on a shared memory 
multiprocessor will be addressed by considering implementations on two, 
fairly differ{'nt., such machines: the Denelcor HEP[3] and the Flexible Com-
put{'r Systems Flex/32[ll]. Not only are the two systems fairly different in 
architect.ure, the lIEP being a pipelined multiprocessor while the Flex/32 is 
built. from mult.iple microprocessors, but the primitive operations for estab-
lishing and cout.rolling parallel processes which are supported by the systems 
are quite differen t. These parallel primitive operations are a combined result 
of hardware, compiler support, operating system and run-time libraries. A 
summary of the hardware, parallelism model and primitive operations for 
eaeh of t.he mac-hines follows. 
The HEP 
The lIEP eomputer is a multiple instruction stream computer categor-
ized as ~n~lD by Flynn[12]. Se\'eral processing units, called Process Execu-
t.ion Modules (PEr..'1s), may be connected to a shared memory consisting of 
one or more memory modules as shown in Fig. 2. Even within a single PEM, 
however, HEP is still an ~IIMD computer. Only the number of instructions 
aetually execut.ing simultaneously, about 12 per PEM, changes when more 
PEMs are added to a system. Separate memories store program and data 
with smaller memories devoted to registers and frequently used constants. 
Only data memory is shared between PEMs. \Ve will concentrate on the 
Process Execu tion Mod ule 
Program Memory 
32K words by 64 b/w 
eglster 
Memory 
2Kw by 6·1b 
Const.an t. 
~lelllory 
,tKw by 6·tb 
MIr..ID 
Processing 
Unit 
Pipelined 
Switch 
PEM 
Figure 2:" Archit.ecture of the IIEP Compu ter 
PEM I/O 
-8-
architecture of a single PEl\1 which implements multiprocessing by using the 
technique of pipelining. 
There are several separate, interacting pipelines in a PEM but the major 
flan)r of the architecture can be given by considering only one of them, the 
main ex('("u tion pipeline. Heavy use has been made of pipelines in vector pro-
cessors (SI~ID computers). In such machines the operating units are broken 
into small st.ages with data storage in each stage. Complete processing of a 
pair of operands involves the data passing sequentially through all stages of 
the "pipeline." Parallelism is achieved by having different pairs of operands 
occupying d ifferen t stages of the pipeline simultaneously. The main execu-
tion pipeline of lIEP can be viewed as a unified structure which processes 
most. instructions using a pipeline with eight steps. Independent instructions 
(along with their operands) flow through the pipeline with an instruction 
being completely executed in eight steps. Independence of the activities in 
successive stages of the pipeline is achieved not by processing independent 
components of vee-tors but by alternately issuing instructions from indepen-
dent instruction streams. Multiple copies of process state, including program 
count.er, are kept for a variable number of processes. A PEM is an MIMD 
processor in exactly the same sense in which a pipelined vector processor is an 
SI~ID machine. In both, independent data items are processed simultane-
ously in different stages of the pipeline while in the I-IEP, independent 
instrnet.ions occupy pipeline stages along with their data. 
The previous paragraph describes the register to register instructions. 
Those dealing wit.h main memory (data memory) behave differently. Data 
memory is shared between PEMs and words are moved between register and 
data memories by means of a class of Storage Function Unit (SFU) instruc-
tions. The relationship between the main execution pipeline and the SFU is 
shown in Fig. 3. A process is characterized by a Process Status \Yord (PS\Y) 
cOilta.ining a program counter and index offsets into both register memory 
and constant memory to support the writing of reentrant code. Under the 
assumpt.ion that. multiple processes will cooperat.e on a given job or task and 
thus share memory, memory is allocated and protected on the basis of a 
st.ruct.ure taIled a task. There are a maximum of 16 tasks, eight supervisor 
tasks anel eigh t user tasks. The 128 possible processes are divieled in to a 
maximllm of 6·1 users and 6,1 supervisor processes which must belong to tasks 
of COITPS pond in g types. Asid e from this restriction a task may h ave any 
number of processes, from zero to G·!, 
An active process is represented in the hardware by a Process Tag (PT) 
which points to one of the 128 possible PS\Vs. The instruction issuing opera-
tion maintains a fair allocation of resources bet.ween tasks first and between 
processes within a task second by means of 16 task queues, each containing 
up to G,l PTs and a secondary queue called the snapshot queue. PTs coming 
one at. a time from the snapshot queue cause the issuing of an instruction 
from t.he corresponding process into the execution pipeline. 
\Yhen an SFU instruction (data memory access) is issued, the PT leaves 
. the qneues of the main scheduler and enters a second set of identical queues 
in the SFU. \Yhen a PT comes to the head of the SFU snapshot queue a 
" 
'\ 
" 
') 
16 Task Queues of 
up to 6-1 Process Tags 
Rl:'link 
-9-
Store 
Result 1<- --
Inst.ruct.ion 
& opera.nd 
fet.ch 
SFU 
Instruction 
Routing 
Figure 3: IIEP Pipeline Archit.ect.ure 
Execu tion Pipeline 
i i j 
Relink 
memory t.ransaetion is built and sent, along with the PT, into the attached 
node of a pipolined, message-switched switching network. The tra.nsaction 
propagates through the switch to the appropriate memory bank and returns 
to the SFU with status and perhaps data.' An SFU instruction behaves as if 
it were issued int.o a pipeline 19nger than the eight step execution pipeline 
but. wit.h t.he same step rate. 
Hardware support for process synchronization is based on 
procilIC'Pr/consumpr synchronization. Each cell in register and data memories 
has a full/empty st.ate and synchronization is performed by having an 
. instruct.ion wa.it for its operands to be full and its result empty before 
proceeding. The synchronizing conditions are optionally checked by the 
-10-
instruction issuing mechanism and, if not fulfilled, cause the PT to be 
immediately relinked into its task queue with the program counter of the 
PS'" unaltered. 
Compiler level support consists of minimal language extensions to give 
the user access to the parallelism of the hardware. The extensions can be 
represented as subroutine calls or incorporated into the language definition. 
Since the force is based on Fortran, the extensions to that language are 
described. To allow for the fact that an independent process usually requires 
some local variables, the process concept is tied to the Fortran subroutine. 
The Fort.r3li extension is merely a second version of the CALL statmen't, 
CR.EATE. Control returns immediately from a CREATE statement, but the 
created subrout.ine, with a unique copy of its local variables, is also executing 
simult.aneously. The RETUR.N in a created subroutine has the effect of ter-
minat.ing the process executing the subroutine. Parameters are passed by 
address in bot.h CALL and CREATE. 
ThE' only ot.her major conceptual modification to Fortran allows access to 
t.he sYllehronizing propert.ies of t.he full/empty state of memory cells. Any 
Fortran variable may be declared to be an "asynchronous" variable. Asyn-
chronous v.uiables are dist.inguished by names beginning with a S symbol and 
may have any Fortran type. They may appear in Fortran declarative st.ate-
ments and adhere to implicit typing rules based on the initial letter. If such 
a variable appears on the righ t. side of an assignment, wait for full, read and 
set empty semant.ics apply. 'Vhen one appears on the left of an assignment., 
the seman tics are wait for empty, write and set full. To initialize the state 
(not t.he value) of asynchronous variables, a new statement, PURGE, sets the 
states of asynchronous variables to empty regardless of their previous states. 
The HEP Fortran extensions of CREATE and asynchronous variables are 
t.he simplest way to incorporate the parallel features of the hardware in to the 
Fortran language. Since process creation is directly support.ed by the HEP 
inst.ruct.ion set and any memory reference may test and set the full/empty 
st.at.e t.hat is associated with each memory cell, the Fortran extensions are 
direct. repr('sen tatiolls of hardware mechanisms. The parallel computation 
model support.ed by the Fortran compiler and run time system can thus be 
view('d as shc)\vn in Fig. 4. A process wit.h its own program counter and regis-
ters may spawn ot.hers like it using CREATE, and the processes interact by 
way of full/empt.y shared memory cells. 
The parall('l programming primitive operations can be characterized as 
in Table 1. Not.e that all the parallel primitives are user level operations 
requiring no operating system intervention. Interrupts are not present in the 
HEP. Conditions which would 'normally lead to an interrupt, including 
supervisor calls, result in the creation of a supervisor process to handle the 
condit.ion and mayor may not suspend the process giving rise t.o the condi-
tion. 
'" 
I, 
,. 
'\ 
-11-
rogram 
Count.er 
I I Instruction 
Set General 
Registers Processor 
8 
Figure ·1: HEP Run Time System IVlodel 
Create 
Quit and save state 
set. loeation empt.y 
Produce 
Consume 
- \Vait for empty, writ,e and fill 
- \Vait for full, read and empty 
Shared· 
Memory 
B 
6 
full/empty 
bits 
Table 1: lIEP Parallel Primitives 
The Flex/32~ 
The architect.ure of the Flex/32 is conceptually simpler than that of the 
lIEP, but the system support for parallelism is more complex. The machine 
. consists of a set of single board microcomputers connected by several buses to 
eaeh other and to some common memory and synchronization hardware. As 
-12-
shown in Fig. 5, t.here are a set of local buses, ten of them, each of which can 
connect two boards, which are either single board computers consisting of 
processor and memory or mass memory boards. Two common buses connect 
the local buses together and to the common memory ane synchronization 
hardware. The memory on the common bus is faster for a processor to access 
than that on the mass memory boards, but both are shared by all processors. 
The memory on a processor board is accessible only to that process.or. 
Hardware support for synchronization is supplied by an 8H)2 bit lock 
memory. This structure is mea.nt to remove the requirement for repeat.ed 
tests by a processor trying to obtain a lock. There is an interrupt system 
connected wit h each processor, which provides underlying hardware support 
for an event signaling mechanism between processors a.s well as for exception 
handling within a single processor. 
The processor/memory boards are based on the National Semiconductor 
3:W:32 microprocessor chip. There may be one or four megabytes of memory 
on a board and a Vl\JE bus interface is provided to connect an individual pro:-
cessor to I/0 devices. A self-test system, connected t"o all processors, pro-
vides a mcC"hanism for testing, bootstrapping and initializing the multiproces-
sor. 
The process model in the Flex/32 is somewha.t different from that of the 
HEP and is shown pietorially in Fig. 6. Since not all of the address space is 
Shared 
memory 
.512K bytes 
Shared 
memory 
5121( bytes I --I 
I Dual ('ommon bus ~ S S ---..,-
I I 
Local bus Loc~bus 
I Processor I· ~ 
S -memory S -i -memory 
board I I board 
Processor 
S ~ -memory 
board 
S 
Mass 
nemor 
board 
S -- bus switching 
Figure 5: Flex/32 Architecture 
Synchronization 
lock memory 
8H)2 bits 
--S 
I.. 
1\ 
,; 
1\ 
-13-
Program Local Process 
Coun ter Memory State 
I I 0 
General 
Regist.ers 
8 Tag 0 
Figure 6: Flex/32 Run Time System - Process Model 
Received 
Message 
Queue 
shared, a process has a certain amount of strictly local memory. The syst.em 
also manages a unique identifying tag for each process and maintains a pro-
eess stat.e which may be one of: running, non-exist.ent, dormant, ready or 
suspended. There is also a received message queue for each process which is 
managed by the system. 
In addit.ion to a slightly more complicated process model, the Flex/32 
system supports a more complex model of synchronization facilit.ies linking 
processes. The total syst.ems model is shown in Fig. 7. At the outset, 
processes are bound to individual processors. The processors may be mul-
tiprogrammed, so more than one process may be bound to a processor. The 
processes share communication and synchronization support supplied by the 
opera.ting system. The Signaling Channels implement the Event mechanism 
and may be atta.ched to a process as a. receiver of the even t, an originator, or 
both. Lock bits may also be connected to several processors for mutual exclu-
sion enforcement. The message passing facility is represented by the received 
messa.ge queue in each process and is thus not shown separately in the system 
model. 
The Flex/32 syst.em provides numerous parallel processing primitives. 
They ma.y be divided into classes dealing with four different. parts of the sys-
tem model: Processes, Messages, Events and Locks. The structures associ-
ated with each of these parts and the primitives which act on the structures 
are summarized in Table 2. The primitives are implemen ted through system 
calls. Sinee most of t.hem interact with the multiprogramming of single pro-
cessors, operating system intervention is usually required. Only a small part 
of this fairly extensive parallel programming model is needed to support the 
implemc.ntat.ion of the force constructs. 
Processors Processes 
-14-
Signaling 
Channels 
LJ pO 
~O 
Figure 7: Flex/:32 Run Time System - Overall Structure 
. \ 
,~ 
I, 
Process 
St.ructure 
Primitives: 
~Iessages 
Structure: 
Primit.ives: 
Events 
St.fueture: 
Prim i tives: 
Loeks 
State: 
get tag 
start up 
kill 
ot.ype 
.length 
opointer 
send 
-15-
.running Tag: unIque, 
system-wide 
identifier 
osuspended 
oready, 
odormant 
ononexistent 
create 
wait for termination 
gIve up processor 
Osource id 
odestination id 
receive-wait 
receive-fail 
list of sources and destinations 
con figure 
remove 
activate 
wait 
passive test. 
on even t call 
set timer 
St.ructure: 8192 single bits 
Operat.ing mode: polling or interrupt 
Primitives: allocate 
Table 2: Flex/32 Parallel Primitives 
lock 
unlock 
Implementation of Force Primitives 
BasiC' hardware support. for synchronizat.ion on t.he HEP is through the 
produee and consume operat.ions on full/empt.y memory cells. The basic 
hardware support. for synchronization on the Flex/32 is supplied by the com-
mon lock memory and the interrupt hardware. Table 3 compares the imple-
mentation of critical sections on the two machines. The implement.ations are 
\"Cry similar, but a det.ailed look at the differences will introduce t.he issues to 
arise in more disjoint implementat.ions of other primitives to follow. 
The basic HEP synchronization is somewhat more powerful than is 
needed for critical sections. A single full/empt.y variable suffices to control 
ent.ry to the section, but only its state is significant; the value of the variable 
-16-
lIEP 
Stst.em st ate and init.ialization: 
Single full/empty variable 
Critical section code: 
Consume critical section variable 
Execute code body 
Produc-e critical section variable 
Performnnee: 
- full 
Consume and produce are single user-mode instructions, 
hu t may result in some resource usage by waiting processes. 
System state and initialization: 
Single bit. loek 
Criti('al seet.ion eode: 
Set c-ritieal section lock 
Exe('ute eod e body 
Clear ("rit.ical section lock 
Performance: 
Flex/32 
- clear 
Set and clear loc-ks are done by system calls. 
Processor rescheduling is possible, and wakeup of 
a delayed process may be by interrupt. or polling. 
Table 3: ImplE'ment.ation of Cdtical Sections 
iSlIll us('(1. The Flex/32 locks are well suited in complexity to what is needed 
for c-ritieal sec-tion con t.rol. The process delay which may be required on criti-
cal sect.ioB entry is supported by the hardware of the HEP, making crit.ical 
section entry a user level operat.ion wit.h no operat.ing system intervention. 
On the other hand, a small amount of system resources is consumed by wait-
ing proc-esses, which may calIse congestion if many pr;oocesses wait simultane-
ously. The Flex/32 implements locking and unlocking t.hrough system calls. 
This is costly in terms of performance but allows processor rescheduling. 
Wakeup of blocked processes may either be by polling or by in terrupt. ' 
There is considerably more structure to the implementation of the Bllr-
rier macro on both machines .. Table -4 summarizes the implementations, 
including two implementations for the HEP having quite different perfor-
mallce charaeteristics. The two HEP implementations emphasize the 
difference bet.ween suspended and partially active waiting, which was men-
tioned in con nection with the critical section code. This issue was not impor-
tant in connection with critical sections because the control is very simple 
1" 
I, 
I', 
" 
I, 
-17-
HEP - Active Waiting HEP - Process Suspending 
Syst.em State Initialization System state Initialization 
En t.ry lock clear Process state empty 
Exi t loC' k set 
Counter zero 
Barrier Code 
\Vait for ent.ry lock clear 
Count. arriying process 
If last process then 
save area 
Counter 
Barrier Code 
Count arriving process 
If not last one then 
save state and quit 
else 
zero 
execu te code body 
set ent.ry lock 
clear exit lock 
recreate other processes 
clear COUll ter 
\Vait for exit lock clear 
Count exiting process 
If last process then 
set. exit 10C'k 
clear en t.ry lock 
Flex/32 
Syst.em State Initialization 
Barrier event connected to all processes 
as source/destination 
Coun ter zero 
Barrier Code 
Lock count.er 
Count arriving process 
Clear counter if last 
Unlock coun ter 
If last process then 
Execu te code body 
Activate barrier even t 
else 
\Vait for barrier event 
Table 4: Implementation of Barriers 
and because the probability that many processes will simultaneously wait on 
entry to critical sections with the same lock is low. In the Barrier, it is 
guaranteed that all processes simultaneously access the same blocking condi-
tion. There is only one implementation for the Flex/32 since all synchroniza-
tion support is through operating system calls and involves process suspen-
sion rat.her than active waiting. 
The critical section and the Barrier implementations serve to give an 
idea of the range of differences in the imple'mentation of Force primitives on 
the two archit.ectures. Many of the primitives, such as prescheduled DOALL, 
did not change a.t all between the machines, while others, such as self-
scheduled DOALL, build on the same techniques used in the critical sect.ion 
-18-
and Barrier. One other implementation issue which deserves mention is the 
implementation of a data oriented synchronization on a machine which has 
hardware support. only for control orient.ed synchronization. 
The Force includes primitive operations for the simplest data oriented 
synehronizat.ion, produce and consume. The HEP hardware supports these 
operations direetly, using the full/empty state bit for each memory cell. In 
the Flex/:32, locks are separate items, not associated with dat.a. To imple-
ment. produe-er/consumer synchronization, a boolean dat.a it.em must he allo-
cated t.o the full/empty state and a lock must be allocated to bind the dat.a 
transfer to the state change as an atomic unit. The lock itself cannot be used 
to model the full/empty state because there is no way t.o bind it to t.he data 
transmission. Furthermore, since the full/empty state is a data item, the sys-
tem supported proe-ess waiting mechanism cannot be used to wait for its 
ehange. Crit ie-al section code must be repeatedly executed to monitor a 
change in t.he state variable. In contrast, it is very easy to model the 
loek/unlock synehronization using produce/consume. The full/empty state 
of a memory cell is used for the lock and the value of the cell is simply 
ignored. 
Conclusions 
The impl(·men t at ion of a parallel programming environment on two 
shared memory multiprocessors with quite diITerent architectures has been 
described. The primit.ive operations of the system make fairly efficient imple-
mentations possible on both machines. One major diITerence has to do with 
whether parallelism is supported directly by hardware accessible to the user 
or is support.ed only through the operating system. In the latter case, the 
implementer must work in terms of the software run-time model presented 
hy the system rather than in terms of a. model related more directly to the 
hardware, which makes the prediction and optimization of performance 
some"'hat more difficult. The mechanism by which processes wait at a syn-
chronization is a key issue. If the waiting mechanism is tied to multipro-
gramming through the operat.ing system call, throughput will he optimized, 
but a large overhead will be incurred for potent.ially .short synchronizatiqn 
delays. 
The use of interrupts in t.he systcm architecture leads to natural support 
for the Event concept.. The implementation of Barrier type synchronizat.ions 
can be t.ied to the event concept fairly naturally. On machines which do not 
support events, attention must be paid to minimizing the utilization of 
resources by waiting processes. The Barrier diITers from the critical section in 
this regard because it is guaranteed that many processes will simultaneously 
wait at the Barrier while critical section conflict is probabilistic, and t.he likli-
hood of many processes waiting at the cntry to a critical section is low in a 
normally const.ructed program. 
,.. 
l 
.'" 
I" 
,~ 
'\ 
-19-
REFERENCES 
[lJ J. B. Dennis a.nd E. C. Van Horn, "Programming semantics for multipro-
grammed computations," Comm. A CAl Vol. 9, No.3, pp. 143-155 (1966). 
[2J H. F. Jordan, "Structuring parallel algorithms in an MIMD, shared 
memory environment," Proc. 18th Hawaii Int'nl Con/. on Systems Sci-
ences, Vol. II, pp. 30-38 (1985); to appear in Parallel Computing, 1985. 
[3J H. F. Jordan, nHEP archit.ecture, programming and performance," in 
Parallel i\fIJ/D Computation: The HEP Supercomputer and its Applica-
tions, J. S. Kowalik, Ed., MIT Press (1985). 
[,lJ "The Uniform System Approach to Programming the Butterfly Parallel 
Processor," Draft. of Oct. 23, H)85, Copyright BBN Laboratories Inc. (R. 
H. Thomas, private communication). 
[5J F. Darema-Rogers, D. A. George, V. A. Norton and G. F. Pfister, "A VM 
Parallel Environment," Rept. RC11fJ!25 (#49161), IB~vI T. J. 'Vat.son Res. 
Ct.r. (Jan. H)85). 
[uJ J. J. Dongarra, J. R.. Bunch, C. B .. Moler and G. 'V. Stewart, LINPACf( 
Users Guide, SIAM Publications, Phil., PA (H)79). 
[7J N. R. Patel and H. F. Jordan, "A parallelized point rowwise successive 
over-relaxation method on a multiprocessor," Parallel Computing, Vol. 1, 
No. 3&4, December 1984. 
[8J N. Patel, 'V. B. Sturek and H. F. Jordan, "A Parallelized Solutiun for 
Incompressible Flow on a Multiprocessor," Proc. AIAA 7th Computational 
Fluid Dynamics Con/., Cincinnati, Ohio, pp. 203-213, july H)85. 
[9J N. Patel, private communication. 
[10J G. Alaghband and H. F. Jordan, "l\iultiprocessor Sparse L/U Decomposi-
tion with Controlled Fill-in," ICASE Rept. No. 85-48, NASA Langley Res. 
Ctr., Hampton, VA, 1985. 
[llJ The Flex/3!2® System Overview, Flexible Computer Corp., Dallas, Texas, 
1986. 
[12J Flynn, ~\'f. J., "Some Computer Organizations and Their Effectiveness," 
IEEE Trans. on Computers, pp. 0·18-060 (1072). 
Standard Bibliographic Page 
1. Report No. NASA CR-178161 12. Government Accession No. 3. Recipient's Catalog No. 
lCASE Report No. 86-54 
4. Title and Subtitle 5. Report Date 
THE FORCE ON THE FLEX: GLOBAL PARALLELISM AND August 1986 
PORTABILITY 6. Performing Organization Code 
7. Author{s) 
8. Performing Organization Report No. 
Harry F. Jordan 86-54 
9. Performing Organization Name and Address 10. Work Unit No. 
Institute for Computer Applications in Science and 
Engineering 11. Contract or Grant No. 
Mail Stop 132C, NASA Langley Research Center NASl-17070 Hampton VA 23665 5225 
12. Sponsoring Agency Name and Address 13. Type of Report and Period Covered 
National Aeronautiics and Space Administration Contractor Report 
Washington, DC 20546 14. Sponsoring Agency Code 
505-31-83-01 
15. Supplementary Notes 
Langley Technical Monitor: Additional support provided by 
J. C. South AFOSR Grant No. 85-0189. 
Final Report 
16. Abstract A parallel programming methodology, called the force, supports ;;ne con-
structhlD of prograins to be executed in parallel by an unspecified, but poten-
tially large, number of processes. The methodology was originally developed on a 
pipelined, shared memory multiprocessor, the Denelcor HEP, and embodies the prim-
itive operations of the force in a set of macros which expend into multiprocessor 
Fortran code. A small set of primitives is sufficient to write large parallel 
programs, and the system has been used to produce 10,000 line programs in compu-
tational fluid dynamics. The level of complexity of the force primitives is in-
termediate. It is high enough to mask detailed architectural differences between 
multiprocessors but low enough to give the user control over performance. 
The system is being ported to a medium scale multiprocessor, the Flex/32, which 
is a 20 processor system with a mixture of shared and local memory~ Memory 
organization and the type of processor synchronization supported by the hardware 
on the two machines lead to some differences in efficient implementations of the 
force primitives, but the user interface remains the same. An initial implemen-
tation was done by retargeting the macros to Flexible Computer Corporation's 
ConCurrent C language. Subsequently, the macros were caused to directly produce 
the system calls which form the basis for ConCurrent C. The implementation of 
the Fortran based system is in step with Flexible Computer Corporations's 
implementation of a Fortran system in the parallel environment. 
._--
17. Key Words (Suggested by Authors{s)) 18. Distribution Statement 
multiprocessors, shared-memory, 61 - Computer Programming and Software 
parallel programming 62 - Computer Systems 
Unclassified - Unlimited 
19. ~ecurity Classif.{of this report) 120. Security Classif.{of this page) 21. No. of Pages 122. Price 
Unclassified Unclassified 21 A02 
For sale by the National Technical Information Service, Springfield, Virginia 22161 
NASA Langley Form 63 (June 1985) 
,.' 
..... 
c 
ClJ 
E 
::s y 
o 
C 
"-o 
"'C 
C 
LI.I 
