A performance analysis of the PASLIB version 2.1X SEND and RECV routines on the finite element machine by Knott, J. D.
r~ASA Contractor Report 172205 NASA-CR-172205 
19830026326 
A Performance Analysis of the PASLIB Version 2.1X 
SEND and RECV Routines on the Finite Element Machine 
J. D. Knott 
Kentron International, Inc. 
Kentron Technical Center 
Hampton, Virginia 23666 
Contract NASl-16000 
August 1983 
111111111111111111111111111111111111111111111 
Ni:ltional Aeronautics and 
Space Administration 
ltilngley Research Cent. 
liampton, Virginia 23665 
NF02512 
https://ntrs.nasa.gov/search.jsp?R=19830026326 2020-03-21T01:01:02+00:00Z
1. SUmlAI{Y 
Till' Fi Ili te El('II1t'nt Mach itw 1M HIl experimental array processor 
designed tn support research 1n parallel algorithms and architectures. 
This report prescnts the results of a case study of communications 
using the SEND .. lIld RECV system software routi.nes on the Finite Element 
Hachine. followpd by a discussion of the impact of communications 
overhead on the efficiency of parallel algorithms. 
2. I NTRO DUCTl ON 
Computer performance has traditionally been determined by the 
architecture of the computer and the level of technology used to 
implement that architecture. Con~uter performance can be enhanced by 
using a faster technology for a given architecture, or through the use 
of innovative architectures which allow the various components of 
computat ion to proceed in parallel. The technological approach has 
brought about significant increases in computational power, but 
technology is thought by many to be approaching its physical limits_ 
In contrast, the exploration of parallel architectures is in its 
inf,gncy. The use of highly parallel architectures to provide 
supj:!rcomputer performance has become a major area of interest in 
computer science research. An example of such research is the Finite 
Element Machine (FEM) currently under construction at the NASA Langley 
Research Center. 
The FJtnite Element Machine is an array of 36 asynchronous 
microcomputers (the Array). each capable of functioning as a st:md 
alone computer, coupled to a min:l.computer front end (the Controller). 
Current topics of interest in the FEM project include computer 
architecture, data management, Array control. and parallel algor:l.thms. 
A critical factor affecting algorithm structure and efficiency on 
arniY computers is the interconnect ion of processors and the 
efftciency of inter-processor I/O. This paper presents an overview of 
the architecture of the Finite Element Machine and the systems level 
software i.mplemented to support applications, and investigates the 
performance of I/O on the Array using the SEND and RECV system library 
routines. Several examples are then given to illustrate the impact of 
I/O performance on the parallel decomposition of algorithms. 
3. THE FINITE ELEMENT MACHINE ARCHITECTURE 
The Finite Element Machine is an experimental parallel array 
processor designed to support research in architectures and algorithms 
for parallel computation [IJ. Ft~ consists of a mini.computer front-end 
and an array of 36 asynchronous microcomputers. A block diagram of the 
FEM architecture is shown in figure 1. An overview of the FEM 
architecture is given below. For a detailed explanati.on of the 
architecture of the Array, see [2]. 
3.1 The Controller 
The Controller is a conventional sequential minicomputer which is 
used to provide program development tools and mass storage for the 
Array. Programs for the Finite Element Machine are co~iled and link 
edited on the Controller and then downloaded to the Array. The 
Controller also hosts the user interface to the Array which supports 
problem def in! t ion, task ini tiation and monitoring. interactive 
debugging of user tasks on the Array, and the uploading and analysis 
of results. 
3.2 The Array 
Each microcomputer in the Array contains 3 circuit boards known as 
the CPU. I/O-I. and 1/0-2 boards. 
The CPU board contains a 16 bit microprocessor, 16 Kbytes of 
Eraseab le Programmable Read Only Hemory (EPROM)>> 32 Kbytes of dynamic 
Random Access Memory (RAM), a floating point arithmetic unit, two 
timers, and serial and parallel I/O interfaces. There is no shared 
memory in the Finite Element Machine. However, processors can share 
information over any of the four communtcation paths provided by the 
architecture. These communication paths consist of a network of 
nearest neighbor connections (local links), a time multiplexed global 
bus, a cooperative binary flag network, and a cooperative sum/maximum 
computation network. The SEND and RECV routines only utilize the local 
links and the global bus. 
The 1/0-1 board contains circuitry for twelve reconfigurable 
serial communication links and the cooperative. sum/maximum network. 
Each of the local communication links is a 1.5 Mhz bit serial 
interface with an associated hardware FIFO buffer capable of storing 
up to 16 words (16 bits per word) of input data. The.local links are 
normally connected in a toroidal eight nearest neighbor scheme (see 
figure 2.). The link conffguration is dete);'mined by the physical 
connections nmde on the front edge of the 1/0-1 board and is in no way 
constrained by software. This allows the connectivity of processors in 
the Array to be modified (prior to execution) to support a wide range 
of interconnection schemes. The perfect shuffle, perfect shuffle 
nearest neighbor. and the cube connected cycle are but a few examples 
of the interconnection topologies possible on FEM [3,4J. 
The 1/0-2 board contains the circuitry for the global bus and the 
cooperative flag network. The global bus is a 1.25 Mhz time 
mUltiplexed 16-bit parallel bus which can transmit to individual 
processors in the Array. or to all cooperating processors in the 
Array. All processors on the global bus have equal priority and the 
bus awards priority on first come, first served basis [5J. The global 
bus has hardware FIFO buffers on the input and output lines, each 
capable of storing up to 64 words of data. 
4. SYSTEM SOFTWARE 
The system software packages support applications on FEM. FEM 
Array Control Software (FACS) provides the user interface to the 
Array, the Nodal Executive (Nodal Exec) is the microcomputer operating 
system. and the PASCAL Library (PASLIB) supports PASCAL access to the 
unique architectural features of the Finite Element Machine. 
4.1 FEM Ar~ay Control Software (FACS) 
FACS is a collection of menu driven, user friendly commands used 
to control the Finite Element Machine. FACS resides on the Controller 
and communicates with the Array via the global bus. The Controller 
2 
appears to the Array as just another processor on the global bus, and 
interfaces directly with the Nodal Executive operating system on a 
request/acknowledge basis. 
4.2 Nodal Executive 
Nodal Exec is the microcomputer operating system embedded in EPROM 
on each processor in the Array. Nodal Exec contains a kernel and a set 
of command routines. The kernel provides standard operating system 
functions such as monitoring, I/O primitives. interrupt handling, and 
memory management. The command routines support the Controller/Array 
interface and are accessed via the operating system monitor. Command 
routine services are requested by the Controller for such functions as 
downloading object code. starting and stopping tasks, and uploading 
results from the Array. 
4.3 PASCAL Library (PASLIB) 
PASLIB is a PASCAL callable subroutine library which allows 
programmers to access the nonstandard architectural features of FEM. 
Application programs for the Finite Element Machine are written in 
PASCAL with PASLIB procedures linked in as external procedures. PASLIB 
routines nre analagous to the supervisor calion conventional systems 
in that thE!y provide an interface to the Nodal Exec operating system 
on the mi.crocomputer, but they additionally provide run-time support 
for mathematical functions utilizing the floating point unit, and for 
the special I/O capabilities of the Array. 
5. I/O PERFORMANCE OF SEND AND RECV 
SEND and RECV are PASLIB routines used to transmit and receive 
data items over the global bus or. local communication links. Data 
items range in size from 1 to 255 sixteen bit words. This section 
describes the operation of the SEND and RECV PASLIB routines and their 
associated interrupt handlers. 
5.1 SEND 
The SEND routine is used to transfer data to a 
proeessor over the global bus or a local link. Data can be 
in a synchronous or asynchronous I/O mode although 
transparent to the SEND call itself. 
neighboring 
transferred 
the mode is 
When the SEND call is executed the processor branches to the SEND 
subroutine and copies the data to be transferred into an output 
buffer. SEND then calculates and appends a checksum, places the buffer 
on the appropriate output queue (either local or global), enables send 
interrupts. and returns to the calling program. 
When the send interrupt occurs, control is transferred to the 
interrupt handling routine and data is transferred from the output 
queue to the receiving processor. If the receiver can handle the 
incoming data as fast as or faster than the transmitting processor, 
one interrupt entry is all that 1.s required. However, if the receiver 
removes the incoming data at a slower rate, the hardware FIFO buffer 
fills and the transmitting processor will return control to the 
interrupted process until the FIFO's can again accept data. 
3 
5.2 RECV 
The RECV routine is used to receive data transmitted over the 
global bus or a local link. The operation of the RECV call is 
determined by the I/O mode, either synchronous or asynchronous. 
Wilen data is detected on either the global bus or local links, a 
receive interrupt is generated and the receiving processor branches to 
the receive interrupt handling routine. 
In the synchronous I/O mode. the interrupt routine buffers 
incoming data in a first-in first-out software queue and a call to the 
RECV routine will return the first buffer on the queue. If the queue 
is empty, a RECV call in synchronous mode will wait for data to 
arrive. 
In contrast, the asynchronous mode keeps a copy of the last 
complete data buffer received and uses a temporary buffer to assemble 
incoming data. Whenever the buffer being assembled is complete. the 
temporary buffer is written over the permanent copy. The RECV routine 
will not wait for data to be received :!.n the asynchronous mode and 
will return only the most recently received buffer; any given buffer 
may be read more than once, or may be overwritten by a more recent 
buffer before it can be read. 
5.3 METHODOLOGY 
The performance of I/O on the Array is determined by the overhead 
of the initlal PASLIB routine call and the subsequent intern'pt 
processing activity. The time for the inUial PASLIB calls for the 
SEND and RECV routines is governed by two factors, a fixed overhead 
for context switching, buffer allocation. and table lookups. and an 
incremental cost for data manipulation based on the buffer size. 
Similarly, the interrupt processing routines associated with the SEND 
and RECV calls consist of a fixed and incremental cost. 
Although it is possible to determine the asymptotic rates of 
performance for the communications hardware and the instruction 
execution times for the associated PASLIB and interrupt handling 
routines. these figures do not accurately predict the performance of 
I/O under actual operating conditions. fixed and incremental cost. The 
interrupt processing routines may be entered one or more times for any 
given SEND call and it is precisely this factor which makes it 
difficult to predict the performance of I/O on the Array without 
measuring it in operation. 
In order to measure the performance of the SEND and RECV PASLIB 
routines in actual use, a series of timing studies were run utilizing 
both synchronous and asynchronous . communication modes on the local 
links and global bus. Since the SEND and RECV routines are entered 
only once per call, these times were measured first and then used as a 
basis for determining the interrupt times. Interrupt times were then 
determined by measuring the time for a series of I/O transmissions and 
subtracting the known cost of program control statements and the 
PASLIB routine calls. The results of the timing runs for the SEND and 
RECV routines are given below. 
5.4 SEND PERFOlU1ANCE 
The SEND call itself consists of a fixed amount of code for 
4 
cont~~ apping and mapping, and a variable amount of code which is 
dependent 0' the size of the buffer being transmitted. This time will 
\ 
vary for dift rent buffer sizes but will remain constant for each call 
with the ~a<iite buffer size. Likewise, the send interrupt routine 
consists ofafixed overhead for entry and exit and a fixed time for 
each data item transferred, but the number of entries into the 
interrupt handler depends on the availability of the transmitting 
medium and can vary from call to call. The fi.rst task to accompli.sh in 
measuring the SEND time was to determine the fixed time for the PASLIB 
routine call. 
The fixed overhead and cost per word for the SEND procedure call 
was measur,ed and found to be the same for both communications paths in 
either I/O mode. This was expected since the I/O mode is transparent 
to the SEND routine. Table 1. gives the fixed overhead and the cost per 
word for all calls to the SEND routine. 
Fixed Overhead Cost Per Word 
======-========= ============== 
All SEND Calls O. '7123 0.0227 
Table 1. SEND Procedure Call - Time in Milliseconds 
Once the cost of the SEND call was known, the time spent in 
pro1cessing send interrupts was determined. The results for both I/O 
modles are given in figures 3.a through 3.d. Here it can be seen that 
although the SEND call rate is constant, the interrupt rate changes at 
a fi.xed point for both the local links and the global bus. This is 
dir1ectly attributable to the receive interrupt processing rate and the 
depth of the hardware FIFOs. The receive interrupt rate, as we shall 
see in the next section, is substantially slower than the send 
intlarrupt rate. Therefore, the send interrupt is loading the hardware 
FIFOs faster than the receive interrupt can read them, and as soon as 
the FIFOs fill up the send interrupt routine exi.ts and waits for room. 
With the additional overhead for repeated send interrupt calls, the 
send interrupt cost increases dramatically. This occurs on about the 
23rd transmission on the local links (16 word FIFOs) and after 
apPlroximatEdy 180 transmissions on the global. bus (combined FIFO depth 
of 128 words). If we graph this data as time per word (figures 4.a. 
through 4.d.), it is obvious that this transition point yields the 
10wl~st cost per word when sending data. 
5.5 RECV PERFORMANCE 
The RECV PASLIB procedure and the receive interrupt processing 
routine both contain a fixed overhead and a cost per word. The total 
timE~ for a RECV call is constant for any given buffer size because 
there is exactly one procedure call per buffer transfer. The receive 
intE~rrupt time can vary depending on the availability of data and the 
number of entries into the interrupt processing routine, although it 
was discovered that the receive interrupt routine will be entered no 
mOrE! than once per SEND in the present ve rs ion of the Nods 1 Exec 
5 
operating system. 
The fixed overhead and cost per word for the RECV procedure call 
was measured in the synchronous and asynchronous I/O modes on both the 
local links and global bus. The fixed overhead and a cost per word for 
each of the RECV I/O modes is given in Table 2. 
Local Sync 
Global Sync 
Local Async 
Global Async 
Fixed Overhead 
================ 
0.3492 
0.3492 
0.3817 
0.3817 
Cost Per Word 
============= 
0.0548 
0.0548 
0.0163 
0.0163 
Table 2. RECV Procedure Call - Time in Milliseconds 
In contrast to the SEND routine, the fixed overhead and cost per 
word in the RECV is different for synchronous and asynchronous 
transmissions. This was expected since the I/O mode is embedded in the 
RECV and receive interrupt routines. The asynchronous receive must 
lock out receive interrupts while reading the last complete copy of 
data to prevent the interrupt routine from overwriting the data being 
read. As a result, the fixed overhead for the asynchronous mode is 
greater than the fixed overhead for the synchronous mode. However, the 
synchronous cost per word is much greater than the asynchronous 
because the synchronous mode must handle queue pointers in a circular 
buffer whereas the asynchronous simply reads from a fixed buffer. 
Having found the cost of the RECV call, a series of timing tests 
was run to determine the interrupt processing time. The results are 
given in figures 5.a through S.d. Here it can be seen that both the 
RECV call rate and the interrupt processing rate are linear. This is 
expected since the receive interrupt processing rate is slower than 
the send interrupt rate, and indicates that all data is received in 
one interrupt. The representation of the interrupt time on a per word 
basis is asymptotic (figures 6.a through 6.d) and therefore the larger 
the buffer the less receiving a buffer costs. 
6. IMPACT ON ALGORITHMS 
The cost of I/O on parallel processing arrays is a fundamental 
factor in the performance of most algorithms. In general. efficient 
decomposition of algorithms into parallel components requires that the 
time saved for arithmetic operations exceed the time for the I/O 
required to support the distributed computations. This can be 
demonstrated by doing an analysis of a simple summation of 32 single 
precision values. Given that a single precision add takes 475 us, and 
using the total I/O time of 3400 us for two word (single precision) 
transfers, let us look at several approaches to the partial summation 
problem as outlined by Hockney and Jesshope [6J. The first approach is 
to simply compute the sum sequentially on one processor. This requires 
0-1 additions and no transfers. Given n - 32, the sequential approach 
takes (n-l) * 475 us, or a total of 14~725 us. A second approach is to 
6 
distribute pairs of values to sixteen processors, sum the pairs, pair 
results. sum the pairs, pair results, etc. This method requires 10g2n 
paralle1 additions and (log2n)-1 parallel transfers and would seem to 
promise an improvement over the sequential method. However, the total 
time for this method is (5 * 475) + (4 * 3400) ;= 15,975.which is more 
than the sequential time. Here we have eliminated 26 adds at a savings 
of 12,350 us, but we have introduced an additional 13,600 us for 
transferring data. A third method for summation might be to divide the 
work between only two processors, requiring (n/2) additions and only 1 
transfer. The total time for this third approach is only 11,000 us, a 
25% reduction of the sequential time. This savings results from 
eliminating 15 adds while introducing only one transfer. Clearly, 
given the ratio of I/O to arithmetic cost the solution time is more 
dependent on the I/O introduced by the transmission of data than it is 
on the degree of parallelism achieved in the algorithm. This might 
require a 111 approach other than the ortginal FEM concept [7] of 
aSSigning one finite element node to each processor. While assigning 
one finitl~ element node per processor minimizes the time for floating 
point arithmetic by distributing the work across the Array, it can 
increase the total solution time by introducing highly inefficient 
I/O. The Multi-Color SOR algorithm [8] demonstrates the effectiveness 
of ibalancing the I/O and arithmetic on the Array by assigning multiple 
nodes to a single processor. 
Since the I/O SEND and RECV routines contain a fixed overhead, one 
way to decrease the overall cost of I/O is to transmit blocks of data. 
By transmitting data in blocks. the overall cost per word is decreased 
bec.:lUse the fixed overhead for I/O is di$tributed over the enti.re 
blolck of dlita. figures 7.a, 7.b, and 7.c. provide a summary of the 
per:formancl~ of the SEND and RECV routines in version 2.1X of PASLIB. 
Thb total I/O time is given on a cost per word basis in figures 8.a, 
and 8.b. These graphs provide a basis for the optimization of 
algorithms by allowing FEM applications programmers to choose the 
optimal I/O mode and medium for any given buffer size. For example, 
the cost pl~r word of transmitting 20-word buffers (lO single precision 
real numbers) on the global bus in synchronous mode is only a third of 
the cost per word for 2-word buffers. By selecting the appropriate 
communicat:ton link and mode, and by taking advantage of blocking, the 
cost of I/O can be substantially reduced. 
Another factor directly impacting the cost of I/O is the 
connectivity of the Array. The ability to transmit data directly to 
distant processors significantly increases I/O performance on 
algorithms requiring such transfers. Let us consider the above 
summation problem on two possible switching networks. Assuming that 
data is initially distributed across thirty-two processors (one value 
per processor), the basic algorithm is to patr values and sum the 
pairs. This is repeated until one processor contains the sum of all 
values. Using a nearest neighbor (strictly left-right) connectivity 
the summation would require n-1 parallel shifts (transfers) and log n 
parallel additions at a total cost: of 107,775 us. This same algorit~m 
executed on the Array using a nearest neighbor perfect shuffle 
connectivity would still require log?n additions, but the transfers 
are reduced to log,n for a total ~ost of only 19,375 us. Obviously, 
the ability to transf~r to any processor in the Array offers an order 
of magnitude decrease in execution time for algorithms requiring long 
distance communication. 
7 
NallY algllrithms requIring long distance communication can be 
implt'llll'nted by properly conf igtlrlng the local links to provide the 
neCt'Hsary communications paths. In cases where this is not possible, 
the global bus can provide the point to point communications required 
across tht> Array. The only restrict ion on using the global bus in this 
capacity is the. potential for bus contention when many processors are 
attempting to transfer at once. The timing results of this study 
indicate that bus contention will never be an issue on the current 
implementation of the 36 processor FEM. In fact, the global bus 
bandwidth is sufficient to support all 36 processors Simultaneously 
transmitting at their maximum rate on the global bus. The rate at 
which processors can place data on the global bus is governed by the 
send interrupt routine. The fastest interrupt time per word, 108.5 us, 
was for asynchronous global bus transmissions of 255 words. However, 
this time includes the overhead for repeated calls to the send 
interrupt routine. To determine the maximum rate at which any 
processor could load data onto the global bus, it was necessary to 
examine the code contained in the inner loop of the send interrupt 
routine. This code was found to take approximately 55 us per word 
which for 36 processors gives an maximum rate of one word every 1.53 
us. Since the global bus hardware is capable of transferring one 16 
bit data word and its associated identifiers every 800 ns. there is no 
possibility of a bus conflict in the current version of Nodal Exec. 
7. CONCLUDING REMARKS 
The Finite Element Machine is an excellent testbed for the 
exploration of parallel architectures and algorithms. The fact that 
point-to-point communications are available for all processors in the 
Array with no potential for hardware contention on the global bus 
opens the door to a broad range of research algorithms and readily 
facilitates architecture simulation and modeling. 
The I/O performance data provided by this study identifies the 
sections of code in the Nodal Exec operating system and PASLIB which 
will benifit most from optimization in subsequent versions of system 
software. Any optimization of these routines in future revisions of 
the system software will invalidate these timings. However, several 
concepts highlighted by this study will remain valid regardless of 
future I/O performance on FEM, and also apply to MIMD architectures in 
general. 
First, it is clear that whatever the ratio of I/O time to 
arithmetic on the Array, this ratio is critical to the efficiency of 
the algorithm. Decomposition of algorithms to maximize the parallelism 
in the problem is not a guarantee of efficiency and can actually cause 
a loss in performance over a sequential implementation of the same 
algorithm. Second, the I/O performance on the Array can be vastly 
improved by properly blocking the data to exploit the buffering 
provided by the hardware FIFOs and to distribute the fixed overhead 
over a number of words. In general, reducing the cost per word of I/O 
allows for greater distribution of arithmetic across the Array, which 
allows efficient exploitation of an algorithm's parallelism. 
8 
REFERENCES 
1. Sturadsl i, 0.0.; Pel'bh's, S.W.; Crockett. T.W.; Knott, J.D.; and 
Adams, L.: The Fini te Element Hachine: An Experiment in 
Paralll'L Processing- NASA TM 1184514, July 1982. 
2. Jordan, Harry F •• Ed.: The Finite Element Machine Programmer's 
Rc:'ference HanuaL Computer Systems Design Group. University 
of Color"ado. Boulder, 1979. 
3. Stone, H. S.: Parallel Processing with the Perfect 
Shuffle. IEEE Transactions on Computers, Vol. C-20. No.2, 
Feb. 1971, pp. 153-161. 
4. Preparata, F. P.; and Vuillemin, J.: The Cube-Connected Cycles: 
A Versatile Network for Parallel Computation. Communications 
of the ACM, Vol. 4, No.5, May 1981, pp. 300-309. 
S. Knott, J.D.; and Crockett, T.W.: Fair Dynamic Arbitration for a 
Multiprocessor Communications Bus. Computer Architecture 
NellTs, Vol. 10, No.5, September 1982, pp. 4-9. 
6. Hocknl~y, R.W.; and Jesshope~ C. R. Parallel Computers. Adam 
Hilger Ltd., Bristol, Great Britain, 1981. 
7. Jordan, Harry F.; and Sawyer, Patricia L.: A 
Multi-Microprocessor System for Finite Element Structural 
Analysis. Trends in Computerized Structural Analysis and 
Synthesis. A. K. Noor and H. G. McComb, Jr. , Editors, 
Pergamon Press, Oxford. 1978. pp. 21-29. 
8. Adams, Lo; and Ortega, J.: A Multi-Color SOR Method for Parallel 
Computation. Proceedings of the 1982 International Conference 
on Parallel Processing, August 1982, pp.53-57. 
9 
....... 
o 
--------------, 
CONT 
,,-
,,-
-Z- ffll',,"O/llrOCIIBllor 
I \ 
I , 
( 
!IOqafl", 
IHid "" 
... ", I , 
\ J---""i. 
''----~ 
'" , , 
, '" 
, 
-----------------
/ 
A AY GLOOAl BUS J 
Figure 1. fiNITE ELElV£NT MACHIf\E BLOCK DIAGRAM 
t--13 
Figure 2. EIGHT NEAREST NE I GHBOR TOPOLOGY. 
11 
...... 
N 
T 
I 
PI 
E 
I 
N 
'" I 
L 
L 
I 
S 
f 
C 
0 
Ii 
D 
S 
s.. 
55." 
45. 
4'. 
, 
'" 35. 
Je. 
25. 
It. 
I 
I 
I , 
, 
, 
, 
, I 
" 
, 
" 
, 
, 
, 
, 
I' , 
, 
, 
, 
, 
I , 
" 
," 
ZOTA!. , 
.. ' MTtRRUPTS 
-1 _~~======H»CA!.!. '.~ 
t. 
:11. U!l. 21f. 
41. 135. '65. 225. 
I'«.m/ER Of' ~ 
Figure 3. a. SEND LOCAL SYNCHRONOUS (PER SEND) 
...... 
w 
r 
I 
,., 
£ 
1 
I'i 4S. 
A 
I 
L .. e. 
l 
I 
S 
f 35. 
C 
0 
ff Je. I) 
S 
15. 
1. 
, 
" 
4S. 71. 
.. 
... 
, , 
135. 
., 
, , 
.. 
, 
, 
, 
/;/ 
lG. 
... 
., 
, 
., 
, 
, 
fUTlQ OF' 1IOI!fJ)S 
Figure 3. b. SEND LOCAL ASYNCHRONOUS (PER SEND) 
us. 
, 
, 
, 
.., 
~ , 
, , *' N"r£~ts 
60 • 
....... 
..;::. 55. 
T 
I 
,.. 
58. E 
I 
,.. 45. 
,. 
· l 48. 
l 
· 
· S 35. E 
C 
0 
N Je. D 
5 
s. 
31. 
?S. us. 
f'UilKR OF' '-05 
Figure 3. c. SEND GLOBAL SYNCHRONOUS (PER SEND) 
&5.~ 
6 •. 
".j i I ,. 
se. r 
"·1 I t; ,. ! L 4t. 
L 
• .
5 35. E 
, 
~ 
0 
N 
(I 
S 
21. 
15. 
1 •• 
... ' 
MJIItIIER OF WORDS 
.. , 
, 
, 
, 
, 
'" , 
'" , 
, 
, 
, 
, 
feTAL , 
, p HTE~IWPTS 
Figure 3. d. SEND GLOBAL ASYNCHRONOUS (PER SEND) 
I-' 
0'> 
T 
I 
.. 
r 1.S*e 
I 
rot 
II! 
I 
, 
~ 
l 
1 
S 
E 
C 1.$1$ 0 
Ii 
0 
S 
e. 
1. 
I 
I 
, 
I 
\ 
I 
I 
\ 
, 
\ 
'- ..,.. .... ., . ,.. - .. .., - .... - ..... - - ... - ... - - - ...... - - ...... - .. - - - - - - - - - - - - - - AJi:L ---$ 
, ... ". ..... ,. ... --- EJIiflPT 
IU. liU. 2... 255. 
.... 'Pi • lts. 
f'lN[R Of: WRDS 
Figure 4. a SEND LOCAL SYNCHRONOUS (PER YORD) 
T 
I 
M 
E 
I 
~ 
PI 
I 
l 
L 
I 
~ 
E 
C 
C 
,... 
t 
S 
··-l 
1.5" 
1. Me 
'.set 
i 
I 
I 
\ 
\ 
\ 
"'~'~--.--.-'------------.~----------------------.-- - 2~s 
!d-l_';'::::::;=:;::~;::::::;=:;::=;=::;:=:;::::::;:=::;::::::;;::::=::;:=:;::::::=;;:::~M> CAll •• as 
121. lit. 248. 2SS. 
4G. .,5, lIS. us . 
HURlER OF' YORK 
Figure 4. b. SEND LOCAL ASVNCHRONOUS (PER LlORD) 
...... 
OJ 
r 
I 
1'1 
E 1.S" 
I 
N 
PI 
1 
L 
L 
1 
'S 
E 
C 1 .... Co 
p; 
/) 
5 
t. 
a. 
\ 
• ,
\ , 
....... - ... - - - - - -- -_ ... _______________________ - - - - - - - - - - .. - Al~,..s 
:U. cu. SUo .ue. 
46. '?G. 136. m. 
HURlER OF _OS 
Figure 4. C. SEND GLOBAL SYNCHRONOUS (PER UORD) 
T 
, .... ~ I '" [ 
I 
N 
pc 
1 
L 
t 
! 
S 
E 
C I. ... (-
t-
D 
~ 
..... 
-_________ _ __ ------ IltAL __ _ 
-- ------------ _______________ --- ~~s 
1. )1. SU. 121. lit. 248. 2SS. 
,S. ?G. 13S. 
I4IJIUO OF WRDS 
Figure 4. d. SEND GLOBAL ASVNCHRONOUS (PER YORD) 
n. 
N 
0 10TAL , 
T ,. , 
I , 
1'1 
, 
, 
E 
, 
, 
, 
, 
I , , 
N , HTE~S , 
it. , 
PI , , 
I , , 
L , 
L 
, 
, 
1 , , 
S St. , 
E 
C 
0 
N 
D .... 
'\ S 
! 
38. 
[CV CAl.l. 
11 • 
•• 
i. :11. 
fClmHR OF YOQ$ 
Figure 5. a. RECV LOCAL SYNCHRONOUS (PER RECV) 
ae. 
T ~.".~ I f'! r 
I 
t'I 
P! 68." 
I T:)'·~. 
, 
l "TE~JP·S , 
" S se. 
E 
( , 
C 
r· 
D 4e. 
5 
)t. 
2t. 
I'. 
1. :u. IU. un. lSI. 21 •• 241, ISS. 
41. 7'. lIS. 135. ISS. 195. 225. 
MlJM[R OF WORDS 
Figure 5.b. RECV LOCAL ASVNCHRONOUS (PER RECV) 
'0 
8e. 
N 
N 
T 
1 ,. 79." 
E 
I 
N 
1'1 6e." 
I 
, 
~ 
, 
~ 
I 
S se. 
E 
c 
C 
t-
t 48. 
.; 
le. 
2$. 
I •. 
1. 
45. 
, 
, 
, 
, 
75. 
, 
, 
, 
, 
, 
SUo 
, 
I"IIRIJER OF' !.BORDS 
, 
, 
, 
, 
, 
, 
, 
, 
, 
, 
, 
135. 
, 
, 
, 
, 
IS •• 
, 
, 
, 
, 
, 
, 
'" , 
, 
lit. 
, 
, 
, 
, 
, 
1IS. 
, 
, 
, 
, 
, 
" 
, 
, 
, 
, 
, 
22S. 
Figure 5.c. RECl) GLOBAL SYNCHRONOUS (PER RECV) 
ell CALL 
I'V 
W 
I 
~ 
PI 
I 
.. 
L 
~ se. 
E 
( 
C 
N 
t 4 •• 
S 
30. 
at. 
1 •• 
31. ttl. 121. 1St. 
1&. 41. 13S. lH. 
I'IUI'IaER OF WRDS 
ItS. 
, 
, 
211. 
, 
, '" 
, 
, 
us. 
Figure 5. d. REev GLOBAL ASYNCHRONOUS (PER RECU) 
, 
, "'E",-,P'!'S 
N 
.::::. 
T i.Nt 
I 
/It 
r 
I 
~ 
Ii! 4 .... 
I 
:-
I 
.:; 
[ 
( 
( 3.Nt 
,. 
r 
5 
2. 
--~-------------------.---------------------------- ~S 
1I!r-+':::=:;:::::::::;:=::;:::::::;;::::::::;::=;=:::;:=:;:::::=;:=::;::=;r=::;:=;=:::;:=::;:::::::::::;=~CV CAtt 
'.IN 
i. IiIl. 
'1'5. 135. 
~R OF WORDS 
Figure 6. a. RECV LOCAL SYNCHRONOUS (PER IJORD) 
N 
(J"l 
T 
I 
1'1 
r 
I 
N 
/II 
I 
.~ 
~ 
I 
S 
E 
C 
( 
,.. 
r 
5 
"1 
'·-1 
4.'" 
3.8M 
I. i1. ,1. 121. 1st. UN. 218. 248. 255. 
16, 46. 7&. 135. UIS. C!2S. 
~ or WORDS 
Figure 6. b. RECV lOCAL ASVNCHRONOUS (PER YORD) 
6. 
N 
0'> 
T S .... 
I 
1'1 
r 
I 
"I 
,. 4.'.' 
I 
. 
.. 
I 
S 
r 
( 
c 3.Ne 
r.; 
t' 
~ 
a. 
-----~---------------------------------------------- ~~S 
i. :11. 151. II. 121. 2.8. 255. 
46. '?S. us. liS. us. 
~IER OF WORDS 
Figure 6. c. RECV GLOBAL SYNCHRONOUS (PER "'ORD) 
"1 
T 
"-1 I 
" £ 
I 
M 
'" 
4.tH 
I 
t 
t 
I 
S 
E 
C 
0 3. 
N 
1) 
S 
IS. 4&. 135. 11"5. us. 
I'GlmJER Of !.lORDS 
Figure 6. d. RECV GLOBAL ASVNCHRONOUS (PER YORD) 
OCAL SYNC 
68. 
1'-:> 
00 
5S. 
T 
I 
,.. se. E 
I 
I'f 
1'1 
I 
t 4'. 
t 
I 
..,GLOBAL SV"IC S 
E 35. 
, LOILAt .-S-mc 
C ,. 
0 ,. , 
N Je. , D , , 
S 
, 
'" 
'" 
, 
25. , .. 
'" 
, 
'" 
'" ae. 
15. 
It. 
11t. 21t. 240. ~S. 
11. us. us. 
MlFlIEIif or: !.lORDS 
Figure 7.a. TOTAL SEND TIME (PER SEND) 
r..;, 
t.O 
,. 
I 
-
,., 
E 
I ,. 
pit 
! 
L 
l 
. 
• S 
E 
C 
0 
Pi 
D 
S 
"oJ 
::0] 
.... 
6$. 
se. 
.... 
at. 
:aa • 
41. 
, 
, 
, 
... 
I' 
I' 
I' , 
, 
, 
I' , 
HI..IPII£R OF YORDS 
, 
, 
, 
, 
, 
" 
UU. 
, 
, 
, 
, 
.-
, 
, 
, 
, 
, 
.-, 
, 
, 
, 
, 
, 
, 
, 
Figure 7. b. TOTAL RECV TIME (PER RECV) 
" 
" , 
, 
" 
, 
, 
, 
" 
GLOMl. sv~c , 
I.OtAl jl5v"<C 
240. m. 
m. 
1 .... 
1~. 
w 
0 lH. 
T 
I ue. 
" E 
I lH. 
M 
PI 9 •. I 
1.. 
1.. 
I Be. 
5 
E 
C 7 •. 0 
N 
l) 
5 se. 
SO . 
.... 
~. 
ze. 
It . 
•• 
, 
, 
, 
~. UIS. 225. 
fUiilER OF YORK 
Figure 7. CO TOTAL I/O TIME (PER BUFFER) 
, 
, 
, 
, 
, 
OCAl 5 ...... : 
SLOBiII:' S~ 
" OCAL AS~ 
T 
1 
.. 
n 
£ 6.". 
I 
N 
PI 
I 
L S."' 
L 
I 
S 
E 
C 
0 ".eM 
~ 
D 
S 
3. 
ASVNC 
•. ~-t----~----~r-----r-----r-----~----'-----~----~-----
1. .,. 
I. 4. s. I. 
Nl..ftIER OF WORDS 
Figure 8.a. TOTAL 1/0 TIME (PER YORD) 
W 
N 
T e.Slte 
I 
/'I 
E 
I 8.8M 
to< 
PI 
I e.1" 
l 
L 
! 
S 
E 
C 
I) 
~NV~~ ________ ~ ______________________________________ ~O~A~ s~c 
N 
[) 
5 
_______ ~ _______ Gl.OIAl. SYNC 
--------- .. --
• OColll ASVNC 
-----________ ------~!.O~L ASVNC 
e.! 
?S. 173. 239. 255. 
R. lS7. au. 
I'U'IID Of UORDS 
Figure 8.b. TOTAL 1/0 TIME (PER LJORD) 
NAS_L<2~-l7.~~Q_~ .. __ _ 
4 r Itl" dnd SulJtlti" 
A Pel'1\ll'mUtll~l:' !\ll:dy,;h; "t' th,,' PA~3L1B Versiun 2.1X SEND 
und 1\~;l'V t\,lULitlt-',; ,)U tllv Finite Element Machine 
5 Report Date 
Auguc;t 1983 
6 Performing O'9dn'l31100 Code 
_ .. _-_._--------.- - -- .. --- -- --------------------------...... ----------.-------~ 
Authur IS) 8 Performing O'g'lnllatlon Report No 
~--------------------------~ 
t------------- .-----.----.- -----.--------------------1 
\) Perfu,nllngO'!ldlllldllun Ndllle dlld Aljdre~ 
Kt~ntl'Oll Inter'llutional, Illl'. 
Kentroll Technil'al e'entt'r 
3221 N. Armisteaci Ave. 
Hampt~~_~._;) }~'l)l~ ___________ . _________________ -1 
12 Spon'lHlng Ayency Ndllle dnd Ad,jle>s 
Natipmd Aeronautics and Space Administration 
Washington, D.C. 20540 
1 ti $upplemenldfY Notes 
Langley Technical Monitar: Dr. Olaf O. Storaasli 
16 Abstract 
10 W()(k Unit No 
11 Contract or Grant No 
NASl-16000 
13_ Type of Report dnd Period Covered 
Contractor Report 
505-37-13-01 
The Finite Element Machine is an experimental array processor designed to 
support research in parallel algorithms and archi tectures .~'his report presents 
a case study of cormnunications using the SEND and RECV system software routines 
on the Finite Element Machine, followed by a discussion of the effect of I/O 
performance on the efficiency of parallel algori thros. 
----------------------r--------------------------------------------------~ 17 Key Wurd~ (Suygested tJ., AulhlH (sl I 
par'alleL urr:lY !'!',-)l":,;,;l-,r, I 'at':,l j el 
archi tecturL', b'illi t t' F:l t:ml'llt ~lal'ilillt" 
" , 
18 Olstrobutlon Stdtement 
End of Document 
