Problems in characterizing barrier performance by Jordan, Harry F.
Problems in Characterizing
Barrier Performance
by
Harry F. Jordan
Computer Systems Design Group
Department of Electrical and Computer Engineering
University of Colorado
Boulder, CO 80309-0425
CSDG 88-3
October 1988
https://ntrs.nasa.gov/search.jsp?R=19910009318 2020-03-19T19:42:17+00:00Z

Problems in Characterizing
Barrier Performance?
Harry F. Jordan
Introduction
The barrier is a synchronization among all executing processes, all of which
encounter a barrier construct at some point in their execution. The synchronization
requires that all processes execute the barrier construct before any process can proceed
past it to the next executable statement. It was introduced in connection with hardware
support for global synchronization in the Finite Element Machine [1] and has since been
used in various parallel languages [2], [3], [4] and incorporated in parallel language stan-
dards proposals [5], [6].
The barrier is usually used to satisfy a number of data dependences simultaneously
by imposing sequentiality on the production and use of data items. A common usage, as
suggested in the definition of a barrier appearing in Fig. 1, is to synchronize the produc-
tion and use of many parts of a complex data structure simultaneously, without dealing
with data items individually. The barrier is one of a class of synchronizations which can
be called "generic," to indicate that processes are not identified by name. The synchroni-
ration condition is specified by the quantifier "all". An example of a synchronization
which is not generic is the rendezvous, in which two specifically named processes wait
Barrier Synchronization
All processes must enter a Barrier statement before any process can complete it.
Process A Process B ... Process P
Compute part of Q Compute part of Q ... Compute part of Q
Barrier Barrier ... Barrier
Use Q Use Q ... Use Q
Figure 1: Use of the Barrier Synchronization
This work was supported in part by the Office of Naval Research under Grant No.
N00014-86-K-0204 and in part by NASA Langley Research Center under Contract
No. NAS 1-17070.
for eachother.
t There are severalvariations on the semanticsof the barrier. Perhapsthe most
importantis theway in which thesetof processesquantifiedby "all" is defined. In some
systems,implicit knowledgeof a parallelexecutionenvironmentdefinesthe set,while in
other systemsa simple countof the numberof processesto arrive before the barrier is
satisfiedis used. In theForcelanguage[2], thesemanticsis modifiedby includinga sec-
tion of codebetweenthebeginningandendof a barrierconstruct. This codeis executed
sequentiallyby one processorafter all processorshavearrived at the barrierand before
anyprocessleavesit.
Implementation of the Barrier
There are numerous ways of implementing barriers on existing multiprocessors, and
the alternatives have performance implications for specific architectures. A discussion of
the linear versus the logarithmic organization of barrier implementations was given by
Axelrod [7] while a broader set of barrier implementations was studied and measured in
[8]. A number of the choices made in implementing barriers are summarized in Fig. 2.
For simplicity, they are given as simple dichotomies which can be combined in various
ways to yield a number of very different implementations with potentially different per-
formance, depending on the underlying multiprocessor architecture.
The discussion will be focused primarily on shared memory multiprocessors, since
the performance measurements to be studied were selected from shared memory machine
examples. Most of the interprocess communication patterns for barriers can appear in
either shared memory or message passing systems. The only exception is the self-
scheduled updating of a shared arrival count, which is difficult in message passing
machines without prescheduling at least one process to maintain the count. In the
prescheduled case, processes report arrival at the barrier in some predetermined order,
while a self-scheduled barrier allows processes to execute their arrival reporting code in
any order. Since the barrier's purpose is to eliminate time skew between arriving
processes, the situation illustrated in Fig. 3 is the normal case for a self-scheduled barrier
using a critical section to update shared arrival variables. If the arrival were
prescheduled so that processes had to execute their critical sections in a fixed order, say
from left to right in the figure, then the arrival section of the barrier would take longer for
most arrival orders.
Shared memory --- Message passing
Prescheduled --- Self-scheduled
Master-slave --- Symmetric
Test&set --- Read/write
Logarithmic structure --- Linear structure
Distributed exit --- Broadcast exit
Figure 2: Alternatives for Barrier Implementation
Processes
T
i
m
e
$
W
t_
i
I
I
I
I
I
I
J
W W
t_:
I
i
!
I
I I
I i
W W
c
I
I
I
I
r
I I
I i
I I
I I
I I
I I ,,
W
!
!
W
C
W
Ij
!
!
I
I
I
I
I
I
I
Ij
W
! I
[ i
1,5 i
!
Ij
W
I t.: J Critical section
i !
I I
I I
I I
C 'I
|
W
w
!
!
!
!
I
I
!
I
_2
W
!
I
!
12
i !
I I
I I
I I
I I
I I
I !
I
I
I
W
W [ Per process work
W
Figure 3: A Self-Scheduled Barrier with Skewed Process Arrival
The master-slave barrier structure is often associated with prescheduling, but it is
perfectly possible to have the first process arriving at the barrier become the master and
execute code distinct from that executed by the slaves. The characteristic of a symmetric
implementation is that all processes execute the same synchronization code, which
differs only in that the indices of certain synchronization variables, or the destinations of
some synchronization messages may be computed from the fixed identity assigned to the
process. The main advantage of the master-slave structure is that communications
between the master and the slaves caq be arranged so that any one synchronization vari-
able is written by only one process. This allows the use of memory cells for which only
read and write are indivisible. Symmetric implementations require something like
test&set to indivisibly test and update a synchronization variable. For machines with
hardware support for synchronization, the difference is small, but it can become large if
all synchronizations are mediated by the operating system.
The most clear cut choice in an implementation is whether to use a linear or loga-
rithmic pattern of communication among the processes. For systems with many physical
processors, the logarithmic organization makes far better use of the system parallelism.
In systems with only a few physical processors, the slightly higher computational com-
plexity of the logarithmic structure may mean that the linear barrier can be more
efficient. Depending on the nature of a machine's synchronization support, 16 processors
is usually enough to make the logarithmic barrier a better choice. Figure 4 shows two
3
examplesof logarithmic barriers. The first is thedouble treebarrier, which hasdistri-
buted,master-slavearrival andexit phases.The secondis the butterflybarrier,which is
symmetricandself-scheduled,andwhosearrival andexit phasesarenot distinct.
The choiceof broadcastversusdistributedexit is alsodependenton the particular
machinehardware. If the machinesupportsefficientbroadcastfrom onesourceto multi-
ple destinations,sayby havingoneprocesswrite a variablewhich is readby manyoth-
ers,thentheexit phaseof a barriermayusethiscapability. If thebarrierarrival codedis-
tinguishesa uniqueprocess,either the masteror the last to arrive, this processormay
broadcasta releaseto all otherprocesses.For example,in thedoubletreebarrierof Fig.
4, the invertedexit treemay be replacedby a one level broadcastfrom the master. Of
course,if somedistributedmechanismis usedby the systemto supportthe broadcast,
theremay benoperformancegain.
Accounting for Barrier Performance
There are several different influences on time accounting in measurements of barrier
performance. The major ones are:
Processes --_
i
m
e
$
J
Double tree barrier
Processes
T
i
m
Butterfly barrier
Figure 4: Two Examples of Logarithmic Barriers
4
delaysin arrival of processes;
thewaiting mechanism;
thecodeof thebarrierimplementation;
synchronizationdelays,i.e. critical sections;
andswappedprocesses,perhapsdueto interrupts.
The purposeof thebarrier is to synchronizeprocessesarriving at different times,sothe
ideal performanceis to wait for the latest arriving process. In addition to the ideal
behavior,a differential delayamongthearrivingprocessescanmasksomeof thebarrier
codeexecution. This maskingis mosteffectivefor a self-scheduledbarrierwhich usesa
critical sectionto updatea sharedcountof arrivingprocesses.If aprescheduled,master-
slavebarrier implementationis used,then theorderof arrival of processesmakesa large
difference in the amountof barrier codewhich canbe maskedby the arrival time dif-
ferential. Thus, in measuringthe performanceof suchbarriers,either a fixed arrival
ordermust be specifiedandguaranteed,or a sufficiently largesampleof randomarrival
orders must be taken to obtain averageperformance. The arrival order for self-
scheduled,symmetric barriers is irrelevant,and no measurementprecautionsneedbe
takenwith respectto arrival order.
Onceprocessesstartarrivingat thebarrier,theearly arrivalsmustbecausedto wait
for the later ones. The waiting mechanismmay be busy waiting, virtualization of
processes(swappingthemout), or somethingintermediatebetweenthe two. Busy wait-
ing wastesprocessorcyclesduring long delays,while virtualization is associatedwith an
irreducible minimum overhead,which is often quite long. Intermediatepositionsare
possiblein systemswhich supporta lightweight processmodel or by giving up the pro-
cessoronly after an initial busywaiting period, thedurationof which is determinedby
the swappingoverheadof the particular system. Our interesthas been primarily in
tightly coupledparallel scientific codes,so the useof processvirtualization for waiting
hasbeenavoidedwherevertheunderlyingsystemhasallowed.
The actualcodeexecutedby processesin a specificbarrierimplementationdepends
on thestructureof thebarrierimplementationchosen,aswell ason thesystemprimitives
usedto implementit. Themajorimplementationinfluenceis thechoiceof linear or loga-
rithmic barrier organization. The logarithmic organization requires somewhat more
code. This can make it slightly slower than a linear barrier when used with only a few
processes, but is more than compensated for by the increased parallelism possible in exe-
cuting the code for even a modest number of processes. Much more important than the
amount of code is the nature of the synchronization primitives on which the barrier is
built. For tightly coupled parallel processing, it is important to avoid operating system
overhead wherever possible. Hardware locks used to support critical sections or to
implement interprocess synchronization directly are one good choice. Another is to use a
master-slave implementation based on shared variables capable of atomic read and write.
The time processes spend waiting for synchronization messages from other processes or
for a critical section to be released is a major factor in barrier time accounting.
Instrumentation for Barrier Measurement
The most important instrument for measuring performance in a parallel system is a
low overhead timer. Timers are also important in measuring sequential systems, of
5
course,but their effect is much easier to subtract out of a strictly serial execution history.
In considering the structure of timers, there are two main aspects to take into account: the
timer update mechanism and the timer sampling mechanism. The time may be updated
by a mechanism which is completely transparent to the processes involved in the parallel
computation. This is usually done by a hardware mechanism, but it is possible to allo-
cate an independent process to do a transparent software update in some systems. The
timer update can involve processor cycle stealing, in which case it is nearly, but not
entirely transparent. It can also be done by a periodic interrupt, which usually performs
other periodic operating systems functions in addition to updating a timer. Timer sam-
piing can either be done by simply reading a shared variable, or it may require the more
substantial overhead of a call to the run-time or operating system.
Another important aspect of timers for multiprocessors is whether there is a single
system clock, which is accessible to all processors, or whether each processor maintains a
separate hardware or software timer. Having separate timers for each processor elim-
inates competition when many processors try to sample the time simultaneously, but a
single timer gives a more coherent measure across the processors. If it is important to
keep track of the distinction between system time and user time on a per processor basis,
then a timer for each processor may well be a natural choice. Systems range from having
a single hardware timer which is readable by any processor as a shared memory location,
as in the Encore Multimax, to having a software timer per processor, which is updated by
a periodic kernel interrupt, as in the Hexible Computer Systems Flex/32.
An important parameter for a software timer is the uncertainty introduced by the
time spent in the timer interrupt handier. This uncertainty can be expressed as
AH = Timer interru, pt service.
Interrupt interval
Since the timer interrupt may support periodic operating system functions of different
frequencies, AH may vary, so that it may be appropriate to use an average value.
There are several aspects of barrier performance which may be measured. Most
obvious is the effect of an ideal barrier, one with no overhead, on the behavior of a sec-
tion of a parallel program. In a simple case, the effect of delaying all processes until the
last one arrives can be calculated analytically, but in more complex situations, especially
data dependent ones, it may be necessary to measure the effect. In addition to the ideal
synchronization performed by the barrier, there are synchronization delays introduced by
interprocessor synchronizing communications used to implement the barrier. Typical
would be critical section delay protecting the update of a shared arrival mechanism.
Finally, there are the processor cycles used to execute code associated with a particular
barrier implementation. If it is assumed that barriers are the right synchronization to use
for a particular parallel algorithm, then the important thing to measure is the difference
between the ideal barrier behavior and that which includes the synchronization and code
of the real implementation.
Given a specific parallel applications program, some simple probes of barrier
behavior are possible using only an elapsed time measurement. This assumes a system
dedicated to the applications program so that no substantial time is used for systems
functions or time multiplexed users during the course of program execution. If the flow
of program control is not altered by the violation of data dependences imposed by
6
barriers(only theanswersarewrong),thenaprogramcan be run and timed both with and
without barriers to get a measure of their total effect. Another possibility is to change the
barrier implementation in a known way, say by doubling the processor cycles used in the
barrier code, and to measure the effect of this change on the overall execution time.
These methods are primarily useful to measure the influence of barrier synchronization
on a specific parallel program to determine whether it is important to try to find a less
costly synchronization method.
If measurements are made for the purpose of comparing different barrier implemen-
tations for the best performance, as was done in [8], then barriers should be measured
independently of surrounding code. It is not possible to separate barriers completely
from their execution environment as a result of the dependence of implementation over-
heads on skew and order of arrival times for different processes. A careful set of meas-
urements on barriers with such dependence will include different types of arrival loading.
A common arrival pattern occurs when two successive barriers are separated by a fixed
amount of computation which is the same for each process. The order of arrival at the
second barrier is then determined by the order of their release from the first one. A
configuration having two barriers, separated by random, and different, amounts of work
for each process, represents another useful measurement. Enough samples must be taken
to average both the latest arrival time and the effects of different arrival order, if any.
Another form of arrival loading which occurs commonly in practice corresponds to two
barriers separated by fixed length work which contains a critical section. The time skew
introduced by processes waiting to enter the critical section may influence barrier perfor-
mance. For example, a self-scheduled, linear barrier does an excellent job of masking
critical section skew.
The nature of the timers used to measure barrier performance is important, and in at
least one case, has a drastic effect on the reliability of the measurement. Since barriers
synchronize all processes, a single clock per system is most natural to their measurement.
This assumes, however, that the whole system is used for the measurement. In shared or
time multiplexed systems the situation is more complicated. A case presenting consider-
able difficulty is a dedicated system, but one which has one software timer per processor.
In addition, the timer interrupts in different processors are asynchronous. The effect of
the measurement uncertainty AH interacts with the synchronization function of the bar-
rier in an unpleasant way. A worst case situation can occur in which all processors are
interrupted while executing a barrier in a sequential order which causes the other proces-
sors to wait on the return of each interrupted processor in order to complete the barrier.
Thus with P processors, the uncertainty of the measurement of barrier completion is not
just AH but can be as large as P xAH. An example of this situation will be reported in
the next section.
Examples
Barrier performance was one aspect of measurements performed on the Denelcor
HEP shared memory multiprocessor system [9]. The system involved in the measure-
ments consisted of four pipelined multiprocessor modules, called PEMs. Each PEM
could obtain a speed of 10 MIPS (or MFLOPS) by executing 12-15 processes in parallel.
The system could support up to 200 processes executing in parallel, but the pipelined
7
structure implied that improvedperformancecould not result, even theoretically, for
more thanabout64 processes.EachPEM wasequippedwith a hardwareperformance
monitor knownasthe SystemPerformanceIndicator,or SPI. TheSPIkept trackof clock
cycles of 10 -7 second along with numbers of completed instructions in several
categories. The instruction categories were floating point instructions, other register to
register instructions, memory reference instructions, and wave-offs (instructions which
could not be issued because of synchronization).
The barrier measurements made on the HEP system were with respect to a parallel
Gaussian elimination program which used barriers for its synchronization. The flow of
control in Gaussian elimination is not influenced by the correctness of the floating point
computations, which are the operations synchronized by the barriers. At most the time
for pivoting could be influenced, but this was not the case in this program. Figure 5
shows the execution time versus number of parallel processes for the Gaussian elimina-
tion program with and without its synchronizing barriers. A summary of the results is
that the synchronized version obtained a maximum speed of 7.5 MFLOPS, corresponding
to 32 MIPS, while the program with barriers removed ran 9.5 MFLOPS, or 40 MIPS,
which is the maximum speed for a four PEM HEP. The decrease in execution time as the
number of processes increases beyond the point at which all pipelines are filled is a result
of process contention for shared synchronization variables in the barriers, and of the
increase in barrier complexity with number of processes.
MFLOPS
10-
m
5m
+ + + + + +
+ +
+
• e e e • e e •
+ Unsynchronized
_ With barriers
4 I
0 50 100 150
Number of Processes
Figure 5: Barriers in Gaussian Eliminatign on a 500 by 500 Matrix .....
8
Measurements were also done by making changes in the barrier implementation.
The initial implementation counted processes entering the barrier and blocked them with
a single shared memory locking variable until the last process had arrived. The compiled
code for this implementation amounted to about 20 instructions per process. Since four
PEMs could generate memory reference retrys at four times the rate that a single memory
module could handle them, it was suspected that memory access congestion made this
implementation inefficient on the four PEM system. An alternate implementation of the
barrier, which suspended processes as they arrived and used the last one to restart them,
executed about 100 instructions per process but reduced execution time by 20%, verify-
ing the memory contention effect. It should be noted that process suspension in this
machine did not involve the operating system, but was possible with four or five user
instructions. A subsequent improvement in the process suspending barrier halved the
number of instructions executed per process and improved execution time by 11.8%.
This allowed the determination of the fraction of execution time occupied by barrier syn-
chronization. In the final Gaussian elimination program about 14% of the time is spent in
barrier synchronization. This illustrates how program specific changes in a well under-
stood code can be used to examine barrier performance.
An extensive set of measurements was done by Arenstorf [8] on the Flexible Com-
puter Corporation's Flex/32 running under the MMOS [10] operating system. This sys-
tem consists of 20 single board microprocessors associated with a combination of shared
and private memories. The particular system used allocated two processors to running in
a single processor mode, leaving 18 processors available under MMOS. The experi-
ments were run with a fixed mapping of one process per processor with no multiprogram-
ruing. The operating system is distributed over the processors, and in particular, has a
software timer per processor. The measurements compared several different implemen-
tations of the barrier for performance in different environments, but of most interest here
is the effect of the software timers running asynchronously on each processor.
The Flex/32 system measurements exhibit the problem, mentioned earlier, of multi-
plying the measurement uncertainty AH due to the service time for timer interrupts by
the number P of processes. The standard configuration of the MMOS operating system
uses a 20 millisec, interrupt interval with a service time of 0.3 ms. The resulting
AH = 1.5% is acceptable for timing single stream phenomena, but using 18 processors
synchronized by barriers, it presents a significant problem. In the worst case sequence of
timer interrupts using all 18 processors, some processor is unavailable to satisfy the bar-
rier synchronization 27% of the time. The resulting uncertainty in barrier measurements
is not a result of inaccurately sampling the value of the time but of overheads involved in
updating the time. The problem of obtaining accurate times was solved by increasing the
timer interval to one second, thus reducing AH to 0.03% and the worst case influence on
barrier measurements to 0.54%. Of course, it was necessary to make very long timing
runs to reduce the effect of the one second accuracy of the time to an acceptable percen-
tage of the measurement.
A final measuremem of barrier performance is of interest in demonstrating an
extreme case of the effect of all processors not being available to execute the barrier
simultaneously. In this case, it is not timer interrupts which occupy the processors but
multiprogramming activity. The measurements were performed by Benten [11] on the
Sequent Balance 8000, a bus connected, shared memory multiprocessor. This system
9
had eight processors,organized two per board sharing a cache memory. All memory is
shared and is connected to the bus, so that it is uniforrrdy accessible to all processors.
The version of the system to which we had access ran a Unix style operating system
which was highly multiprogrammed and treated processors as a schedulable resource.
We had no way of locking processes to processors other than ensuring that there was no
other load on the system, and no reason in our own program to time multiplex processors.
A barrier synchronized Gaussian elimination program, similar to the one measured
on the FIEP, was run for varying numbers of processes. The character of the results is not
surprising, but the magnitude is interesting. As shown in Fig. 6, when the program was
run on an unloaded system, normal speedup results were observed until multiprogram-
ming was forced to occur by virtue of the number of processes exceeding the number of
physical processors. With nine processes time multiplexed on eight processors, the exe-
cution time is significantly longer than that for one process. This reflects the fact that all
processes must be coscheduled for a barrier which is implemented independently of the
operating system to perform reasonably. An alternative to the barrier must be used for
synchronizing parallel programs unless the operating system's scheduler can cooperate
with the barrier construct. Figure 7 shows that the presence of multiprogramming load
on the system merely causes the effect to be seen for fewer processes and to increase in
magnitude.
Seconds
150 - -
100--
-7
50--
0'
0
-4-
+
+
4-
+
4-
+ 4- 4-
5 10
Processes
Figure 6: Effect of Lack of Coscheduling, Single Program
10
Seconds
500-- --
400 --
300--
200--
100--
0"
0
+ +
+
+
+ + +
I I 1 I I I I t
5
+
+
i 1
10
Processes
L_
Figure 7: Effect of Lack of Coscheduling, with Multiprogramming Load
Conclusions
The barrier is a convenient synchronization mechanism in multiprocessors, espe-
cially of the shared memory type. Measuring the performance of barriers may be used to
determine their effect on the performance of a parallel program, in order to optimize the
program, and to choose the best method and arrangement of synchronization. Measure-
ments can also help select between different types of barrier implementation, to suit it to
the machine environment. The measurement of barrier performance offers some unique
difficulties, which result from the global nature of the synchronization. All processes
must be available for the barrier to complete, and any virtualization of processes will
have a noticeable effect. Another problem is to separate overhead due to a specific bar-
rier implementation from the waiting time imposed purposely on processes by the nature
of the synchronization. Hardware or software support for measuring the waiting time of
processes directly would significantly aid barrier measurement. In the work done to date,
waiting times must be inferred from several indirect time measurements.
11
References
[10]
[11]
[1] H. Jordan, "A Special Purpose Architecture for Finite Element Analysis," Proc.
1978 lnt'nl Conf. on Parallel Processing, IEEE Computer Society Press (1978) pp.
263-266.
[2] H. Jordan, "The Force," in The Characteristics of Parallel Algorithms, L. Jamieson,
D. Gannon and R. Douglass, Eds., Chapter 16, MIT Press, Cambridge, MA (1987).
[3] E. Lusk and R. Overbeek, "Use of Monitors in Fortran: A Tutorial on the Barrier,
Self-scheduling Do Loop and Askfor Monitors," Argonne National Laboratory
Report No. ANL-84-51, Argonne, IL (1985).
[4] A. Osterhaug, Guide to Parallel Programming on Sequent Computer Systems,
Sequent Computer Systems, Inc., Beaverton, OR (1985).
[5] P. Frederickson, R. Jones and B. Smith, "Synchronization and Control of Parallel
Algorithms," Parallel Computing, V. 2, No. 3 (1986) pp. 265-254.
[6] M. Furtney, R. Kuhn, B. Leasure and E. Plachy, "PCF Fortran: Language
Definition," Parallel Computing Forum, Kuck & Associates, 1906 Fox Drive,
Champaign, IL 61820, Version 1 (Aug. 16, 1988).
[7] T. Axelrod, "Effects of Synchronization Barriers on Multiprocessor Performance,"
Parallel Computing, V. 3, No. 2 (1986) pp. 129-140.
[8] N.S. Arenstorf and H. F. Jordan, "Comparing Barrier Algorithms," ICASE Rept.
No. 87-65, NASA Langley Res. Ctr., Hampton, VA, Sept. 1987, to appear in Paral-
lel Computing.
[9] H.F. Jordan, "Performance and Program Structure in a Large Shared Memory Mul-
tiprocessor," in New Computing Environments: Parallel, Vector and Systolic,
Arthur Wouk, Ed., pp. 201-217, SIAM, Philadelphia, PA, 1986.
Muhicomputing multitasking operating system (MMOS) reference manual, Flexible
Computer Corporation, Dallas, TX (1986).
M. S. Benten and H. F. Jordan, "Multiprogramming and the Performance of Parallel
Programs," Proc. 3rd SlAM Conf. on Parallel Processing for Scientific Computing,
Los Angeles, CA (Dec. 1987).
12
_BLIOGRAPHIC DATA I1. Report No.
;MEET I ECE Tech. Rept. 88-i-4
. Titleand Subtitle
Problems in Characterizing Barrier Performance
• Author(s)
_arry F. Jordan
• perlotmin 80rsanlzation Namc and Address
_omputer Systems Design Group
)epartment of Electrical and Computer Engineering
7niversity of Colorado
8oulder, CO 80309-0425
Z Sponsoring Organization Name and Address
_ffice of Naval Research
300 N. Quincy Street
krlington, VA 22217-5000
2.
3. Reeipien:'s Access/on No.
5. Report Date
October 1988
6.
8. Pertorming C_ganizarlon Rope.No.
CSDG 88-3
10, Ptoject/Task/_Iork Unit .Xo.
i|. Contract/Grant No.
ONR
N00014-86-K-0204
13. Type of Report & Period
Covered
Interim
14.
$. Supplementary Nores Also supported in part by NASA Langley Research Center under NASA
]ontract Number NASI-17070
6. Abstracts The barrier is a synchronization construct which is useful in separating a parallel
program into parallel sections which are executed in sequence. The completion of a bar-
rier requires cooperation among all executing processes. This requirement not only
introduces the "wait for the slowest process" delay which is inherent in the definition of
the synchronization, but also has implications for the efficient implementation and meas-
urement of barrier performance in different systems.
Types of barrier implementation and their relationship to different muhiprocessor
environments are described. Then the problem of measuring the performance of barrier
implementations on specific machine architectures is discussed. The fact that the barrier
synchronization requires the cooperation of all processes makes the problem of perfor-
mance measurement similarly global. Making non-intrusive measurements of sufficient
accuracy can be tricky on systems offering only rudimentary measurement tools.
17. Key Words and Document Analysis. 17o. Descriptors
_ltiprocessors
;ynchronlzatlon
"erformance Measurement
_arriers
17b. Identifiers/Open-Ended Terms
ORIGINAL PA(_E IS
OF POOR QUALr/'y
17c. COSATI Field/Group
18. Availability Statement
fORM NTIS-SS I10-?0)
19. Security Class (ThisReport)
UNCI.ASSIFIFD
20. Securiiy Class (]'hisPage
_rNCLASSIFIFD
21. No. of Pages
12
22. Price
_J S (_ Ok(_4- O C 40329-_71

