Performance Analysis of Hardware Barrier Synchronization by O\u27Keefe, M. & Dietz, H.
Purdue University
Purdue e-Pubs
Department of Electrical and Computer
Engineering Technical Reports
Department of Electrical and Computer
Engineering
8-1-1989






Follow this and additional works at: https://docs.lib.purdue.edu/ecetr
This document has been made available through Purdue e-Pubs, a service of the Purdue University Libraries. Please contact epubs@purdue.edu for
additional information.
O'Keefe, M. and Dietz, H., "Performance Analysis of Hardware Barrier Synchronization" (1989). Department of Electrical and





I l i i l i l l l i l l l i i l i
x x x x x x x x x
ll i i l l l l l l lp l l l l l
iiiliillllllliliii





y.w.ww.'.Wwwwwwwv-Wwwwvwv.*.w  v.wwwwwv.v.v.wv.v.v.v.v.v.v.v.v.v.v.v y. .̂y.y •*•*•*•*•y •*•*•*•*• v.y •*•*•*•*• v.v.*•*.-.*•'
.v X w w w w w w w 'X w w w w v X v X v X v X w w
v!,!,/X v!v!v!,;^ > X v ;,;-;,;-;v>;v;v;v;v;v;v;v;v;'





School of Electrical Engineering
Purdue University
West Lafayette, Indiana 47907
i i i i i i i i i i i i i i i
Wrnrnmmmrnmmmmmmmmmmmm
mmmmmmmmmm







School of Electrical Engineering
Purdue University
West Lafayette, Indiana 47907
M. O 1K etJ t and H. D ietz
S c h o o l o f  E lec tr ica l Engineering  
Purdue U niversity  
W est L afayette, IN  47907  
August 1989
P erfo rm a n ce  A n a ly s is  o f  H ardw are B arrier S y n ch ro n iza tio n
A B S T R A C T
Synchronization among cooperating processors is a critical issue in 
the  perform ance of high speed m ultiprocessors. For cu rren t M ultiple 
Instruction  stream  M ultiple D ata  stream  (MIMD) com puters synchroniza- 
tion  cost is high. Hence, these architectures can execute only large g ranu- 
la rity  parallelism  efficiently. In this report we study a new hardw are syn- 
chronization technique, known as a hardware barrier. M achines using th is 
technique are known as barrier MIMDs. A nalytic and sim ulation studies 
are em ployed to  show th a t  hardw are barrier synchronization can ou tper- 
form  the m ore comm on directed synchronization techniques. B arrier syn- 
chronization can be viewed as a sta tic  synchronization m echanism  sim iliar 
to  the  im plicit synchronization of Very Long Instruction  W ord architec- 
tu res (VLIWs). We study two variations of hardw are barrier synchroniza- 
tion  previously developed, sta tic  and dynam ic, and suggest a new hybrid  
approach.
Performance of Barriers
I .  In tro d u c tio n
A barrier is a synchronization point. A  processor typically  perform s the following 
three steps a t a barrier:
[1] M arks itself as present a t  the barrier.
[2] W aits for all o ther participating  processors to  arrive a t the  barrier.
[3] Proceeds along w ith  the o ther partic ipa ting  processors p ast the  barrier.
B arrier synchronization has commonly been considered a softw are technique to  provide 
sequence control, insuring th a t  events happen in the proper order. F or example, a 
D O A L L  loop (all iterations m ay execute in parallel) requires th a t  all ite rations synchron­
ize when completed: execution proceeds p as t the D O A L L  loop only after all iterations are 
complete. This can be im plem ented using f o r k  and j o i n ,  b u t the  overhead of 
spawning and killing processes is high using this approach. Counting sem aphores can also 
be used, although the serialization caused by the m utual exclusion necessary for the  sem a­
phores also w astes additional cycles [Axe86],[HeF88]. In addition, the  processors are only 
"approxim ately" synchronized when using o ther prim itives to  im plem ent barriers.
B arrier MIMDs are asynchronous M ultiple Instruction  stream  M ultiple D ata  stream  
architectures th a t  employ a fast hardw are synchronization m echanism  known as a 
hardware barrier. They can execute loops, subprogram s, and variable-execution tim e code 
in parallel like any MIMD architecture. In addition, the  hardw are ba rrier synchroniza­
tion m echanism  is fast and efficient, reducing the execution-tim e cost of synchronization 
between processes. In th is respect, barrier MIMD architectures are sim ilar to  Single 
Instruction  stream  M ultiple D ata  stream  (SIMD) architectures; the  execution-tim e cost of 
synchronization is essentially zero, and synchronization is im plem ented sta tically  a t 
compile-tim e.
The sim ilarities between barrier MIMDs and sta tically  scheduled arch itectures such 
as SIMD and VLIW  are explored in Dietz and Schwederski [DiS87]. A contiguous spec­
trum  o f properties between SIMD and MEMD are also given. W e will briefly sum m arize 
these properties. The num ber of simultaneous operations specifies how m any different 
operations m ay be perform ed on a m achine w ith  N processors; the  larger th is num ber, the 
m ore varied the parallelism  th a t  m ay be executed on a m achine. The num ber of control 
flow threads is the num ber of independent program  counters in a m achine of w idth  N. 
The relative time synchronization error specifies the tim e error known to  the compiler 
between tw o instructions executing on different processors. In SIMD and VLIW  execu­
tion, th is error is very sm all, enabling s ta tic  scheduling of instructions w ithou t synchroni­
zation overhead. A barrier MIMD also has this p roperty , and it can be instruction 
scheduled w ith  good efficiency. A barrier can always be inserted  to  reduce the relative 
tim ing erro r among processors to  zero.
Page 2
Performance of Barriers
An im portan t feature which distinguishes two form s of barrier MIMD is the  num ber 
of synchronization control flow threads, which specifies how m any synchronization opera­
tions are candidates for curren t execution. This is the key difference betw een S ta tic  B ar­
rier MIMD (SBM) and Dynam ic B arrier MIMD (DBM). SBM barrier execution is charac­
terized by a to ta l ordering imposed on the execution of the barriers  th a t  in general will 
no t correspond to  the actual ordering th a t  occurs during program  execution. Hence, ba r­
riers ready to  execute m ay have to w ait for o ther barriers. The hardw are m echanism  
used in S tatic  B arrier MIMD is a queue; barriers are loaded in to  the queue according to  
the  to ta l ordering of barriers determ ined a t compile-time. D ynam ic B arrier MIMD allows 
the execution of a barrier im m ediately after the  processors associated w ith  the  ba rrier 
reach it. This requires an associative search of the m em ory containing the  cu rren t b a r­
riers, and a complex loading m echanism  for the associative m em ory. Hence, the  addi­
tional perform ance of the  DBM appears to  come a t significantly higher cost. These issues 
are considered in m ore detail in Dietz [DiS87]. This report provides discusses techniques 
to  reduce the  delays introduced by S tatic  B arrier MIMD. B oth  analy tic  and sim ulation 
studies are employed to  determ ine the u tility  of these techniques.
The last feature considered is w hether the  synchronization prim itives are directed or 
undirected1. D irected synchronization is m ore efficient in th a t  i t  allows one processor to  
continue execution after it  perform s the synchronization action; the  o ther processor(s) 
m ust w ait for the  action. Thus, if A is to  w ait for B and B arrives before A does, B is 
allowed to  continue im m ediately. U ndirected synchronization, such as the  hardw are b a r­
rier discussed in th is report, causes all involved processes to  w ait. A lthough it  is clearly 
less efficient th an  directed synchronization, th is report will show th a t  under a  va rie ty  of 
assum ptions the  difference is no t significant.
The key poin t m ade in Dietz et al. [DiS87] concerning barrier m achines is th a t  they  
can synchronize all processors a t the clock-cycle level, and th is in form ation  can be used to  
satisfy conceptual synchronizations through sta tic  code scheduling. In th is view, barriers  
are viewed as a m echanism  to  reduce the relative-tim e synchronization erro r betw een 
processes ra th e r th an  purely as a synchronization technique.
This rep o rt examines the perform ance of the two barrier MIMD arch itectu res pro­
posed in [DiS87]. Both analytic and sim ulation results are discussed. A barrier sim ulator 
has been program m ed and executed to  gather d a ta  on the perform ance of the  barrier 
arch itectures com pared to  directed MIMD m achines. I t will be shown th a t  the  barrier 
arch itectures com pare favorably to  directed MIMD even under the  w orst case assum p­
tions for the barrier m achine perform ance. This is tru e  independent of the  additional
I. A  directed synchronization is an operation whereby one processor is forced to 
wait for some action of another processor, but the processor performing the 
action need not wait upon performing the action.
Page 3
benefits th a t  can be realized w ith the barrier architectures th rough  the  appropriate  com­
piler technology. In addition, we will show th a t  the additional delays in troduced in SBM 
execution due to  the to ta l ordering of barriers  can be reduced th rough  the appropriate  
s ta tic  scheduling techniques and a hybrid approach th a t  combines features of the sta tic  
and dynam ic barrier models.
2. A n a ly t ic  M odels
In th is report, a barrier configuration will be represented as in figure la .  The verti­
cal lines represent concurrently executing processors while the  horizontal lines represent 
barriers across the processors they intersect. The tim e axis is vertical, w ith  tim e increas­
ing in the  downward direction. A barrier configuration m ay also be represented as a 
directed acyclic graph (dag), w ith the graph  nodes representing barriers and edges 
representing the ordering constraints am ong the barriers. The dag describes the  p a rtia l 
ordering am ong the barriers, which are a partia lly  ordered set (poset). A  dag for the  bar­
rier configuration shown in figure lb .
Performance of Barriers
F ig u re s  l a  & l b :  Sample B arrier Configuration & DAG
Page 4
Performance of Barriers
We will consider a simple example to  com pare the efficiency of barrier and directed 
synchronization. Observe the barrier configuration in figure 2. The code executed in a 
given processor between two barrier synchronizations is referred to  as a  "region" , and is 
consistent w ith the definition of a region found in Dietz [Die87].
F ig u re  2: B arrier Im plem enting D irected Synchronization
Assum e the four regions rO through rS have execution tim es which are independent and 
identically d istribu ted . Regions rO and rS are the producer and consum er of a syn­
chronization, respectively. Assuming th a t  the tim e to  execute the  barrier and directed 
synchronization operations is equal to  zero, it  is clear th a t  the  directed synchronization 
version should be faster. B ut how significant is the difference? In the  following para­
graphs we examine th is question.
Let F r (r) represent the  d istribu tion  function of the  random  variab le r , the  tim e to  
execute a region. Let fr (r) represent the corresponding density function. Also, bj is a ran ­
dom variable corresponding to  the execution tim e of barrier i . T hen the d istribu tion  
function for is given by
FblW = F?W (I)
yielding the  density function
fb,w =  2 f,W F ,W  (2)
As an example, if fr(r) is a uniform  distribution w ith a range from  0 to  m, we have th a t
2
the  expected value (first m om ent) of b x is Efb1] == — m and, given the linearity  of expected
o
values, E[b2] =  2E[bx] =
Page 5
Performance of Barriers
Let be the r.v . representing the execution tim e of barrier two, given th a t  barrier 
one is im plem ented using a directed synchronization. Let p0 =  r 0 4- T1 and P1 =  r 2 +  r3, 
hence d2 =  max(po , P1). Processor zero m ay s ta r t  executing region one as soon as it 
finishes region zero. Clearly, the execution tim e using directed synchronization will differ 
from  th a t  using barrier synchronization only when r0 <  r 2 and T1 >  r 3. Given th a t  the  Ti 
are independent and identically d istributed (i.i.d.), th is occurs w ith  p robab ility  0.25. For 
th is case, we can express the directed sync execution tim e as
d2 =  m ax(p0 , P i) (3)
and then
fr0+riW  =  / f r o t z - y ^ M d y  (4)
Assum ing a uniform  distribu tion  with range 0 to  m, it can be shown th a t
E [ro+ r1] = E [ r 2+ r2] =  m
as expected. The densities for Tq and T1 in (4) are actually  conditional densities on the 
event Tq <  r 2 and T1 >  r^. The conditional d istribu tion  for Tq is given by
Fro(z I T0 < r 2)
P [(rO <  z) n (r0—r 2 < o)]
P [r o <  r 2]
=  2P[(r0 <  z) D  (r0- r 2) <0)]
=  2P[(r0 <  z)] PI (r 4 <  0)] 
where t± =  Tq--T2. Recall th a t  for jo in t d istributions
F xy( x ,y ) = P [ X < x ,Y < y ]
hence
Fro(z I rO <  r2) =  2Fror4(z ,0 )
Since Tq and are no t independent, the jo in t d istribu tion  is no t d irectly  available. 
Instead, we will approxim ate the conditional density w ith  fro(z). To determ ine E[d2], we 
use (2) and again assum ing a uniform  distribution, we have
fd2(z)
2 73 3 z2
in'4 2m 4 m
0 <  z <  m 





E[d2] =  /  zfd.2(z)dz =  — m ~  1.23m
-00
Recalling the theorem  of to ta l probability
f(x) =  f(x I A1JP(A1) +  • • • +  f(x I An)P(An) (4)
where the events A1,A 2, • • * ,An form  a partition  of the  event space, and f(x J Aj) 
represents the  conditional density of x given event Aj, and P(Aj) represents the  probabil­
ity  of event A j. F rom  (4) i t  follows th a t
E[x] =  E fx lA 1] P(A 1) + E [ x |  A2] P(A 2) +  • • • +  E[x | An] P(A n) (5)
07 A
and hence E[d2] =  (-— m)(0.25) +  (—m)(0.75) 2; 1.3083m com pared to
OU 0
4E[b2] =  —m ^  1.3333. Thus, directed synchronization is only 2.7%  faster th an  barrier 
o
synchronization, under the  given assum ptions. Using the sim ulator, we obtained s ta tis ­
tics for the  expected values of the barrier and directed synchronization execution tim es 
for uniform  d istribu tions and found the difference to  be less th an  4% . Clearly, this 
difference is no t large.
Let us consider the sam e question given an exponential d istribu tion  for the  regions 
fP.(rj) =  Xe“ Xz where the m ean m =  From  (4) we know th a t
A
fb,M  =  2f,„MF,„(i) =  2Xe-x,(l — e-x“)
The expected value of the  execution tim e of Ij1 is then
E N - Z ^ b lW d b - I r
from  which it  follows th a t
E[b2] =  2E[b1] =  A  =  3m
A
If we ignore the  conditional na tu re  of the  densities for the  d irected case the resulting  den­
sities for Po and P1 are given by
fPo(z ) =  fP i ( z ) =  X 2ze_Xl> z > 0  ( 6 )
for w hich E[po] =  2m. The density of d2 can be determ ined using (4) and (6), and 
E[d2] =  2.75m, under the  given assum ptions, and (5) yields only a  2% difference between 
the  b a rrier and directed perform ance. Sim ulation results showed a difference of slightly 
less th an  4% . A sim ilar approach can be applied assuming G aussian distribu tions, b u t the 
differences are again very sm all, less than  4% .
Page 7
Performance of Barriers
W e have seen th a t  for a very simple case, the execution tim es of barrier and directed 
synchronization are quite close for a varie ty  of d istributions representing the behavior of 
different kinds of code. The exponential d istribu tion  approxim ates the  execution tim e of a 
loop w ith  a data-dependent exit test. The norm al d istribu tion  would approxim ate the exe­
cution of straight-line code, given the occurrence of variab le-tim e instructions and 
m em ory references. The additional w ait tim e caused by the barrier is typically  quite low. 
I t should be noted th a t  the  assum ption th a t  the barrier and directed synchronizations 
execute in the same am ount of tim e is quite generous to  directed synchronization. As 
shown in the following sections, the hardw are barrier can execute in a few clock cycles, 
com pared to  several hundred clock cycles for the fastest directed synchronization imple­
m entations. On the o ther hand, in this very simple case there  cannot be any delays due 
to  the  to ta l ordering in the SBM queue since the three barriers will always execute in the 
sam e order as they  appear in the queue. In the next section, we quantify  the delays caused 
by the  s ta tic  barrier queue.
3. E ffect o f  SB M  D elays
To understand  the poten tia l im pact of delays imposed by the to ta l ordering in the 
queue we will consider the following example, shown in figure 3. In th is barrier 
configuration, n barriers are unorderedP and there are n! possible orderings. The w orst 
case for the  SBM occurs when the code regions between barriers  have the sam e expected 
execution tim es; in th a t  case, no assum ptions concerning the execution ordering of the 
barriers can m ade, and the  placem ent of the barriers a t com pile-tim e is essentially a  ran ­
dom selection.
W e will first characterize the num ber of barriers th a t  are delayed by a particu la r 
SBM queue ordering, and show th a t these delays are equivalent to  "com bining" the 
delayed and delaying barriers into several larger barriers or even a single barrier. A fter 
characterizing the percentage of barriers combined for a given schedule, i t  is possible to  
estim ate  the  delay caused by th is combining phenomena.
Consider the case w ith  n =  3. There are six possible execution tim e orderings of 
barriers I , 2, and 32 3. Consider execution ordering 3^>*2—►!.: barriers  3 and 2 are forced to  
w ait on barrier I , and the effect is equivalent to  the three barriers  being com bined into a 
single barrier. This ordering is shown in figure 4.
2. Barriers are unordered if there are no constraints on the order in which they 
may execute.
3. Note that in this discussion, the numbering scheme for the barriers 
corresponds directly to their ordering in the SBM queue. Hence, barrier I is 
first in the queue, barrier 2 is second, etc.
Page 8
Performance of Barriers
F ig u re  3: Configuration w ith  n Unordered barriers
If the  execution ordering is 2—►].—*3, barrier 2 is forced to  w ait for barrier I to  exe­
cute, and these two barriers are, in effect, combined. The different execution orderings 
can be represented as a tree, shown in figure 5.
Each level of the tree  corresponds to  the firing of a particu lar barrier. The leaves of the  
tree  have been anno ta ted  w ith  the num ber of barriers th a t  are delayed given the particu ­
lar execution ordering. W e can determ ine the expected value for the percentage of b a r­
riers delayed (combined), which we will call the combining quotient, by weighting the 
num ber of barriers delayed by the appropriate  probability . Under our assum ptions, all 
execution orderings are equiprobable. Hence, the probability  th a t  p barriers are delayed,
Kn (p)
is given by -----■— , where Kn(p) corresponds to  the num ber of execution orderings w ith  p
n!
barrier delays given n barriers  in the queue. I t can be shown th a t
Page 9
Performance of Barriers
F ig u re  4 : Effect of “ B ad” S tatic  B arrier O rder
«»(*') =  K»-l(*) +  » (7)
This recurrence has been used to  generate Table I.
Page 10
Performance of Barriers
F ig u re  5: Tree Representing All Possible Execution O rders
n E [delays] E [delays] n 0 I 2 3 4 5 6 7 8 9 10 11
2 0.50 0.50 0.50 0.50 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
3 1.17 0.58 0.17 0.50 0.33 0.00 0.00 0.00 0.00 0.00 0f00 0.00 0.00 0.00
4 1.92 0.64 0.04 0.25 0.46 0.25 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
5 2.72 0.68 0.01 0.08 0.29 0.42 0.20 0.00 0.00 0.00 0.00 0.00 0.00 0.00
6 3.55 0.71 0.00 0.02 0.12 0.31 0.38 0.17 0.00 0.00 0.00 0.00 0.00 0.00
7 4.41 0.73 0.00 0.00 0.03 0.15 0.32 0.35 0.14 0.00 0.00 0.00 0.00 0.00
8 5.28 0.75 0.00 0.00 0.01 0.05 0.17 0.33 0.32 0.13 0.00 0.00 0.00 0.00
9 6.17 0.77 0.00 0.00 0.00 0.01 0.06 0.19 0.33 0.30 * 0.11 0.00 0.00 0.00
10 7.07 0.79 0.00 0.00 0.00 0.00 0.02 0.07 0.20 0.32 0.28 0.10 0.00 0.00
11 7.98 0.80 0.00 0.00 0.00 0.00 0.00 0.02 0.09 0.21 0.32 0,27 0.09 0.00
12 8.90 0.81 0.00 0.00 0.00 0.00 0.00 0.01 0.03 0.10 0.22 0.32 0.25 0.08
T a b le  I :  Com bining P robabilities for Unordered S tatic  B arriers
The tab le  shows th a t  as the num ber of unordered barriers in the  queue increases the 
combining quotient, given in the  th ird  column, increases asym ptotically . The combining 
quotient versus num ber of unordered barriers is given in figure 6. Each row in table I 
corresponds to  a certain  num ber of unordered barriers, and the columns labeled O through
Page 11
Performance of Barriers
15 contain the probabilities th a t  p barriers will be combined, where p is the column 
num ber. For example, w ith  11 unordered barriers in the queue, the probability  th a t  7 
barriers are combined is 0.21. It can be seen from figure 6 th a t  over 80% of the barriers 
are combined when there  are more than  11 unordered barriers in the queue. The percen­
tage is less for sm aller num bers of barriers. W hen the num ber is from  two to  five, less 
th an  70% of the barriers are combined.
unordered barriers
F ig u re  6: Com bining Q uotient Vs. No. of Unordered B arriers
The analysis suggests th a t  if the barriers are expected to  execute a t about the sam e 
tim e, m any of the  barriers are combined. This phenom ena is undesirable in th a t  m any 
processes are forced to  w ait for the slower processes. To obtain  an estim ate  of these 
effects, we will determ ine the expected execution tim es for processes forced to  w ait on a 
barrier, focusing on w orst case upper bounds.
Since a large percentage of the barriers in the SBM queue are combined, it  is im por­
ta n t  to  know how this affects the execution tim e of a given program . The upper bound on 
the additional delays in troduced can be determ ined by assum ing th a t  all barriers are com­
bined in to  a  single barrier across all processors. This simplifies the analysis considerably. 
W e will determ ine the expected execution tim e for this case for several different d istribu­
tions of region execution tim es, b u t first we give a general upper bound found in Hwang 
[HwB84]. If random  variables Xl j X2, ...,xn are independent and identically d istribu ted  
(i.i.d.) w ith  m ean m and standard  deviation s, then
E m a x (X i)I <  m + n —I
V 2 n —1
Hence, we see th a t  the grow th in the expected value of the barrier execution tim e is 
0(^J~n ). I t is possible to  obtain  tigh ter bounds if particu lar distributions are assumed.
Page 12
Performance of Barriers
Let us assum e th a t  the region execution tim es are identical and uniform ly distributed,
w ith m ean ~  and ranging from  0 to  m, and there  are n regions th a t  all partic ipa te  in the 
JL
barrier. F irs t, we note th a t  the d istribu tion  of the m ax function is given by 
F max(z) “  FJ}(z), where F u(z) is the uniform  distribution function. Hence,
which yields
— (— )n m m
^ - / z ’ dz =  m ( - ^ - )
m 11 o
and E(z)—►m as n —►oo, as expected. If we assum e th a t the n regions have an exponential 
d istribu tion , we can use resu lts concerning mean time to failure for parallel system s 
[Tri82]. In th is case,
n I
E(z) =  m — m logen
i=l 1
where m is the m ean of the exponential d istribu tion .
Now let us assume gaussian distribu tions w ith m ean m and stan d ard  deviation s. 
This yields the  following equation for the random  variable representing the ba rrier execu­
tion tim e:
z =  m ax(sz1+m , sz2+m , • • • , szn+ m ) =  smax(zx, z2, • * ' , Z n ) +  m
where the  Z1 are i.i.d. gaussian random  variables w ith  m ean O and s tan d ard  deviation I . 
Clearly, E(z) =  aE(m ax(zi)) +  m. R ather th an  evaluating th is equation directly, we will 
employ an im p o rtan t idea from  order s ta tis tics  discussed in [KrW84] and [Gum58]. Given 
i.i.d. random  variables (Xl7X2, • • • ,xn), each having d istribu tion  function G(x), their 
characteristic maximum value mn is the solution of the equation
I — G(m„) =  I  (9)
It tu rn s  ou t th a t  mn is a good estim ate  of m ax(x1,x2, • • • ,xn) for large values of n. 
K ruskal and W eiss [Tri82] show th a t  for a norm al d istribu tion
mn ~  m +  s \ / 2 Ioge» (10)
These bounds, developed for a  several different d istributions, suggest th a t  even in 
the  w orst case, when all barriers  are combined into a single barrier, the  grow th in comple­
tion tim e is fairly restra ined . Figure 7 shows a plot of these equations for the  three 
different d istribu tions considered here. The num ber of unordered barriers  was scaled by 
the com bining coefficient to  get a more accurate bound. We assum e th a t  the  barriers are
Page 13
Performance of Barriers
combined in to  a single barrier, and th a t  th is single barrier executes last. Note th a t  for the
norm al d istribution, the  expected execution tim e is only 2— stan d ard  deviations away
2
from  the mean.
execution tim e 
(std. devs. from  mean)





F igu re  7s Execution Tim e Vs. No. of Unordered B arriers
S im ulation results discussed in the next section support these analytic results. In 
addition, tw o techniques, one related  to  s ta tic  scheduling and the o ther concerning the 
ba rrier hardw are, are proposed to  reduce the  delays due to  com bining effects.
4 . S ta g g e re d  B arrier S ch ed u lin g
The analysis given in the previous section m ade the w orst case assum ption th a t  the 
unordered barriers where scheduled such th a t  they all had the sam e expected execution 
tim e. In th is situation , the  compiler has no useful inform ation concerning the ordering of 
the  barriers  in the SBM queue. Any random  ordering of the  barriers would be expected to 
perform  ju s t as well as any other ordering. We now introduce the concept of staggered 
barrier scheduling. This refers to  scheduling barriers so th a t  the expected execution tim e
of a set of unordered barriers |b i , b 2> “ • • ,bj, • • • ,bnj is a m onotone nondecreasing func­
tion. Let E(bj) be the expected execution tim e of barrier b*. Then the following equation
E(bi+^) ~  E(bj) =  S E(bj) (11)
defines the  stagger coefficient 8 and the in tegral stagger distance <f>. We say th a t  two bar­
riers bj and b^ are adjacent if | j —k | =  0. The stagger coefficient 8 refers to  the percen­
tage difference between the expected execution tim es of adjacent barriers. Figure 8 shows 
a schedule of .four barriers w ith a stagger coefficient 8 =  0.10 and stagger distance 0 =  1.
Page 14
Performance of Barriers
F ig u re  8: Staggered B arrier Schedule (<j>= I , <5=0.10)
F igure 9 shows a sim ilar schedule of four barriers, except the  stagger distance <j> =  2.
The advantage of staggered scheduling is th a t  the barriers  can now be expected to 
execute in a particu lar order w ith a higher probability  th an  if there was no staggering. 
This "expected” execution ordering can then  be used as the  ordering of the barriers in the 
SBM queue. Let us consider an example. Let Xj represent the  random  variable for the 
execution tim e of barrier bj. W e wish determ ine P [X i+m^ >  Xi ], the probability  th a t 
barrier b i+m  ̂ executes after bj. The form er barre r is staggered m<5 percent from  the 
la tte r . We have
Page 15
Performance of Barriers
F ig u re  9: Staggered B arrier Schedule (</>— I , 6= 0 .10)
P[ Xi+m* >  X 1 ] =  P[ Xi+m* -  Xi >  0 ] =  I -  P[ Xi+m* -  Xi <  0 ] =  I -  F 5 w ^(O) 
and if exponential d istributions are assumed
pIx- > x< l = x S W
Sim ulations results show th a t  staggered scheduling reduces the delay caused by 
queue waits, i.e. w aits caused solely by the SBM queue ordering. Figure 10 shows the 
sim ulation resu lts assum ing th a t  region execution tim es have a norm al d istribu tion  w ith 
^==IOO and s= 20, (j>=I and 6 set to  0.0, 0.05, and 0.10.
It is evident from  figure 10 th a t  staggering the barriers can significantly reduce the 
accum ulated  delays caused by queue waits.
5. H y b r id  B a r r i e r  M e c h a n is m
Recall th a t  the  s ta tic  barrier hardw are design (a queue) was m uch sim pler and less 
expensive th an  the dynam ic barrier design (an associative m em ory). W hen first proposed 









4 8 12 16
unordered barriers
F ig u re  10: Effect of Staggering on Queue W ait T im e 
B arrier MIMD design was favored due to  the  sm aller hardw are requirem ents, although it 
could introduce delays no t present in a dynam ic barrier scheme. In th is section, we pro­
pose a hybrid barrier m echanism , shown in figure 11, th a t  combines the associative buffer 
of dynam ic barrier w ith  the  queue of s ta tic  barrier.










from  P E  W A IT ou tpu ts
F ig u re  11: Hybrid S tatic/D ynam ic B arrier A rchitecture
5=0.00
■$: 5=0.10
The basic idea is to  use the queue to  load the barriers  in to  a very sm all associative 
m em ory (in figure 11 the associative m em ory has four cells). P re lim inary  sim ulation 
results have shown th a t  the associative m em ory in the hybrid  barrier arch itectu re  need be 
no larger th an  four to  five cells to  reduce delays caused by the barrier hardw are m echan­
ism to  alm ost zero.
Page 17
Performance of Barriers
Prelim inary  sim ulation results are displayed in Figures 12 and 13. The horizontal 
axis indicates the  num ber of unordered barriers th a t  are to  be executed, while the  vertical 
axis represents the  to ta l barrier delay, norm alized to fi. The region execution tim es are 








4 8 12 16
unordered barriers
I : hybrid (AM= 2)
pure SBM
C hybrid (AM=3) 
hybrid (AM=4)
F ig u re  12: Effect of Hybrid A rchitecture on Queue W ait Tim e
F rom  F igure 12, i t  is evident th a t  the  hybrid barrier scheme reduces barrier delays 
alm ost to  zero for sm all associative buffer sizes. There is an anom aly here for an associa­
tive buffer size of two: in th is case, the  barrier delays are greater th a t  those of the pure 
s ta tic  barrier scheme when the num ber of barriers is g reater th an  about eight. The rea­
sons for th is  anom aly are currently  under investigation, b u t no clear answer is currently  
available. This anom aly is of m ore theoretical th an  practical significance.
F igure 13 shows the  results when staggered scheduling is employed w ith  8 =  0.10 
and <j> =  I . The effects of staggering alone reduce the delays significantly.
6* C o n c lu s io n s  a n d  F u r t h e r  W o rk
This rep o rt has discussed the hardw are barrier, a new technique for fast synchroni­
zation in parallel processors. W e have shown th a t  in some cases the hardw are barrier can 
com pete w ith  directed sychronization even under assum ptions very favorable to  directed. 
W hether th is is tru e  for the  general case is currently  under study, b u t sim ulations to  date 
suggest th a t  barriers can compete w ith directed. The effects of sta tic  barrier delays have 
been quantified, and upper bounds developed. Two techniques for reducing these delays 
were developed: staggered barrier scheduling and the hybrid barrier m echanism . Simula­
tions run  to  date  have shown these techniques to be especially effective when used 
together. A dditional sim ulations need to  be perform ed to  verify these prelim inary results.
Page 18
Performance of Barriers
queue w ait 
(normalized)
<5=0.10
H C pure SBM 
4 Chybrid (AM= 2) 
J<hybrid (AM= 3,4)
unordered barriers
F ig u re  13: Effects of Staggering 4- H ybrid A rchitecture on Queue W ait T im e 
V arious stagger coefficients and stagger distances should be sim ulated  and com pared.
The perform ance analysis studies to  date  have concentrated  alm ost wholly on 
scheduling large-grain tasks. However, the  in itial proposal for the hardw are barrier 
m echanism  focused on the sm all g ranu larity  parallelism  m ade available by th is fast syn­
chronization technique. Hence, the  next phase of the perform ance analysis will concen­
tra te  on fine-grained scheduling. Various heuristics are curren tly  being considered for 
scheduling barrier MIMDs, and an interface between the code scheduler and the  sim ulator 
will be constructed  to  te s t the  results of the  code scheduler using the  various heuristics.
The overhead and im plem entation requirem ents for different synchronization tech­
niques need to  be examined. As described in th is  report, barriers  can be designed to  exe­
cute a t the clock cycle level. This is no t possible w ith  o ther techniques which m ust access 
shared m em ory, registers, or a even a com bining netw ork. In addition, unlike a barrier, 
these techniques do no t provide a statically predictable tim e to  perform  the synchoniza- 
tion. B arrier synchronization requires additional hardw are, including a synchronization 












H.G. Dietz and T . Schwederski, "Extending S ta tic  Synchronization 
Beyond SIMD and VLIW ," Technical R eport T R -EE  88-25, School of 
E lectrical Engineering, P u rdue  University, June 1988.
H.G. Dietz, The Refined-Language Approach to Compiling For Parallel 
Supercomputers, Ph.D . D issertation, Polytechnic University, June 1987.
T.S. Axelrod, "Effects of Synchronization B arriers on M ultiprocessor 
Perform ance," Parallel Computing, Yol. 3, pp. 129-140, 1986.
D. Hensgen, R. Finkel, and U. M anber, "Two A lgorithm s for B arrier 
Synchronization," Int. Journal of Parallel Programming, Vol. 17, No. I, 
pp. 1-17.
C. P . K ruskal and A. Weiss, "Allocating Independent Subtasks on P ara l­
lel Processors," Int. Conf. on Parallel Processing, pp. 236-240, 1984.
K. Hwang and F.A . Briggs, Computer Architecture and Parallel Process­
ing, McGraw-HillrNew Y ork, 1984, pg. 611.
K.S. Trivedi, Probability and Statistics with Reliability, Queuing, and 
Computer Science Applications, Prentice-Hall: Englewood Cliffs, NJ, 
1982, pp. 217-219. .
E . J. Gum bel, Statistics of Extremes, Colum bia U niversity Press: New 
York, 1958.
Page 20
