An Approach to Scalability Study of Shared Memory Parallel Systems by Anand Sivasubramaniam et al.
An Approach to Scalability Study of Shared Memory
Parallel Systems
￿
Anand Sivasubramaniam Aman Singla Umakishore Ramachandran H. Venkateswaran
College of Computing
Georgia Instituteof Technology
Atlanta, GA 30332-0280.
fanand, aman, rama, venkat
g@cc.gatech.edu
In Proceedingsof the ACM SIGMETRICS Conferenceon Measurementand Modeling of Computer Systems,pages171-180, May 1994.
Abstract
Theoverheadsinaparallelsystemthatlimit its scalabilityneedto be
identiﬁedandseparatedin orderto enableparallelalgorithm design
and the developmentof parallel machines. Such overheadsmay be
broadly classiﬁedinto two components. The ﬁrst one is intrinsic to
the algorithm and arises due to factors such as the work-imbalance
and the serial fraction. The second one is due to the interaction
betweenthe algorithm andthe architecture andarises dueto latency
and contention in the network. A top-down approach to scalability
study of shared memory parallel systems is proposed in this re-
search. We deﬁnethe notion of overheadfunctionsassociatedwith
the different algorithmic and architectural characteristics to quan-
tify the scalability of parallel systems; we isolate the algorithmic
overhead and the overheads due to network latency and contention
fromthe overallexecutiontime ofanapplication;wedesignandim-
plement an execution-driven simulation platform that incorporates
these methods for quantifying the overhead functions; and we use
this simulator to study the scalability characteristics of ﬁve appli-
cations on sharedmemory platforms with different communication
topologies.
1 Introduction
Scalability is a notion frequently used to signify the “goodness”of
parallel systems, where the term parallel system is used to denote
an application-architecture combination. A good understanding of
this notion may be used to: select the best architecture platform for
an application domain, predict the performance of an application
on a larger conﬁguration of an existing architecture, identify appli-
cation and architectural bottlenecks in a parallel system, and glean
insightontheinteractionbetweenanapplicationandanarchitecture
to understandthe scalability of other application-architecture pairs.
In this paper, we develop a framework for studying the inter-play
betweenapplicationsand architecturesto understandtheir implica-
tions on scalability. Sincereal-life applicationssetthe standardsfor
computing,itismeaningfultousesuchapplicationsforstudyingthe
scalability of parallel systems. We call such an application-driven
approachatop-downapproachtoscalability study. Themain thrust
￿This work has been funded in part by NSF grants MIPS-9058430 and MIPS-
9200005,and an equipmentgrantfrom DEC.
of this approach is to identify the important algorithmic and archi-
tectural artifacts that impact the performance of a parallel system,
understand the interaction between them, quantify the impact of
theseartifacts on the execution time ofan application,and usethese
quantiﬁcationsin studying the scalability of the system.
The main contributions of our work can be summarized as fol-
lows: wedeﬁnethenotionofoverheadfunctionsassociatedwiththe
different algorithmic and architectural characteristics; we develop
a method for separating the algorithmic overhead; we also isolate
the overheads due to network latency (the actual hardware trans-
mission time in the network) and contention (the amount of time
spentwaiting for a resourceto becomefree in the network)from the
overall execution time of an application; we design and implement
a simulation platform that quantiﬁes these overheads; and we use
this simulator to study the scalability of ﬁve applications on shared
memory platforms with three different network topologies.
Performance metrics such as speedup [2], scaled speedup [11],
sizeup [25], experimentally determined serial fraction [12], and
isoefﬁciency function [13] have been proposed for quantifying the
scalability of parallel systems. While these metrics are extremely
useful for tracking performance trends, they do not provide ade-
quateinformation neededto understandthe reasonwhy an applica-
tion doesnotscale well on an architecture. Theoverhead functions
that we identify, separate, and quantify in this work, help us over-
come this inadequacy. We are not aware of any other work that
separates these overheads (in the context of real applications), and
believe that such a separation is very important in understanding
the interaction betweenapplications and architectures. The growth
of overhead functions will provide key insights on the scalability of
a parallelsystembysuggestingapplication restructuring,aswell as
architectural enhancements.
Several performance studies addressissuessuch as latency, con-
tention and synchronization. The scalability of synchronization
primitives supportedbythehardware[3,15]andthelimits oninter-
connectionnetworkperformance[1, 16]are examplesofsuchstud-
ies. While such issues are extremely important, it is necessary to
puttheimpactofthesefactorsinto perspectivebyconsideringthem
in the contextofoverall applicationperformance. There arestudies
thatuse realapplicationsto addressspeciﬁcissueslike the effect of
sharingin parallelprogramsonthe cacheandbusperformance[10]
and the impact of synchronization and task granularity on parallel
systemperformance[6]. Cypheretal. [9] identify the architectural
requirementssuchasﬂoatingpointoperations,communication,and
input/output for message-passingscientiﬁc applications. Rothberg
etal. [18]conductasimilarstudytowardsidentifyingthecacheand
memory sizerequirements for severalapplications. However,there
havebeenveryfewattemptsatquantifyingtheeffectsofalgorithmic
and architectural interactions in a parallel system.
This work is partof a larger projectwhich aims atunderstandingthe signiﬁcantissuesin thedesignofscalableparallelsystemsusing
the above-mentioned top-down approach. In our earlier work, we
studied issues such as task granularity, data distribution, schedul-
ing, andsynchronization,by implementingfrequently usedparallel
algorithms on shared memory [21] and message-passing[20] plat-
forms. In [24], we illustrated the top-down approach for the scala-
bility study of message-passing systems. In this paper, we conduct
a similar study for shared memory systems. In a companion pa-
per [23] we evaluate the use of abstractions for the network and
locality in the context of simulatingcache-coherentsharedmemory
multiprocessors.
The top-down approach and the overhead functions are elabo-
rated in Section 2. Details of our simulation platform, SPASM
(Simulator for Parallel Architectural Scalability Measurements),
which quantiﬁesthese overhead functions are also discussedin this
section. The characteristics of the ﬁve applications used in this
studyaresummarizedin Section3,detailsofthethree sharedmem-
oryplatformsarepresentedin Section4,andtheresultsofourstudy
with their implications on scalability are summarized in Section 5.
Concluding remarks are presented in Section 6.
2 Top-Down Approach
Adhering to the RISC ideology in the evolution of sequential ar-
chitectures, we would like to use real world applications in the
performance evaluation of parallel machines. However, applica-
tions normally tend to contain large volumes of code that are not
easily portable anda level of detailthat is notvery familiar to some-
one outside that application domain. Hence, computer scientists
have traditionally used parallel algorithms that capture the inter-
esting computation phases of applications for benchmarking their
machines. Such abstractions of real applications that capture the
main phases of the computation are called kernels. One can go
even lower than kernels by abstracting the main loops in the com-
putation (like the Lawrence Livermore loops [14]) and evaluating
their performance. As one goes lower, the outcome of the evalu-
ation becomes less realistic. Even though an application may be
abstracted by the kernels inside it, the sum of the times spentin the
underlying kernels may not necessarily yield the time taken by the
application. There is usually a cost involved in moving from one
kernel to another such as the data movements and rearrangements
in an application that are not part of the kernels that it is comprised
of. For instance, an efﬁcient implementation of a kernel may need
to have the inputdata organized in a certain fashion which may not
necessarilybe theformat ofthe outputfrom theprecedingkernelin
the application. Despite its limitations, we believe that the scalabil-
ity of an application with respectto an architecture canbe captured
by studying its kernels, since they represent the computationally
intensivephasesofanapplication. Therefore,wehaveusedkernels
in this study.
Parallel system overheads (see Figure 1) may be broadly classi-
ﬁed into a purely algorithmic component (algorithmic overhead),
and a component arising from the interaction of the algorithm and
the architecture (interaction overhead). The algorithmic overhead
is quantiﬁed by computing the time taken for execution of a given
parallel program on an ideal machine such as the PRAM [26] and
measuring its deviation from a linear speedup curve. A real exe-
cution could deviate signiﬁcantly from the ideal execution due to
overheadssuch as latency, contention,synchronization,scheduling
and cache effects. These overheads are lumped together as the
interaction overhead. In an architecture with no contention over-
head, the communication pattern of the application would dictate
the latency overhead incurred by it. Thus the performance of an
application (on an architecture devoid of network contention) may
lie betweenthe idealcurve and the realexecution curve (seeFigure
Processors
S p e e d u p
Linear
Real
Execution
Algorithmic
Overhead
Interaction
Overhead
Contention
Ideal
Other
Overheads
Figure 1: Top-down Approach to Scalability Study
1). Therefore, to fully understand the scalability of a parallel sys-
tem it is importantto isolatethe inﬂuenceofeachcomponentofthe
interaction overhead on the overall performance.
The key elements of our top-down approach for studying the
scalability of parallel systems are:
￿ experiment with real world applications
￿ identify parallel kernels that occur in these applications
￿ studytheinteractionofthesekernelswitharchitecturalfeatures
to separate and quantify the overheads in the parallel system
￿ use these overheads for predicting the scalability of parallel
systems.
2.1 Implementing the Top-Down Approach
Scalability study of parallel systems is complex due to the several
degrees of freedom that they exhibit. Experimentation, simulation,
andanalyticalmodelsarethreetechniquesthathavebeencommonly
used in such studies. But it is well-known that each has its own
limitations. Themainfocusofourtop-downapproachistoquantify
the overheads that arise from the interaction between the kernels
andthe architectureandtheir impactonthe overall executionofthe
application. Experimentation on real architectures does not allow
studyingthe effectsofchangingindividualarchitecturalparameters
on the performance. It is not clear that analytical models can
realistically capturethe complexanddynamicinteractions between
applications and architectures. Therefore, we use simulation for
quantifying and separating the overheads.
Our simulation platform (SPASM), to be presented in the next
sub-section, provides an elegant set of mechanisms for quantify-
ing the different overheads we discussed earlier. The algorithmic
overhead is quantiﬁed by computing the time taken for execution
of a given parallelprogram on an idealmachine suchas the PRAM
[26] and measuring its deviation from a linear speedupcurve. The
interaction overheadis also separatedinto its componentparts. We
currently do not address scheduling overheads1. Accesses to vari-
ables in a sharedmemory systemmay involve the network, andthe
1We do not distinguish between the terms, process, processor and thread, and use
them synonymouslyin this paper.physicallimitations of the network tend to contribute to overheads
in the execution. These overheads may be broadly classiﬁed as
latencyand contention,andwe associateanoverheadfunction with
each. The Latency Overhead Function is thus deﬁned as the total
amount of time spent by a processor waiting for messages due to
the transmission time on the links and the switching overhead in
the network assuming that the messages did not have to contend
for any link. Likewise, the Contention Overhead Function is the
total amount of time incurred by a processor due to the time spent
waiting for links to become free by the messages. Shared memory
systems normally provide some synchronization support that is as
simple as an atomic read-modify-write operation, or may provide
specialhardwarefor more complicated operationslike barriers and
queue-based locks. While the latter may save execution time for
complicated synchronization operations, the former is more ﬂexi-
ble for implementing a variety of such operations. For reasons of
generality, we assume that only the test&set operation is supported
bysharedmemorysystems. We alsoassumethatthe memorymod-
ule (at which the operation is performed), is intelligent enough to
perform the necessary operation in unit time. With such an as-
sumption, the only network overhead due to the synchronization
operation (test&set) is a roundtrip message, and the overheads for
such a message are accounted for in the latency and contention
overhead functions described earlier. The waiting time incurred by
a processor during synchronization operations is accounted for in
the CPU time which would manifest itself as an algorithmic over-
head. The statistics (CPU time, latency overhead, and contention
overhead)are quantiﬁedandpresentedfor eachinteresting modeof
the program execution (see Section 2.2).
Constant problem size (where the problem size remains un-
changed as the number of processors is increased), memory con-
strained (where the problem size is scaled up linearly with the
number of processors), and time constrained (where the problem
size is scaledup to keepthe execution time constantwith increasing
numberof processors)are three well-acceptedscaling models used
in the study of parallel systems. Overhead functions can be used
to study the growth of system overheads for any of these scaling
strategies. In our simulation experiments, we limit ourselves to the
constantproblem size scaling model.
2.2 SPASM
SPASM is an execution-driven simulator written in CSIM. As with
other recentsimulators [5, 7, 17], the bulk of the instructions in the
parallel program is executed at the speed of the native processor
(SPARC in this study) and only the instructions (such as LOADS
and STORES) that may potentially involve a network access are
simulated. The input to the simulator are parallel applications
written in C. These programs are pre-processed (to label shared
memory accesses),the compiled assembly codeis augmented with
cycle countinginstructions, andthe assembledbinary is linked with
the simulator code. The system parameters that can be speciﬁedto
SPASM are: the number of processors (p),t h eclock speed of the
processor,the hardwarebandwidth of the links in the network, and
the switching delays.
2.2.1 Metrics
SPASM provides a wide range of statistical information about the
execution of the program. It gives the total time (simulated time)
whichisthemaximumoftherunningtimesoftheindividualparallel
processors. This is the time that would be taken by an execution of
the parallel program on the target parallel machine. Speedupusing
p processorsismeasuredasthe ratioofthetotaltime on1processor
to the total time on
p processors.
Ideal time is the totaltime takenbya parallelprogram to execute
on an idealmachine suchas the PRAM. It includesthe algorithmic
overhead but does not include the interaction overhead. SPASM
simulatesan idealmachineto provide this metric. As wementioned
in Section 2, the difference between the linear time and the ideal
time gives the algorithmic overhead.
SPASM quantiﬁes both the latency overhead function as well as
the contention overhead function seen by a processoras described
inSection2. Thisisdonebytime-stampingmessageswhentheyare
sent. At the time a message is received, the time that the message
wouldhavetakenin acontentionfree environmentis chargedto the
latency overhead function while the rest of the time is accounted
for in the contention overhead function. Though not relevant to
this study, it is worthwhile to mention that SPASM provides the
latency andcontentionincurred by a messageaswell as the latency
and contention that a processor may choose to see. Even though
a message may incur a certain latency and contention, a processor
maychoosetohideallorpartofitbyoverlappingcomputationwith
communication. Such a scenario may arise with a non-blocking
messageoperationonamessage-passingmachineorwith a prefetch
operationonasharedmemorymachine. Butfortherestofthispaper
(sincewedealwith blockingload/storesharedmemoryoperations),
we assume that a processor sees all of the network latency and
contention.
SPASM also provides statistical information about the network.
It gives the utilization of each link in the network and the average
queue lengths of messagesat any particular link. This information
can be useful in identifying network bottlenecks and comparing
relative merits of different networks and their capabilities.
It is often useful to have the above metrics for different modes
of execution of the algorithm. Such a breakup would help identify
bottlenecksin the program, andalsohelp estimatethe potentialgain
in performance thatmay bepossiblethrough aspeciﬁchardwareor
softwareenhancement. SPASMprovidesstatisticsgroupedtogether
for system-deﬁnedas well as for user-deﬁned modes of execution.
The system-deﬁnedmodes are:
￿ NORMAL: A program is in the NORMAL mode if it is not
in any of the other modes. An application programmer may
further deﬁnesub-modes if necessary.
￿ BARRIER: Mode corresponding to a barrier synchronization
operation.
￿ MUTEX:Eventhoughthesimulatedhardwareprovidesonlya
test&setoperation,mutual exclusionlock (implemented using
test-test&set [3]) is available as a library function in SPASM.
A program enters this mode during lock operations. With this
mechanism,wecanseparatetheoverheadsduetothesynchro-
nization operationsfrom the rest of the program execution.
￿ PGM SYNC: Parallel programs may use Signal-Wait seman-
tics for pairwise synchronization. A lock is unnecessary for
the Signal variable since only 1 processor writes into it and
the other reads from it. This mode is used to differentiate such
accessesfrom normal load/store accesses.
The total time for a given application is the sum of the execution
times for each of the above deﬁnedmodes. The execution time for
eachprogram mode is the sum of the computation time,t h elatency
overheadand the contention overhead observed in the mode. The
metricsidentiﬁedbySPASMquantifythealgorithmicoverheadand
the interesting components of the interaction overhead. Computa-
tion time in the NORMAL mode is the actual time spent in local
computation in an application. The sum of latency and contention
overheadsin theNORMAL modeis the actualtime incurred foror-
dinary data accesses. For the BARRIER and PGM SYNC modes,the computation time is the wait time incurred by a processor in
synchronizing with other processors that results from the algorith-
mic work imbalance. The computation time in the MUTEX mode
is the time spent in waiting for a lock and represents the serial part
in an application arising due to critical sections. For the BARRIER
and MUTEX modes, the computation time also includes the cost
of implementing the synchronization primitive and other residual
effects due to latency and contention for prior accesses. In all
three synchronizationmodes,the latency andcontention overheads
together represent the actual time incurred in accessing synchro-
nization variables.
3 Application Characteristics
Threeoftheapplications(EP,IS andCG)arefrom theNASparallel
benchmarksuite[4]; CHOLESKYisfrom theSPLASHbenchmark
suite [19]; and FFT is the well-known Fast Fourier Transform al-
gorithm. EP and FFT are well-structured applications with regular
communication patterns determinable at compile-time, with the
difference that EP has a higher computation to communication ra-
tio. IS also has a regular communication pattern, but in addition
it uses locks for mutual exclusion during the execution. CG and
CHOLESKY are different from the other applications in that their
communication patterns are not regular (both use sparse matrices)
and cannotbe determined at compile time. While a certain number
of rows of the matrix in CG is assigned to a processor at compile
time (static scheduling), CHOLESKY uses a dynamically main-
tained queue of runnable tasks. The reader is referred to [22] for
further details of the applications.
4 Architectural Characteristics
Since uniprocessor architecture is getting standardized with the
advent of RISC technology, we ﬁx most of the processor charac-
teristics by using a 33 MHz SPARC chip as the baseline for each
processor in a parallel system. Such an assumption enables us
to make a fair comparison of the relative merits of the interest-
ing parallel architectural characteristics across different platforms.
Input-output characteristics are beyond the purview of this study.
We use three shared memory platforms with different intercon-
nection topologies: the fully connected network,t h ebinary hy-
percube and the 2-D mesh. All three networks use serial (1-bit
wide) unidirectionallinks with a link bandwidth of20 MBytes/sec.
The fully connected network models two links (one in each direc-
tion) between every pair of processors in the system. The cube
platform connects the processors in a bidirectional binary hyper-
cube topology and uses the
e-cube algorithm for routing. The 2-D
mesh resembles the Intel Touchstone Delta system. Links in the
North, South, East and West directions, enable a processor in the
middle of the mesh to communicatewith its four immediate neigh-
bors. Processors at corners and along an edge have only two and
three neighbors respectively. Equal number of rows and columns
is assumed when the number of processors is an even power of 2.
Otherwise,the numberofcolumnsis twice the numberofrows (we
restrict the number of processors to a power of 2 in this study).
Messages in the mesh are routed along the row until they reach
the destination column, upon which they are routed along the col-
umn. Messages on all three platforms are circuit-switched using a
wormhole routing strategy and the switching delay is assumed to
be negligible.
The simulated shared memory hierarchy is CC-NUMA (Cache
Coherent Non-Uniform Memory Access). Each node in the sys-
tem has a sufﬁciently large piece of the globally shared memory
such that for the applications considered, the data-set assigned to
each processor ﬁts entirely in its portion of shared memory. There
is also a 2-way set-associative private cache (64KBytes with 32
byte blocks)at eachnodethat is maintainedsequentially consistent
usinganinvalidation-basedfully-mappeddirectory-basedcacheco-
herencescheme. The memory accesstime is assumedto be 5 CPU
cycles,while the cacheaccesstime is assumedto be 1 CPU cycle.
5 Performance Results
In this section, we present results from our simulation experiments
showing the growth of the overhead functions with respect to the
numberofprocessorsandtheir impacton scalability. The simulator
allows one to explore the effect ofvarying other systemparameters
such as link bandwidth and processor speed on scalability. Since
the main focus of this paper is an approach to scalability study, we
have not dwelled on the scalability of parallel systems with respect
to speciﬁc architectural artifacts to any great extent in this paper.
We also brieﬂy describethe impact of problem sizes on the system
scalability for each kernel.
Figures 2, 3, 4, 5 and 6 show the “ideal"speedupcurves (section
2) for the kernels EP, IS, FFT, CG and CHOLESKY, as well as the
speedup curves for these kernels on the three hardware platforms.
There is negligible deviation from the ideal curve for the EP kernel
on the three hardwareplatforms; a marginaldifference for FFT and
CG; and a signiﬁcant deviation for IS and CHOLESKY. For each
of these kernels, we quantify the different interaction overheads
responsible for the deviation during each execution mode of the
kernel. Only theresults forIS,FFT andCHOLESKYare discussed
in this section due to spaceconstraints. Details on the other kernels
can be found in [22].
0
5
10
15
20
25
30
35
0 5 10 15 20 25 30
S
p
e
e
d
u
p
Processors
Linear
Ideal
Real(Full)
Real(Cube)
Real(Mesh)
Figure 2: EP: Speedup
0
5
10
15
20
25
30
35
0 5 10 15 20 25 30
S
p
e
e
d
u
p
Processors
Linear
Ideal
Real(Full)
Real(Cube)
Real(Mesh)
Figure 3: IS: Speedup0
5
10
15
20
25
30
35
0 5 10 15 20 25 30
S
p
e
e
d
u
p
Processors
Linear
Ideal
Real(Full)
Real(Cube)
Real(Mesh)
Figure 4: FFT: Speedup
0
5
10
15
20
25
30
35
0 5 10 15 20 25 30
S
p
e
e
d
u
p
Processors
Linear
Ideal
Real(Full)
Real(Cube)
Real(Mesh)
Figure 5: CG: Speedup
0
5
10
15
20
25
30
35
0 5 10 15 20 25 30
S
p
e
e
d
u
p
Processors
Linear
Ideal
Real(Full)
Real(Cube)
Real(Mesh)
Figure 6: CHOLESKY: Speedup
In the following subsections,we showfor eachkernelthe execu-
tion time, the latency, and the contention overhead graphs for the
mesh platform. The ﬁrst shows the total execution time, while the
latter two show the communication overheads ignoring the com-
putation time. In each of these graphs, we show the curves for
the individual modes of execution applicable for a particular ker-
nel. We also present for each kernel the latency and contention
overhead curves on the three architecture platforms. The latency
overhead in the NORMAL mode (i.e. due to ordinary data access)
is determined by the memory reference pattern of the kernel and
the network trafﬁc due to cacheline replacement. With sufﬁciently
large size cache at each node, it is reasonable to assume that this
latency overhead is only due to the kernel, and thus is expected to
be independent of the network topology. Due to the vagaries of
the synchronizationaccesses,it is conceivablethat the correspond-
ing latency overheadscould differ acrossnetworkplatforms for the
other modes. However, in our experiments we have not seen any
signiﬁcant deviation. As a result, the latency overhead curves for
all the kernels look alike across network platforms. On the other
hand, it is to be expected that the contention overhead will increase
astheconnectivityin thenetworkdecreases. Thisis alsoconﬁrmed
for all the kernels.
5.1 IS
For this kernel, there is a signiﬁcantdeviation from the ideal curve
for all three platforms (see Figure 3). The overheads may be an-
alyzed by considering the different modes of execution. In this
kernel, NORMAL and MUTEX are the only signiﬁcant modes of
execution (see Figure 7). The network accesses in the NORMAL
mode are for ordinary data transfer, and the accesses in MUTEX
are for synchronization. The latency and contention overheads in-
curred in the MUTEX mode is higher than in the NORMAL mode
(see Figures 8 and 9). As a result of this, the total execution time in
the MUTEX mode surpasses that in the NORMAL mode beyond
a certain number of processors (see Figure 7), which also explains
the dip in the speedupcurve for mesh (see Figure 3).
0
20
40
60
80
100
120
140
160
0 5 10 15 20 25 30
T
i
m
e
 
(
i
n
 
m
i
l
l
i
s
e
c
s
)
Processors
NORMAL
BARRIER
MUTEX
PGM_SYNC
Figure 7: IS: Mode-wise Execn. Time (Mesh)
0
1
2
3
4
5
6
7
8
9
0 5 10 15 20 25 30
T
i
m
e
 
(
i
n
 
m
i
l
l
i
s
e
c
s
)
Processors
NORMAL
BARRIER
MUTEX
PGM_SYNC
Figure 8: IS: Mode-wiseLatency (Mesh)0
5
10
15
20
25
0 5 10 15 20 25 30
T
i
m
e
 
(
i
n
 
m
i
l
l
i
s
e
c
s
)
Processors
NORMAL
BARRIER
MUTEX
PGM_SYNC
Figure 9: IS: Mode-wise Contention (Mesh)
0
2
4
6
8
10
12
14
0 5 10 15 20 25 30
T
i
m
e
 
(
i
n
 
m
i
l
l
i
s
e
c
s
)
Processors
Latency
Contention
Figure 10: IS: Latency and Contention (Full)
0
2
4
6
8
10
12
14
0 5 10 15 20 25 30
T
i
m
e
 
(
i
n
 
m
i
l
l
i
s
e
c
s
)
Processors
Latency
Contention
Figure 11: IS: Latencyand Contention (Cube)
0
5
10
15
20
25
30
0 5 10 15 20 25 30
T
i
m
e
 
(
i
n
 
m
i
l
l
i
s
e
c
s
)
Processors
Latency
Contention
Figure 12: IS: Latency and Contention (Mesh)
Figures10,11and12. showthelatencyandcontentionoverheads
forthethree hardwareplatforms. In IS,sinceeveryprocessorneeds
toaccessthedataofallotherprocessors,andsincethedataisequally
partitioned amongthe executingprocessors,the numberofaccesses
to remote locations grows as
(
p
￿1
)
=
p. Thisexplainstheﬂattening
of the latency overhead curve for all three network platforms as
p
increases. On the mesh network the contention overhead surpasses
the latency overhead at around 18 processors. Table 1 summarizes
the overheads for IS obtained by interpolating the datapoints from
our simulation results.
IS Full Cube Mesh
Comp. Time (ms) 129
:3
=
p0
:7 129
:3
=
p0
:7 129
:3
=
p0
:7
Latency (ms) 13
:2
(1
￿ 1
p
) 13
:2
(1
￿ 1
p
) 13
:2
(1
￿ 1
p
)
Contention (ms)
N
e
g
l
i
g
i
b
l
e 4
:0log
p 0
:9
p
Table 1: IS : OverheadFunctions
Parallelization of this kernel increases the amount of work to be
done for a given problem size (see [22]). This inherent algorith-
mic overhead causes a deviation of the ideal curve from the linear
curve (see Figure 3). This is also conﬁrmed in Table 1, where
the computation time does not decrease linearly with the number
of processors. This indicates the kernel is not scalable for small
problem sizes. As can be seen from Table 1, the contention over-
head is negligible and the latency overhead converges to a constant
with a sufﬁciently large number of processorson a fully connected
network. Thus for a fully connectednetwork, the scalability of this
kernel is expected to closely follow the ideal curve. For the cube
andmesh platforms, the contention overheadgrows logarithmically
andlinearlywiththenumberofprocessors,respectively. Therefore,
the scalability of IS on these two platforms is likely to be worse
thanfor the fully connectednetwork. From the aboveobservations,
we canconcludethat IS is notvery scalablefor the chosenproblem
size on the three hardware platforms. However, if the problem is
scaledup,the coefﬁcientassociatedwith the computationtime will
increase thus making IS more scalable.
5.2 FFT
The algorithmic and interaction overheads for the FFT kernel are
marginal. Thus the real execution curves for all three platforms
as well as the ideal curve are close to the linear one as shown in
Figure 4. The execution time is dominatedby the NORMAL mode
(Figure 13). The latency and contentionoverheads (Figures 14 and
15) incurred in this mode are insigniﬁcant compared to the totalexecution time, despite the growth of contention overhead with
increasing numberof processors.
0
500
1000
1500
2000
2500
3000
0 5 10 15 20 25 30
T
i
m
e
 
(
i
n
 
m
i
l
l
i
s
e
c
s
)
Processors
NORMAL
BARRIER
MUTEX
PGM_SYNC
Figure 13: FFT: Mode-wise Execn. Time (Mesh)
0
5
10
15
20
25
0 5 10 15 20 25 30
T
i
m
e
 
(
i
n
 
m
i
l
l
i
s
e
c
s
)
Processors
NORMAL
BARRIER
MUTEX
PGM_SYNC
Figure 14: FFT: Mode-wise Latency (Mesh)
0
0.5
1
1.5
2
2.5
3
0 5 10 15 20 25 30
T
i
m
e
 
(
i
n
 
m
i
l
l
i
s
e
c
s
)
Processors
NORMAL
BARRIER
MUTEX
PGM_SYNC
Figure 15: FFT: Mode-wise Contention (Mesh)
The communication in FFT has been optimized as suggested in
[8] into a single phase where every processor accesses the data of
all the other processors in a skewed manner. The number of such
non-localaccessesincurredbya processorgrowsas
O
(
(
p
￿1
)
=
p
2
)
with the numberof processors,andthe latency overhead curvesfor
all three networks reﬂect this behavior. As a result of skewing the
communication among the processors, the contention is negligible
on the full (Figure 16) and the cube (Figure 17) platforms. On the
mesh (Figure 18), the contention surpassesthe latency overhead at
around 28 processors. Table 2 summarizes the overheads for FFT
obtainedbyinterpolatingthedatapointsfromoursimulationresults.
0
5
10
15
20
25
0 5 10 15 20 25 30
T
i
m
e
 
(
i
n
 
m
i
l
l
i
s
e
c
s
)
Processors
Latency
Contention
Figure 16: FFT: Latency and Contention (Full)
0
5
10
15
20
25
0 5 10 15 20 25 30
T
i
m
e
 
(
i
n
 
m
i
l
l
i
s
e
c
s
)
Processors
Latency
Contention
Figure 17: FFT: Latencyand Contention (Cube)
0
5
10
15
20
25
0 5 10 15 20 25 30
T
i
m
e
 
(
i
n
 
m
i
l
l
i
s
e
c
s
)
Processors
Latency
Contention
Figure 18: FFT: Latency and Contention (Mesh)
With marginal algorithmic overheads and decreasing number of
messages exchanged per processor (latency overhead), the con-
tention overhead is the only artifact that can cause deviation from
linearbehavior. Butwithskewedcommunicationaccesses,thecon-
tention overhead has also been minimized and begins to show only
onthemeshnetworkwhereit growslinearly (seeTable2). Thuswe
canconcludethatthe FFTkernelis scalablefor the fully-connected
and cube platforms. For the mesh platform, it would take 200 pro-
cessors before the contention overhead starts dominating for theFFT Full Cube Mesh
Comp. Time (s) 2
:5
=
p 2
:5
=
p 2
:5
=
p
Latency (ms) 49
:9
=
p0
:9 49
:9
=
p0
:9 49
:9
=
p0
:9
Contention (us)
N
e
g
l
i
g
i
b
l
e
S
m
a
l
l 63
:5
p
Table 2: FFT : OverheadFunctions
64K problem size. With increase in problem size (
N), the local
computation that performs a radix-2 Butterﬂy is expected to grow
as
O
(
(
N
=
p
)log
(
N
=
p
)
) while the communication for a processor
is expected to grow as
O
(
N
(
p
￿ 1
)
=
p
2
). Hence, increase in data
size will increase its scalability on all hardware platforms.
5.3 CHOLESKY
The algorithmic overheadsfor CHOLESKY causea signiﬁcantde-
viation from linear behavior for the ideal curve as shown in Figure
6. Anexaminationoftheexecutiontimes(Figure19)showsthatthe
bulk of the time is spentin the NORMAL mode which performs the
actual factorization. The communication overheads in the NOR-
MAL mode for the data accessesof the sparse matrix outweigh the
accesses for synchronization variables (Figures 20 and 21). Thus
the time spent in the MUTEX mode (which represents dynamic
scheduling and accesses to critical sections) is insigniﬁcant com-
pared to the NORMAL mode Although, the contention overhead
in the NORMAL mode increases quite rapidly with the number of
processors the overall impact of communication on the execution
time is insigniﬁcant(see Figure 19).
As with FFT, the number of non-local memory accesses made
by a processor decreases with increasing number of processors
explaining a decreasing latency overhead. The contention over-
head is negligible for the fully-connected network (Figure 22) and
grows with increasing processors for the cube (Figure 23), becom-
ing more dominant than the latency overhead for the mesh (Figure
24) at around 20 processors. Table 3 summarizes the overheads
for CHOLESKY obtained by interpolating the datapoints from our
simulation results.
0
500
1000
1500
2000
2500
3000
3500
4000
4500
5000
0 5 10 15 20 25 30
T
i
m
e
 
(
i
n
 
m
i
l
l
i
s
e
c
s
)
Processors
NORMAL
BARRIER
MUTEX
PGM_SYNC
Figure 19: CHOLESKY: Mode-wiseExecn. Time (Mesh)
0
100
200
300
400
500
600
700
800
0 5 10 15 20 25 30
T
i
m
e
 
(
i
n
 
m
i
l
l
i
s
e
c
s
)
Processors
NORMAL
BARRIER
MUTEX
PGM_SYNC
Figure 20: CHOLESKY: Mode-wise Latency (Mesh)
0
10
20
30
40
50
60
70
80
90
0 5 10 15 20 25 30
T
i
m
e
 
(
i
n
 
m
i
l
l
i
s
e
c
s
)
Processors
NORMAL
BARRIER
MUTEX
PGM_SYNC
Figure 21: CHOLESKY: Mode-wiseContention (Mesh)
0
100
200
300
400
500
600
700
800
0 5 10 15 20 25 30
T
i
m
e
 
(
i
n
 
m
i
l
l
i
s
e
c
s
)
Processors
Latency
Contention
Figure 22: CHOLESKY: Latency and Contention (Full)0
100
200
300
400
500
600
700
800
0 5 10 15 20 25 30
T
i
m
e
 
(
i
n
 
m
i
l
l
i
s
e
c
s
)
Processors
Latency
Contention
Figure 23: CHOLESKY: Latency and Contention (Cube)
0
100
200
300
400
500
600
700
800
0 5 10 15 20 25 30
T
i
m
e
 
(
i
n
 
m
i
l
l
i
s
e
c
s
)
Processors
Latency
Contention
Figure 24: CHOLESKY: Latency and Contention (Mesh)
CHOLESKY Full Cube Mesh
Comp. Time (s) 3
:9
=
p0
:8 3
:9
=
p0
:8 3
:9
=
p0
:8
Latency (s) 1
:2
=
p0
:9 1
:2
=
p0
:9 1
:2
=
p0
:9
Contention(ms)
N
e
g
l
i
g
i
b
l
e
C
o
n
s
t
a
n
t 39
:9log
p
Table 3: CHOLESKY : OverheadFunctions
The deviation of the ideal from the linear curve (Figure 6) indi-
catesthatthekernelis notvery scalableforthechosenproblemsize
due to the inherent algorithmic overhead as in IS. As can be ob-
served from Table 3, the latency decreaseswith increasing number
of processorsandthe scalability of the real execution would thus be
dictated by the contention overhead. The contention on the fully-
connectedand cubenetworks is negligible thus projecting speedup
curves that closely follow the ideal speedup curve for these plat-
forms. On the other hand, the contention grows logarithmically on
the mesh making this platform less scalable. With increasing prob-
lemsizes,thecoefﬁcientassociatedwiththecomputationtimeinthe
above table is likely to grow faster than the coefﬁcients associated
with the communication overheads (veriﬁed by experimentation).
Hence, an increase in problem size would enhance the scalability
of this kernelon all hardwareplatforms.
6 Concluding Remarks
We usedan execution-drivensimulation platform to studythe scala-
bility characteristics of EP, IS, FFT, CG, and CHOLESKY on three
shared memory platforms, respectively, with a fully-connected,
cube, and mesh interconnection networks. The simulator allows
for the separation of the algorithmic and interaction overheads in a
parallel system. Separating the overheads provided us with some
key insights into the algorithmic characteristics and architectural
features that limit the scalability for these parallel systems. Algo-
rithmic overheads such as the additional work incurred in paral-
lelization could be a limiting factor for scalability as observedin IS
andCHOLESKY.In sharedmemory machineswith private caches,
aslong asthe applicationsare well-structured to exploit locality,the
key determinant to scalability is network contention. This is par-
ticularly true for most commercial shared memory multiprocessors
which have sufﬁciently large caches.
We have illustrated the usefulness as well as the feasibility of
ourtop-down approachfor understandingthe scalability of parallel
systems. This approach can be used to study the impact of other
systemparameters(suchaslink bandwidthandprocessorspeed)on
scalability and provide guidelines for application design as well as
evaluate architectural design decisions.
References
[1] A. Agarwal. Limits on Interconnection Network Perfor-
mance. IEEE Transactions on Parallel and Distributed Sys-
tems, 2(4):398–412,October1991.
[2] G. M. Amdahl. Validity of the Single Processor Approach to
achieving Large Scale Computing Capabilities. In Proceed-
ings of the AFIPS Spring Joint Computer Conference, pages
483–485,April 1967.
[3] T. E. Anderson. The Performance of Spin Lock Alternatives
for Shared-Memory Multiprocessors. IEEE Transactions on
Parallel and Distributed Systems,1(1):6–16, January1990.
[4] D. Bailey et al. The NAS Parallel Benchmarks. International
Journalof SupercomputerApplications,5(3):63–73, 1991.
[5] E.A.Brewer,C.N.Dellarocas,A.Colbrook,andW.E.Weihl.
PROTEUS : A high-performance parallel-architecture sim-
ulator. Technical Report MIT-LCS-TR-516, Massachusetts
Institute of Technology, Cambridge, MA 02139, September
1991.
[6] D. Chen, H. Su, and P. Yew. The Impact of Synchronization
and Granularity on Parallel Systems. In Proceedings of the
17th AnnualInternationalSymposiumonComputerArchitec-
ture, pages239–248,1990.
[7] R. G. Covington, S. Madala, V. Mehta, J. R. Jump, and J. B.
Sinclair. TheRiceparallelprocessingtestbed. In Proceedings
of the ACM SIGMETRICS 1988 Conferenceon Measurement
and Modeling of Computer Systems, pages 4–11, Santa Fe,
NM, May 1988.
[8] D. Culler et al. LogP : Towards a realistic model of parallel
computation. In Proceedingsof the 4th ACM SIGPLAN Sym-
posium on Principles and Practice of Parallel Programming,
pages 1–12,May 1993.
[9] R. Cypher, A. Ho, S. Konstantinidou,and P. Messina. Archi-
tectural requirements of parallel scientiﬁc applications with
explicit communication. In Proceedings of the 20th AnnualInternational Symposium on Computer Architecture, pages
2–13, May 1993.
[10] S.J.EggersandR.H.Katz.TheEffectofSharingontheCache
and Bus Performance of Parallel Programs. In Proceedingsof
the Third International Conference on Architectural Support
for Programming Languages and Operating Systems, pages
257–270,Boston,Massachusetts,April 1989.
[11] J.L.Gustafson,G.R.Montry,andR.E.Benner.Development
of Parallel Methods for a 1024-node Hypercube. SIAM Jour-
nal on Scientiﬁc and Statistical Computing, 9(4):609–638,
1988.
[12] A. H. Karp and H. P. Flatt. Measuring Parallel processor
Performance. Communications of the ACM, 33(5):539–543,
May 1990.
[13] V. Kumar and V. N. Rao. Parallel Depth-First Search. Inter-
national Journal of Parallel Programming, 16(6):501–519,
1987.
[14] F. H. McMahon. The Livermore Fortran Kernels : A Com-
puter Test of the Numerical Performance Range. Technical
Report UCRL-53745,Lawrence Livermore National Labora-
tory, Livermore, CA, December1986.
[15] J. M. Mellor-Crummey and M. L. Scott. Algorithms for
Scalable Synchronization on Shared-Memory Multiproces-
sors. ACM Transactions on Computer Systems, 9(1):21–65,
February 1991.
[16] G. F. Pﬁster and V. A. Norton. Hot Spot Contention and
Combining in Multistage Interconnection Networks. IEEE
Transactions on Computer Systems, C-34(10):943–948, Oc-
tober 1985.
[17] S. K. Reinhardt et al. The Wisconsin Wind Tunnel : Virtual
prototypingofparallelcomputers. In ProceedingsoftheACM
SIGMETRICS 1993 Conferenceon Measurementand Model-
ing ofComputerSystems,pages48–60,SantaClara, CA,May
1993.
[18] E. Rothberg, J. P. Singh, and A. Gupta. Working sets, cache
sizes and node granularity issues for large-scale multiproces-
sors. In Proceedingsofthe 20th AnnualInternationalSympo-
sium on Computer Architecture,pages14–25,May 1993.
[19] JaswinderPalSingh,Wolf-Dietrich Weber,andAnoopGupta.
SPLASH:Stanford Parallel Applicationsfor Shared-Memory.
Technical Report CSL-TR-91-469, Computer Systems Labo-
ratory, Stanford University, 1991.
[20] A. Sivasubramaniam, U. Ramachandran, and H. Venkat-
eswaran. Message-Passing: Computational Model, Program-
ming Paradigm, and Experimental Studies. TechnicalReport
GIT-CC-91/11, College of Computing, Georgia Institute of
Technology,February 1991.
[21] A. Sivasubramaniam, G. Shah, J. Lee, U. Ramachandran,
and H. Venkateswaran. ExperimentalEvaluation of Algorith-
mic Performance on Two Shared Memory Multiprocessors.
In Norihisa Suzuki, editor, Shared Memory Multiprocessing,
pages 81–107.MIT Press, 1992.
[22] A. Sivasubramaniam, A. Singla, U. Ramachandran, and
H. Venkateswaran. An Approach to Scalability Study of
Shared Memory Parallel Systems. TechnicalReportGIT-CC-
93/62, College of Computing, Georgia Institute of Technol-
ogy, October1993.
[23] A. Sivasubramaniam, A. Singla, U. Ramachandran, and
H. Venkateswaran. Machine Abstractions and Locality Is-
suesin StudyingParallelSystems. TechnicalReportGIT-CC-
93/63, College of Computing, Georgia Institute of Technol-
ogy, October1993.
[24] A. Sivasubramaniam, A. Singla, U. Ramachandran, and
H. Venkateswaran. A Simulation-based Scalability Study of
Parallel Systems. Technical Report GIT-CC-93/27, College
of Computing, Georgia Institute of Technology,April 1993.
[25] X-H. Sun and J. L. Gustafson. Towards a better Parallel Per-
formance Metric. Parallel Computing, 17:1093–1109,1991.
[26] J. C. Wyllie. The Complexity of Parallel Computations.P h D
thesis, Department of Computer Science, Cornell University,
1979.