Performance prediction: A case study using a multi-ring KSR-1 machine by Sun, Xian-He & Zhu, Jianping
NASA Contractor Report 195066
ICASE Report No. 95-24
S
PERFORMANCE PREDICTION: A CASE STUDY USING
A MULTI-RING KSR-1 MACHINE
Xian-He Sun
Jianping Zhu
(NASA-CR-195066) PERFORMANCE
PREDICIION: A CASE STUOY USING
MULTI-RING KSR-I MACHINE Final
Report (ICASE) 22 p
N95-26697
Unclas
_/60 O049050
Contract No. NAS 1-19480
April 1995
Institute for Computer Applications in Science and Engineering
NASA Langley Research Center
Hampton, VA 23681-0001
Operated by Universities Space Research Association
https://ntrs.nasa.gov/search.jsp?R=19950020277 2020-06-16T07:11:35+00:00Z

Performance Prediction:
A case study using a multi-ring KSR-1 machine*
Xian-He Sun Jianping Zhu
Department of Computer Science
Louisiana State University
Baton Rouge, LA 70803-4020
A_F Engineering Research Center
Dept. of Math. and Stat.
Mississippi State University
Mississippi State, MS 39762
Abstract
As computers with tens of thousands of processors are successfully delivering high
performance power for solving some of the so-called "grand-challenge" applications,
the notion of scalability is becoming an important metric in the evaluation of paral-
lel machine architectures and algorithms. In this study, the prediction of scalability
and its appfication are carefully investigated. A simple formula is presented to show
the relation between scalahility, single processor computing power, and degradation of
parallelism. A case study is conducted on a multi-ring KSR-1 shared virtual memory
machine. Experimental and theoretical results show that the influence of topology vari-
ation of an architecture is predictable. Therefore, the performance of an algorithm on a
sophisticated, hierarchical architecture can be predicted and the best algorithm-machine
combination can be selected for a given application.
*This research was supported in part by the National Aeronautics and Space Administration under NASA con-
tract NAS1-19480 while the first author was in residence at, the Institute for Computer Applications in Science and
Engineering (ICASE), NASA Langley Research Center, Hampton, VA 23681-0001

1 Introduction
With modern technology, parallel processing seems to be the only way to achieve higher perfor-
mance. In recent years, various architectures have been proposed to connect a large number of
processors into a single powerful machine; and various algorithms have been developed on these
machines to explore the potential of high computation power. However, each architecture has
distinct properties, and each algorithm has its own inherent data structures. The performance of
an algorithm on a particular architecture may vary significantly as the system and problem sizes
increase. Predicting the performance of an algorithm-machine combination is difficult and elusive.
There are two commonly used synchronization and communication models: message-passing and
shared-memory. Processes communicate through explicit message passing in the message-passing
model and through shared variables in the shared-memory model. Traditionally, message-passing
is the natural choice of distributed-memory machines. With shared virtual address space, shared
virtual memory can be supported on distributed-memory machines, but requires sophisticated
hardware and system support. Shared virtual memory machines combine the merits of both the
distributed-memory machines and the shared-memory communication model. They are scalable
and provide sequential-like programming environment. However, performance prediction of shared
virtual memory machines is more difficult than that of traditional message-passing machines, be-
cause their communication is implicit and memory access time is non-uniform.
Simply speaking, a scalable architecture is an architecture capable of yielding very high raw
computation power when the system size is large. However the high computation power may not
be realized in solving a given application, since the achievable efficiency of an application may
drop quickly with the increase of system size. To evaluate the ability of maintaining performance,
several metrics have been proposed to measure the scalability of algorithm-machine combinations
[2, 3, 7, 8, 11]. Isospeed scalability [8] is one of the proposed metrics. It measures the ability
of an algorithm-machine combination to maintain unit processor speed. Through a case stud), in
this paper, we investigate issues of performance prediction of shared virtual memory machines.
Performance models are developed in terms of execution time and scalability. Experimental results
on a 64-node Kendall Square KSR-1 show that, when performance information of small scale
systems is available, the performance of large scale systems can be predicted. Thus, machine
architectures and algorithms can be compared in terms of scalability without run-time information.
Since a 64- node KSR- 1 is a shared virtual memory machine with variable memory access times, the
experience learned in this study is reasonably general and should extend to a class of applications.
The paper is organized as follows. In section 2, we first review isospeed scalability, then the
properties of isospeed scalability are discussed. Performance formulas also developed to show the
relations between execution time and scalability and to show possible approaches of predicting
scalability. In Section 3, the regularized least squares application and the KSR-1 architecture are
introduced. Theoreticalanalysisis givento find the performanceboundof the applicationand
to developthe performancemodelof the algorithm-machinecombination.Experimentaldetails
and results,whichmatchthe predictedperformanceclosely,aregivenin Section4. A practical
methodis introducedto measurethememoryaccessdelayandothersystemoverheadof themulti-
levelring, sharedvirtual memorymachine.Performancepredictionwithout run-timeinformation
andselectionof anappropriatealgorithm-architecturecombinationfor a givenapplicationarealso
discussedin Section4. Finally,thesummaryis givenin Section5.
2 Definition and Analysis
One of the main motivations of parallel processing is to solve large problems fast. Considering both
execution time and problem size, what we seek from parallel processing is speed which is defined
as work divided by time. In general, how work should be defined is controversial. For scientific
applications, it is commonly agreed that the floating point (flop) operation count is a good estimate
of work (problem size) 1. The average unit speed (or average speed, in short) is a good measure of
parallel processing speed.
Definition 1 The average unit speed is the achieved speed of the given computing system di-
vided by p, the number of processors.
In the ideal situation, average speed remains constant when system size increases. Hardware
peak performance provided by vendors is usually based on this ideal assumption. If problem size
is fixed, the ideal situation is unlikely to happen in practice, since when problem size is fixed,
the communication/computation ratio is likely to increase with the number of processors, and
therefore, the speed will decrease with increased system size. On the other hand, if system size is
fixed, communication/computation ratio is likely to decrease with increased problem size for most
practical algorithms. For these algorithms, increasing problem size with the system size may keep
the average speed constant. Based on this observation, the isospeed scalability has been formally
defined as the ability to maintain the average speed in [8].
Definition 2 An algoritkm-machine combination is scalable if the achieved average speed of
the algorithm on the given machine can remain constant with increasing numbers of processors,
provided the problem size can be increased with the system size.
For a large class of algorithm-machine combinations, the average speed can be maintained by
increasing problem size [8]. The necessary increase of problem size varies with algorithms, machines,
_Some authors refer to problem size as the parameter that determines the work, for instance, the order of matrices.
In this paper, problem size refers to the work to be performed and we will use problem size and work alternatively.
and their combinations.This variationprovidesa quantitativemeasurementfor scalability. Let
W be the amount of work of an algorithm when p processors are employed in a machine, and
let W r be the amount of work needed to maintain the average speed when pr > p processors are
employed. Then we define the scalability from system size p to system size p_ of the algorithm-
machine combination as follows:
7"(P, P') - p'W ( 1)
pl4,'l
The work I¥ I is determined by the isospeed constraint. When 14''_ = P'W, that is when average
P
speed is maintained with work per processor unchanged, the scalability equals one. It is the ideal
case. In general, work per processor may have to be increased to achieve the fixed average speed,
and scalability is less than one.
Speedup is a widely used performance metric in parallel processing. It is defined as sequential
execution time over parallel execution time and is used to measure the parallel processing gain
over sequential processing. Traditionally, parallel efficiency is defined as speedup divided by p,
where p, the number of processors, is the ideal speedup. The traditional parallel efficiency is the
efficiency in terms of speedup. Contrary to speedup, average speed is an indicator of uniprocessor
efficiency, where uniprocessor efficiency is defined as average unit speed over peak uniprocessor
speed. Maintaining average speed is equivalent to maintaining the uniprocessor efficiency. Under
certain assumptions, maintaining average speed is also equivalent to maintaining the parallel ef-
ficiency [9]. However, in practice, these two approaches may lead to totally different results [9].
Unlike parallel efficiency, average speed does not inherit any deficiency of speedup. It does not
require solving large problems on a single processor and does not give credit to slow computation,
while parallel efficiency does.
Three different approaches have been proposed in [8] to obtain scalability.
1. The scalability can be measured using software by a control program that invokes the appli-
cation program and searches for the run which has the desired fixed average unit speed.
2. The scalability can be computed by first finding the relation between average unit speed and
execution time (or work) and then using equation (1) (or equation (4)).
3. The scalability can be predicted by deriving a general scalability formula.
The third approach, i.e. prediction, is the topic of this study. It is the simplest one among the
three approaches, if a formula can be defined. A prediction formula is given in [8] for applications
where communication cost is independent of problem size and work load is balanced among pro-
cessors. By the definition of scalability (1), scalability can be predicted if and only if the scaled
work size, I47', can be predicted. Proposition 1 provides a way to obtain W r.
Proposition 1 ff parallel degradation exists, then for scalability (1)
W'- ap'To
1 - a/k' (2)
where a is the fixed average speed, A is the computing rate of a single processor, and To is the
parallel processing overhead.
Proof: Since 1¥' is the scaled work satisfying the isospeed requirement,
a-
p%,(w,)
The parallel execution time, T#(W'), can be divided into two parts: ideal parallel processing time
and parallel processing overhead, To.
T1 W'A
p, +To,
where T1 is the sequential execution time and T1/p' is the ideal parallel execution time. Thus,
a -
W' A +Top"
and
[3
Note that in Equation (2), a is the achieved average speed considering the parallel processing over-
head, and A is the computing rate without considering the overhead. When parallel degradation
does exist (i.e. To > 0), /k -1 > a and, therefore, equation (2) is traceable. To > 0 is a necessary
and sufficient condition of Proposition 1.
Combining scalability (1) and equation (2), we have
g,(p,p,) _ W(1 - a/k) (3)
paT°
Equation (3) is very useful. It not only gives a way to predict scalability, but more importantly, it
shows tile following three properties of isospeed scalability.
1. Scalability (1) increases with the decrease of the fixed average speed a.
2. A, the computing rate of a single processor, is the inverse of single processor speed. Equation
(3) shows that scalability increases with single processor speed.
3. Scalability increases with the decrease of degradation of parallelism To.
Property 1 is very reasonable. Scalability is the ability of a computing system to maintain per-
formance when system size is scaled up. Property 1 shows that less effort is needed to maintain
lower efficiency, if we consider aA as the uniprocessor efficiency. Equation (3) gives the relation
between the effort (scalability) and performance (the fixed average speed) of an algorithm-machine
combination. Property 1 also shows that, by adjusting the average speed a, isospeed scalability can
be applied to a large class of algorithm-machine combinations, from massively parallel systems with
relatively weak processing elements to supercomputers with a few powerful processors. Equation
(3) also gives the relation between isospeed scalability, computing power of a single processor, and
degradation of parallelism. Properties 2 and 3 show that isospeed scalability does not give credit
to slow computing and communication. These two properties are very important in evaluation of
computing systems. They distinguish isospeed scalability from parallel metrics based on speedup.
It is known that speedup favors parallel systems with a high communication/computing ratio [9].
Although equation (3) is very useful, using it in performance prediction may not be as simple
as it looks. The degradation of parallelism, To, which contains both communication and workload
imbalance degradation, may be difficult to compute. Also, the single processor rate may vary with
algorithm and problem size, especially for shared virtual memory machines [9]. A detailed case
study is given in the next section to illustrate how the prediction formula could be used in practice,
and how the predicted scalability could be used to evaluate machine architectures.
Finally, equation (4) shows how parallel execution time could be computed from scalability.
rv,(w' ) = (4)
where Tp(W), T#(W _) are the parallel execution times of solving the problem with the work of W
and W _ on a system of p and p_ processors respectively. The computing rate of single processor,
A, is machine dependent. The degradation of parallelism, To, is both architecture and algorithm
dependent. Equation (3) gives a way to find a good algorithm-machine combination in terms of
scalability. Equation (4) shows larger scalability will lead to smaller execution time.
3 The Case Study
In this section, we discuss the case study for solving an application problem on KSR-1 parallel
computers. We first give brief descriptions of the architecture and the application problem, and
then present the measured performance and compare it with the predicted performance.
3.1 The Machine
Our case study was performed on the KSR-1 parallel computer. It has a distributed physical
memory which makes a large ensemble size possible, and a shared address space which allows users
to developprogramsin a shared-memory-likeenvironment.
Ring:1
connectingupto34Ring:0's
SearchEngine:l"_
Rring:0connecting_
upto32processors._. . .._.
Figure1. Configuration of KSR-1 parallel computers.
p: processor M: 32 Mbytes of local memory
Figure 1 shows the architecture of the KSR-1 parallel computer. Each processor on the KSR-1
has 32 Mbytes of local memory. The CPU is a super-scalar processor with a peak performance of
40 Mflops in double precision. Processors are organized into different rings. The local ring (ring:0)
can connect up to 32 processors, and a higher level ring of rings (ring:l) can contain up to 34 local
rings with a maximum of 1088 processors.
Access to non-local data on KSR is provided by a hierarchy of Search Engines. The Search
Engine SE:0 locates data in the local ring, while the Search Engine SE:I provides data access
between local rings. These different Search Engines are connected in a fat-tree-like structure. The
memory hierarchy of KSR is shown in Figure 2.
Group:0 Cache
1GB
I Search
Group:l Cache
34 GB
Engine:0
Engine: 1
Figure 2. Memory hierarchy of KSR-1.
Eachprocessorhas512Kbytesof fast subcache which is similar to the normal cache on other
parallel computers. This subcache is divided into two equal parts: an instruction subcache and a
data subcache. The 32 Mbytes of local memory on each processor is called a local cache. A local
ring (ring:0) with up to 32 processors can have 1 Gbytes total of local cache which is called the
Group:O cache. Access to the Group:0 cache is provided by Search Engine:0. Finally, a higher level
ring of rings (ring:l) connects up to 34 local rings with 34 Gbytes of total local cache which is called
Group:l cache. Access to the Group:l cache is provided by Search Engine:l. The entire memory
hierarchy is called ALLCACHE memory by the Kendall Square Research. Access by a processor
to the ALLCACHE memory system is accomplished by going through different Search Engines as
shown in Figure 2. The latencies for different memory locations [4] are: 2 cycles for subcache. 20
cycles for local cache, 150 cycles for Group:O cache, and 570 cycles for Group:1 cache.
3.2 The Application
The numerical algorithm used in this case study is the Householder Transformation algorithm for
the QR factorization of matrices. It is used for solving the normal equation
ATAx -- ATb (5)
without explicitly forming ATA.
In many cases, for instance the inverse problem of partial differential equations [1], the nor-
mal equation system resulting from the discretization is too ill-conditioned to be solved directly.
Tikhnov's regularization method [10] is frequently used in this case to increase numerical stabil-
ity. The key step in solving the Regularized Least Squares Problem (RLSP) is to introduce a
regularization factor a > 0. Instead of solving (5) directly, we solve the following system
(ATA + oI)x = ATb
for x. Equation (6) can also be written as
or
(6)
(7)
(s)
so that the major task is to carry out the QR factorization for matrix B which has the structure
B __
o111) oli' all)
a(1) _(1) ^(1)
ml (lrn2 " " " Urnn
(9)
where we usually have m _> n with m of the same order as n. Matrix B is neither a complete full
matrix nor a sparse matrix. The upper part is full and the lower part is sparse (in diagonal form).
Because of the special structure in (9), not all elements in the matrix are affected in a particular
transformation step. In the first step, all elements within the frame in matrix (9) will be affected.
In each new step, the frame in (9) will shift downwards one row with the left most column out of
the game. Therefore, at the ith step, the suhmatrix Bi affected in the transformation has the form:
Bi =
a(i) _(0
ii ...... uin
a(i) ...... a(i)
m+i-l,i m+i-l,n
o ... o
(10)
If the columns of matrix Bi of (10) are denoted by b}, i.e.
Bi [bll/ b i= bi+l "'" n], (11)
then the Householder Transformation can be described as:
Householder Transformation
Initialize matrix B
for i = l. n
1. ai = -sign(a!_ ) )(bii r bi/)1/2
2. wi = b i - o_iel
3./3j T i 2wi bj(ai (i)"= --c_iaii _, j=i+l,...,n
4. b) = b_ - /3jwi, j = i + l,...n
end for
Thecalculationof/3j's andupdatingof bi's canbedonein parallelfor differentindicesj.
3.3 Scalability Analysis
Based on the definition of isospeed scalabihty, the work W' at processor number p' should keep the
system ensemble running at the same average speed a as with p processors, so that
W W'
a = pTAw) /T¢(W') (12)
where Tp(W) and Tp,(W') are the execution times using p and p' processors respectively.
For the particular problem discussed here, the run time model is
and the work is
|¥(n) = 2n 3 -4-3n 2,
13)
(14)
where n is the number of columns in a 2n × n matrix to be transformed, p is the number of
processors, r is the rate of computing without communication overhead, and ,/3 is the latency for
access of remote data in the Group:0 cache. We use r, instead of A, to represent the computing
rate, because in practice the computing rate may vary with algorithm, problem size, and system
size. We reserve the notation A for the theoretical computing rate. Following the discussion given
in Section 2, the run time Tp(n) in (13) can apparently be represented as
= + ro(,,p), (15)
where Tc(n, p) is the computing time with ideal parallelism and To(n, p) represents the degradation
of parallelism. We then have
2n 3 + 3n 2
Tc(n,p) = r,
P
To(in,p) (an 2 3n2)r= - + n23.
P
The first term of To is due to the workload imbalance. The second term is due to the communication
(remote memory access) delay. Using relation (2) we get
( __3n t2
W'-- ap'_- p, v+ + n'2 j;3_ (16)
3nt2r
1 -- aT
The matrix size n is the parameter used to adjust the problem size. Substituting
I_[,71 : 2n/3 + 3n '2
into (16), we have
which eventually leads to
3n ¢2
2n,3 + 3nt2 = ap'(---y-r + 3n'2v + nt2fl)
1 -- ar
, 3arp' + aflp' 3
n - 2(1 - at) 2(1- aT)" (17)
Equation (17) is true for any work-processor pair which maintains the fixed average speed, assuming
that 7- and fl are unchanged. In particular,
3arp + aflp 3
n - 2(1 - at) 2(1 - aT)" (18)
Combining equation (17) and (18), we have
3at + aft
(n'- n) _ ----=' (P' - v),aT) (19)
which shows that the variation of n is in direct proportion to the variation of ensemble size, provided
that _- and/3 are independent of the number of processors.
Equation (19) indicates that the matrix size n' must increase at the same rate as the number of
processors pP, to maintain the pre-specified average speed a. If pr = rap, then we will have n _ = mn.
Assuming n is large so that the cubical term in equation (14) is dominant, we have the relation
= m3w( ).
Therefore, the scalability of this algorithm-machine combination can be estimated as
,(p, p') = rap) mp. W 1
pm3H z -- m2.
(2o)
In particular, if m = 2, which means the number of processors is doubled for each case, the
1
scalability will be approximately _.
It is clear from (19) that the parameters 7-and fl must first be determined before we can predict
the execution time and scalability. With the run-time model given by (13), we can estimate r and
/3 in the model to fit the measured run times using the least squares method. Assume that the
executions times Tm (nl),- " ", Tp_ (nk) are available oil Px, P2," • ", Pk processors, with problem sizes
10
beingnl, n2 -.-nk respectively, we will have
where
2 k kE,k:,b,rp,E,k:,c,-E,:l¢,rp, E,:,b,c,
"J- -- Eik__l bz2 Eik_l c_-(Eki bici) 2
k k
EiL1 b,2 E_=, ¢,_-(E_--1 b,_,F
bi 2n3
Pi
(21)
4 Scalability Prediction and Its Application
The peak performance provided by vendors gives the hardware performance limit but can hardly
be used to predict execution time accurately. For most application problems, the sustained speed
is only a small percentage of the peak performance. The same argument applies to communication
latency. The observed latency can be significantly different from the machine specifications. The
architecture specification [4] for KSR-1 gives
r = 0.025 (#s), fll = 7.5 (ps). (22)
To determine the value of r and/3 for this particular algorithm-machine pair, we ran the code on
p = 2 and 4 processors and measured the total execution time Tp(n) with n = 362 and n = 512
respectively. Then v and/3 are calculated using the model in (21). The parameters obtained this
way are
7-' = 0.18 (#s), /3'= 3.37 (ps). (23)
Comparing (22) and (23), we see that r' is significantly larger than _-. The sustained computational
speed is
1
r--; = 5.56 (M flops)
which is about 14% of the peak performance of 40 Mflops. This speed includes all the effects of
subcache misses and other overheads. On the other hand, the value of fl_ in (23) is significantly
smaller than/3 of (22), which means the actual observed communication speed is faster. This can
be attributed to two factoers:
1. Overlapping of communications with computations. In the Householder transformation, one
processor calculates the pivoting column and then broadcasts it to all other processors. This
broadcasting process can be partly overlapped with the other computations.
2. Automatic prefetch. The KSR-1 Fortran compiler analyzes loops and, whenever possible,
generates instructions to prefetch remote data needed for subsequent loops, thus saving data
11
accesstime.
225.0 ....
E
I--
175.0
125.0
75,0
-- measured execution time
....... predicted execution time _'
t
25.0 2 lr2 2_2 3_2 42 52
Number of processors
Figure 3. Measured and predicted execution time
Problem size is scaled up with available memory
Figure 3 shows both the measured execution time and the predicted execution time in seconds.
The predicted execution time is based on equations (13) and (23). The problem size is scaled-up
using the memory-bounded scale-up model [7], i.e. when the number of processors increases, the
matrix size also increases to fill up the available local memory. For the RLSP application, memory
requirement is a square function of the parameter n, and the computation count is a cubical function
of n. That explains why the run time goes up with more processors.
It is clear from the figure that the predicted execution time matches the measured execution
time well until p = 22. After that, the error increases significantly. This is due to the multi-ring
structure of KSR-1. Each ring has 32 processors. Since several of the 32 processors are dedicated
for I/O and control processes and are usually not used in computation, multi-ring communication
is involved even for p less than (but close to) 32. This multi-ring communication requires data
access to the Group:l cache which slows the computations significantly. The listed access time for
the Group:l cache on KSR-1 is [4]
/32 = 28.5 (ps). (24)
Again, the measured access time for our application is significantly different from the listed value,
especially when most communications are within a single ring. To determine the communication
delay for multiple rings, we ran the code on 36 processors and measured the execution time. Then
12
the valueof fl was calculated from (13) by fixing r = 0.18 (its) as given in (23). The new/3 value is
Z"= 6.27 (25)
which is about twice as large as that given in (23).
225.0
175.0 I
E 125.0
I--
75.0 -- measured execution time
....... predicted execution time using 13'
..... predicted execution time using 13"
, i i
25"036 40 44 48 52
Number of processors
Figure 4. Measured execution time and predicted time using the adjusted parameters
Problem size is scaled up with available memory
Figure 4 shows the execution time for p > 32. We see that with the new value of/3", the
predicted run time matches the measured execution time nicely.
Based on the test runs on p = 2, 4 and 36 processors and equation (17), the matrix size n' can
be predicted. Table 1 shows the predicted and measured matrix sizes respectively. The average
I size 1111 21 41 81 161 321 561
predicted - 54 115 238 484 976 2889measured 29 57 109 230 461 1006 2773
Table 1. Predicted and measured matrix size
speed a maintained in this test is 3.25 Mflops, which is about 58% of the sustained speed in (23).
From Table 1 we can see that the predicted matrix size is very close to the actual matrix size
measured by running the code on 8, 16, 32, and 56 processors. The last column in Table 1 shows
the predicted size n _ using fl'. If the fit given in (23) is used in predicting the matrix size, then n _
will be 1715 at p = 56, which is significantly smaller than the measured n _. The difference shows
13
g,(p,p') 1 I 2 4 8 16 32 56
1 1.00000 0.33238 0.07183 0.01652 0.00397 0.00097 0.00007
2 1.00000 0.21611 0.04971 0.01193 0.00292 0.00020
4 1.00000 0.23003 0.05520 0.01352 0.00092
8 1.00000 0.23999 0.05879 0.00398
16 1.00000 0.24499 0.01658
32 1.00000 0.06767
56 1.00000
Table 2. Predicted scalabifity of RLSP-KSR1 Combination
the influence of slower remote memory access of the Group:l cache on scalability.
With the matrix sizes given in Table 1 and the parameters given in (23) and (25), we can compute
the scalability g,(p, pt). Table 2 and 3 give the predicted and measured scalability respectively. We
can see that the predicted and measured scalabilities are fairly close. The prediction at ensemble
size of 56 is based on the justified communication delay/3". Figure 5 depicts the difference between
the measured scalability and the predicted scalability obtained by using/3 _. The curves in the figure
represent measured and predicted _(p,56) respectively with p varying from 1 to 56. Note that in
order to see clearly the difference between the two curves in figure 5, we plotted -log(_b(p, 56)),
instead of g,(p, 56). Therefore, the curve with lower - log(_b(p, 56)) value actually represents higher
scalability than the curve with higher -log(0(p, 56))value.
A single bus is an efficient architecture to support the shared-memory communication model
and has been used successfully in several commercial shared-memory machines. Due to network
contention, the single bus architecture is difficult to use to support a large number of processors
efficiently. All the commercially available machines with bus communication network share less
than 40 processors. In order to build a scalable shared virtual memory machine, the architecture
of KSR-1 is designed as a combination of buses and a fat-tree (see Section 3.1). Each local ring
has 32 processors connected to a single bus. Then, the local rings are connected with the fat-tree-
like structure. Theoretically, the computing system can be scaled up to any number of processors
by increasing the number of levels of the tree. Figure 5 shows the limitation of the ring-tree
approach. The scalability is severely reduced when inter-ring remote access is required. It shows
that, unless inter-ring communication can be improved, uniprocessor efficiency will reduce quickly
with the increase of ensemble size and high computing power may not be achievable by increasing
the number of levels of the fat-tree.
The scalability difference given in figure 5 is based on the measured scalability and the measured
_- and /3q Figure 6 shows the scalability difference with the theoretical performance data A, /31,
and/32, where the average speed is fixed at 58% of the peak performance. It gives the theoretical
difference of the RLSP application when Group:l communication is required. Comparing the
14
II 1 2 4 8 16 32 56
1 1.00000 0.28382 0.08418 0.01830 0.00459 0.00089 0.00007
2 1.00000 0.29660 0.06446 0.01616 0.00313 0.00026
4 1.00000 0.21734 0.05449 0.01054 0.00088
8 1.00000 0.25070 0.04849 0.00406
16 1.00000 0.19343 0.01621
32 1.00000 0.08378
56 1.00000
Table 3. Measured Scalabihty of RLSP-KSR1 combination.
curves in figure 5 with those in figure 6, we can clearly see the similarity. Both figures show
that the scalability with remote cache access is much lower than that without considering remote
data access. The general trends in both figures are very similar. Since the curves in figure 6
were plotted based on machine specification, it shows that, while machine specification does not
provide good estimate of execution time or speed, it does give a foundation to predict the influence
of architecture variation on performance. Equation (3) is a useful tool to predict performance
of an algorithm-machine pair, even when the computing system is scaled up from one level of
architecture hierarchy to two levels. It gives the variation of performance even only the hardware
specification is available. The influence of architecture variation is different on different algorithms.
When architecture scales up from one level of hierarchy to another, an algorithm that performed
worse than another algorithm at a less hierarchical architecture might become better on a more
hierarchical architecture. The scalabihty formula (3) provides a guidehne for chosing algorithms as
system size is scaled up.
Figure 7 shows the scalability curves for the Givens Rotation algorithm [5], which can also be
used to solved the least squares problem. The same machine specifications as those used for figure
6 are used in figure 7. We can see that the scalabihty of the Givens rotation algorithm is worse
than that of the Householder algorithm. However, the difference is decreased when the system
scales up. This demonstrates that the scalability of the Givens algorithm is less affected by the
hierarchical remote cache access than the Householder algorithm is. The Givens algorithm may
provide better scalabihty and, therefore, better execution time when the system size is large enough
so that multi-level ring communication is required. Figure 6 and 7 show how algorithms could be
compared with the notion of scalability.
The average speed a maintained in this study is about 58% of the sustained speed. The efficiency
maintained is reasonably high. The scalabihty given in Table 2 and 3 could be higher if a was lower,
as shown in equation (3). Also, the computing rate r in general varies with the number of processors
and problem size on any machine with memory hierarchy. For our implementation, since the initial
problem size is large and it increases with the number of processors, the computing rate is quite
15
6.00
¢0
It')
¢5.
v
._ 3.00
¢:D
0
"-i-
0.00
-- measured scalability
....... predicted scalability using 13'
I
8 16 2 32
number of processors p
Figure 5. Measured and predicted scalability
Equation (23) is used in prediction
6.00
CO
Q.
._ 3.00
0
"3-
-- predicted scalability using 13_
....... predicted scalability using IB_& 132
"4
i
0.00 L ..... _ , k ,
0 8 1 6 24 32
number of processors p
Figure 6. Predicted scalability using machine specifications
16
6.00
"_ 3.00
o
0.00
0 8 16 24 32
number of processors p
Figure 7. Predicted scalability of Givens rotation using machine specifications
stable. The scalability prediction will be more involved if the computing rate varies with the system
size [6].
5 Conclusion
Recent trends in parallel processing suggest that the issue of performance prediction is becoming
more complex and difficult. Massively parallel computing has been adopted as a cost-effective
way to achieve high computing power. Sophisticated architectures have been proposed to deliver
performance scalability with a large number of processors. Shared virtual memory and other kinds
of system support, that hide the communication and other implementation details from the users,
are becoming more prevalent. At the same time, with various architectures and algorithms available,
performance prediction is becoming critical in of chosing an appropriate algorithm-machine pair
for an application, especially when the machine has a sophisticated, hierarchical architecture. The
study given in this paper is an attempt to combine simple formulas with run-time information
to provide a reasonable prediction on modern parallel computers. A simple prediction formula is
presented. Then, a case study is conducted on a multi-ring KSR-1 virtual memory machine to
illustrate how the formula could be used in practice. Four different aspects are discussed in the
paper. First, a method is proposed to measure the needed run-time parameters. Second, when the
system size is scaled up from one level of architecture hierarchy to another level of hierarchy, an
adjustment is proposed to catch the influence of the architecture variation. Experimental results on
the multi-ring KSR-1 machine shows our predicted performance matches the measured performance
17
well, in both executiontime andscalability.Then,with this casestudy,wehaveshownthat it is
possibleto predictthe influenceof architecturehierarchyonscalabilityby simplyusinghardware
specifications.Finally,wehavediscussedtheissueof choosinganappropriatealgorithmfor agiven
applicationwhenthecomputingsystemis scaledup from onelevelof hierarchyto another.
Twobasicproblemshavebeenaddressedin thisstudy: predictingthe executiontime and pre-
dictingthescalability.Likemostexistingmodels,thepredictionof executiontimereliesonrun-time
information(suchas7-and/3) whichmayvarywith problemandensemblesize.Our experiments
show,however,that while hardwaredoesnot realizethe advertizedperformancein solvingac-
tual applications,the relativeperformanceof architecturesandalgorithmscanbe predictedand
comparedin termsof scalabilitygivena hardwarespecification.
While thenumericalexperimentherewasconductedon a KSR-1machine,the resultgivenin
this study is not limited to the KSR-1 architecture. It is a general result of scalability prediction
and should be useful in evaluation of any scalable architecture and algorithm.
Acknowledgment
The authors are grateful to the Cornell Theory Center for providing us access to the KSR parallel
computer.
References
[1] CHEN, Y. M., ZHu, J. P., CI{EN, W. H., AND WASSERMAN, M. L. GPST inversion
algorithm for history matching in 3-d 2-phase simulators. In IMACS Trans. on Scientific
Computing I (1989), pp. 369-374.
[2] GRAMA, A. Y., GUPTA, A., AND KUMAR, V. Isoefficiency: Measuring the scalability of
parallel algorithms and architectures. IEEE Parallel _ Distributed Technoloty 1, 3 (Aug.
1993), 12 21.
[3] GUSTAFSON, J., MONTRY, G., AND BENNER, R. Development of parallel methods for a
1024-processor hypercube. SIAM J. of Sci. and Stat. Computing 9, 4 (July 1988), 609-638.
[4] KENDALL SQUARE RESEARCH. KSR technical summary. Waltham, USA, 1991.
[5] POTHEN, A., AND RAGHAVAN, P. Distributed orthogonal factorization: Givens and House-
holder algorithms. SIAM J. of Sci. and Star. Computing 10 (1989), 1113-1135.
[6] RAMACHANDRAN, Y., SHAH, G., RAVtKUMAR, S., AND MUTHUKUMARASAMY, 3. Scalability
study of the KSR-1. Technical Report, GIT-CC 93/03, College of Computing, Georgia Institute
of Technology. 1993.
[7] SUN, X.-H., AND NI, L. Scalable problems and memory-bounded speedup. J. of Parallel and
Distributed 6bmputing 19 (Sept. 1993), 27-37.
[8] SUN_ X.-H., AND ROVER, D. Scalability of parallel algorithm-machine combinations. IEEE
Transactions on Parallel and Distributed Systems (June 1994), 599-613.
18
[9] SUN,X.-H., AND ZHU, J. Shared virtual memory and generalized speedup. In Proc. of the
Eighth International Parallel Processing Symposium (April 1994), pp. 637-643.
[10] TIKHNOV, A. N., AND ARSENIN, V. Solution of Ill-posed Problems. John Wiley and Sons,
1977.
[11] WORLEY, P. T. The effect of time constraints on scaled speedup. SlAM J. of Sci. and Stat.
Computing 11, 5 (Sept. 1990), 838-858.
19

Form Approved
REPORT DOCUMENTATION PAGE OMBNo 0704-0188
Public reporting burden for this collection of information is estimated to average ] hour per response, including the time for reviewing instructions, searching existing data sources,
gathering and maintainin_ the data needed, and completing a nd reviewing the collection of information Send comments regarding this burden estimate or any other aspect of this
collection of information, including suggestions for reducingthis burden, to Washington Headquarters Services, Directorate for Information Operations and Reports. 1215 Jefferson
Davis Highway. Suite 1204, Arlington, VA 22202-4302, and to the Office of Management and Budget, Paperwork Reduction Project (0704-0188). Washington, DC 20503
1. AGENCY USE ONLY(Leave blank) 2. REPORT DATE
April 1995
4. TITLE AND SUBTITLE
PERFORMANCE PREDICTION: A CASE STUDY
MULTI-RING KSR-1 MACHINE
6. AUTHOR(S)
Xian-He Sun
Jianping Zhu
7. PERFORMING ORGANIZATION NAME(S) AND ADDRESS(ES)
Institute for Computer Applications in Science
and Engineering
Mail Stop 132C, NASA Langley Research Center
Hampton, VA 23681-0001
9. SPONSORING/MONITORING AGENCY NAME(S) AND ADDRESS(ES)
National Aeronautics and Space Administration
Langley Research Center
Hampton, VA 23681-0001
3. REPORT TYPE AND DATES COVERED
Contractor Report
5. FUNDING NUMBERS
USING A
C NAS1-19480
WU 505-90-52-01
8. PERFORMING ORGANIZATION
REPORT NUMBER
ICASE Report No. 95-24
10. SPONSORING/MONITORING
AGENCY REPORT NUMBER
NASA CR-1950fi6
ICASE Report No. 95-24
ll. SUPPLEMENTARY NOTES
Langley Technical Monitor: Dennis M. Bushnell
Final Report
To appear in the Proceedings of the 28th Hawaii International Conference on System Sciences, January '95
12a. DISTRIBUTION/AVAILABILITY STATEMENT 12b. DISTRIBUTION CODE
Unclassified-Unlimited
Subject Category 60, 61
13. ABSTRACT (Maximum 200 words)
While computers with tens of thousands of processors have successfully delivered high performance power for solving
some of the so-called "grand-challenge" apphcations, the notion of scalability is becoming an important metric in
the evaluation of parallel machine architectues and algorithms. In this study, the prediction of scalability and its
application are carefully investigated. A simple formula is presented to show the relation between scalability, single
processor computing power, and degradation of parallelism. A case study is conducted on a multi-ring KSR-1 shared
virtual memory machine. Experimental and theoretical results show that the influence of topology variation of an
architecture is predictable. Therefore, the performance of an algorithm on a sophisticated, heirarchical architecture
can be predicted and the best algorithm-machine combination can be selected for a given apphcation.
14. SUBJECT TERMS
High Performance Computing; Parallel Processing; Performance Prediction;
Performance Evaluation; Scalability; Computer Architecture
17. SECURITY CLASSIFICATION
OF REPORT
Unclassified
NSN 7540-01-280-5500
18. SECURITY CLASSIFICATION
OF THIS PAGE
Unclassified
19. SECURITY CLASSIFICATION
OF ABSTRACT
15. NUMBER OF PAGES
21
16. PRICE CODE
A03
20. LIMITATION
OF ABSTRACT
Standard Form 298(Rev. 2-89)
Prescribed by ANSI Std Z39-18
298-102

