Latency measures the delay caused by communication between processors and memory modules over the network in a parallel system. Using intensive measurements and simulation, we show that network latency forms a major obstacle to improve parallel computing performance and scalability. We present an experimental metric, using network latency to measure and evaluate the scalability of parallel programs and architectures. This latency metric is an extension to the isoe ciency function 3] and isospeed metric 9]. We give a measurement method for using this latency metric, and report the experimental results of evaluating the scalabilities of several scienti c computing algorithms on the KSR-1 shared-memory architecture. Our analysis and experiments show that the latency metric is a practical method to e ectively predict and evaluate scalability based on measured latencies inherent in the program and the architecture.
Introduction
Parallel computing scalability does not have a commonly accepted de nition yet. However, in scienti c computations, people are more interested in knowing if there is a corresponding increase in performance of a computation as the size of a parallel machine is increased. The increase of computing performance is primarily a ected by overhead patterns inherent to an application program and the e ects of the architecture's interconnection network. Program overhead patterns refer to the synchronization and communication structures of the program; while the e ects of the architecture's interconnection network refers to the delays inherent in communication hardwares.
This work is supported in part by the National Science Foundation under research grants CCR-9008991, CCR-9102854, and under instrumentation grant DUE-9250265, by the U.S. Air Force under research agreement FD-204092-64157, by a grant of Cray Research, and by a Fellowship from the Southwestern Bell Foundation. Part of the experiments were conducted on the BBN TC2000 in Lawrence Livermore National Laboratory, and on the KSR-1 machines at Cornell University, and at the University of Washington.
Therefore, parallel computing scalability consists of architecture scalability which is related to bottlenecks inherent in an architecture design, and algorithm scalability which is related to the parallelism inherent in an algorithm design.
Scalability is aimed to measure the ability of a parallel architecture where the parallelism of a given algorithm has already been e ectively exploited. Evaluation of the scalability can also be used to predicate the performance of large problems on large systems based on the performance of small problems on small systems. A rigorous scalability de nition and metric provides an important guidline for precisely understanding the nature of the scalability, and for e ectively measuring the scalability in practice. Isoe ciency 3] and Isospeed 9] are two useful scalability metrics. The former evaluates the performance of an algorithm-machine combination through modeling an isoe ciency function. The latter evaluates the performance of an algorithm-machine combination through measuring the workload increment with a change of the machine size under the condition of the isospeed. Isoe ciency is considered to be an analytical method for algorithm scalability evaluation. Although isospeed metric is an experimental metric, it may face the di culty of measuring some real machine factors in practice.
In this paper, we present an experimental metric using network latency for measuring and evaluating parallel program and architecture scalability. We rst give the de nitions of latency and the scalability. Furthermore we show the analytical relationships among the latency metric, the isoe ciency function and the isospeed metric. Finally we give a measurement method for using the latency metric. We include experimental measurements on the KSR-1 to show the e ectiveness of the latency metric in predicting and evaluating the scalability.
The organization of this paper is as follows. Section 2 overviews the two current available metrics. We emphasize the evaluation of the merits and limits of the isospeed metric. We present our latency metric in Section 3. We overview the architectures and the application programs used for the experiments in section 4. In section 5, we present traced and monitored program execution results, and address the importance of using the latency to abstract various e ects from network architectures and program structures. Both measurement methods and experimental results of several numerical programs on KSR-1 are presented in Section 6. Finally we give summaries and conclusions in Section 7.
2 Scalability metrics from e ciency and speed
Overview and background
The de nition of scalability comes from Amdahl's law which is tied to e ciency and speedup. There are two important scalability metrics: the isoe ciency function based on parallel computing e ciency 3], and the isospeed metric based on parallel computing speed 9] . The isoe ciency function of a parallel system is determined by abstracting the size of a computing problem as a function of the number of processors, subject to maintaining a desired parallel e ciency (between 0 and 1). Speci cally, the e ciency is de ned as E = 1 1 + To(n;W) tcW (2:1) where T o is the total overhead caused by all processors to do the computation in parallel, t c is the average executing time per operation in the architecture, W is the problem size and n is the number of processors. Here t c W is the sequential runtime of an algorithm.
If the e ciency needs to be maintained at a certain value E(0 < E < 1), then from (2. ) is a constant. The scalability is determined by the overhead function T o (n; W). The larger the T o (n; W), the lower the scalability of the algorithm on the architecture will be. If an analytical form T o of a given algorithm on a given architecture is described as a function of n and and W, the isoe ciency (E(n)) curve using up to n processors can be generated for evaluating the scalability.
The isoe ciency function (2.2) rst captures, in a single expression, the e ects of characteristics of the parallel algorithm as well as the parallel architecture on which it is implemented. In addition, the isoe ciency function shows that it is necessary to vary the size of a problem on a size changeable parallel architecture so that the processing e ciency of each processor can remain constant. However, there are two limits in this metric to the experimental evaluation of the scalability. First, analytical forms of the program and architecture overhead patterns in a shared-memory architecture may not be as easy to model as in a distributed memory architecture. Because the computing processes involved in a shared-memory system include process scheduling, cache coherence and other low level program and architecture dependent operations which are more complicated than message-passing on a distributed memory system. It would be di cult to use the isoe ciency metric to precisely evaluate the scalability for a program runing on a shared-memory system in practice. Second, the metric may not be used to measure the scalability of the algorithm-architecture combination through machine measurements. Of course, experiments can be used to verify the analytical isoe ciency for the algorithm on a speci c architecture. We believe this metric is more appropriate to evaluate the scalability of parallel algorithms.
Sun and Rover 9] take an approach to algorithm-machine combinations. Their metric starts from de ning the average unit speed:
where W is the amount of work quantitatively given by the number of oating point operations in the program, N is the number of processors, and t is the execution time. where W and W 0 are the amount of work (or the problem sizes) for the architecture of size N and for the architecture of size N 0 respectively. This metric provides more information about architectures and programs because the scalability of (2.4) can be well determined through machine measurements.
Merits and limits of the isospeed metric
Our experimental metric is along the same line of the isospeed metric. Here we rst study the merits and limits of this metric. Since the computing speed is used to de ne the scalability function, the isospeed metric has the following merits in theory and practice. First, the speed comes from two major performance factors: the problem size and the execution time. While the problem size describes a property of the application program, the execution time re ects the e ects of the architecture and the e ciency of the program. Second, the speed is a fair quantity for comparisons among various architectures. The execution time includes the pure computing part and the latency part which is the major performance factor. Finally, the isospeed is easy to measure because the problem size is determined by the number of oating point operations performed in the computation. However, in terms of algorithm-machine combination, the isospeed metric may not explicitly and completely measure the architecture e ects and the program overhead patterns, because of the usage of the oating point operations to measure the the work (problem size). We believe there are two limits in the isospeed metric for precisely measuring and evaluating the scalabilities of the application program and the architecture. First, some non-oating point operations can cause major performance changes. For example, a single assignment to a shared variable in a cache coherent shared-memory system may generate a sequence of remote memory/cache accesses and data invalidations. But this type of operation is excluded in the measurement of the scalability. Second, the latency is included in the total execution time in the metric, but it is not de ned in the amount of work, W, in the scalability metric (2.4). In practice, the execution overhead caused by the interconnection network and the program structure is a function of the problem size. Since the execution patterns can be precisely monitored by hardware and/or software in more and more modern parallel architectures, scalability can be further evaluated at lower application and system levels to capture more precise performance factors of architecture e ects and program overhead patterns. In the next section we propose a new metric called the latency metric, to enhance the ability of the isospeed metric. Instead of the speed, we use the latency, the average computing delay, as the major factor in the metric. We show that latency includes more precise information of the architecture's interconnection network e ects and the overhead patterns inherent in application programs in section 5.
3 The latency metric
De nitions and assumptions of the metric
We de ne the latency metric through a series of formal de nitions and theorems.
De nition 1. The parallel computing time of an algorithm implementation, denoted as T para , is the elapse time between starting the program and ending the program on a parallel architecture.
The parallel execution time on the ith processor, denoted as T i , for i = 1; :::N, is the e ective In the isoe ciency function and the isospeed metric, the scalability is de ned as algorithmmachine combination, and may not be used to measure di erent parallel implementations of a given algorithm. However, in parallel programming, di erent implementations of an algorithm may have very di erent computing performance. In contrast, our latency metric de nes the scalability of a parallel algorithm implementation-machine combination, simpli ed as an implementation-machine combination. The next de nition combines this concept in the scalability.
De nition 5. For a given e ciency, E 2 0,1] of running a program on N of processors, if and only if the e ciency of an implementation of an algorithm on a given machine can become equal to or greater than the given E by increasing the size of the problem, the implementation-machine combination is called as scalable.
De nition 5 indicates that an algorithm-machine combination is scalable if and only if we can nd an implementation of the algorithm on the machine that is scalable. Recall that a problemmachine combination is scalable if and only if the algorithm on the machine is scalable.
Before formally de ning the scalability latency metric, we describe the the parallel computing time, T para based on Amdahl's Law:
T para = Wt c N + L(W; N) (3:6) where W is the size of a problem, N is the number of the processors, t c is the average computing time per operation in the system, and L(W; N) is the average latency time.
De nition 6. We also call the metric in (3.7) an E-conserved scalability because the e ciency is kept constant.
From the de nition of the e ciency in (2.1), (3.7) satis es the following E-conserved condition:
In practice, the value of (3.7) is less than or equal to 1. A large scalability value of (3.7) means small increments in latencies inherent in the program and the architecture for e cient utilization of an increasing number of processors, and hence the parallel system is considered highly scalable. On the other hand, a small scalability value means large increments in latency and therefore a poorly scalable system. Furthermore, when a set of scale(E; (N; N 0 )) values is measured on a system for di erent given e ciency E's, an average scalability may be described by an integration of the latency metric from N to 
Analytical relationships among the three metrics
Based on the latency metric and the related de nitions in the previous section, we are also able to present analytical relationships among the isoe ciency function, the isospeed metric and the latency metric through the following two theorems. Theorem 1. We assume the average execution time per operation for solving a problem in an architecture, t c , is a constant, then the e ciency E and the speed S have the following relationship: E = St c : (3:9) This theorem indicates that the isoe ciency is equivalent to the isospeed.
Proof: Let W be the problem size, N be the number of the processors and T para be the parallel From (3.14), the latency becomes L(W; N) = W N ( 1 S ? t c ):
From (3.15), we get (3.13). This means the latency metric can be derived from the isospeed metric. From (3.17), we obtain (3.13). This means the latency metric can also be derived from the isoeciency function (3.17). Theorem 2 indicates that the latency scalability metric covers the scalability measures included in both the isospeed function and the isoe ciency metric. Again, another important reason for using the latency is related to the real measurements in computer systems. This metric more directly and precisely catches the architecture interconnections network e ects and the overhead patterns inherent in the algorithm in the program execution. In general, the average latency de ned by De nition 3 is an increasing function of the machine size and the problem size. So the scalability scale(N; N 0 ) of an implementation-machine combination is less than 1. When scale(N; N 0 ) = 1, the average latency from the implementation-machine combination is a constant, and is independent of the problem size and the machine size. In this case, by the de nition of the e ciency, we have
Here the e ciency E and the latency L(W; N) are constants, so the problem size can be expressed
where k = L(W;N)E 1?E t c . This indicates that the problem size W increases linearly with the system size N. This case is the ideal algorithm implementation-machine combination, and gives an important quantative reference for designers of parallel machines, parallel algorithms and parallel algorithm implementations.
Using the latency metric for scalability prediction
Although the scalability de ned in De nition 6 is motivated by architecture and system measurements, we can also predicate it through a so called E-conserved latency function, denoted as f(N), which is an analytical function of the machine size only. From the e ciency de nition, we have
Then the problem size W can be derived from the e ciency de nition:
(1 ? E)t c : (3:19) Substituting this W function to the latency metric (3.7), we can always get the E-conserved latency function in the following form:
(3:20) Using the E-conserved latency function, the scalabilities can be predicted by any given machine size. 4 Parallel architectures and application programs for the scalability experiments
The architectures we used for scalability testbeds are network-based shared-memory systems which use a logical shared address space, where physical memory is distributed. This type of systems intends to combine the scalability of network-based architectures with the convenience of sharedmemory programming. The shared-memory systems we used were the KSR-1 and the Cerberus shared-memory simulator. We tested three application programs from the Stanford SPLASH set for latency measurement and scalability evaluation on both architectures. We also implemented three other standard numerical algorithms on the KSR-1 for scalability measurement using the latency metric. In this section, we brie y overview the parallel architectures and the application programs for the scalability experiments.
The KSR-1 system
The KSR-1 5], introduced by Kendall Square Research, is a hierarchical ring based shared-memory multiprocessor system with up to 1,088 64-bit custom superscalar RISC processors (20 MHz). A basic ring unit in the KSR-1 has 32 processors. The system uses a two-level hierarchy to interconnect 34 rings (1088 processors). Each processor has a 32 MB cache. The basic structure of the KSR-1 is the slotted ring, where the ring bandwidth is divided into a number of slots circulating continuously through the ring. The number of slots in the ring is equal to the number of processors plus the number of directory/routers connecting to the upper ring. A standard KSR-1 ring has 34 message slots, where 32 are constructed for the 32 processors and the remaining two slots are used for the directory/router cells connecting to level 1 ring. Each slot can be loaded with a packet, made up of a 16 byte header and a 128 byte of data which is the basic data unit in the KSR-1, called a subpage. A processor in the ring ready to transmit a message waits until an empty slot is available, which rotates through a ring interface of the processor.
The Cerberus simulator
In order to evaluate cache and cache coherence e ects, we also traced one of the application programs on a cache coherent multistage interconnection network multiprocessor simulated by the Cerberus 1] on the TC2000. The simulator constructs a multistage interconnection network architecture similar to the TC2000. Each processor has 64K cache where cache line size is 128 bytes. All processors are connected to a large shared-memory. An interleaved shared-memory scheme is supported in the memory system. The cache coherence protocol used is a standard full-map directory cache coherence protocol 2]. The simulator collects detailed statistics on execution behaviors including memory access patterns, cache invalidation patterns, and network tra c.
Three application programs from the SPLASH
Three application programs from the SPLASH parallel benchmark set of Stanford 8] have been ported to the KSR-1 and the Cerberus simulator for the latency pattern and scalability evaluation. These parallel programs are a molecular dynamics simulation (Water), a rare ed uid ow simulation (MP3D) and a Cholesky factorization of a sparse matrix (Cholesky). The program and synchronization structure overview of the three application programs is listed in Table 1 .
Quick Sort (QS)
The basic data structure of the parallel quick sort uses a task queue to allocate tasks onto each thread dynamically. Since the task queue is globally shared and allocated in a critical section, its size and structure is quite sensitive to the computing performance. In our implementation, we used a single queue structure. Initially, each thread gets a task which is a segment of the data to be sorted from the task queue. It then divides it into two smaller segments. The thread puts one of them back into the queue. and sorts the other segment repeatly until the resulted data segment becomes a set of individual elements. Each thread will repeat the same process until the task queue is empty. For detailed information of the algorithm, the interested reader may refer to 6]. The latency in the parallel quick sort algorithm mainly comes from the waiting time for entering critical section of the task queue, and the time for producing the tasks in the critical section.
Gauss Elimination (GE)
In this parallel Gauss elimination algorithm, multiple threads are mainly used to do column element elimination simultaneously. Whenever a column element is eliminated, all of the threads must wait at a synchronization point for the main thread to reorder the elements of the next row. The parallel algorithm performs this operation from the rst row to the last row in the matrix. For detailed information of the algorithm, the interested reader may refer to 7] .
The latency in the Gauss Elimination program mainly comes from the thread synchronization and the data movement among memory modules.
All Pairs Shortest Path (APSP)
The parallel implementation of All Pairs Shortest Path is based on the Dijkstra's sequential algorithm. It uses a path matrix to store the connection relations among nodes. For a given number of threads, the partition of the shortest path searching work among threads can be done statically. Each thread is responsible for nding its shortest path. For detailed information of the algorithm, the interested reader may refer to 6].
The latency of this implementation only comes from the initialization of the program | the creation and join of the threads.
Miss latency and scalability prediction
In a large shared-memory multiprocessor, memory modules are distributed and an interconnection network is used to connect them together to construct a shared-memory computing environment. Latency de nes a delay in time by any type of non-local cache/memory access through the interconnection network. All the non-local accesses are caused by local access misses. Thus we call the latency miss latency. The miss latency in large-scale shared-memory computing is a major performance bottleneck for parallel processing. The major miss latency sources are non-local cache/memory searching/access, synchronization locks and barriers, cache coherence overheads, hot spots, and management of cache/memory localities. These are important program events with heavy network activities. We quantitatively measure, evaluate and analyze the miss latency through an execution pattern study. The program miss latency is a function of the number of processors used for computing, which quantitatively describes the changes of network delay for an application program as the number of processors is increased. By using the execution measurement results, the program miss latency function can also provide its upper bound for the program to achieve maximum speedup with the maximum number of processors. We also use the miss latency as an important factor to evaluate and predict scalability of application programs on a network-based shared-memory architectures.
Latency and locality analysis on the KSR-1
We ran the three SPLASH programs on the KSR-1 for latency and locality pattern analysis in order to obtain further understanding of the communication patterns inherent in the programs and the architecture. All the performance data are collected by the hardware monitor which is built into the KSR-1. Each KSR-1 processor contains an event monitor unit (EMU) designed to log various types of local cache events and intervals. The job of the EMU is to count events and elapsed time The idle time is called CEU stalls time in the KSR-1 system, where CEU is the Cell Execution Unit. During the idle time period, the CEU is stalled; therefore the Floating Processing Unit (FPU) and Integer Processing Unit (IPU) are also stalled. The CEU stalls because the following scenarios occur: data subcache miss: The CEU requests data from its subcache, but data is not there. Therefore data subcache miss occurs. It takes 23 cycles for subpage to be transferred from the local cache to the subcache via the Cache Control Unit (CCU). Additional stall cycles can occur if write-back occurs, since there is a need to create block descriptors before the data is transferred. They can also occur if some other processor is requesting data from this subcache. connect Unit (CIU) sends a message to the directories at this ring level or upper ring level requesting the subpage. This operation takes about 140-150 cycles on the local ring, plus 23 cycles for an original subcache miss. It takes about 600 cycles for the request from a ring of the upper level.
page miss: Data is not in the subcache, the subpage is not in the local cache, nor in the directories on the local level and upper level. The operating system needs to create a new page descriptor. This operation takes 163 cycles.
cache ins instruction time: The operating system inserts instructions during program execution for cache instruction misses. io ins instruction time: The I/O Processor inserts instructions during an program execution for I/O instruction misses. We divide the idle time into three parts: cache miss time including the data subcache miss and the cache subpage miss, page miss time, and system instruction miss time including the cache ins instruction time and the io ins instruction time. Since system instruction misses occasionally happen, it only has very small percentage in the idle time. Cache miss time is often a big part of the idle time. Page miss time also makes considerable contribution to the idle time. Figure 2 shows that over 80% total computing time of a MP3D execution is idle when processors are waiting for replacements of cache data, page data and system instruction misses. In contrast, Figure 3 and 4 present higher locality patterns in the executions of the Water and the Cholesky programs, respectively. The processor idle times are signi cantly lower because the numbers of various misses are low. Therefore the e ective computing times are much higher than that of the MP3D program.
Another important factor which creates program localities is the frequency of nonlocal data movement through the rings. This frequency is high when processor locality of a program is low. Figure 5 presents the three frequency curves of nonlocal data movement in number of packets per second for the three application programs on di erent numbers of processors. A communication packet in the KSR-1 is the size of a subpage (128 bytes). The data movement frequency of the MP3D program is signi cantly higher than that for the Water and the Cholesky programs. The high frequency of the MP3D program comes from the result of low processor locality and a large number of cache and page misses in the executions. Tracings of both execution time distributions and frequencies of nonlocal data movement indicate that the Water and Cholesky programs have high hierarchical localities, which can be well exploited by the KSR-1 architecture. Figure 6 presents average program miss latencies for the three programs, and con rms that the Water and Cholesky programs have high hierarchical localities. The miss latencies of the two programs do not necessarily grow with the number of processors used. Thus, the average distance between an arbitrarily chosen pair of processors does not necessarily grow with the number of processors. This is because communications in executions of these two programs are often conducted on \nearby processors", instead of arbitrary processors, such as two processors on di erent rings. In contrast, the miss latency of the MP3D program is not just signi cantly higher, but also grows monotonically with the number of processors. Table 3 : The general cache/memory access patterns for the MP3D program.
Execution-driven simulations and tracing on the Cerberus
Performance measurements are limited in their ability to provide insight into dynamic execution patterns of application programs because it may be impossible to capture execution activities at lower system and architecture levels. In addition, measurements may only be used for performance evaluation of an existing system. In order to provide detailed execution patterns of application programs, and to study the e ects of some important system modi cations, we have conducted execution-driven simulations and tracings using the application programs on large scale sharedmemory multiprocessors. The architecture we used is a simulated cache coherence multistage interconnection network based architecture provided by the Cerberus simulator. We use the MP3D as the target parallel program. The size of the problem input was reduced from 50,000 molecules to 3,000 molecules in order to obtain an a ordable simulation time. This reduction did not change the program structure because the structure is independent of the program input size. Before presenting the detailed execution-driven simulation results, we present the execution time measurements of the MP3D program run on the simulated architecture with di erent numbers of processors. Table 2 lists the execution times in CPU cycles, utilizing a simple lock and simple barrier schemes. The minimum execution time for the program is printed in bold font. As the number of processors increased to 16, the execution time hit its minimum, after which more processors caused the program to take a longer time to complete. In the following sections, we present our execution-driven simulation results to provide insight into the performance, by investigating the primary factors a ecting the parallel scalability of this application program.
Memory access patterns and cache coherence e ects
Memory access characteristics of the MP3D program running on the simulated architecture are traced. These include shared and private data access, frequency of access, data movement, data locality e ects of cache coherence and other related e ects. Table 3 shows the general memory access information for the MP3D program running on the architecture with di erent number of processors. It lists the total number of remote cache/memory accesses through the network, the total number of access-hit rate (read/write), the barrier access-hit rate and the lock access-hit rate. The simulation results in Table 3 show that the general access-hit rates and the rates of accesshits caused by barriers and locks are slightly changed when the number of processors is increased. However, the number of remote cache/memory accesses is signi cantly increased as the number of processors is increased. This is because the increase of the number of processors for computing the MP3D program generates more tasks, thus the number of remote cache/memory accesses for task scheduling, data access and data invalidation is increased accordingly. In addition, the increase of the number of processors in the computation generates more processes to be synchronized at points of barriers and locks; thus the number of remote cache/memory accesses for the same purpose is also increased accordingly. Figure 7 con rms that the increase of the number of remote accesses in the MP3D program almost exclusively comes from the synchronization barriers and locks as the number of processors is increased, while the number of remote accesses for exclusive computing remains as almost a constant level. Figure 8 gives the distribution of average access-hit rates among the four groups of multiprocessor sets for executing the MP3D program. The traced data show that the average cache access-hit rate of the system is independent of the number of processors used. This performance result also indicates that the program structure of the MP3D for data accesses on this multistage interconnection network based architecture is reasonably regular. Table 4 : The general cache invalidation patterns for the MP3D program. Table 4 gives general cache invalidation information. The average invalidation width de nes the average number of cache copies to be invalidated in each invalidation. The invalidation width is only slightly changed as the number of processors is changed. However, the number of invalidations is signi cantly increased as the number of processors is increased. Table 4 also lists the ratio between the number of invalidations caused by barriers (barrier invalidations) and the total number of invalidations in the computation (program invalidations), and the ratio between the number of invalidations caused by locks (lock invalidations) and the program invalidations. The large increase in the number of program invalidations mainly comes from the the large increase in the number of barrier invalidations as the number of processor is increased. The number of lock invalidations is increased moderately. The percentage of invalidations caused by synchronization barriers/locks is increased from 67% to 87% as the number of processors is increased from 8 to 64. Figure 9 shows the invalidation distributions for the MP3D program. The distributions are dominated by single invalidations. As the number of processors is increased, the invalidation distribution remain essentially the same. This result is very similar to the cache invalidation distribution pattern reported by Gupta and Weber 4] , where the MP3D program is executed on another simulated shared-memory architecture. This multiprocessor assumes shared memory partitioned among the processing nodes, in nite caches, and a directory-based cache coherence protocol.
Miss latencies in executions
The simulator also conducts two exclusive miss latency measurements. The synchronization miss latency gives the average remote access delay exclusively caused by the synchronization barriers/locks in the program. The computing miss latency provides the average remote access delay exclusively caused by computation without any synchronization process involved. Figure 10 plots the 3 miss latency curves of the MP3D program running up to 64 processors. The major source of network delay comes from the synchronization barriers/locks in the program. The average program miss latency curve indicates that the upper bound latency for the architecture to achieve maximum speedup of the program is 42 cycles using 16 processors, because 16 is the largest number of processors used to achieve the maximum speedup. The computing miss latency curve can be used as an optimistic bound for the computation, which assumes that there are no synchronization processes in the computation. In this optimistic case, the execution time will hit the minimum when the 30 processors are used. This is because the miss latency of 16 processors (about 42 cycles) is equal to the critical latency of the program (42 cycles). The synchronization miss latency curve can be also used as an pessimistic bound for the computation, which assumes that the programs are dominated by synchronization barriers/locks. If this is the case, the execution time will soon drop to the minimum even when less than 8 processors are used. This is because the miss latency of 8 processors (56 cycles) is already larger than the critical latency of the program (42 cycles).
The synchronization e ects can be reduced by increasing the input size of the MP3D program. This is because the number of barriers and locks of the program is independent of the size of the problem, and the size of the local computation in each processor is also increased. As Figure 7 shows, the number of remote accesses for exclusive computing is independent of the number of processors used. However, the miss latency of exclusive computing is signi cantly increased as the number of processors is increased. For example, Figure 10 shows that the miss latency of exclusive computing for the MP3D program is increased 5 times (from 25 cycles to 124 cycles) as the number of processors is increased from 8 to 64 processors. In other words, no matter how e cient the parallel program is, the built-in latency in the architecture which is a growing function of the number of processors, will only let the program scale to a certain point, for example, up to 30 processors for this MP3D program in the Cerberus simulator.
An experimental metric of scalability prediction
Our execution pattern case studies indicate that the major source for various overheads on a network-based shared-memory multiprocessor architecture, such as synchronization locks/barriers, hot spots, cache invalidations, and remote search/access, can be quantitatively identi ed by the miss latency. Therefore, the miss latency is a primary factor to measure scalability of a parallel program on a network-based shared-memory architecture. The program miss latency is closely related to the program structure and the architecture used. Since a synchronization-free program, in general, has low network activities and minimum e ects from the architecture, the computing miss latency (the latency exclusively caused by the computing of the program) can be used for comparing scalabilities among di erent architectures. For example, the computing miss latencies of the same program running on di erent architectures may be used as a factor to distinguish scalabilities of the architectures | the higher the miss latency, the lower scalability the architecture has. The factor of program miss latency may also be used to determine and predict program scalability. By measuring the program execution time running on an architecture with di erent numbers of processors, and the program miss latencies, the upper bound miss latency for gaining minimum execution time can be determined. This upper bound may be used to determine a justi cation of the program structure, such as to increase the input size of the program, for the purpose of scaling the program to run on a larger number of processors.
We propose an experimental metric called miss latency metric for measuring and predicting a program's capability to e ectively utilize an increasing number of processors. The proposed miss latency metric consists of the following steps for studying scalability of a program on a sharedmemory multiprocessor architecture: The measured and predicted scalability factors of the MP3D program on the Cerberus are listed in Table 5 Table 6 : The measured and predicted scalability factors for the three programs on the KSR-1.
Using the same experimental metric, we can predict the scalability for each of the three programs. The MP3D program would scale to, at most, 8 processors with very limited execution time reduction. We expect both the Water and the Cholesky programs scale to a large number of processors, provided the sizes of the problems are also increased on the KSR-1. Table 6 lists the measured and predicted scalability factors of the three programs on the KSR-1.
6 Measuring and evaluating the scalabilities by using the latency metric
In the previous section, we have shown by experiments that various latencies are major factors a ecting program performance and scalabilities. In this section, we measure and evaluate the scalabilities by using the latency metric proposed in section 3. The latency metric is mainly concerned with the average latency increment when both the sizes of the problem and the machine are adjusted to keep the e ciency as a constant. In this section, we present measurement methodology of the latency metric, and report the scalability measurements and evaluation on the KSR-1.
Measuring methodology of the latency metric
Before measuring and evaluating the latency, we need to experimentally determine both the sizes of the the problem (W and W 0 ), and the system (N and N 0 ) for a given e ciency constant E. After then, the E-conserved latencies L(W; N) and L(W 0 ; N 0 ) can be either calculated or measured for determining the scalability. Figure 11 gives the basic testing process to determine the problem size for a given e ciency running on a given number of processors.
To e ectively and precisely determine the average latency L(W; N) is the key to the measurement and evaluation of the scalability. The following three methods can be used: NT para (W; N); which may be used as a reference to adjust the problem and system sizes under an E-conserved condition. This method is simple, and does not require any special performance monitor tools to trace program and architecture latency behaviors. There are two disadvantages in this method. First, the sequential time T seq (W) may not be obtained for large application programs due to limited memory space and limited computing power on a single processor in a parallel system. Second, the average latency determined by the above formula is a approximate value, and may not be precise enough.
Method 2. The average latency data can be measured and collected by using software instrumentation which inserts the trace codes at certain important points, such as the synchronization, remote accesses and cache coherence operations in the program. The advantage is that the latency data may be more precisely and exibly obtained. But if the overhead introduced by the instrumentation is not reasonably small, the program execution speed may slow down signi cantly.
Method 3. Hardware monitor can precisely get the the latency data with low overhead costs. We used Pmon, a hardware monitor on the KSR-1 to collect the latency data.
The relationship between the e ciency and the parallel computing time can be determined by E = 1 ? L(W; N) T para (W; N); which may be used as a reference to adjust the problem and system sizes under a E-conserved condition in Method 2 and 3.
6.2 Measurement of the scalabilities using the latency metric
We measured and evaluated the scalabilities of the three application programs using the latency metric on the KSR-1. The three programs are Quick Sort, Gauss Elimination and All Pairs Shortest Path. We show that the degree of the program scalability is dependent on program latency which is a ected by program structures, program locality and program task granularity. A cache/memory miss latency, simpli ed as miss latency, de nes an average time between when a remote cache/memory access (read/write) is requested and when the desired access operation is done. As we have described in the previous section, the miss latency covers the overheads of synchronization, cache coherence, remote data accesses and other events with heavy network activities. We used Pmon, a hardware monitor on the KSR-1 to trace the total number of accessmiss operations and calculates the average miss latency for running a program on the architecture with di erent numbers of processors. We call this measured latency the program miss latency. The program latency measurements for the three application programs are plotted in Figure 12 -14, which show that the program miss latency is a function of the number of processors used for computing. It also quantitatively describes the changes of network delay for an application program as the number of processors is increased.
The scalability results are listed in Tables 7-12, 
Scalability and program structures
We implemented two versions of Dijkstra's All Pairs Shortest Path (APSP) algorithms. The rst one, denoted as APSP1, includes parallel initialization part in the code, and second version, denoted as APSP2, does not. In order to keep the e ciency E as a constant of 0.25, the problem size of the APSP1 program was adjusted from W = 12 on 2 processors up to W = 229 on 60 processors. For the same e ciency constant, the problem size of the APSP2 program was adjusted from W = 10 on 2 processors up to W = 259 on 60 processors.
The measured program latency curves plotted in Figure 12 indicates that the program with parallel initialization became e ective and has lower average delays when more than 48 processors were used.
The latency scalability results by measurements on the KSR-1 in Table 7 and Table 8 show that the APSP with parallel initialization is more scalable than that without parallel initialization in most cases. The measurements give an example on how program structures can a ect the computing scalability. 
Scalability and program Locality
Two versions of Gauss Elimination programs, denoted as GE1 and GE2 are implemented on the KSR-1. The GE2 exploits more processor locality than GE1 does. The basic di erences of the two programs are as follows. In an iteration of GE1, each processor is responsible for the elimination of a new row which is eliminated by another processor in the previous iteration. Therefore the processor must rst access the new row remotely, then do the elimination. But in GE2, the elimination of a row is statically scheduled onto a xed processor, so that the number of remote accesses is reduced and the locality of the program is enhanced. In order to keep the e ciency E as a constant of 0.25, the problem size of the GE1 program was adjusted from W = 59 on 2 processors up to W = 3800 on 32 processors. For the same e ciency constant, the problem size of the GE2 program was adjusted from W = 55 on 2 processors up to W = 1970 on 32 processors and W = 3317 on 60 processors.
Unfortunately GE1 could only be scaled to up 32 processors on the KSR-1. The measured program latency curves plotted in Figure 12 indicates that the program with better locality has signi cant lower average network delays.
The calculated scalabilities for GE1 and GE2 based on the measured program latency are listed in Table 9 and Table 10 , which show the e ectiveness of the locality to the computing scalability.
The e ects of the task size in dynamically scheduling
Two versions of Quick Sort algorithms are implemented: QS1 with task size of 5000 elements and QS2 with task size of 2500 elements. In order to keep the e ciency E as a constant of 0.25, the problem size of the QS1 program was adjusted from W = 4; 866 on 2 processors up to W = 9; 753; 184 on 32 processors. For the same e ciency constant, the problem size of the QS2 program was adjusted from W = 3483 on 2 processors up to W = 53; 769; 503 on 32 processors.
The measured program latency curves plotted in Figure 14 indicate that the program (QS1) had signi cantly lower average network delays by increasing the task size.
The calculated scalabilities of the two programs based on the measured latencies are listed in Table 10 : The scalability of GE2 by keeping the e ciency as a constant of 0.25. Table 12 : The scalability of QS with task size of 2500 elements by keeping the e ciency as a constant of 0.25. Table 11 and Table 12 . As we expected, QS1 is more scalable than QS2 due to the reason that QS1 uses a larger task size to reduce the network latency.
Comparing the scalabilities among the above three sets of algorithms, we can get the following scalability relations among these algorithms on the KSR-1: APSP1 > APSP2 > QS1 > QS2 > GE2 > GE1;
where`> 0 represents`more scalable than'.
Summary
We present the latency metric, an experimental method for measuring and evaluating the program and architecture scalability. We also address analytical relationships among the latency metric, and the isoe ciency function and isospeed metric. Using the latency metric, and other experimental methods, we have conducted an intensive experimental study of scalability of application programs on two network-based shared-memory multiprocessor systems: the Cerberus, and the KSR-1. In this study, we quantitatively identify and evaluate the major performance bottleneck sources for parallel program scalability | non-local cache/memory searching/access, synchronization locks and barriers, cache coherence overheads, hot spots, and management of cache/memory localities. Current work includes developing a software tool using the latency metric for measuring and evaluating the program and architecture scalability.
