In a distributed shared memory (DSM) multiprocessor, the processors cooperate in solving a parallel application by accessing the shared memory. The latency of a memory access depends on several factors including the distance to the nearest valid data copy, data sharing conditions, and traffic of other processors. To provide a better understanding of DSM performance and to support application tuning and compiler development for DSM systems, this paper extends microbenchmarking techniques to characterize the important aspects of a DSM system. We present an experiment-based methodology for characterizing the memory, communication, scheduling, and synchronization performance, and apply it to the Convex SPP1000. We present carefully designed microbenchmarks to characterize the performance of the local and remote memory, producer-consumer communication involving two or more processors, and the effects on performance when multiple processors contend for utilization of the distributed memory and the interconnection network.
Introduction
Distributed shared memory systems are parallel processors that use high-bandwidth, low-latency interconnection networks to connect powerful processing nodes that contain processors and memory. The physically distributed memory is accessible from a single global address space, thus, providing a natural and convenient shared-memory programming model. Due to the increasing gap between processor speed and memory speed, DSM systems use one or more levels of caches that are kept consistent using one of many cache coherence protocols [1, 2] . The distributed memory, caches, and coherence protocols make DSM performance very complex, thus, slowing the development of efficient applications. Developing efficient applications and compilers for a DSM multiprocessor requires a good understanding of the performance of its hardware and parallel software environment. Performance models are also useful for conducting quantitative comparisons among different multiprocessors and selecting the best available application implementation techniques.
In this paper, we present an experimental methodology and use it to characterize the memory, scheduling, and synchronization performance of the Convex SPP1000 [3] . We extend the microbenchmark techniques used in previous studies [4, 5, 6, 7, 8] by adapting them to deal with the important attributes of a DSM system. Like previous studies, this paper analyzes the cache, memory, and interconnection network performance. Moreover, special attention is given to characterizing the communication performance, the overheads in maintaining cache coherence, and the effect of contention for local and remote resources. This characterization is useful to programmers to estimate the components of the memory access time seen by their programs. Additionally, it is useful to architects to identify strengths and weaknesses in current systems.
The next section outlines our suggested methodology for characterizing DSM performance. Section 3 gives an overview of the SPP1000. Section 4 presents the characterization of local-memory performance. Shared memory and communication performance are treated in Section 5. Sections 6 and 7 treat the scheduling and synchronization overheads, respectively. Section 8 discusses some considerations for applying this methodology to other systems, and Section 9 offers some concluding remarks. tion time depends on the computation, scheduling overhead, load balance, and synchronization overhead. In this paper, we focus on characterizing the memory access components of the computation time, and briefly consider scheduling and synchronization overheads.
DSM Performance
In shared-memory programs, communication is implicit through load and store instructions to the shared memory during the computation time. We present experimental methods for measuring the memory access latency under varying conditions of access distance, sharing, and traffic. The distance is determined by where in the memory hierarchy level, relative to the processor, the memory access is satisfied. For example, in the SPP1000 a memory access can be satisfied by a hit in the processor cache, local memory, another processor' s cache in the same node, remote shared memory, or a remote processor' s cache.
Due to the various overhead factors in the cache coherence protocol, the memory access latency is affected by the way that the referenced data is shared and the sequence of accesses to it. We present experiments for characterizing the performance of basic shared data access patterns for two processors, and then extend these experiments to multiple processors. We then present experiments to characterize the effects on performance when multiple processors compete for utilization of the distributed memory and interconnection network.
The scheduling overhead is the parallel environment overhead in allocating and freeing processors for parallel tasks. The example in Figure 1 shows static scheduling overhead at the start and at the end of task execution. Scheduling overhead may also exist within the execution when dynamic allocation is supported.
The synchronization overhead is the overhead in performing synchronization operations, e.g., barrier synchronization, and acquiring a lock for a critical section [9] . The synchronization overhead does not include the wait time due to load imbalance, which can easily be treated separately [10, 11] .
We have used simple Fortran programs to gather timing data for the three aspects of the DSM performance. Each program calls and times a kernel 100 times and, after subtracting timer overheads, reports the minimum, average, and standard deviation of its call times. The first call time is ignored since it is significantly affected by page faults and other cold start effects. The experiments are carried out during an exclusive reservation period where the one active user process is running one of the timing programs.
We have used load and store kernels [4] to characterize the memory access performance. The load kernel is a subroutine that loads double-precision (8-byte) elements of a one-dimensional array into the floating-point registers. In the store kernel, the load instructions are replaced with stores. The kernel generates one access per processor clock unless slowed by the memory hierarchy. The array size, w in bytes, and the access stride, s in doubleprecision elements, can be varied. One call time consists of the w=8s accesses in a kernel call. Double-precision elements are used to achieve a performance characterization that is appropriate for scientific applications that frequently have long regular access patterns to double-precision arrays. Nevertheless, the performance for accessing other element types, e.g., single-precision (4 bytes), can be derived from the doubleprecision performance. Computers usually take the same time to transfer memory lines of the various data types, but one line may accommodate a different number of data elements for a different data type.
In this paper, the average access time is calculated as the minimum call time divided by the accesses per call (w=8s). The transfer rate is found by dividing the number of accessed bytes per call (w=s) by the minimum call time. Usually the highest observed transfer rate occurs at s = 1 element because all the transferred data are accessed. The bandwidth through an interconnecting link is calculated by dividing the number of bytes transferred through this link by the average call time.
The minimum call time is used in calculating the average access time and the transfer rate for a processor in order to minimize extraneous effects. The average call time is used in calculating the bandwidth through a link in order to account for concurrent link utilization by multiple processors.
Convex SPP1000
The SPP1000 was introduced in 1994 as the first implementation of the Convex Exemplar Scalable Parallel Processing architecture. The SPP1000 consists of 1 to 16 processing nodes called hypernodes. some control devices and is connected to one CTI ring through the CTI interface.
Each processor has direct access to its instruction and data caches, which are direct-mapped and virtuallyaddressed. The processor pair share one CPU agent to communicate with the rest of the machine. The memory of each functional block has two physical banks that are configured into three logical sections; hypernode local, subcomplex global, and interconnect cache (IC).
The hypernode local section is used for thread-private and hypernode-private data. On the SPP1000, processes run on virtual machines called subcomplexes, which are arbitrary disjoint collections of processors. The subcomplex global section is used for shared data and might be interleaved across the hypernodes of the subcomplex. The IC is used for holding copies of shared data that are referenced by this hypernode' s processors, but have home addresses in the global memory section of other hypernodes in the subcomplex.
The physical memory is 8-way interleaved across the memory banks of the four functional blocks of each hypernode. Consecutive 64-byte memory lines are assigned in round robin fashion, first to the four even banks, then to the four odd banks. A processor cache line is 32 bytes wide, and thus holds half a memory line. When a load or store kernel' s stride, s, is 1 element, all the banks will be accessed. However, with s > 8 elements, some banks will be skipped, e.g., 4 banks are accessed with s = 16, 2 banks with s = 32, and only 1 bank with s = 64.
The Convex coherent memory controller provides the interface between the memory and the rest of the machine. Each CCMC is responsible for 1=4 of the memory space that a processor can address. When a processor has a cache line miss, its agent generates a line request through the crossbar to one of the four CCMCs associated with the processor' s hypernode. The CCMC uses the address bits that specify the home node to select one of its three memory sections to access a valid copy; the IC section is accessed for remote-home lines, and one of the other two sections is accessed for local-home lines. If the memory attached to this CCMC does not have a valid copy, the CCMC either contacts a local agent if any of the agent' s processors has the valid copy in its cache, or contacts a remote CCMC for service through its CTI ring.
Each CTI ring is a pair of unidirectional links with a peak bandwidth of 600 Mbytes/sec for each link. The CTI supports an extended version of the Scalable Coherent Interface (SCI) standard [12] . The CTI supports shared memory accesses by providing one 64-byte line in response to a global shared memory access.
The coherence data is maintained in tags associated with the 64-byte memory lines. Each tag holds the local and global sharing state of the respective line. The local state part includes the local caching status and an 8-bit sharing vector for each of the two 32-byte halves of the line. The intra-node coherence protocol is a three-state protocol [13] . The global sharing state is arranged in a doubly linked list distributed directory rooted in the home node. The tag in the home node contains the global status and a pointer to the head of the sharing list. Each IC tag in other sharing nodes contains the caching status of the line and pointers to the previous and next nodes in the list.
The experiments of this paper were carried out on an SPP1000 system that has 4 hypernodes, each with 8 CPUs and 1024 Mbytes of main memory. The operating system is SPP-UX, version 3.1.134.
Local Memory Performance
Two experiments were used to characterize the SPP1000 processor data cache and local memory performance under private access conditions. In the first experiment, the load kernel is run on one processor. In the second experiment, the store kernel is used. Figure 4 shows the average load and store access times as a function of array size.
Each experiment was repeated for strides 1; 2; 4; :::;S;and 2S elements; where 2S is the first stride with the same time per access as the previous stride. Thus, one cache line contains S elements, and results are shown for strides 1 through S. In the SPP1000, the access time at s = 8 equals the access time at s = 4, confirming that the processor cache line size is 32 bytes. The figure shows three regions:
1. Hit region (0 < w 1024 Kbytes): the array size is smaller than the cache size, and every access is a hit taking T H cycles. T H increases to a slightly higher level for w > 480 Kbytes due to TLB misses 1 .
2. Transition region (1024 < w < 2048 Kbytes): some of the accesses are hits and others are misses, resulting in an average access time of T T (s; w) cycles. The width of this region is the cache size divided by the degree of the cache associativity (assuming LRU replacement for set-associative caches). The average access time shown in Figure 4 is calculated from the minimum call time of 100-call runs. The minimum call time for the load kernel slightly varies from one run to another. All the information shown in Table 1 is derived only from these simple kernels and resulting graphs. The cache load and store transfer rates are calculated from the minimum call time in the hit region. The memory load and store transfer rates are found from the minimum call time for s = 1 element in the miss region.
The T T (s; w) curve is a truncated hyperbola, namely T 
Shared Memory Performance
In this section, we present the shared-memory performance evaluation methodology and results. Subsection 5.1 evaluates interconnect cache performance, 5.2 evaluates shared-memory performance when 2 processors interact in a producer-consumer access pattern, 5.3 evaluates the overhead of maintaining coherence when multiple processors are involved in shared-data access, and 5.4 evaluates overall performance when multiple processors are active in accessing local and remote memory.
Interconnect Cache Performance
The interconnect cache is a dedicated section of the hypernode memory. The IC size is configurable by the system administrator, and is selected to achieve the best performance for applications that are frequently executed.
The IC in each hypernode exploits locality of reference for the remote shared-memory data (shared data with a home memory location in some other hypernode). Whenever remote shared-memory data is referenced by a processor, if there is a miss in the processor' s data cache followed by a miss in the hypernode' s IC, a 64-byte line is retrieved over the CTI through its home hypernode. This line is stored in the IC, and the referenced 32-byte portion is stored in the processor' s data cache. Hence, additional references to this line that miss in one of the data caches of this hypernode can be satisfied locally from the IC until this line is replaced in the IC or invalidated due to an update by a remote hypernode.
To evaluate the performance of the IC, we used an experiment similar to the one used for evaluating localmemory performance. We have used a program that is run on two processors from distinct hypernodes. The first processor allocates a local shared array and initializes it. The second processor accesses the array repetitively using the load kernel. Figure 6 shows the average load time of the second processor for a variety of strides as a function of the array size.
For array sizes up through 128 Mbytes, the array fits in a 128-Mbyte IC, and the load time is similar to the local-memory load time shown in Figure 4 . For larger arrays, we enter a transition region that is 128 Mbytes wide, which indicates that the IC is direct mapped. For array sizes larger than 256 Mbytes, no part of the array remains in the IC between accesses to it in two successive kernel calls, so every access to a new IC line generates an IC miss that is satisfied from the remote hypernode.
As we go from s = 1 to s = 8 elements, the average access time increases due to the increase in the number of IC misses per access. For s 8, every access is a miss. Nevertheless, the time continues to grow up to s = 32, because fewer CTI rings are used to serve these 
Producer-Consumer Performance
In this subsection, we evaluate the shared-memory performance when two processors interact in a producerconsumer access pattern on shared data. For this purpose, we use a program that has the following pseudo-code: The time of Processor 1 is the read-after-write (RAW) access time, shown in Figure 8 . When Processor 1 writes into the array instead of reading, we get a third access pattern with write-after-write (WAW) access time, shown in Figure 9 . This is a less common access pattern, but may occur, for example, in a false-sharing situation where two processors write into two different variables that happen to be located in the same cache line.
For the fourth access pattern, read-after-read (RAR), each processor acquires a local copy of the data. Hence, we get access times similar to the local-memory access times.
For the first three access patterns, the access time depends on the access stride, the distance between the two processors, and whether the data is cached in the other processor' s cache. Since the array fits in the data cache, it is cached whenever a processor accesses it. To measure the not-cached case, we add some code to flush the cache after the kernel calls. In general, the access time increases as the stride increases due to the increase in the number of misses per reference or the decrease in the number of CTI rings used. The access time when the two processors are from different hypernodes (Far) is from 2 to 10 times larger than the access time when the two processors are from the same hypernode (Near).
Although WAR time is highly sensitive to the distance between the two processors, it does not depend on whether the data is in the other processor' s cache.
In RAW, a read access is done to the local memory (Near RAW) or the remote memory (Far RAW). When the memory has a valid copy, the read access is satisfied from the memory. Otherwise, the memory data is invalid, and the valid copy is in the cache of the other processor. In the latter case, the read access is satisfied from the other processor' s cache with a higher latency.
The WAW access is similar to the RAW access and starts by reading the current copy with invalidation. Once the data is in the processor' s cache, it is updated. WAW is somewhat slower than RAW due to the ring remaining busy for invalidation after the current copy is provided to the new node, which may delay the beginning of the next write. This effect is most visible for stride 32 when the same ring is used for successive writes.
The WAR and RAW access times can be used to find the shared-memory point-to-point communication latency and transfer rate. The latency is the sum of WAR and RAW access times with s = 8 elements. The transfer rate is 8 bytes divided by the sum of WAR and RAW access times with s = 1. These derived parameters are shown in Table 2 . For the cached case, far communication has about 5 times the near communication latency with about one third the transfer rate. 
Coherence Overhead
In this subsection, we evaluate the shared-memory performance when two or more processors perform read and write accesses to shared data. The objective here is to evaluate the coherency overhead as a function of the number of processors involved in shared-memory access. For this purpose, we use a program that has the following pseudo-code: w=8s write accesses per iteration, and each of the other processors does w=8s read accesses per iteration inside a critical section. Notice that no more than one processor is active in the critical section at any time. The time spent in doing these accesses is measured for every processor and the minimum times are divided by the number of accesses to get the average access times. The time of Processor 0 is the invalidation time, shown in Figure 10 as a function of p. The times for the other processors depend on the order in which they enter the critical section. Figure 11 shows the read times for the p ? 1 other processors from experiments using 24 processors. In these experiments, the processors enter the critical section in order of their IDs.
The invalidation time increases with increasing stride for the same reasons described in Subsection 5. the processors span, and is the fastest within one hypernode (8 or fewer processors). In general, the invalidation time increases in steps, it remains almost constant when the new processor is from the same hypernode, and increases when the new processor is from a new hypernode. Invalidation within one hypernode remains constant because the SPP1000 utilizes the crossbar broadcast capability. The invalidation time with s = 8 jumps from 74 cycles for 8 processors to 575 cycles for 9 processors and, in contrast with other strides, it does not increase for three hypernodes.
The first processor reads from the writer' s cache causing the memory to be updated as a side effect. The second processor' s read time is thus less since it is satisfied from the local-memory; this is also true for processors 3 through 7. The read time is higher for Processor 8 since the data is not in its hypernode and must be provided remotely. When processors 9 through 15 read, they find the data in their hypernode' s IC and their read time is similar to processors 2 through 7. This sequence repeats for each hypernode.
Concurrent Traffic Effects
In each experiment presented in the previous sections, only one processor at a time is actively running a load or store kernel. Hence, the reported results represent a best case where processor accesses are not hindered by any resource contention with other accesses. We have modified some of the previous experiments to have multiple active processors. These modifications allow characterizing the aggregate memory bandwidth through the crossbar and the CTI rings, and characterizing single processor memory performance degradation due to concurrent traffic. We present here the results of two representative experiments. Figure 12 shows the aggregate local-memory bandwidth for an experiment using the load kernel. In this 
Scheduling Overhead
The scheduling time is the time needed by the parallel environment to start and finish parallel tasks on p processors. This time may include the overhead of allocating processors for the parallel task, distributing the task' s executable code, starting the task on the processors, and freeing the allocated processors at task completion. In this section, we present the overhead of two aspects of the SPP1000 scheduling: static scheduling and parallel-loop scheduling.
Static scheduling overhead (SSO) occurs when scheduling a number of processors that does not change during run time. It is incurred once per task and is significant for short tasks. To evaluate this overhead, we run a simple program on a varying number of processors where each processor simply prints its task ID. Measuring the execution wall clock time is a good approximation for SSO. Figure 14 shows the range and average of the SSO for 10 runs.
Using curve fitting, the SSO in seconds is roughly approximated by: Compared with multicomputers, the SPP1000 has a relatively short SSO, e.g., SSO on the IBM SP2 is about one order of magnitude higher [14] . The SPP1000 advantage stems from having one operating system image, with central control that swiftly allocates and starts parallel tasks. Moreover, multiple processors can share the same executable code, thus, reducing the overhead of distributing the executable code.
The Convex compilers also enable parallelizing loops. During program execution, Processor 0 is always active. When it reaches a parallel loop, it activates the other available processors to participate in the loop' s work. The other processors go back to idle at the parallel-loop completion. The overhead of processor activation and deactivation for a parallel loop is the parallel-loop scheduling overhead (PLSO).
To evaluate this overhead, we time a parallel loop that has one call to a subroutine consisting simply of a return statement (Null Subroutine). The loop is run 1000 times and the average time for a varying number of processors is shown in Figure 15 . The PLSO is approximately proportional to the number of processors and has sudden increases when the additional processor is from a new hypernode. The PLSO standard deviation increases as p increases, it becomes about 100% of the average for p > 16. This high variance is the reason for the unsteady PLSO increase for p > 16. Using curve fitting, the PLSO in microseconds is roughly approximated by: PLSO(p) = 14:3p
Synchronization Time
In a shared-memory multiprocessor, a call to an explicit synchronization routine is often needed between code segments that access shared data. When a processor reads shared data that is modified by another processor, a bar- rier synchronization call before the read ensures that the other processor has completed its update. The SPP1000 has several barrier synchronization subroutines, of which WAIT BARRIER and CPS BARRIER are the commonly used. We have evaluated the time to call these subroutines when all the processors enter the barrier simultaneously. This was implemented by making every processor call the subroutine for many iterations, and reporting the elapsed time per call. Figure 16 shows the average WAIT BARRIER synchronization time for a varying number of processors. CPS BARRIER has similar performance.
This time shoots up to more than 1500 microseconds for 8 processors, implying high contention on the synchronization variables. For 9 or more processors, the processors are spread over two or more hypernodes with less contention, but the synchronization time increases steadily. This time is approximated by:
T sync (p) = 7:1p 2 ; for p 6 = 8.
Clearly, an inefficient implementation of the barrier is used by this subroutine, yielding large synchronization overheads when more than 7 processors are synchronized.
Considerations for Other Systems
In this section, we discuss some considerations for applying the microbenchmarks presented in this paper to other DSM systems. One important aspect is the design of load and store kernels to measure cache miss latencies. Since the PA-7100 blocks further cache misses when it has an outstanding miss, it does not overlap the latencies of its misses. Consequently, its load and store kernels are simply sequences of load or store instructions. However, modern processors, e.g., the PA-8000 [15] , can access their caches even when there are outstanding misses. Thus, the latencies of individual misses can be overlapped. To measure the load latency, the load kernel should force the processor to have only one outstanding access, which is achievable by the following load-use Fortran kernel:
The array used in this kernel is initialized as array(i) = i -stride for each of its N elements. Consequently, this kernel scans the array backwards and has only one outstanding load because the address of each load is a function of the previous loaded value.
Designing a store kernel is trickier, because the store instruction does not return data that can be used to establish a data dependency that serializes the store stream. However, for strides larger than one element, the following store-load-use kernel is capable of measuring the store miss latency. This kernel measures the store miss latency in write-allocate caches as the latency from issuing the store to completing the allocation in the cache of the missed line. The store in this kernel may hit or miss, but the load always hits. The load simply reads the address for the next store, thereby establishing the data dependency needed to serialize the store accesses.
array(N-1) = 0 tmp = array(N) do i= 2, N/stride array(tmp-1) = 0 tmp = array(tmp) end do
Consider the array used in this kernel as being divided into consecutive groups of stride elements each. The array is initialized such that the elements in each group have the same value and point to the last element in the previous group. This kernel also scans the array backwards, and the two elements accessed by one iteration' s store and load instructions must be allocated in the same cache line. The store instruction in this kernel must precede the load (otherwise, in a multiprocessor system, the load may generate a load miss, and if the memory returns a shared line, a following store to that line would request exclusive ownership).
In modern superscalar processors, the overhead of the load and use is minimized since these processors usually perform the load concurrently with the store and bypass the load' s result for quick calculation of the new address. On the PA-8000, the load-use kernel and the store-loaduse kernel each takes three cycles per iteration in the hit region.
Processors that support prefetching make measuring the cache miss latencies even harder. Some processors support software prefetching by providing "non-binding" load instructions that can be added by the programmer or the compiler to bring the data to the cache ahead of its use. Non-binding loads do not return data to a register. Software prefetching can be disabled by instructing the compiler not to insert prefetch instructions. Some processors support hardware prefetching where the cache controller, based on the miss history, speculates which uncached lines will be referenced in the near future and automatically generates requests to prefetch these lines. For example, the PA-7200 processor [16] prefetches the next sequential cache line on a cache miss. Since these two kernels both scan the array backwards, the next (forward) sequential line is always in the cache, thus, no sequential prefetching is done.
In addition, the PA-7200 prefetches cache lines based on hints from the load/store-with-update instructions. These instructions are usually used in loops with constant strides through arrays. They perform loads or stores that update the base register that is used in calculating the current address. The modified base is typically used to find the address in the next iteration. When a load/store-withupdate instruction generates a cache miss, the PA-7200 uses the modified address to find the stride and direction for prefetching. The above two kernels have indirect addressing and do not have load/store-with-update instructions, thus, they avoid this type of prefetching.
In this illustration, array is initialized to make the kernels scan the array linearly in the backward direction. However, it can be initialized differently to avoid hardware prefetching when other prefetching algorithms are used.
When measuring bandwidth, the kernels should be designed to achieve the highest possible bandwidth. This can be achieved when the kernel does simple, forward accessing of the array, allowing as much overlapping as possible, and exploiting hardware prefetching. Thus, the above two kernels are unsuitable for measuring bandwidth. Usually, modern optimizing compilers can achieve this target with the following load kernel: do i= stride, N, stride sum = sum + array(i) end do
The compiler should unroll this loop sufficiently to minimize the effect of the backward branch at the loop' s end, and schedule the loop to mask compute time. In some processors, the existence of the add instruction may reduce the number of outstanding misses that the processor can have. For such processors, the add instructions can be deleted from the assembly file of the compiled kernel. A store kernel can similarly be designed by replacing the reduction in the above kernel by array(i) = 0.
In an SPP1000 system, the latency of a remote access depends on the number of nodes and does not depend on the distance of the remote node. Each access has a request path and a reply path, which collectively circle one ring, therefore, the latency is not affected by the location of the remote node. However, other interconnection topologies, like mesh and hypercube, have latencies that do depend on the remote node location. Consequently, the microbenchmarks that measure their communication performance would report performance figures that depend on the relative position of the processors.
To characterize such systems accurately, we need either the ability to control processor allocation or, at least, the ability to find the physical location of an allocated thread. When the system provides the ability to control processor allocation, we suggest characterizing such systems on two dimensions: First the distance dimension, by repeating the producer-consumer communication experiment for various distances. Second the function dimension, by performing the experiments described in this paper with the best case allocation, in which the processors used are as close as possible.
For systems that provide the ability to find the physical location of an allocated thread, we can achieve a similar two-dimensional characterization by carrying out experiments that request all the physical processors, but activate only a subset of these processors according to their location. This approach assumes that the system does not over subscribe processors and does not migrate threads from one processor to another.
Finally, the synchronization barrier used in these microbenchmarks should be carefully selected. Some of these microbenchmarks use synchronization barriers to coordinate the accesses of multiple processors. For example, in the producer-consumer benchmark, when Processor 0 is in the store kernel, Processor 1 is waiting on the synchronization barrier. A processor waiting on a barrier should be waiting on a cached flag so that it does not generate memory traffic that affects the accesses of the other active processor or processors.
Conclusions
In this paper, we have presented an experimental methodology for measuring the memory and communication performance of a DSM multiprocessor. We illustrated this method by carrying out a case study of four aspects of Convex SPP1000 performance: local-memory and shared-memory accessing, scheduling, and synchronization.
In local-memory access, the SPP1000 processors depend on their large data caches to reduce the cache miss rates and to achieve a small effective access latency. The processor local-memory load bandwidth is only about 50 Mbytes/sec due to the small cache line (32 bytes), the large miss latency (55.4 cycles), and the limit of one outstanding miss. This bandwidth is, however, maintained when all the processors in a hypernode are active and only drops when the accesses are not well spread over all the memory banks.
The SPP1000' s IC load miss time is 410 cycles, i.e., 7.4 times a miss satisfied from the local memory. This implies that there is a high overhead when the CCMC accesses remote data. Consequently, the achieved load transfer rate between a remote memory and a processor is 15 Mbytes/sec (only 2.5% of the peak transfer rate of one CTI ring). Furthermore, unlike the crossbar, the CTI rings cannot sustain 8 active processors without loss in processor bandwidth. This result is contrary to the naive assumption that 100/2.5 = 40 processors per ring (160 processors total) can run at 15 Mbytes/sec.
We have classified shared data accessing into four patterns and presented experiments for evaluating their access times. The RAW and WAR times are used to estimate shared-memory point-to-point transfer rate, which is 23 Mbytes/sec locally and 5.0 Mbytes/sec remotely. Changing the RAW sharing degree in Section 5.3, enabled characterizing the invalidation time as function of p. The SPP1000 exploits its crossbar broadcast capability for fast invalidation within one hypernode, but suffers large overheads when invalidations cross hypernode boundaries.
The static scheduling overhead of the SPP1000 is directly proportional to the number of processors and is relatively small. The parallel-loop scheduling overhead (PLSO) is also proportional to the number of processors and takes 14.3 microseconds on average to schedule each additional processor. Since the PLSO is on the order of hundreds of microseconds for tens of processors, it might not be profitable to parallelize a short loop. For a serial loop that takes T 0 microseconds, parallelizing it with perfect load balance gives a parallel loop that takes T 0 =p + PLSO(p) microseconds. Hence, parallelism may be profitable if T 0 > 14:3p barrier subroutines is inefficient and algorithms with better time complexities are available [17] .
We suggest that the experiment-based methodology presented in this paper can be applied to other sharedmemory multiprocessors and that the resulting characterization is useful for developing and tuning shared-memory applications and compilers. We have shown that the corresponding characterization of message-passing multicomputer communication performance can also be systematically carried out [14] .
