On large-scale multiprocessors, access to common memory is one of the key performance limiting factors. The shared-memory performance depends not only on the characteristics of the memory hierarchy itself, but also upon the characteristics of the memory address streams and the interaction between the two. We present a technique for multiprocessor workload construction and a family of arti cial kernels, called MAD-kernels, to systematically investigate the behavior of the memory hierarchy. The measured performance is independent of any particular application or algorithm. The proposed methodology is demonstrated on two commercial shared-memory systems.
Introduction
Access to the common memory is one of the key limiting factors in the performance of sharedmemory multiprocessors. Large-scale multiprocessors can encounter signi cant problems related to access contention to the shared-memory modules. Because such contention is potentially the limiting performance bottleneck of shared-memory architectures, it is important to understand how these performance penalties depend on the various design parameters. Such understanding will make it possible to assess the performance of systems as they scale in size and use newer memory technologies. This is essential for the evaluation of the future of shared memory systems. Traditionally, three main approaches have been used for the study of multiprocessor memory performance: analytical, simulation and experimental. All three approaches are necessary because each has its own advantages and limitations. The advantage of experimental analysis is, of course, that the performance of the real system is obtained as opposed to the performance of a model of the system. Moreover, interactions may be present on a real system that a ect performance and are di cult to capture in a model.
Owing to the diversity of architectural approaches of a multiprocessor, the development of working models that can provide a measure of the \actual" performance of these machines under workloads of interest can be an extremely complex, if not impossible, problem. We believe that by experimentally evaluating machines according to a limited number of wisely selected workload patterns, it is possible to expose machine weaknesses and strengths along a desired performance dimension. Of particular interest to us here is the measurement of the impact on memory performance of di erent design choices that were made for a multiprocessor. Hardware design parameters include the number of processors serviced by the common memory, the degree and granularity of memory interleaving, the data-coherence protocols used and the speed ratio of the processor to memory. Among software design parameters are the granularity of computation, the distribution of shared-data to the di erent levels of shared-memory hierarchy and the shared-data access patterns. In this paper, we formally present an experimental framework and a family of arti cial workload kernels, called Memory Access Degradation (MAD) kernels, to investigate the sharedmemory performance as a function of selected design parameters. It should be emphasized that these kernels are not used to model the performance of any speci c application.
In parallel computers with global shared memory, parallel memory modules must be used to provide su cient bandwidth for the processors. Multiprocessor systems di er in their design as to where the global shared memory is physically located and how they provide hardware and software so that each processor sees a uniform address space in a hierarchy of local and global random access memory.
Memory con icts may occur when two or more processors attempt to gain access to a shared resource along the processor{to{memory path simultaneously. There are several factors that contribute to performance degradation of concurrent shared-data accesses: contention for memory path, contention for memory module, contention for memory location, and interconnection network bandwidth consumed to maintain data coherency over the memory hierarchy. Our primary goal here is to introduce a exible technique to build a variety of arti cial workloads using the parameters that most in uence the shared-memory behavior, and experimentally determine the sensitivity of the memory performance to such parameters. The major components of our proposed experimental methodology are: the de nition of a unit grain of computation in multiprocessor programs, a technique for characterizing the unit grain and a controlled methodology for changing the grain attributes to systematically create a wide range of workload patterns, a competitor workload model encompassing both static and interference characteristics of concurrent execution to evaluate the performance degradation of a unit grain under a variety of homogenous and heterogenous workload conditions, a family of empirical kernels (MAD-kernels) built using the grain characterization technique to study the e ciency of concurrent memory accesses.
The remainder of this paper is organized as follows. Section 2 contains the details of the proposed methodology. A description of the workload parameters and performance metrics used in our experiments is included in Section 3. Section 4 describes the target machines used in the experiments. The experimental measurements are presented in Section 5. Finally, Section 6 concludes the paper.
Performance Framework

Workload Characterization and Classi cation
System characterization (to distinguish it from benchmarking) is a set of experiments that isolate and measure the performance response of a system to controlled workload inputs. These responses describe the system and determine its performance. The accuracy of the system characterization depends closely on the type of workloads chosen for selective assessment and how well they represent the measurement objective.
Examples of multiprocessor performance evaluation using experimental techniques are discussed in 1, 2]. Work has also been done to gather data about the behavior of real workloads on shared memory multiprocessors 3, 4] and use it to analyze the behavior of cache-coherent shared memory systems 5, 6] . For our system characterization experiments, we adopt an abstract model of parametric workload speci cation that allows the exploration of performance over a wide spectrum of assumptions about data sharing, locality of reference, and inter-process synchronizations.
It has been shown 7] that the performance of a parallel system in the short term|during one iteration|for example, can also be used to model long term performance. We model the computation in a single process (or thread of activity), which is part of the parallel workload, as a sequence of loop iterations that may be random or deterministic. Each such loop iteration represents a single grain of computation, called a unit grain and denoted as G. The sequence of iterations, therefore, represents a string of grains constituting the execution pro le of a single processor in a parallel program. The unit grain is the fundamental level at which all performance measurements are taken. Each unit grain G is further assumed to be composed of exactly three granules: shared-memory access, local computation and synchronization ( Figure 1) . A shared-memory granule, denoted as g m , is concerned with accessing globally shared data needed for the computation. Most often, access to globally shared data within this granule would be in concurrent{read mode, since writes to shared data must be properly guarded within critical sections in order to preserve memory access consistency. In situations where concurrent writes are legitimate and consistency preserving, however, g m could include writes to shared data. A local computation granule, denoted as g c , represents the portion of the execution grain that performs CPU bound computation using only process private data. We assume that any shared data needed for the computation is rst retrieved into a process private area (possibly internal registers or processor cache) before being used. A synchronization granule, denoted as g s , represents inter-processor synchronization in the form of mutual exclusion (using locking semaphores) to access critical sections of code wherein updates to write-shared data are performed. It could also represent synchronization operations such as event post/wait for synchronous algorithms. This granule imposes an ordering restraint on the otherwise concurrent execution of a multiprocessor application. Using this decomposition, the unit grain G is de ned to be G = (g m ; g c ; g s ).
A special characterization called null granule characterization, and denoted by g i = , is reserved to indicate that granule g i is absent from the unit grain. Any component granule in the de nition of G can be null, re ected by the alternate bypass paths shown around each granule in Figure 1 . We will characterize the unit grain G by choosing an appropriate characterization for each of its component granules. The choice of attributes needed to characterize each granule depends upon what aspect of the multiprocessor system performance is under study and the level of abstraction at which the analysis is to be carried out.
Granule g c contains the meaningful computations performed by a task and hence represents the operations whose overall rate should be maximized. Hence, g c must always be present in the unit grain de nition. Based on whether the granules g m and g s are present, the range of workloads represented by this technique can be categorized into four broad classes based on the mode of concurrent accesses to shared data: embarrassing workloads (g m = ; g s = ), concurrent-access workloads (g m 6 = ; g s = ), exclusive-access workloads (g m = ; g s 6 = ), and dual-mode access workloads (g m 6 = ; g s 6 = ).
Workloads designed according to each of the classes above can be used to either measure a system's performance along a particular dimension or the interactions between di erent performance dimensions. This provides a means of observing how di erent factors a ecting performance interact. Based on these results, one can identify critical parameters and recognize performance bottlenecks.
Experimental Framework
The de nition of the unit grain provides a unit of workload speci cation for the computational activity in a single process (or a single thread of control). Our objective is to measure not only the static characteristics of the execution of a speci ed workload but also the interference characteristics that result from the run-time interactions between concurrent processes in that workload. In order to ful ll this goal of observing the interference between a set of simultaneously executing homogenous or heterogenous grains under varying degrees of parallelism, we have established an experimental structure for our measurements (see Figure 2 ) that consists of: one test processor (called P 0 ), a variable number, N, of competitor processors (called P 1 ; P 2 ; :::; P N ), and a number, M, of data elements that are shared by the test and competitor processors.
The test processor P 0 executes a unit grain called the test grain and denoted by G t . Each competitor processor executes an identical copy of a unit grain called the competitor grain and denoted by G c . The number of competitor processors, N, can be varied to control the degree of parallelism and, hence, the extent of interference among the concurrent grains.
The structure of the shared data is assumed to be a one-dimensional array consisting of M elements and distributed over the memory modules in the shared address space in some pre-determined fashion. This view of the shared data is justi ed by the fact that any higher dimensional data structure will ultimately be translated into a one-dimensional sequence of memory addresses for the purpose of storage.
We also make the following assumptions for all our experimental measurements:
The number of concurrently competing processes in our framework is less than or equal to the maximum number of available processors N max , i.e., N + 1 N max . A process once created and attached to a processor remains stationary, i.e., process migration is not allowed (this is ensured by making the appropriate system call to bind a process to a given processor on the Symmetry; processes are bound by default on the TC2000).
The execution of a process is nonpreemptive (this is made possible on the Symmetry by running the experiments on a dedicated system with no other load; on the TC2000, allocated processor clusters are dedicated to the process allocating them).
The rst assumption ensures that all the processes in a given workload are simultaneously active on di erent processors thus participating in shared resource contention resulting in the worst-case runtime overheads. Throughout this paper, therefore, the terms process and processor are used interchangeably. The second assumption helps eliminate the context-switch overhead that would entail from process migration. The third assumption precludes any unexpected program behavior due to unpredictable process preemptions. Further, all measurements are performed on a quiescent system thus enabling us to ascribe reasons for the observed losses with greater con dence.
The set of input parameters to the experiment, I, can be written as I = fN; M; G t ; G c g. Note that by setting G t = G c (or alternatively, G t 6 = G c ), we can create a homogenous (or heterogenous) workload. Homogenous workloads are used to characterize the loss in processing e ciency ensuing from runtime overheads when multiple identical processes cooperate to achieve a common goal (as in SPMD style computations), whereas heterogenous workloads are used to observe the interference characteristics of unrelated processes (the test and competitor grains in this case).
With a suitable selection of attributes characterizing the unit grain, the workload model parameters in I allow a range of workload behaviors to be represented. The input parameters are varied systematically one parameter at a time with the changing parameter, say X, denoted byX in the experiments. Further, the input parameters are assigned either constant values to allow a study in deterministic execution behavior, or probabilistic values (drawn from a uniform distribution) thus allowing for stochastic behavior. The e ective unit grain execution time T G (N) for a concurrent workload with N competitor processes is recorded as the maximum of all grain execution times for a homogenous workload; and simply the test grain execution time for a heterogenous workload.
The Workload Emulation Kernels
Once an appropriate characterization for the unit grain has been selected, we have a method of specifying di erent workloads of interest by assigning suitable values to the grain attributes and the input parameters. What is needed is an emulation program that uses the workload specication to mimic the execution behavior of an asynchronous program that would demonstrate the same characteristics, namely, memory reference and synchronization patterns. The Memory Access Degradation (MAD) and Synchronization Access Degradation (SAD) kernels are a family of such emulation programs. Their only purpose is to mimic the usage of shared resources of the speci ed workload keeping intact the timing relationships between the di erent components of the computational structure.
Each kernel is written to use a set I of input parameters and generate a set of performance measures of interest by executing the emulated workload in a controlled experiment. It should be emphasized that these kernels are di erent from standard benchmarks. The key attribute of these kernels is that they are programs that do not perform any useful computation, but rather, model the computation, memory access and synchronization structure of a class of workloads of interest. They generate synthetic loads that are designed to stress a particular aspect of the target system. The usefulness of this approach lies in the fact that:
The measured performance is not tied to any speci c application. The user can design selective workloads, using the workload characterization technique provided, to generate a system characterization of interest. Since they are simple, the interpretation of the observed behavior in terms of the kernel structure is easy.
A collection of such kernels can be used to quantify and compare the performance of existing, new, or experimental architectures.
The length of each observation run, N itr , and the size of an experiment sample, N repeat , are selected ( Figure 2 ) based on the resolution of the clock available on the target system, the overhead of the timing function and the overhead of the loop control statements. The choice of suitable values for these two control parameters is crucial to the minimization of experimental error and the con dence interval of the measured quantities 8].
The experimental measurement of concurrent memory access performance using the MAD kernels is presented in this paper. The use of the SAD kernels to evaluate synchronization and barrier performance can be found in 9].
MAD Workload Parameters
It is clear that the workload used to evaluate the memory performance can have a strong in uence on the results. The domain of the parameter space for investigating the shared-memory performance is prohibitively large. The attributes selected for the unit grain should help probe the memory system systematically by creating diverse sets of memory address streams to determine its sensitivity to the di erent workload characteristics. These workloads should not only measure the sustained memory bandwidth under di erent memory demands, but also highlight potential bottlenecks.
Unit Grain Characterization
The major consideration in memory system design for multiprocessors is that the memory bandwidth must match the memory demand of the processors. The e ectiveness of the memory design in meeting this goal depends not only on the organization of the memory hierarchy, but also on the distribution of the shared data in the hierarchy, the memory reference pattern of the program, and the locality of memory references. In addition to temporal locality and spatial locality of references, parallel computing also makes a new type of locality, called processor locality, desirable. To keep high processor locality, unnecessary interleaving of references by more than one processor to the same memory data should be avoided. The unit grain characterization selected for probing the memory system is summarized in Table 1 . The shared-memory access granule g m is characterized by a set of four attributes: g m = (p; d; s; m). The rst attribute, p, simply indicates the probability of a shared data reference being a write access. In other words, p = 0 implies that all accesses are reads, and p = 1 implies all accesses to be writes. As mentioned earlier, writes to shared data by multiple processors are typically performed within critical sections in a mutually exclusive fashion unless the concurrent writes are guaranteed to be consistency preserving.
In our experiments, processor P i accesses addresses given by the expression (i d+k s) modulo M, for k = 1; 2; : : :; where i d determines the initial address and s is the distance between two consecutive memory references emanating from the same processor. Note that the initial address is processor-dependent and that parameter s represents the spatial distribution of the sequence of addresses. Depending on how the shared data elements are distributed over the memory hierarchy, using di erent access strides will cause the memory request transactions to traverse over di erent components of the processor{to{memory interconnection.
Finally, the attribute m denotes the number of memory accesses to be performed within a single memory-access granule. The value of m determines the granularity of shared data access within a grain. The main purpose of changing this attribute is to control the density of memory requests, thus highlighting the interaction between request bursts and idle periods.
Characterization of g c :
Since all the computation within granule g c operates purely on processor private data out of a private memory space (assumed to be available locally), by our de nition, the computation granule does not alter the memory interference behavior of the shared data access stream as it is external to the processor. Its only in uence is setting the memory access rate and, hence, the temporal distribution of the shared data references. So we have characterized the computation granule g c by a single operation count attribute: g c = (c). The attribute c represents the number of computational steps performed within a unit grain, and is expressed in terms of a \basic computation unit" (BCU).
The BCU chosen for granule g c is a long oating-point expression involving the loop-count variable and other process-local variables (indicative of the compute speed of the processor).
Characterization of g s :
As only the shared memory access performance is of interest here, the null characterization was chosen for the synchronization granule, i.e., g s = . The handling of a workload with g s 6 = by the MAD kernels is described in 10].
Performance Characterization Parameters
It has been recognized for years that the single parameter M op/s (mega ops) is inadequate to measure the performance of a multiprocessor system, because it takes no account of the communication, synchronization and resource contention overhead inherent in the parallel execution of multiple processes. More recently, a two-parameter (r 1 ; s 1=2 ) description has been used 11] to characterize the oating-point performance in MIMD computing that is based on measuring the importance of the overhead of synchronizing multiple instruction streams. The parameter r 1 denotes the asymptotic oating-point performance as M op/s whereas s 1=2 indicates the amount of useful arithmetic that could have been done during the time taken for synchronization. In a similar spirit, a three-parameter (r 1 ; n 1=2 ; s 1=2 ) description of MIMD vector computers 12] has also been used that incorporates, in addition to the synchronization overhead s 1=2 , the vector startup overhead in terms of n 1=2 . However, the parameters used in these characterizations assume that the overheads are constant quantities thus accounting for only the static overheads encountered. The variation of program performance with the number of processors and the associated dynamic overheads caused by run-time interactions between processes cannot be accurately captured by such static parameters only.
We use a hierarchical performance model to describe the net performance of a concurrent program structure that encapsulates its static as well as interference behavior. The lower level in the hierarchy, the granule level, focuses attention on each component granule of the unit grain. The e ect of the static distribution of work among the granules on computational performance is captured by the three static parameters (R 1 ; f 1=2 ; c 1=2 ) measured at this level. Measurements at the higher level, the grain level, quantify the overheads that result from run-time interactions between concurrent instruction streams as a function of the number of interfering processes. The in uence of these overheads on overall performance is described by the two interference parameters ( m (N); s (N)). The degradation of performance from this peak is determined by the amount of computational work performed per shared data reference, here measured by f, the computation granularity c, and the static parameters f 1=2 and c 1=2 . The value of f 1=2 measures the memory bottleneck in terms of the amount of lost work that could have been done during the time of the shared data access. Similarly, c 1=2 measures the lost work due to synchronization. Hence, they signify the cost of shared data access and synchronization in terms that are meaningful to the programmer.
If the concurrent execution of processes, represented by unit grains, on di erent processors were ideal (i.e., no mutual interference), then the net computational rate achieved with N competitor processors would be (N + 1)R 1 . However, in practice, parallel execution of cooperating processes involves contention for shared resources in hardware (memory modules, interconnection network, etc.) and software (shared lock variables). The result is runtime overheads that are dynamic in nature which degrade the asymptotic performance further beyond the ine ciencies introduced by the static parameters (R 1 ; f 1=2 ; c 1=2 ). It is important to know the computational cost of these overheads, because this in uences the way in which parallel algorithms are designed. The memory interference m (N) for a given workload varies with N, and depends upon the distribution of shared data objects over the memory hierarchy and the memory reference patterns.
Similarly, the lock interference s (N) also varies with N, and depends on the implementation of the locking primitives, the frequency of critical section access and the amount of computation performed in between consecutive critical section operations.
Given that there are c BCUs computed per unit grain and`unit grains executed per processor, the e ective BCU rate with N competitor processes active, R(N), can be written with some The interference parameters m (N) and s (N) for a given workload can be obtained by experimental measurements at the grain level to determine the increase in the average execution time per unit grain G. In this paper, we only measure and quantify performance loss due to memory access con icts.
Quantities Measured
We measure the static parameters R 1 A value of m (N) = 1 indicates that the concurrent memory access streams are independent of each other and do not encounter any con icts at all. A value of m (N) 1 re ects signi cant con icts with the competitor processes leading to extremely high access latencies.
It should be emphasized that the e ciency metric is a measure of the relative performance of a workload with N competitors as compared to its performance with no competitors. Similarly, the interference metric is also a relative measure in that it presents the net contention overhead as a fraction of the uncontested unit grain execution time, i.e., the number of unit grains that could have been processed during the time lost due to overheads. Thus both the metrics are scaled in terms of the uncontested unit grain time T G (0). The implication of this for two workloads with identical absolute contention performance (i.e., same net overheads) is that the one with the larger amount of work per unit grain (i.e., larger T G (0)) will be considered as the more e cient of the two.
Target Systems
We have used the framework described in Section 2 to measure the memory access performance of two multiprocessors. Table 2 summarizes their architectural features and the measured characteristic times t c and t m (t s = 0 since g s = in our experiments). Note that on the TC2000, t m depends on whether the access is local or remote to a processor/memory node. The Sequent Symmetry S81 13] is a bus-based shared-memory multiprocessor that supports upto 30 processor nodes. Each processor node consists of an Intel 80386 processor equipped with a 64 KB two-way set-associative cache. All caches in the system are kept coherent by snooping on the bus. Cache coherence is enforced by using a write-invalidate copy-back caching policy. Multiple bus transactions are pipelined so that the bus throughput can be maximized.
The BBN TC2000 parallel processor 14] is a distributed shared-memory multiprocessor. Each processor node contains a Motorola 88100 RISC processor, 4 or 16 Mbytes of memory, 16 Kbytes each of instruction and data cache and a switch interface. Each processor can access its own memory directly, and can access the memory of any node through a switching network consisting of 8x8 crossbar bidirectional switches in a log 8 N-column interconnection, where N is the number of processor nodes. A reply to a given request is also returned along the same path. If collisions occur at a switch node, one transaction succeeds and all others are aborted to be retried at a later time (in hardware) by the processors that initiated them. All shared-data, by default, are not cached on the TC2000. A user can choose to selectively cache shared-data and manage its coherency explicitly. 
Experimental Results
In this section, we present excerpts from the ndings of the MAD-kernel experiments.
Concurrent-Access Workloads
Concurrent-access workloads, with no lock-based synchronization within the unit grain (i.e., g s = ), were designed and used to characterize the impact of concurrent memory reference patterns on the shared memory performance. The increased access latencies observed in this case are due only to access con icts in hardware and the overhead of maintaining the consistency of replicated data over the memory hierarchy. The workloads have been used to measure and compare the performance of di erent workloads on the Sequent Symmetry and the BBN TC2000 systems.
The shared data, with M elements, were allocated using the shmalloc() call on each machine.
On the Symmetry, the data elements are interleaved across the memory modules with a interleaving granularity of 32-bytes. On the TC2000, the shared data use shared, uncached memory. If the system is con gured with interleaved memory, then the shared data is interleaved. However, since the current version of the nX operating system does not support interleaving, the shared data is scattered across the allocated cluster of processor/memory nodes instead.
We conducted experiments using a number of parameter families. Each family was designed to measure the e ect of a particular grain attribute on the resultant contention and, hence, unit grain e ciency. The spectrum of input parameters included both homogenous and heterogenous settings. The heterogenous parameter families were particularly useful in revealing the interactions between concurrent read and write streams, especially on cache-based systems such as the Symmetry.
Homogenous Workloads
In these experiments, the attributes for the test and competitor grains were set to be identical, i.e., G t = G c . Thus, the resultant performance degradation when concurrent grains with identical execution behavior compete was measured.
Spatial Distribution
By manipulating the stride s of shared-data access, and by choosing a value of M large enough to cause a complete sweep of all the memory modules, the e ectiveness of the interleaving of the main memory system is probed. Changing the value of s, in e ect, creates di erent spatial distributions of the memory access stream generated by each process. In Figure 3 , the e ciency of both read and write accesses is shown. The observed e ciency m (N) of a given workload provides a measure of the potential increase in the memory bandwidth for that workload by a factor of (N + 1) m (N). By examining the input parameters, it can be seen that all processors start their access from the shared-data element 0 (since d = 0) and perform subsequent accesses with identical strides.
The Symmetry scales well for s = 0; 2 for read accesses. However, for s 4, every access to a shared-data element results in a cache-miss (since the cache line length is 16) forcing a memory read transaction over the bus. The bus, therefore, begins to saturate at N = 14. For write access, a stride of 0 with initial distance 0 causes repeated writes to the same location by all processes resulting in a hot-spot. This situation is discussed a little later. For higher stride values, although the processors trace the same sequence of addresses their accesses becomes staggered over time.
Hence, writes still occur in parallel. However, for stride values of s 4, each write accesses a new cache line and requires sending an invalidation as there is no temporal locality of access. Hence, the total cache-coherency tra c increases linearly with the number of processors N for all values of s 4. The same is also true for s = 2 except that the number of invalidations may reduce by half in the best case. Consequently, the increase in the grain execution time is linear in N with m (N) decaying as 1=(1 + kN) for some constant k. This is evident from the similar downward trends in m (N) curves for all s 2 in Figure 3(a) . The TC2000 scales well for both reads and writes for all strides except s = 0, which is again a hot-spot scenario discussed later.
The static characterization (R 1 ; f 1=2 ) of the memory access performance for various stride values appears in Table 3 The f 1=2 parameter for all access strides is much higher for the BBN TC2000 pointing to the fact that there is a large disparity between the computation and shared memory access speeds on that system. Another interpretation of this fact is that for a given target rate of computation, a much larger computational granularity per shared data access is necessary on the TC2000 as compared to the Symmetry. Also noticeable in Table 3 is the fact that f 1=2 is relatively insensitive to the stride of access s on the TC2000. This is a consequence of the absence of data caching thus necessitating a majority of the data accesses to go out over the network incurring the worst-case latency. On the other hand, the parameter f 1=2 on the Symmetry is relatively lower for s = 0; 1; 2; 3 than for higher values of s. This is as a result of some of the data accesses being satis ed by the cache for s < 4. For s 4, every access results in a cache-miss as the cache line size is 16 bytes on the Symmetry.
Memory Hot Spot
The e ciency pro les generated for a setting of s = 0 and d = 0 in Figure 3 correspond to that of a memory hot-spot. In these experiments, the processors not only contend for the global interconnection network, but also for a single shared-data item. The reads on Symmetry cache the shared-data item on the rst access, and operate out of the cache on subsequent accesses, thus exhibiting no degradation. However, writes to a single location by multiple processors impacts processing e ciency in two major ways: the writes must be executed sequentially with no parallelism to be exploited; the shared location bounces between the processor caches (ping-pong e ect) thus requiring an invalidation and cache-line ll with each write. The result is a dramatic drop in e ciency. This is apparent from the extremely low value of m = 0:025 with just 3 processors executing concurrently, i.e., N = 2. 1) where P is the number of processors (it is assumed that there are an equal number of memories), r is the number of network packets emitted per processor per switch cycle (0 r 1), and h is the fraction of memory references directed at the hot spot (i.e., each processor emits packets directed at the hot spot at a total rate of rh).
Using the unit grain attributes, the net memory request rate per processor is given by r 0 = m=T G (0). If t sw denotes the network switch cycle time, then the memory request rate per processor per switch cycle becomes r = mt sw =T G (0). For the workload shown in Figure 3 (b), P = N + 1 and all accesses are to the hot spot making h = 1. The maximum per-processor request rate, therefore, is limited to r max = 1=(N + 1) using Eq. 5.1.1. In other words, the following The variation of the density of memory requests of each processor is accomplished by altering the number of computation steps performed within the computation granule g c . This corresponds to a shared memory access followed by a subsequent interval of c units of delay with no memory access. Figure 5 shows the improvement in unit grain e ciency that is achieved as a consequence of increasing the length of g c on the Symmetry. The e ect is particularly striking for write operations, since the intervening computational delay without any bus access reduces the bandwidth demand on the bus preventing saturation. Further, the increased computational interval signi es more work that can be performed in parallel thus favoring e ciency.
Size of Shared Data
By choosing small values for M, all memory references on the Symmetry can be kept in the cache. In contrast, a very large M can cause the cache to be ushed on each pass through the shared-data.
The TC2000, on the other hand, does not cache shared data. However, varying the shared-data size on the TC2000 revealed some interesting facts. The e ciency m was observed to behave identically for values of M from 1 through 4. Progressive improvement in m was observed for each increment of 4 in the value of M (Figure 6 ). This would imply that the scattering of shared-data by the system across cluster memory modules was done in chunks of 4 elements (i.e., 16 bytes).
Thus, going from M = 4 to M = 16 (and so forth) increases the number of memory modules, for which the processors contend, from 1 to 4 (and so forth) leading to a decrease in contention. 
Random Memory Access
Most multiprocessor memory organizations use special techniques (such as memory interleaving, skewing) to maximize the performance of uniform memory-access patterns. But the performance of the memory hierarchy under conditions that do not display such uniformities in memory access is also of interest. So, we measured the memory bandwidth under random access conditions, expressed as Words Accessed Randomly Per Second (warps), to quantify this performance. This is done using a homogenous workload consisting of only memory-access granules g m and varying its stride attribute s randomly. The results of these tests are presented in Figure 7 . The read and write performance on the TC2000 are comparable and appear to scale reasonably with the number of processors. The read performance on the Symmetry scales (for the number of processors used in the experiment) helped by read-caching that prevents accesses to a cached location to go out on the bus. Writes, on the other hand, place a heavier bandwidth demand on the bus due to the additional coherency tra c generated thus causing saturation at around 13 processors. 
Heterogenous Workloads
Using a heterogenous workload, we have investigated the interactions that occur between concurrent read and write memory access streams. In particular, we demonstrate using the following two scenarios:
(a) Case 1: the test grain performs read (write) accesses to shared data with uniform stride, while the competitor grain performs write (read) accesses with random stride, (b) Case 2: the test grain performs read (write) accesses to shared data with uniform stride, while the competitor grain performs write (read) accesses to a single shared (hot spot) location.
The grain e ciencies for both these cases is shown in Figure 8 .
Performance on the Symmetry, when G t performs read accesses, steadily deteriorates. It is markedly worse for Case 2 due to the heavy coherency tra c (cache invalidation and cache-line re lling) generated by the competitor grains while repeatedly writing to one shared location. When G t executes write accesses on the Symmetry, Case 2 corresponds to the competitor processors operating out of their private caches thus causing no bus tra c and memory contention. Hence, virtually no degradation is experienced by the test grain. The interference from the competitor grains on the TC2000 is fairly small in both cases, owing to the much higher bandwidth of the multistage network and the non-blocking switches used.
The improvement in execution e ciency of G t on the Symmetry, for Case 2 above, as a result of introducing computational delay is shown in Figure 9 . The cache-invalidation tra c on the bus generated by the competitor grains in competing for the same location reaches quiescence during 
Dual-Mode Access Workloads
Workloads consisting of concurrent accesses to shared data (granule g m 6 = ) as well as exclusive access to shared data within critical sections (granule g s 6 = ) can be used to characterize the combined degradation of performance resulting from memory and lock contention. The MAD kernels, for such dual-mode access workloads, measure the incremental overhead (and therefore incremental interference) resulting from the dynamic nature of pure memory access con icts. The overheads arising from the locking semantics of the critical section access is precluded from the measured performance degradation by transforming the shared lock variable in g s into private lock variable and replicating it into each processor's local memory during the execution of the MAD kernels. This leaves the memory contention behavior for shared data accesses intact, but eliminates the performance losses due to lock contention (which depends upon the implementation of the locking primitives) and queuing delay for mutually-exclusive critical section access. The lock contention and queuing delay characteristics are measured by the SAD kernels and quanti ed by s (N). The incremental interference charcterization studies, including both memory and lock interference, for dual-mode access workloads are presented in 10].
Conclusions
The performance of the shared memory organization of a multiprocessor depends not only on the characteristics of the memory hierarchy itself, but also upon the characteristics of the memory address streams and the interaction between the two. The MAD kernels described in this paper provide an e ective testbed for characterizing the shared memory performance for a variety of memory access workloads. These kernels were employed to measure and compare the performance of the Sequent Symmetry and the BBN TC2000 multiprocessors.
The MAD kernels can be used either independently to perform a detailed evaluation of the sensitivity of a shared memory organization to various memory access parameters; or they can be used in conjunction with the SAD kernels 10] to isolate the incremental overhead contribution of memory access con icts from the total performance loss experienced by an input workload. The MAD kernels have also been used at Oak Ridge National Laboratory to perform a preliminary investigation 16] of the memory access performance of the KSR1 multiprocessor from Kendall Square Research.
