The overheads in a parallel system that limit its scalability need to be identified and separated in order to enable parallel algorithm design and the development of parallel machines. Such overheads may be broadly classified into two components. The first one is intrinsic to the algorithm and arises due to factors such as the work-imbalance and the serial fraction. The second one is due to the interaction between the algorithm and the architecture and arises due to latency and contention in the network. A top-down approach to scalability study of shared memory parallel systems is proposed in this research. We define the notion of overhead functions associated with the different algorithmic and architectural characteristics to quantify the scalability of parallel systems; we develop a method for separating the algorithmic overhead into a serial component and a work-imbalance component; we also develop a method for isolating the overheads due to network latency and contention from the overall execution time of an application; we design and implement an execution-driven simulation platform that incorporates these methods for quantifying the overhead functions; and we use this simulator to study the scalability characteristics of five applications on shared memory platforms with different communication topologies.
Introduction
Scalability is a notion frequently used to signify the "goodness" of parallel systems 1 . A good understanding of this notion may be used to: select the best architecture platform for an application domain, predict the performance of an application on a larger configuration of an existing architecture, determine inherent limitations in an application for exploiting parallelism, identify algorithmic and architectural bottlenecks in a parallel system, and glean insight on the interaction between an application and an architecture to understand the scalability of other application-architecture pairs. In this paper, we develop a framework for studying the inter-play between applications and architectures to understand their implications for scalability. Since real-life applications set the standards for computing, it is meaningful to use such applications for studying the scalability of parallel systems. We call such an application-driven approach as a top-down approach to scalability study. The main thrust of this approach is to identify the important algorithmic and architectural artifacts that impact the performance of a parallel system, understand the interaction between them, quantify the impact of these artifacts on the execution time of an application, and use these quantifications in studying the scalability of a parallel system.
The following are the main contributions of our work: we define the notion of overhead functions associated with the different algorithmic and architectural characteristics to quantify the scalability of parallel systems; we develop a method for separating the algorithmic overhead into a serial component and a work-imbalance component; we also develop a method for isolating the overheads due to network latency (the actual hardware transmission time in the network) and contention (the amount of time spent in the network waiting for a resource to become free) from the overall execution time of an application; we design and implement a simulation platform that incorporates these methods for quantifying the overhead functions; and we use this simulator to study the scalability of five applications on shared memory platforms with different communication topologies.
Several performance metrics such as speedup [2] , scaled speedup [12] , sizeup [30] , experimentally determined serial fraction [14] , and isoefficiency function [15] have been proposed for quantifying the scalability of parallel systems. While these metrics are extremely useful for tracking performance trends, they do not provide the information needed to understand the reason why an application does not scale well with an architecture. The overhead functions which we identify, separate, and quantify in this work, help us toward this end. We are not aware of any other work that separates these overheads, and believe that such a separation is very important for understanding the interaction between applications and architectures. The 1 The term, parallel system, is used to denote an application-architecture combination.
growth of overhead functions will provide key insights for the scalability of a parallel system by suggesting application restructuring, as well as architectural enhancements.
There have been several performance studies that have addressed issues such as latency, contention and synchronization. The scalability of synchronization primitives supported by the hardware [3, 19] , the limits on interconnection network performance [1, 21] , and the performance of scheduling policies [33, 16] are examples of such studies undertaken over the years. While such issues are extremely important, it is appropriate to put the impact of these factors into perspective by considering them in the context of overall application performance. There are studies that use real applications to address specific issues like the effect of sharing in parallel programs on the cache and bus performance [11] and the impact of synchronization and task granularity on parallel system performance [6] . Cypher et al. [10] , identify the architectural requirements such as floating point operations, communications, and input/output for messagepassing scientific applications. Rothberg et al. [24] conduct a similar study towards identifying the cache and memory size requirements for several applications. However, there have been very few attempts at quantifying the effects of algorithmic and architectural interactions in a parallel system.
The work we present in this paper is part of a larger project which aims at understanding the significant issues in the design of scalable parallel systems using the above-mentioned top-down approach. In [29] , we illustrated this approach for the scalability study of message-passing systems. In this paper, we conduct a similar study for shared memory systems. A companion paper [28] develops a framework using the overhead functions for studying machine abstractions and impact of locality on the performance of parallel systems.
The top-down approach and the overhead functions are elaborated in Section 2. Details of our simulation platform, SPASM (Simulator for Parallel Architectural Scalability Measurements), which implements these overhead functions are also discussed in this section. The characteristics of the five applications used in this study are discussed in Section 3, details of the three shared memory platforms are presented in Section 4, and the results of our study with their implications on scalability are summarized in Section 5. Concluding remarks are given in Section 6.
Top-Down Approach
In keeping with the RISC ideology in the evolution of sequential architectures, we would like to use real world applications in the performance evaluation of parallel machines. However, applications normally tend to contain large volumes of code that are not easily portable and a level of detail that is not very familiar Such abstractions of real applications that capture the main phases of the computation are called kernels.
One can go even lower than kernels by abstracting the main loops in the computation (like the Lawrence Livermore loops [18] ) and evaluating their performance. As one goes lower, the outcome of the evaluation becomes less realistic.
Even though an application may be abstracted by the kernels inside it, the sum of the times spent in the underlying kernels may not necessarily yield the time taken by the application. There is usually a cost involved in moving from one kernel to another such as the data movements and rearrangements in an application that are not part of the kernels that it is comprised of. For instance, an efficient implementation of a kernel may need to have the input data organized in a certain fashion which may not necessarily be the format of the output from the preceding kernel in the application. Despite its limitations, we believe that the scalability of an application with respect to an architecture can be captured by studying its kernels, since they represent the computationally intensive phases of an application. Therefore, we have used kernels in this study.
It is desirable to see a performance improvement (speedup) that is linear with the increase in the number of processors (as shown by the curve for linear behavior in Figure 1 ). With increasing number of processors, overheads in the parallel system increase (as shown by the curve for real execution in Figure 1 ) causing deviation from linear behavior. The overheads may even dominate the added computing power after a certain stage resulting in potential slow-downs. Parallel system overheads may be broadly classified into a purely algorithmic component (algorithmic overhead), and a component arising due to the interaction of the algorithm and the architecture (interaction overhead). The algorithmic overhead is due to the inherent serial part [2] and the work-imbalance in the algorithm, and is independent of the architectural characteristics.
Work imbalance could result with a differential amount of work done by the executing threads in a parallel phase. Isolating these two components of the algorithmic overhead would help in re-structuring the algorithm to improve its performance. Algorithmic overhead is the difference between the linear curve and that which would be obtained (the "ideal" curve in Figure 1 ) by executing the algorithm on an ideal machine such as the PRAM [32] . Such a machine idealizes the parallel architecture by assuming an infinite number of processors, and unit costs for communication and synchronization. A real execution could deviate significantly from the ideal execution due to the overheads such as latency, contention, synchronization, scheduling and cache effects. These overheads are lumped together as the interaction overhead. To fully understand the scalability of a parallel system it is important to isolate the influence of each component of the interaction overhead on the overall performance. For instance, in an architecture with no contention overhead, the communication pattern of the application would dictate the latency overhead incurred by it. Thus the performance of an application (on an architecture devoid of network contention) may lie between the ideal curve and the real execution curve (see Figure 1 ).
The key elements of our top-down approach for studying the scalability of parallel systems are as follows:
experiment with real world applications identify parallel kernels that occur in these applications study the interaction of these kernels with architectural features to separate and quantify the overheads in the parallel system use these overheads as a way of predicting the scalability of parallel systems.
Implementing the Top-Down Approach
Scalability study of parallel systems is complex due to the several degrees of freedom that exist in them.
Experimentation, simulation, and analytical models are three techniques that have been commonly used in such studies. But each has its own limitations. We adopted the first technique in our earlier work by experimenting with frequently used parallel algorithms on shared memory [27] and message-passing [26] platforms. Experimentation is important and useful in scalability studies of existing architectures, but has certain limitations: first, the underlying hardware is fixed making it impossible to study the effect of changing individual architectural parameters; and second, it is difficult if not impossible to separate the effects of different architectural artifacts on the performance since we are constrained by the performance monitoring support provided by the parallel system. Further, monitoring program behavior via instrumentation can become intrusive yielding inaccurate results.
Analytical models have often been used to give gross estimates for the performance of large parallel systems. In general, such models tend to make simplistic assumptions about program behavior and architectural characteristics to make the analysis using the model tractable. These assumptions restrict their applicability for capturing complex interactions between algorithms and architectures. For instance, models developed in [17, 31, 9] are mainly applicable to algorithms with regular communication structures that can be predetermined before the execution of the algorithm. Madala and Sinclair [17] confine their studies to synchronous algorithms while [31] and [9] develop models for regular iterative algorithms. However, there exist several applications [24] with irregular data access, communication, and synchronization characteristics which cannot always be captured by such simple parameters. Further, an application may be structured to hide a particular overhead such as latency by overlapping computation with communication. It may be difficult to capture such dynamic program behavior using analytical models. Similarly, several other models make assumptions about architectural characteristics. For instance, the model developed in [20] ignores data inconsistency that can arise in a cache-based multiprocessor during the course of execution of an application and thus does not consider the coherence traffic on the network.
The main focus in our top-down approach is to quantify the overheads that arise in the interaction between the kernels and architecture and its impact on the overall execution of the application. It is not clear that these overheads can be easily modeled by a few parameters. Therefore, we use simulation for quantifying and separating the overheads.
All three techniques have a significant role to play in the scalability study of large parallel systems.
We already observed some of the limitations of experimentation and modeling. Simulation also has its limitations. For example, it may not always be possible to predict system scalability with simulation owing to resource (time and space) constraints in using architectural simulators. But we believe that our simulation technique can be viewed as complementing the analytical one for the following reasons: it does not have the same limitations as the latter in that the datapoints obtained using it are closer to reality; but owing to resource constraints it may not be possible to simulate large systems. Therefore, simulation can be used to obtain several datapoints for a parallel system, which can then be used as a feedback to refine the existing analytical models to predict the scalability of larger parallel systems. Further, our simulator can also be used to validate existing analytical models using real applications.
Our simulation platform (SPASM), to be presented in the next sub-section, provides an elegant set of mechanisms for quantifying the different overheads we discussed earlier. The algorithmic overhead is quantified by computing the time taken for execution of a given parallel program on an ideal machine such as the PRAM [32] and measuring its deviation from a linear speedup curve. Further, we separate this overhead into that due to the serial part (serial overhead) and that due to work imbalance (work-imbalance overhead).
As we mentioned earlier, the interaction overhead should be separated into its component parts. We currently it is interesting to separate these two artifacts. Such systems normally provide some synchronization support which may either be as simple as an atomic read-modify-write operation, or may provide special hardware for more complicated operations like barriers and queue-based locks. While the latter may save execution time for complicated synchronization operations, the former is more flexible for implementing a variety of such operations. For reasons of generality, we assume that only the test&set operation is supported by shared memory systems. We also assume that the memory module (at which the operation is performed), is intelligent enough to perform the necessary operation in unit time. With such an assumption, the only network overhead due to the synchronization operation (test&set) is a roundtrip message, and the overheads for such a message are accounted for in the latency and contention overhead functions described earlier.
The waiting time in a processor during synchronization operations is accounted for in the CPU time which 2 We do not distinguish between the terms, process, processor and thread, and use them synonymously in this paper.
would manifest itself as an algorithmic (serial or work imbalance) overhead. Hence, for the rest of this paper, we confine ourselves to the the only two aspects of the interaction overhead that are germane to this study, namely, latency and contention.
Constant problem size (where the problem size remains unchanged as the number of processors is increased), memory constrained (where the problem size is scaled up linearly with the number of processors),
and time constrained (where the problem size is scaled up to keep the execution time constant with increasing number of processors) are three well-accepted scaling models used in the study of parallel systems. Each model is appropriate depending on the nature of the study. Overhead functions can be used to study the growth of system overheads for any of these scaling strategies. In our simulation experiments, we limit ourselves to the constant problem size scaling model.
SPASM
SPASM is an execution-driven simulator written in CSIM. As with other recent simulators [5, 7, 23] , the bulk of the instructions in the parallel program is executed at the speed of the native processor (SPARC in this study) and only the instructions that may potentially involve a network access are simulated. The reader is referred to [29] for a detailed description of the implementation of SPASM. The input parameters and output statistics provided by SPASM are given below.
Parameters
The system parameters that can be specified to SPASM are: the number of processors (p), the clock speed, the hardware setup time for transmission of a message, the hardware bandwidth, the software latency for transmission of a message and the sustained software bandwidth.
Metrics
SPASM provides a wide range of statistical information about the execution of the program. It gives the total time (simulated time) which is the maximum of the running times of the individual parallel processors.
This is the time that would be taken by an execution of the parallel program on the target parallel machine.
Speedup using p processors is measured as the ratio of the total time on 1 processor to the total time on p processors.
Ideal time is the total time taken by a parallel program to execute on an ideal machine such as the PRAM.
It includes the algorithmic overhead but does not include the interaction overhead. SPASM simulates an ideal machine to provide this metric. As we mentioned in Section 2, the difference between the linear time and the ideal time gives the algorithmic overhead.
SPASM quantifies both the latency overhead function (f L (p)) as well as the contention overhead function (f C (p)) seen by a processor as described in Section 2. This is done by time-stamping messages when they are sent. At the time a message is received, the time that the message would have taken in a contention free environment is charged to the latency overhead function while the rest of the time is accounted for in the contention overhead function. Though not relevant to this study, it is worthwhile to mention that SPASM provides the latency and contention incurred by a message as well as the latency and contention that a processor may choose to see. Even though a message may incur a certain latency and contention, a processor may choose to hide all or part of it by overlapping computation with communication. Such a scenario may arise with a non-blocking message operation on a message-passing machine or with a prefetch operation on a shared memory machine. But for the rest of this paper (since we deal with blocking load/store shared memory operations), we assume that a processor sees all of the network latency and contention. This mode is used to differentiate such accesses from normal load/store accesses.
The total time for a given application is the sum of the execution times for each of the above defined 
Algorithmic Characteristics
This section briefly describes the characteristics of five kernels used in this study. Three of them (EP, IS and CG) are from the NAS parallel benchmark suite [4] ; CHOLESKY is from the SPLASH benchmark suite [25] ; and FFT is the well-known Fast Fourier Transform algorithm. The characteristics include the data access pattern, the synchronization pattern, the communication pattern, the computation granularity (the amount of work done) and data granularity (the amount of data communicated) for each phase of the program. EP and FFT are well-structured kernels with regular communication patterns determinable at compile-time, with the difference that EP has a higher computation to communication ratio. IS also has a regular communication pattern, but in addition it uses locks for mutual exclusion during the execution. CG and CHOLESKY are different from the other kernels in that their communication patterns are not regular (both use sparse matrices) and cannot be determined at compile time. While a certain number of rows of the matrix in CG is assigned to a processor at compile time (static scheduling), CHOLESKY uses a dynamically maintained queue of runnable tasks.
EP
EP is the "Embarrassingly Parallel" kernel that generates pairs of Gaussian random deviates and tabulates floating point random numbers is generated which are then subject to a series of operations. The computation granularity of this section of the code is considerably large and is linear in the number of random numbers (the problem size) calculated. A data size of 64K pairs of random numbers has been chosen in this study.
The operation performed on a computed random number is completely independent of the other random numbers. The processor assigned to a random number can thus execute all the operations for that number without any external data. Hence the data granularity is meaningless for this phase of the program. Towards the end of this phase, a few global sums are calculated by using a logarithmic reduce operation. In step i
of the reduction, a processor receives an integer from another which is a distance 2 i away and performs an addition of the received value with a local value. The data that it receives (data granularity) resides in a cache block in the other processor, along with the synchronization variable which indicates that the data is ready (synchronization is combined with data transfer to exploit spatial locality). Since only 1 processor writes into this variable, and the other spins on the value of the synchronization variable (the PGM SYNC mode described in Section 2.2), no locks are used. Every processor reads the global sum from the cache block of processor 0 when the last addition is complete. The computation granularity between these communication steps can lead to work imbalance since the number of participating processors halves after each step of the logarithmic reduction. However since the computation is a simple addition it does not cause any significant imbalance for this kernel. The amount of local computation in the initial computation phase overshadows the communication performed by a processor. Table 1 summarizes the characteristics of EP.
IS
IS is the "Integer Sort" kernel that uses bucket sort to rank a list of integers which is an important operation in "particle method" codes. A list of 64K integers with 2K buckets is chosen for this study.
An implementation of the algorithm is described in [22] and for each global bucket, subtracts the value found in the corresponding local bucket, updates the local bucket with this new value in the global bucket, and unlocks the bucket (phase 7). The memory allocation for the global buckets and its locks is done in such a way that a bucket and its corresponding lock fall in the same cache block and the rest of the cache block is unused. Synchronization is thus combined with data transfer and false sharing is avoided. The final list ranking phase (phase 8) is a completely local operation using the local buckets in each processor and is similar to phase 1 in its characteristics.
FFT
FFT is the one dimensional complex Fast Fourier Transform of N (64K for this study) points. N is a power of 2 and greater than or equal to the square of the number of processors p. calculations for the rows assigned to it in the resulting matrix (which are also the same rows in the sparse matrix that are local to the processor). But the calculation may need elements of the vector that are not local to a processor. Since the elements of the vector that are needed for the computation are dependent on the randomly generated sparse matrix, the communication pattern for this phase is random. 
Architectural Characteristics
Since uniprocessor architecture is getting standardized with the advent of RISC technology, we fix most of the processor characteristics by using a 33 MHz SPARC chip as the baseline for each processor in a parallel system. Such an assumption enables us to make a fair comparison of the relative merits of the interesting parallel architectural characteristics across different platforms. Input-output characteristics are beyond the purview of this study.
We use a transmission scheme similar to the one used on the Intel iPSC/860 [13] . A circuit is set up between the source and the destination, and the message is then sent in a single packet. Message-sizes can vary upto 32
bytes. We assume that the switching time for setting up a circuit (in a contention free scenario) is negligible.
The simulated shared memory hierarchy is CC-NUMA (Cache Coherent Non-Uniform Memory Access). Each node in the system has a sufficiently large piece of the globally shared memory such that for the applications considered, the data-set assigned to each processor fits entirely in its portion of shared memory. There is also a 2-way set-associative private cache (64KBytes with 32 byte blocks) at each node that is maintained sequentially consistent using an invalidation-based fully-mapped directory-based cache coherence scheme.
Performance Results
In this section, we present the results from our simulation experiments showing the growth of the overhead functions with respect to the number of processors and their impact on scalability. The simulator allows one to explore the effect of varying other the system parameters such as link latency and processor speed on scalability. Since the main focus of this paper is an approach to scalability study, we have not dwelled on the scalability of parallel systems with respect to specific architectural artifacts to any great extent in this paper. We also briefly describe the impact of problem sizes on the system scalability for each kernel. In the following subsections, we show for each kernel the execution time, the latency, and the contention overhead graphs for the mesh platform. The first shows the total execution time, while the latter two show the communication overheads ignoring the computation time. In each of these graphs, we show the curves for the individual modes of execution applicable for a particular kernel. We also present for each kernel the latency and contention overhead curves on the three architecture platforms. The latency overhead in the NORMAL mode (i.e. due to ordinary data access) is determined by the memory reference pattern of the kernel and the network traffic due to cache line replacement. With sufficiently large size cache at each node, it is reasonable to assume that this latency overhead is only due to the kernel, and thus is expected to be independent of the network topology. Due to the vagaries of the synchronization accesses, it is conceivable that the corresponding latency overheads could differ across network platforms for the other modes. However, in our experiments we have not seen any significant deviation. As a result, the latency overhead curves for all the kernels look alike across network platforms. On the other hand, it is to be expected that the contention overhead will increase as the connectivity in the network decreases. This is also confirmed for all the kernels.
EP
The communication time in this kernel is insignificant compared to the computation time. Figures 8 and 9, show the latency and contention respectively for the different modes of execution of EP. Despite the growth of these overheads they are insignificant compared to the total execution time (which is dominated by the NORMAL mode), as can be seen in Figure 7 .
Figures 10, 11 and 12 show the latency and contention overheads for the three hardware platforms.
Since the number of communication events for global sums is logarithmic in the number of processors (see Section 3), the latency overhead curve exhibits a logarithmic behavior. Although there is a potential for contention overhead for the fully connected network due to message queueing at a node, the observed contention is marginal (Figure 10 ). With less connectivity, the contention overhead even dominates the latency overhead (Figures 11 and 12 ) with increasing number of processors. For instance, the cross-over point between contention and latency occurs at around 25 processors for the cube ( Figure 11 ) and at around 16 processors for the mesh (Figure 12 ). As can be seen from Table 6 , the constants associated with the computation time is much larger than those associated with latency and contention. Therefore, we can conclude that EP is a perfectly scalable kernel on all three hardware platforms. Even for a relatively small problem size (64K in this case), it would take nearly 1000 processors for the overheads to start dominating. While an increase in the problem size would increase the coefficient associated with the computation time, it does not have any effect on the coefficients of the other overhead functions. Hence, the scalability of EP for a larger problem size can only get better.
IS
For this kernel, there is a significant deviation from the ideal curve for all three platforms (see Figure 3) .
The overheads may be analyzed by considering the different modes of execution. In this kernel, NORMAL and MUTEX are the only significant modes of execution (see Figure 13 ). The network accesses in the NORMAL mode are for ordinary data transfer, and the accesses in MUTEX are for synchronization. The latency and contention overheads incurred in the MUTEX mode is higher than in the NORMAL mode (see Figures 14 and 15) . As a result of this, the total execution time in the MUTEX mode surpasses that in the NORMAL mode beyond a certain number of processors (see Figure 13 ), which also explains the dip in the speedup curve for mesh (see Figure 3) . Figures 16, 17 and 18. show the latency and contention overheads for the three hardware platforms.
In IS, since every processor needs to access the local buckets of all other processors, and since the data is equally partitioned among the executing processors, the number of accesses to remote locations grows as (p ? 1)=p. This explains the flattening of the latency overhead curve for all three network platforms as p increases. On the mesh network the contention overhead surpasses the latency overhead at around 18 processors. Parallelization of this kernel increases the amount of work to be done for a given problem size (see Section 3). This inherent algorithmic overhead causes a deviation of the ideal curve from the linear curve (see Figure 3) . This is also confirmed in Table 7 , where the computation time does not decrease linearly with the number of processors. This indicates the kernel is not scalable for small problem sizes. As can be seen from Table 7 , the contention overhead is negligible and the latency overhead converges to a constant with a sufficiently large number of processors on a fully connected network. Thus for a fully connected network, the scalability of this kernel is expected to closely follow the ideal curve. For the cube and mesh platforms, the contention overhead grows logarithmically and linearly with the number of processors, respectively.
Therefore, the scalability of IS on these two platforms is likely to be worse than for the fully connected network. From the above observations, we can conclude that IS is not very scalable for the chosen problem size on the three hardware platforms. However, if the problem is scaled up, the coefficient associated with the computation time will increase thus making IS more scalable.
FFT
The algorithmic and interaction overheads for the FFT kernel are marginal. Thus the real execution curves for all three platforms as well as the ideal curve are close to the linear one as shown in Figure 4 (Figure 24 ), the contention surpasses the latency overhead at around 28 processors. Table 8 summarizes the overheads for FFT obtained by interpolating the datapoints from our simulation results. With marginal algorithmic overheads and decreasing number of messages exchanged per processor (latency overhead), the contention overhead is the only artifact that can cause deviation from linear behavior.
But with skewed communication accesses, the contention overhead has also been minimized and begins to show only on the mesh network where it grows linearly (see Table 8 ). Thus we can conclude that the FFT kernel is scalable for the fully-connected and cube platforms. 
CG
Interaction overheads for CG ( Figure 5 ) cause a larger deviation from an ideal behavior than for EP but the difference is not as pronounced as in IS. The NORMAL mode is the only dominant mode of execution as depicted in Figure 25 . The communication in the NORMAL mode for data accesses (the matrixvector product) outweighs the overhead in accesses for synchronization variables during the BARRIER and PGM SYNC modes (Figures 26 and 27 ). But the communication is still insignificant compared to the overall execution time for the range of processors considered.
In this kernel, since the input matrix is sparse, the fewer the rows assigned to a processor the fewer will be the number of elements of the vector that may need to be accessed for the matrix-vector product.
Therefore, as the number of processors increases, the number of rows of the sparse matrix allocated to a processor decreases, thereby decreasing the likelihood of non-local memory references. Hence, the latency overhead decreases with an increase in the number of processors. The contention overhead increases from the full (Figure 28 ) to the cube ( Figure 29 ) and surpasses the latency overhead for the mesh ( Figure 30 ) at around 17 processors. As can be seen from Table 9 , the latency overhead decreases with increasing number of processors and the contention overhead is more pronounced. The contention overhead is negligible for the fully-connected network, grows linearly for the cube and the mesh with a larger coefficient for the mesh compared to the cube. We can thus conclude that the CG kernel is scalable for the fully-connected network and becomes less scalable for networks with lower connectivity like the cube and the mesh. The NORMAL mode consists of the two program phases, matrix-vector product and vector-vector product (see Section 3). Since the NORMAL mode dominates the total execution time, the scalability of the matrix-vector product would determine the scalability of the kernel. Scaling the problem size increases the number of non-local memory accesses linearly while increasing the amount of local computation quadratically. Thus an increase in problem size is likely to enhance scalability of CG on all three network platforms.
CHOLESKY
The algorithmic overheads for CHOLESKY cause a significant deviation from linear behavior for the ideal curve as shown in Figure 6 . An examination of the execution times ( Figure 31) shows that the bulk of the time is spent in the NORMAL mode which performs the actual factorization. The communication overheads in the NORMAL mode for the data accesses of the sparse matrix outweigh the accesses for synchronization variables (Figures 32 and 33) . Thus the time spent in the MUTEX mode (which represents dynamic scheduling and accesses to critical sections) is insignificant compared to the NORMAL mode Although, the contention overhead in the NORMAL mode increases quite rapidly with the number of processors the overall impact of communication on the execution time is insignificant (see Figure 31 ). The deviation of the ideal from the linear curve ( Figure 6 ) indicates that the kernel is not very scalable for the chosen problem size due to the inherent algorithmic overhead as in IS. As can be observed from Table 10 , the latency decreases with increasing number of processors and the scalability of the real execution would thus be dictated by the contention overhead. The contention on the fully-connected and cube networks is negligible thus projecting speedup curves that closely follow the ideal speedup curve for these platforms.
On the other hand, the contention grows logarithmically on the mesh making this platform less scalable.
With increasing problem sizes, the coefficient associated with the computation time in the above table is likely to grow faster than the coefficients associated with the communication overheads (verified by experimentation). Hence, an increase in problem size would enhance the scalability of this kernel on all hardware platforms.
Concluding Remarks
We used an execution-driven simulation platform to study the scalability characteristics of EP, IS, FFT, CG, and CHOLESKY on three shared memory platforms, respectively, with a fully-connected, cube, and mesh interconnection networks. The simulator allows for the separation of the algorithmic and interaction overheads in a parallel system. Separating the overheads provided us with some key insights into the algorithmic characteristics and architectural features that limit the scalability for these parallel systems.
Algorithmic overheads such as the additional work incurred in parallelization could be a limiting factor for scalability as observed in IS and CHOLESKY. In shared memory machines with private caches, as long as the applications are well-structured to exploit locality, the key determinant to scalability is network contention. This is particularly true for most commercial shared memory multiprocessors which have sufficiently large caches.
We have illustrated the usefulness as well as the feasibility of our top-down approach. This approach can be used to study the impact of other system parameters (such as link bandwidth and processor speed) on scalability and provide guidelines for application design as well as evaluate architectural design decisions.
Speedup Graphs 
