Abstract
Introduction
Distributed Shared Memory (DSM) systems are becoming popular in the high performance computing arena because they promise ease of programming due to a global address space and scalability to large number of nodes. Although DSM systems facilitate programming, they can potentially introduce performance bottlenecks that require additional effort on the part of a user to discover and eliminate [9] . Non Uniform Memory Access (NUMA) architectures can incur orders of magnitude greater latencies to access data that reside farther from the processor in memory hierarchy [6] . Memory traffic generated by protocols that keep the caches coherent is another potential source of performance degradation. While the developers of compilation and parallelization tools for shared memory systems have addressed some of these problems, extensive user input is still required to fully benefit from these tools [1, 2, 7] . Understanding the sources of parallelism in a program and potential overhead due to subtleties of a DSM architecture is essential for effectively using these systems.
Due to a growing disparity between processor and memory speeds, tool developers have been focusing on measurement-based tools to analyze memory performance. Several state-of-the-art microprocessors provide on-chip performance counters to facilitate these measurements [9] . However, most of the existing tools and techniques are limited to evaluating cache and memory performance for a single processor [8] . These tools typically do not directly address multiprocessor memory performance issues. There are examples of research prototype DSM systems that can support memory performance measurements across multiprocessor nodes [5] . Unfortunately, such tools are not yet widely available for commercial multiprocessors. We present a performance model that accounts for inherent parallelism in a program, which can result in potential speedup as well as overhead when that program is executed on a DSM system. This model can be used to analyze the efficacy of parallelization and quantitatively measure the overhead of parallelizing a program. Quantitative evaluation of this overhead provides an indirect measure of effective utilization of available memory subsystem performance.
In this paper, we present a performance model to characterize the execution of a compiler directives-based parallelized program. We subsequently apply this model to evaluate the performance of our parallelized version of NAS Parallel Benchmarks (NPBs) on SGI Origin2000, which is a commercial DSM system with a ccNUMA architecture. We used native tools to parallelize the sequential implementation of NPBs [4] . These tools include: Power Fortran Accelerator (PFA), which can automatically insert parallelization directives in sequential code and transform the loops to enhance their performance; Parallel Analyzer View (PAV), which can annotate the results of dependence analysis of PFA and present them graphically; and Fortran77 compiler with MP runtime library to compile and executed the parallelized code. In addition to using these tools, we inserted some directives by hand to assist the compiler and improve the performance.
We explain the directives-based parallelization paradigm in Section 2. A performance model and metrics to evaluate different aspects of a directives-based parallelized program are presented in Section 3. Section 4 reports detailed measurement based evaluation of the parallelized NAS benchmarks using performance model and metrics of Section 2. We conclude with a discussion of our results in Section 5.
Compiler-Directed Parallelism
Compared to the process-level parallelism for message-passing programs, directives-based parallelism constitutes a finer-grained, loop-level parallelism. Figure  1 provides an example of this parallelism implemented through MIPS Fortran compiler directives for multiprocessing. The C$DOACROSS directive instructs the compiler to divide the outer loop iterations equally among the available processors. This is the default loop scheduling, which is implemented by the runtime system until specifically instructed otherwise by additional compiler directives.
Directives-based parallelism is supported by the MP runtime library on Origin2000, which implements a forkand-join paradigm of parallelism. A master thread initiates the program, creates multiple slave threads, schedules the iterations of parallelized loops on all the threads including itself, waits for the completion of a parallel loop by all the slave threads, and executes sequential potions of the program. Slave threads wait for work (i.e., for parts of parallel loops) during execution of a sequential portion of code by the master thread.
Performance Model and Metrics
We first explain the performance model with respect to the DSM system architecture that we are focusing on. Subsequently, we define metrics to evaluate parallelization and scalability of the parallelized code.
Performance Model
Consider a sequential program consisting of N blocks, such that only one block is executed at any time. The sequential execution time of a program is denoted by T s and is calculated as: 
where t i is the execution time spent in the i-th block. We have to measure the aggregate time spent in every block of the code that substantially contributes toward the overall sequential execution time. Therefore, we define the sequential cost for executing the i-th block as a fraction:
.
When a program is executed in parallel using fork-andjoin paradigm, synchronization overhead is incurred by slave threads to wait for parallel work and by the master thread to wait for all the slave threads to finish executing a particular parallel loop. The execution time of a directivesbased parallelized program is denoted by T p and is given by: ,
where the (useful) execution time spent in the i-th block (t i ) is the sum of time spent in parallelized loops of that block (tp i ) and the remaining sequential code of that block (ts i ). Parallelization overhead for the entire program is given by t o because it is non-trivial to measure it for each individual parallelized block of the program using profiling. Quantitative calculation of parallelization overhead and other metrics are presented in the following subsection.
Performance Metrics
Consider that a subroutine j in the program has K parallelized loops. Then we define the metric parallel coverage of subroutine j as:
. (4) Note that parallel coverage of a subroutine can be determined by profiling the execution of a sequential program. This technique is often used to determine the fraction of code that can be executed in parallel [3] . The total parallel coverage of a parallelized program is equal to the sum of parallel coverages of all subroutines in the program. If there are L subroutines in a program, then the parallel coverage of the entire program is calculated as:
. (5) A value of PC close to 1.0 (or 100%, if expressed as a percentage) will be an ideal value for a parallelized program indicating that there is no sequential code and no parallelization overhead. Therefore, executing such a program on n processors should result in a speedup of n,
provided that all the processors are fully utilized during the entire execution. A higher value of this metric is desirable because it represents a better parallelization of sequential code. Amdahl's law based on fixed workload can be used as a measure of scalability of the parallelized code under fork-and-join execution model. According to Amdahl's law if a is the sequential fraction of a program, the maximum possible speedup that can be obtained on an n processor system is given by: ,
where a is the fraction of serial portion of the code.
Noting that parallel coverage PC=1-a, we can express theoretical speedup according to Amdahl's law as:
. (7) Using this definition of theoretical speedup, we can now calculate the combined value of parallelization overhead as:
, (8) where T p is the measured execution time on n processors. Parallel coverage and speedup metrics are defined by equations (5) and (7), respectively, for independent assessment of a directives-based parallelized program. In order to compare the performance of a directive-based parallelized program with the same program parallelized using a different technique, we use execution time as a metric. Additionally, equation (8) will be used for evaluating parallelization overhead for directives-based parallelized programs.
Performance Evaluation
Performance is evaluated from four perspectives: efficacy of parallelization process; scalability of parallelized programs; overhead due to parallelization; and performance comparison of directives-based parallelized programs against the hand-parallelized and optimized codes. The metrics discussed in Section 3.2 are used for this evaluation.
Analysis of Parallelization
Parallel coverage is defined in Section 3.2 as a metric to represent the efficacy of parallelization process. This metric was calculated for all NAS benchmarks parallelized using compiler directives for shared memory multiprocessing. For these calculations, the benchmarks are compiled with instrumentation to measure the time spent in each subroutine that contains parallel code blocks. We execute these programs on a single processor of Origin2000. Table 1 reports parallel coverage values for BT, FT, CG, and MG benchmarks using measurements.
Quantitative measurements for BT indicate that the code responsible for more than 99% of the entire execution time is parallelized. This level of parallelism was attained after iteratively analyzing the source code and discovering possibilities of parallelization by minor modification in some loop nests. Unlike BT, we relied on native SGI tools (PFA and PAV) to parallelize CG, and MG benchmarks. Furthermore, we had to manually perform inter-procedural analysis to parallelize a few important loops in FT.
The results shown in Table 1 suggest that 93%-99% of the code is parallelized. It should be noted that when a program is 100% parallelized, a linear speedup could be obtained provided that all the processors are equally utilized throughout the execution. This theoretical speedup will be used as a criteria to evaluate the actual performance of parallelized code in the following subsections. (8) in Section 3.2. The speedup is less than ideal or theoretical values for BT and FT. However, CG and MG exhibit close to ideal speedup values. BT and FT are relatively larger programs consisting of several parallelized loops compared to CG and MG. Additionally, algorithms for BT and FT depend on a regular pattern of data accesses which is not the case for CG and MG [4] . Lack of structured data accesses helps loop-level parallelization paradigm by reducing parallelization overhead. Therefore, BT and FT are susceptible to overhead due to data locality as well as synchronization. Since these overhead are not significant for CG and MG due to their structure as well as smaller number of parallelized loops, the speedup is close to ideal.
Analysis of Scalability
Based on the results of scalability measurements, it can be observed that speedup close to the ideal and theoretical values are attainable by parallelizing programs using directives-based approach. However, the differences from the expected theoretical values of speedup should be expected for larger applications with regular data accesses. In those cases, careful data distribution becomes important to obtain high speedup values.
Parallelization Overhead
Considering the architecture of a ccNUMA-based DSM system, parallelization overhead is an intricate function of following factors: (1) time for synchronization time among threads during execution of a parallelized program; (2) number of parallel loops; (3) non-local memory accesses by each thread; and (4) resource contention due to multiple users. Although it is fairly simple to calculate overall parallelization overhead, it is not trivial to isolate the quantitative contribution of each of the above factors to this aggregate value.
Aggregate synchronization time can be measured using SGI's SpeedShop toolset, which can determine the time spent in synchronization primitives of MP library. These measurement based experiments were carried out for BT, FT, CG, and MG using relatively small number of processors. The results of these experiments are reported in Table 2 . Synchronization overhead for each case is obtained as a percentage of measured execution time.
Synchronization overhead were as high as 19% in some cases. The last column lists the total parallelization overhead obtained by subtracting measured execution time from the theoretical execution time according to equation (8) . In two cases, this calculation is not possible due to better than expected speedup of CG and MG, which is a consequence of untuned sequential versions of these programs as discussed in Section 4.2.
Although the measurements report up to 19% overhead due to synchronization, it is incorrect to assume that synchronization overhead is a result of parallel loop scheduling alone. Synchronization and data locality overhead are strongly correlated with each other. The time that a master thread spends waiting for slaves to finish executing a parallel loop could be due to a combination of two reasons: (1) time to synchronize multiple threads; and (2) load imbalance between master and some of the slave threads due to their non-local data accesses. If resource contention from other users is also considered, the problem of isolating one particular type of overhead becomes even more complex.
Calculation of aggregate parallelization overhead using the performance model of Section 3 provides useful Figure 3 presents the comparison between directivesbased parallelized benchmarks and hand-parallelized, MPI-based versions of the same. In all of these cases, performance improves with the number of processors. For BT and FT, the MPI-based implementations perform slightly better than the shared-memory implementation due to data placement. Directives-based data distribution results in placing pages of arrays on multiple processors. Coarse granularity of data distribution starts becoming a bottleneck for larger number of processors because all loop iterations that use a particular data element cannot be co-located at the same node. Therefore, as the number of processors increases, multiple processors access data from pages that they do not own locally, which adversely impacts the overall execution time. In contrast, a message-passing program is designed in a way that the programmer controls locality of every data element. As the number of processors increases, the amount of data owned by a processor reduces proportionately. This is a particularly favorable situation for a cache-based DSM system because larger proportions of local data can reside in caches to enhance memory system performance. We tuned BT's data locality for almost all of the parallelized loops to ensure that each loop iteration is scheduled at a processor that owns elements of an array accessed during those iterations. Consequently, the performance of BT is comparable to its hand-parallelized implementation. Performance of two implementations of CG and MG is also comparable (see Figure 3 (c) and (d) ). In case of CG and MG, data locality does not become a bottleneck due to comparatively smaller size of code with smaller number of memory accesses. Therefore, performance remains comparable with the hand-parallelized implementations of CG and MG.
Comparative Performance Analysis

Discussion and Conclusions
We presented a performance model to characterize the performance of directives-based parallelized programs for an Origin2000 system. Using measurements, we quantitatively evaluated the fraction of code that was parallelized. Further evaluation indicated reasonable speedup as well as significant parallelization overhead. Based on extensive tuning of one parallelized program and some experiments presented in this paper, we conclude that non-local data accesses are the main source of parallelization overhead.
Recent performance evaluation studies have examined the effect of data locality on the performance of DSM systems. Anderson reports that overhead for programs that were parallelized with near 100% parallel coverage and executed on Stanford DASH (a ccNUMA DSM system) resulted in significantly inferior speedup characteristics [3] . Performance was improved by analyzing data distribution. In our case, we conclude that single processor cache performance is another key factor that can improve performance, in addition to appropriate data distribution.
Evaluation of parallelization overhead based on performance model presented in this paper emphasizes the need for appropriate instrumentation of multiprocessor memory subsystem. Such instrumentation is readily accessible to a user for measurements limited to a single node only. Without hardware or software based instrumentation of non-local memory accesses and cachecoherence traffic, direct measurement of data locality overhead is not possible. Some commercial tool developers realize this problem and are working on tools that furnish multiprocessor memory performance measurements. 
