This paper studies the performance implications of architectural synchronization support for automatically parallelized numerical programs. As the basis for this work, we analyze the needs for synchronization in automatically parallelized numerical programs. The needs are due to task scheduling, iteration scheduling, barriers, and data dependence handling. We present synchronization algorithms for e cient execution of programs with nested parallel loops. Next, we identify how v arious hardware synchronization support can be used to satisfy these software synchronization needs. The synchronization primitives studied are test&set, fetch&add and exchange-byte operations. In addition to these, synchronization bus implementation of lock unlock and fetch&add operations are also considered. Lastly, w e ran experiments to quantify the impact of various architectural support on the performance of a bus-based shared memory multiprocessor running automatically parallelized numerical programs. We found that supporting an atomic fetch&add primitive in shared memory is as e ective as supporting lock unlock operations with a synchronization bus. Both achieve substantial performance improvement o v er the cases where atomic test&set and exchange-byte operations are supported in shared memory.
Introduction
Automatically parallelized numerical programs represent an important class of parallel applications in high-performance multiprocessors. These programs are used to solve problems in many engineering and science disciplines such as Civil Engineering, Mechanical Engineering, Electrical
Engineering, Chemistry, P h ysics, and Life Sciences. In response to the popular demand, paralleliz-ing Fortran compilers have been developed for commercial and experimental multiprocessor systems to support these applications 1 11 21 6 10 . With maturing application and support software, the time has come to study the architecture support required to achieve high performance for these parallel programs.
Synchronization overhead has been recognized as an important source of performance degradation in the execution of parallel programs. Many hardware and software techniques have been proposed to reduce the synchronization cost in multiprocessor systems 12 23 22 2 13 14 15 .
Instead of proposing new synchronization techniques, we address a simple question in this paper:
does architecture support for synchronization substantially a ect the performance of automatically parallelized numerical programs?
To answer this question, we start with analyzing the needs for synchronization in parallelized Fortran programs in Section 2. Due to the mechanical nature of parallelizing compilers, parallelism is expressed in only a few structured forms. This parallel programming style allows us to systematically cover all the synchronization needs in automatically parallelized programs. Synchronization issues arise in task scheduling, iteration scheduling, barriers and data dependence handling. A set of algorithms are presented which use generic lock unlock and increment operations. We then identify how several hardware synchronization primitives can be used to implement these generic synchronization operations. These synchronization primitives are test&set, fetch&add, exchangebyte, and lock unlock operations . Since these primitives di er in functionality, the algorithms for synchronization in parallel programs are implemented with varying e ciency. Section 3 describes the experimental procedure and the scope of our experiments. In Section 4, the issue of iteration scheduling overhead is addressed in the context of hardware synchronization support. We use an analytical model for the e ect of iteration scheduling overhead and loop granularity on execution time. The model is then used to explain the di erences in the iteration scheduling overhead of di erent synchronization primitives for a simulated shared-memory multiprocessor.
Synchronization needs of a parallel application depend on the numerical algorithms and the e ectiveness of the parallelization process, therefore the performance implications of architectural synchronization support can only be quanti ed with experimentation. Section 5 addresses the issues of granularity and lock locality in real applications. Using programs selected from the Perfect Club 4 benchmark set, we e v aluate the impact of various architectural support on the performance of a bus-based shared-memory multiprocessor architecture in Section 6. We conclude that architectural support for synchronization has a profound impact on the performance of the benchmark programs.
Background and Related Work
In this section, we rst describe how parallelism is expressed in parallel Fortran programs. We then analyze the synchronization needs in the execution of these programs. Most importantly, w e show how architectural support for synchronization can a ect the implementation e ciency of scheduling and synchronization algorithms.
Parallel Fortran Programs
The application programs used in this study are selected from the Perfect Club benchmark set 4 .
The Perfect Club is a collection of numerical programs for benchmarking supercomputers. The programs were written in Fortran. For our experiments, they were parallelized by the KAP Cedar source-to-source parallelizer 17 10 which generates a parallel Fortran dialect, Cedar Fortran. This process exploits parallelism at the loop level, which has been shown by Chen, Su, and Yew to capture 
Synchronization Needs
In executing parallel Fortran programs, the needs for synchronization arise in four contexts: task scheduling, iteration scheduling, barrier synchronization, and Advance Await. In this section, we discuss the nature of these synchronization needs.
Task scheduling is used to start the execution of a parallel loop on multiple processors. All processors to participate in the execution of a parallel loop, or task, must be informed that the loop is ready for execution. In this study, all experiments assume a task scheduling algorithm that uses a centralized task queue to assign tasks to processors. The processor which executes a DOALL or DOACROSS statement places the loop descriptor into the task queue. All idle processors acquire the loop descriptor from the task queue and start executing the loop iterations. The accesses to the task queue by the processors are mutually exclusive. A lock is used to enforce mutual exclusion.
A n umber of distributed task scheduling algorithms have been proposed in the past, Anderson, Lazowska, and Lewy 3 compared the performance of several algorithms in the context of thread managers. Most distributed task scheduling algorithms rely on a large supply of parallel tasks to maintain load balance. Also, they usually assume that each task needs to be executed by only one processor. These are valid assumptions for thread managers because there are usually a large number of tasks threads in their application programs and each task represents a piece of sequential code. These assumptions are, however, not valid for the current generation of automatically parallelized Fortran programs where parallelism is typically exploited at only one or two loop nest levels. Since all parallel iterations of a single loop nest level form a task, there is typically only a v ery small number of tasks in the task queue. Also, multiple processors need to acquire the same task so that they can work on di erent iterations of the task loop. This lack of task level parallelism makes it di cult to e ectively use distributed task queues. Thus, while distributed task queues may become attractive when production parallelizing compilers can e ectively exploit more advanced constructs of parallelism, such as nested parallel loops, the experiments reported in this paper assume a task scheduling algorithm based on a centralized task queue. Figures 3 and 4 show the task scheduling algorithms for the processor which executes a parallel DO statement and for the idle processors respectively. The removal of the loop descriptor from the task queue is performed by the rst processor entering the barrier associated with the loop.
The implementation of the lock, unlock, and increment functions with di erent primitives is presented in the next section. By de nition lock and unlock operations are atomic.
Whenever underlined in an algorithm, the increment operation is also assumed to be atomic and can be implemented with a sequence of lock, read-increment-write, and unlock operations. However, we will show that the frequent use of atomic increment in parallel Fortran programs makes it necessary to implement atomic increment with e cient hardware support.
During the execution of a parallel loop, each processor is assigned with di erent iterations, which is called iteration scheduling. We use the self-scheduling algorithm 20 to implement iteration scheduling. In this method, the self-scheduling code is embedded in the loop body. Each time a processor is ready to execute the next loop iteration, it executes this code to get a unique iteration number. The self-scheduling algorithm shown in Figure 5 is executed at the beginning Two alternative dynamic iteration scheduling algorithms, chunk scheduling and guided selfscheduling GSS, have been proposed to avoid the potential bottleneck o f s c heduling the iterations one at a time 19 . When the number of iterations in a parallel loop is much larger than the number of processors, these algorithms reduce the iteration scheduling overhead by assigning multiple iterations to each processor at a time. This increases the e ective granularity of parallel loops. The issue of granularity and scheduling overhead is discussed in Section 4. Both of these algorithms are proposed for DOALL loops. In the presence of dependences across iterations ,i.e., DOACROSS loops, scheduling more than one iteration at a time may sequentialize the execution of a parallel loop. In section 5, we present the program characteristics of our applications to show that the parallelism is mostly in the form of DOACROSS loops or DOALL loops with a small number of iterations. Therefore, our experimental evaluation of the architectural support assume self-scheduling algorithm rather than guided self-scheduling or chunk scheduling. The barrier algorithm shown in Figure 6 speci es that the rst processor to enter the barrier removes the completed loop from the task queue. Using this barrier synchronization algorithm, the processors entering the barrier do not wait for the barrier exit signal and before they start executing another parallel loop whose descriptor is in the task queue. In contrast to the compile time scheduling of fuzzy barrier" 14 , this algorithm allows dynamic scheduling of loops to the The combination of task scheduling, iteration self scheduling and non-blocking barrier synchronization algorithms presented in this section allows deadlock free execution of nested parallel loops with the restriction that DOACROSS loops appear only at the deepest nesting level 20 .
The last type of synchronization, Advance Await, is implemented by a v ector for each synchronization point. In executing a DOACROSS loop, iteration i, w aiting for iteration j to reach synchronization point synch pt, busy waits on location V synch pt j . Upon reaching point synch pt, iteration j sets location V synch pt j . This implementation, as shown in Figure 7 , uses regular memory read and write operations, thus does not require atomic synchronization primitives. This implementation assumes a sequentially consistent memory system. In the case of weak ordering memory systems, an Await statement can be executed only after the previous memory write operations complete execution. For a multiprocessor with software controlled cache coherency protocol, Cedar Fortran Advance Await statements include the list of variables whose values should be written to read from shared memory before after their execution. The implementation details of these statements under weak ordering memory system models or software controlled cache coherency protocols are beyond the scope of this paper.
Locks and Hardware Synchronization Primitives
In executing numeric parallel programs, locks are frequently used in synchronization and scheduling There are several algorithms that implement l o c ks in cache coherent m ultiprocessors using hardware synchronization primitives 2 13 . Virtually all existing multiprocessor architectures provide some type of hardware support for atomic synchronization operations. In theory, a n y synchronization primitive can be used to satisfy the synchronization needs of a parallel program. In practice, di erent primitives may result in very di erent performance levels. For example, a queuing lock algorithm 2 13 can be implemented e ciently with an exchange-byte or a fetch&add primitive whereas a test&set implementation may be less e cient. In this section, we outline the lock algorithms that we c hoose for each hardware synchronization primitive examined in our experiments.
Exchange-byte. The exchange-byte version of the queuing lock algorithm is shown in Figure 8 .
In this implementation, the exchange-byte primitive is used to construct a logical queue of processors that contend for a lock. The variable my id is set at the start of the program so that its value for the ith processor is 2 i, where processors are numbered from 0 to P , 1 Test&set. Because of its limited functionality, test&set cannot be used to construct processor queues in a single atomic operation. Therefore, in this study, whenever the architecture o ers only test&set, a plain test&test&set algorithm see Figure 9 is used to implement all lock operations 2 .
Fetch&add. Due to the emphasis on atomic increment operations in iteration scheduling and barrier synchronization, supporting a fetch&add primitive in hardware can signi cantly decrease the need for lock accesses in these algorithms. When the fetch&add primitive is supported by a system, a fetch&add implementation of test&test&set algorithm can be used to support the lock accesses in task scheduling as well as a queuing lock algorithm. The performance implications of 2 However, We w ould like to point out that in an environment where critical sections of algorithms involve many instructions and memory accesses, a test&set implementation of a queuing lock m a y enhance performance. Each trace used in our simulations is a record of events that takes place during the execution of a parallel program and detailed information about instructions executed between each pair of events.
In this study, traces are collected by instrumenting the source code of parallelized applications.
In a trace, each e v ent is identi ed by its type and arguments, e.g., the synchronization point and the iteration number for an Await event. Each task piece is annotated with the number of dynamic instructions executed in the task piece and the dynamic count of shared memory accesses. These numbers are collected with the help of pixie, an instruction level instrumentation tool for the MIPS architecture 18 . Using a RISC processor model similar to MIPS R2000, where instruction execution times are de ned by the architecture, the time to execute instructions in CPU and local cache can be calculated directly from the dynamic instruction count. On the other hand, the time to service the cache misses and the atomic accesses to the shared memory depends on the activities of other processors in the system. Therefore, a multiprocessor simulator is used to calculate the program execution time from a trace.
In order to assess the performance implications of synchronization primitives, a library of scheduling and synchronization routines as described in Section 2 is included in the simulator.
In the simulation model, the processor memory interconnect is a split transaction or decoupled access bus, where a memory access requested by a processor only occupies the bus when its request and response are transmitted between the processor and the memory modules. The bus is made available to other memory accesses while the memory modules process the current accesses. When the memory modules have long access latency, the split transaction bus plus memory interleaving allows the multiple accesses to be overlapped. In our experiments, we assume that shared memory In our experiments, the atomic operations test&set, exchange-byte and fetch&add are performed in the memory modules rather than through the cache coherence protocol. Whenever a memory location is accessed by one of these synchronization primitives, the location is invalidated from the caches. The read-modify-write operation speci ed by the primitive is then carried out by the controller of the memory module that contains the accessed location. Note that this memory location may be brought i n to cache later by normal memory accesses made to that location due to spin waiting. This combination of atomic operation implementation in memory modules, the cache coherence protocol, and the split transaction bus is similar to that of Encore Multimax 300 series multiprocessors 11 . In Section 5, we present the characteristics of our application programs that lead to the choice of performing the read-modify-write in memory modules rather than through the cache coherence protocol.
Without any memory or bus contention, a synchronization primitive takes one cycle to invalidate local cache, one cycle to transmit request via the memory bus, two memory module cycles to perform the read-modify-write operation, and one cycle to transmit response via the memory bus.
This translates into 9 and 43 cycles for our two memory module latencies respectively. A memory access that misses from cache takes one cycle to detect the miss, one cycle to transmit cache re ll request via the bus, one memory module cycle time to access the rst word in the missing block, four clock cycles to transmit the four words back to cache via the memory bus. This amounts to 9 and 26 cycles for our assumed memory module latencies. Note that the latency for executing synchronization primitives and re lling caches increases considerably in the presence of bus and memory contention. This e ect is accounted for in our simulations on a cycle-by-cycle basis.
To e v aluate the e ectiveness of a synchronization bus, a single cycle access synchronization bus model is used. The synchronization bus provides single cycle lock unlock operations on shared lock variables and single cycle fetch&add operations on shared counters. In the presence of con icts, i.e., multiple requests in the same cycle, requests are served in round robin fashion. A summary of In all the simulations, an invalidation based write-back cache coherence scheme is used. The shared memory tra c contributed by the application is modeled based on the measured instruction count and frequency of shared data accesses. Table 2 lists the assumptions used to simulate the memory tra c for the task-pieces. We assume that 20 of the instructions executed are memory references. In addition, we measured that 6-8 of all instructions approximately 35 of all memory references are to shared data. We assume that references to shared data cause the majority o f cache misses 80 shared data cache miss rate and 5 non-shared data cache miss rate 3 .
Analysis of iteration scheduling overhead
In the execution of a parallel loop, the e ect of iteration scheduling overhead on performance depends on the number of processors, total number of iterations, and the size of each iteration. In this section we rst present the expressions for speedup in executing parallel loops where the loop iterations are large coarse granularity and where the loops iterations are small ne granularity.
These expressions provide insight i n to how iteration scheduling overhead in uences loop execution time, and will be used to analyze the simulation results later in this section. A more general treatment of program granularity and run-time overhead can be found in 16 .
Consider a DOALL loop with N iterations where each iteration takes t l time to execute without parallel processing overhead. For a given synchronization primitive and lock algorithm, let t sch be the time it takes for a processor to schedule an iteration. We will look at the impact of scheduling overhead for two cases. For the rst case we assume that when a processor is scheduling an iteration, it is the only processor doing so.
For a given P and t sch , the necessary condition for this case is using t l P , 1 t sch speedup P t l t l P,1 + t l P P , 1 P P , 1 Therefore, when t l P , 1 t sch , the speedup increases linearly with number of processors hence the execution time depends only on P and the total amount o f w ork in the loop, N t l . Now let us consider the case where a processor completing the execution of an iteration always has to wait to schedule the next iteration because at least one other processor is scheduling an iteration at that time. The necessary condition for this case is t l P , 1 t sch ;
and the iteration scheduling overhead forms the critical path in determining the loop execution time. When iteration scheduling becomes the bottleneck, execution time is:
t P = N t sch + t l ;
for N P t P N t sch :
When the iteration scheduling algorithm is implemented with lock operations, scheduling an iteration involves transferring the ownership of the lock from one processor to the next, and reading and incrementing the shared counter. Therefore t sch = t lock,transfer + t update :
In the remainder of this section we rst look at how loop execution time varies with loop granularity. Then we quantify the iteration scheduling overhead t sch for di erent hardware synchronization primitives by simulating execution of a parallel loop with very ne granularity.
Granularity e ects
The analysis above shows the existence of two di erent t ypes of behavior of execution time for a parallel loop. Given a multiprocessor system, the parameters P and t sch do not change from one loop to another. Keeping these parameters constant, the granularity o f a l o o p , t l , determines whether scheduling overhead is signi cant i n o v erall execution time or not.
The architectural support for synchronization primitives in uences the execution time of parallel loop in two w a ys. On one hand, di erent v alues of t sch for di erent primitives result in di erent execution time when the loop iterations are small i.e., ne granularity loops. On the other hand t sch determines whether a loop is of ne or coarse granularity. In this section we present the simulation results on how loop execution time varies across di erent implementations of the iteration scheduling algorithm. Since t sch determines the execution time of ne granularity loops, we quantify how t sch changes with synchronization primitives used, and the number of processors in the system. 
Scheduling overhead for ne grain loops
For ne grain loops, the loop execution time T P is approximately N t sch . The change of execution time with respect to the granularity of a set of synthetic loops is shown in Figure 11 for the test&set primitive implementing the test&test&set algorithm. Each of the synthetic loops has a total of 220000 executed instructions. Therefore, the region where iteration size 50 instructions corresponds to N 4400 in these gures. The common observation from these gures is that when loop iterations are su ciently small N is su ciently large, the execution time increases linearly with N. Also, when extrapolated, T P vs. N lines go through the origin which v alidates the linear The synchronization bus model used in these simulations has single cycle access time for free locks and single cycle lock transfer time. Therefore the synchronization bus data shows the highest performance achievable by hardware support for lock accesses alone. In Section 6, the performance gures for a synchronization bus which also supports single cycle fetch&add operation are given.
Such a synchronization bus is capable of scheduling a loop iteration every clock cycle. Therefore its overall performance can be expected to be better than all the primitives analyzed in this section.
Synchronization Characteristics of Applications
In this section we report some synchronization characteristics of the application programs used in our experiments. These characteristics help to focus our experiments and to analyze the experimental results. Section 5.1 presents the granularity of the parallel loops in these application programs.
Section 5.2 deals with their lock access locality.
Parallelism characteristics of application programs
Experimental investigation of parallel processing requires realistic parallel programs. To support our experiments, we parallelized a set of programs from the Perfect Club benchmark set. KAP 17 was used as the primary parallelization tool. Using basic-block pro ling tcov, the frequently executed parts of the program were identi ed. If the parallelization of these parts were not satisfactory, the reasons for were investigated. In some cases, the unsatisfactory parallelization results were simply due to KAP's limitations in manipulating loop structures, e.g., too many instructions in loop body or too many levels of nesting. In these cases, the important loops were parallelized manually.
Among all the programs thus parallelized, four of them show a relatively high degree of parallelism, i.e., at least 60 of the computation was done in the parallel loops. These four programs are ADM, BDNA, DYFESM, and FLO52. ADM is a three-dimensional code which simulates pollutant concentration and deposition patterns in a lakeshore environment b y solving complete system of hydrodynamic equations. The BDNA code performs molecular dynamic simulations of biomolecules in water. The DYFESM code is a two-dimensional, dynamic, nite element code for the analysis of symmetric anisotropic structures. The FLO52 code analyses the transonic inviscid ow past an airfoil by solving unsteady Euler equations. To perform experiments with these four programs, we insert instrumentation code in the programs and collected their traces. An in-depth treatment of automatic parallelization and the available parallelism in the Perfect Club programs can be found in 7 9 . Table 3 shows the available parallelism and granularity for the innermost parallel loops in the four automatically parallelized programs. In three of the four programs, FLO52, ADM, and DYFESM, the parallelism was exploited in the form of nested DOALL loops. For the BDNA program, the parallel loops were not nested and two thirds of the dynamic parallel loops were DOACROSS loops with dependence distances of one iteration.
For nested parallel loops, the number of iterations of outer loops does not di er from that of innermost parallel loops. Therefore, the number of iterations of parallel loops cannot be increased with techniques such as parallelizing outer loops or loop interchange. The small number of loop iterations suggests that chunk scheduling and guided self scheduling cannot be used to improve performance signi cantly beyond self-scheduling. The small number of instructions in each iteration suggests that architectural support is needed to execute these programs e ciently.
Locality o f l o c k accesses in synchronization algorithms
In our simulations, all four programs exhibited very low locality for lock accesses. When a processor acquires a lock, we consider it a lock hit if the processor is also the one that last released the lock.
Otherwise, the lock acquisition is results in a lock miss. The measured lock hit rate for the four programs with four or more processors was less than 0.2. Such a l o w l o c k access locality can be explained by the dynamic behavior of scheduling and synchronization algorithms.
For each parallel loop, every processor acquires the task queue lock and barrier lock only once.
This results in a round-robin style of accesses to these locks. For each parallel loop, the loop counter lock used in the loop self-scheduling algorithm is accessed multiple times by each processor. Howeve r , a l o c k hit can occur only when the processor which most recently acquired an iteration nishes the execution of that iteration before the completion of all the previously scheduled iterations. Due to low v ariation in the size of iterations of a parallel loop, this scenario is unlikely.
In the experiments, because of the low l o c k hit rate, the atomic memory operations are implemented in shared memory. An implementation of atomic operations via the cache coherence protocol would result in excessive i n v alidation tra c, and would also increase the latency of atomic operations. On the other hand, algorithms like test&test&set require spinning on memory locations which are modi ed by atomic operations. Therefore all memory locations are cached with an invalidation based write-back cache coherence scheme. This simple scheme e ectively use cache to eliminate excessive memory tra c due to spinning while e ciently executes atomic synchronization primitive in memory modules.
Experimental Results
In this section we present the performance implications of synchronization primitives on four application programs. The performance results are obtained by simulating a 16-processor system assuming centralized task scheduling, iteration self-scheduling, and linear non-blocking barrier synchronization. The system timing assumptions are the same as those summerized in Section 3.
To calculate the speedup, the execution time for the sequential version of a program without any parallel processing overhead is used as the basis.
Figures 13 16 present the speedup obtained in the execution of these program together with three categories of parallel processing overhead: iteration scheduling, task scheduling, and idle time.
Each gure shows the results for one benchmark in two graphs, one for 3-cycle memory modules and the other for 20-cycle memory modules. The horizontal axis lists the combinations of architectural support and lock algorithms used in the experiments; these combinations are described in Table 4 4 .
The task scheduling overhead corresponds to the time the processors spent to acquire tasks from the task queue. The iteration scheduling overhead refers to the time the processors spent i n the self-scheduling code to acquire iterations. The processor idle time is de ned as the time spent by processors waiting for a task to be put into the empty task queue. According to this de nition, a processor is idle only if the task queue is empty when the processor completes its previously assigned task. This provides a measure of available parallelism in the parallelized programs.
Note that the three overhead numbers in Figures 13 16 for each combination do not add up to 100. The major part of the di erence is the time that is actually spent in the execution of the application code. In addition, there are three more categories of overhead that are measured but not shown because they are usually too small to report. They are due to task queue insertion, barrier synchronization, and Advance Await synchronization. The time it takes for processors to insert tasks into the task queue is less than 2 of the execution time for all experiments. For all four benchmarks, the barrier synchronization overhead is also measured to be less than 2 of the execution time. Of the four benchmarks, we encounter a signi cant n umber of DOACROSS loops 4 The combination of exchange-byte primitive with test&test&set algorithm is not included because this case has the same performance as the test&set with test&test&set combination. exchange-byte with queuing lock Queuing lock algorithm is used to access the locks associated with shared counters in iteration scheduling and barrier synchronization. It is also used to access the task queue lock.
test&set with test&test&set Test&test&set algorithm is used to access the locks associated with the shared counters in iteration scheduling and barrier synchronization algorithms. It is also used to access the task queue lock.
synch. bus Synchronization bus provides the single cycle lock unlock operations to access the locks associated with the shared counters in iteration scheduling and barrier synchronization algorithms. They are also used to access the task queue lock. In Figures 13-16 , the three experiments on the left side of each graph correspond to the cases where some form of fetch&add primitive is supported in hardware. For all four applications, when fetch&add operation is not supported, the iteration scheduling overhead increased signi cantly.
This increase in overhead has a direct impact on the performance of the applications. Furthermore, the performance of fetch&add primitive with queuing lock algorithm column 2 was at least as good as the performance of a synchronization bus supporting single cycle atomic lock accessescolumn 6. This is true even when the memory module cycle time is 20 processor cycles, which implies a minimal latency of 43 cycles to execute fetch&add. Therefore, implementing the fetch&add primitive in memory modules is a e ective as providing a synchronization bus that supports one cycle lock unlock primitives.
For the BDNA program, task scheduling overhead is not signi cant for all experiments. As shown in Table 3 , loops in BDNA have a large number of iterations and relatively large granularity. We w ould like to make t w o more points about the lock algorithms. We h a v e three di erent implementations of lock accesses. They are test&test&set algorithm columns 1 and 4, queuing lock algorithm columns 2 and 5, and a synchronization bus implementation of lock operations columns 3 and 6. The test&test&set algorithm di ers from queuing lock algorithm in the amount of bus contention it causes. On the other hand, the queuing lock algorithm is similar to a synchronization bus implementation of lock operations, except for a much higher lock access latency. As for ADM and DYFESM, lack of parallelism is also an important factor for the low speedup gures. This can be observed from the idle time of processors in Figures 15 and 16 . Finally, the results presented here demonstrate that the architectural support for synchronization and the choice of lock algorithms signi cantly in uence the performance of all the four parallel application programs.
Concluding Remarks
In this paper, we analyze the performance implications of synchronization support for Fortran programs parallelized by a state-of-the-art compiler. In these programs, parallelism is exploited at the loop level that requires task scheduling, iteration scheduling, barrier synchronization, and advance await.
Using simulation, we show that the time to schedule an iteration varies signi cantly with the architectural synchronization support. The synchronization algorithms used in executing these programs depend heavily on shared counters. In accessing shared counters, we conclude that lock algorithms which reduce bus contention do enhance performance. For the applications we examined, due to the importance of shared counters, a fetch&add primitive implemented in memory modules
