This paper presents a new parallelization model, called coarse-grained thread pipelining, for exploiting speculative coarse-grained parallelism from general-purpose application programs in shared-memory multiprocessor systems. This parallelization model, which is based on the ne-grained thread pipelining model proposed for the superthreaded architecture 11, 12], allows concurrent execution of loop iterations in a pipelined fashion with run-time data-dependence checking and control speculation. The speculative execution combined with the run-time dependence checking allows the parallelization of a variety of program constructs that cannot be parallelized with existing run-time parallelization algorithms. The pipelined execution of loop iterations in this new technique results in lower parallelization overhead than in other existing techniques. We evaluated the performance of this new model using some real applications and a synthetic benchmark. These experiments show that programs with a su ciently large grain size compared to the parallelization overhead obtain signi cant speedup using this model. The results from the synthetic benchmark provide a means for estimating the performance that can be obtained from application programs that will be parallelized with this model. The library routines developed for this thread pipelining model are also useful for evaluating the correctness of the codes generated by the superthreaded compiler and in debugging and verifying the simulator for the superthreaded processor.
Introduction
Shared-memory multiprocessor systems typically exploit the coarse-grained parallelism available in loops in order to improve execution time performance. Compile-time data-dependence analysis often cannot extract much of the available loop-level parallelism, however, due to the inherent limitations of compile-time information. In these cases, the necessary dependence analysis must be performed at run-time to be able to exploit whatever parallelism may be available. Several run-time parallelization schemes have been proposed that can improve the performance of application programs that would otherwise have to be executed sequentially 1, 8, 9, 16] . These approaches typically use an inspector phase to determine the dependences that This work is supported in part by the National Science Foundation under grant Nos. MIP-9610379, MIP-9971666, CDA-9502979, and CDA-941405. actually exist at run-time, followed by an executor phase that actually performs the computation. Although the executor phase can often be run speculatively at the same time as the inspector phase, these approaches cannot be used to parallelize such general constructs as do-while loops. Instead, these previous approaches are limited to parallelizing only array-oriented constructs. They also must contend with the overhead of the inspector phase.
In this paper, we describe a new parallelization model, called coarse-grained thread pipelining 4], for exploiting coarse-grained parallelism from general-purpose application programs in shared-memory multiprocessor systems. This model allows concurrent execution of loop iterations in a pipelined fashion with run-time data-dependence checking and control speculation. Our coarse-grained model is based on the negrained thread pipelining model proposed for the superthreaded architecture 11, 12] . The superthreaded architecture uses a thread pipelining execution model in which threads are dynamically initiated and executed. We extend the ne-grained superthreaded model by implementing it in a set of software library routines to parallelize coarse-grained applications on o -the-shelf multiprocessor systems. With the pipelined execution of loop iterations, the run-time dependence analysis with this new approach results in lower parallelization overhead than in other existing run-time parallelization models. The coarse-grained thread pipelining model can handle run-time dependence analysis of any type of data structures making it applicable to a wide variety of loop structures. Furthermore, speculative execution combined with run-time data-dependence checking allows the parallelization of traditionally sequential constructs, such as do-while loops.
The remainder of the paper is organized as follows. Section 2 describes the basic ne-grained thread pipelining execution model of the superthreaded architecture. Section 3 then presents our coarse-grained thread pipelining model describing how it is implemented in software using library routines. Section 4 evaluates the performance of this implementation using some real application benchmarks and a synthetic benchmark program run on an SGI Challenge shared-memory multiprocessor. Section 5 discusses some related work for run-time parallelization and how our model di ers from these previous schemes. Finally, Section 6 summarizes our results and conclusions.
Superthreaded Architecture
The superthreaded architecture 11, 12] exploits task-level parallelism using multiple threads of control. A superthreaded processor consists of a number of thread processing units that share an instruction and data cache. At run-time, the multiple thread processing units, each with its own program counter and instruction execution data path, can fetch and execute instructions from multiple program locations simultaneously. The basic architecture of a superthreaded processor is shown in Figure 1 . 
Thread Partitioning
The compiler for the superthreaded architecture statically partitions the control ow graph of a program into the individual threads. Each thread is run on a separate thread processing unit. The execution of a program starts from its entry thread. It can then fork a successor thread on another thread processing unit. This successor thread can in turn fork its own successor thread. This process continues until all thread processing units are busy. Among the multiple threads running on the superthreaded processor, the oldest thread in the sequential order is referred to as the head thread. All the other threads derived from the head thread are called successor threads. After the head thread completes its computation, it will retire and release its thread processing unit. Its successor then becomes the new head thread. The completion and retirement of the threads must follow the original sequential program order to ensure correct results.
In the superthreaded execution model, successor threads can be forked with or without control speculation. When a thread forks a successor without control speculation, it must ensure that all of the control dependences of the successor thread have already been satis ed. If a thread forks a successor thread with control speculation, however, it must subsequently verify all of the speculated control dependences. If any of the speculative dependences evaluate to false, the thread must abort its successor thread and all of the following threads.
Thread Pipelining Execution Model
The superthreaded architecture uses the thread pipelining model to overlap thread execution and to enforce data dependences between concurrently executing threads. In this model, thread initiation and data forwarding are performed through explicit thread management and communication instructions. The execution of a thread is partitioned into several stages, each of them performing a speci c function. Figure 2 shows the pipelined execution of contiguous threads in a superthreaded processor. The function of each of the thread pipelining stages is described in the following sections. 00 11 00 00 11 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 0000000 0000000 0000000 1111111 1111111 1111111 
Continuation Stage
After being initiated by its predecessor thread, each thread begins with the continuation stage. The major function of this stage is to compute the recurrence variables, such as loop index variables, needed to fork the next thread. All of these types of variables are called continuation variables. The values of the continuation variables must be forwarded to the next thread processing unit before the next thread can be activated. This stage ends with a fork instruction, which ensures serialization of the continuation stages of successive threads. This serialization is necessary because the computation of the continuation variables in the current thread is dependent on the results from the continuation stage of the previous thread.
Target-Store-Address-Generation (TSAG) stage
The Target-Store-Address-Generation stage computes the addresses of the write operations upon which concurrent threads may be data-dependent. These addresses are referred to as target store addresses. These store addresses are computed at run-time, and are forwarded to the memory bu ers of successor threads using the allocate ts instruction. To guarantee program correctness, a successor thread is not allowed to perform any load operation which can be data dependent on those store operations marked with an allocate ts instruction until its predecessor thread has completed the TSAG stage and has forwarded all the target store addresses to its memory bu er. This ordering is enforced using two synchronization instructions.
The release tsag done instruction is placed at the end of the TSAG stage to send a tsag done flag to its successor thread. A corresponding wait tsag done instruction is placed in the successor thread before its corresponding load operations. This mechanism e ectively implements run-time data-dependence checking.
To allow more overlap between threads, the TSAG stage can be further partitioned into two parts. The rst generates target store addresses that do not have any data dependences on earlier threads. This part does not need to wait for the tsag done flag and so can complete quickly. The part that is data dependent on previous threads needs to wait for the tsag done flag as before.
Computation Stage
The computation stage performs the main computation of a thread. If a thread executes a load operation whose address matches that of a target store entry in its memory bu er during the computation stage, the thread will either read the data from the entry if it is available, or it will wait until the data is forwarded to its memory bu er by an earlier thread. If the thread is computing the value of a target store, it needs to forward the data to the memory bu ers of all its concurrent successor threads using a store ts or release ts instruction.
If a thread completes normally without being aborted by a predecessor thread, it will end with a stop instruction. After executing the stop instruction, the thread waits until it becomes the head thread and then performs its write-back stage. If a thread determines that a control speculation is incorrect, however, it kills all its successor threads using the abort future instruction.
Write-back Stage
In the write-back stage, a thread writes all the data stored in its memory bu er to the main memory through the shared data cache. The write-backs are performed in the sequential program order to preserve non-speculative memory state. This ordering also eliminates all output and anti-dependences between threads. After completing the write-back stage, a thread is retired and the thread processing unit becomes idle until it is again scheduled with a new thread. The serialization of the write-back stage is achieved with two synchronization instructions. The wait wb done instruction causes the thread to wait on the ag wb done. A corresponding release wb done instruction executed in the head thread sets the wb done ag for its immediate successor thread.
3 Coarse-grained Thread Pipelining
The coarse-grained thread pipelining model developed in this paper extends the basic thread pipelining model to speculatively parallelize coarse-grained applications on conventional multiprocessor systems. While the original superthreaded architecture uses specially-designed hardware to support the thread pipelining model, this coarse-grained version of the thread pipelining model is implemented entirely in software on a standard shared-memory multiprocessor system. It uses the same four thread pipeline stages as in the ne-grained superthreaded model, i.e., the continuation, TSAG, computation and write-back stages. The functionality of the stages remains the same, but the implementation of the stages in this case is adapted to match the requirements of the shared-memory architecture.
In this coarse-grained thread pipelining model, the thread management and communication between threads is done at the software-level through calls to a number of specially-developed library routines. Our thread pipelining model executes concurrent threads on multiple processors. A unique process is created on each processor to act as a thread processing unit. These processes initiate threads for execution and each thread goes through the di erent pipeline stages, retiring when it completes all of the stages. At this point, the processor running the thread waits until a new thread is initiated. It then begins executing this new thread.
Mapping the Thread Pipeline Stages to Software
This section describes how the various thread pipeline stages are mapped on to an o -the-shelf sharedmemory multiprocessor system. The subsequent section brie y describes the di erent library routines that are used to actually implement the stages and thereby parallelize an application program with this thread pipelining model.
Continuation Stage
As shown in Figure 3 , each thread starts with the continuation stage. This stage serializes the threads in order of the thread number to ensure the correct computation of the continuation variables. Since the processes initiating the threads are all created simultaneously at the beginning of the program's execution, we need some mechanism to allow threads to wait at the entry point of a thread until a predecessor thread signals a successor thread to continue. We introduce the ag thread active for each processor to indicate that the corresponding thread on that particular processor can proceed. Initially, all but the rst thread's ags will be reset so that only the rst thread enters the continuation stage. When the rst thread completes this stage, it sets the thread active ag for the next thread so that its successor thread can start. When this thread completes its continuation stage, it will then set its successor's ag. In this way, successive threads on di erent processors will be started sequentially. During this stage, each thread sets the variable thread num in the processor on which it is running. This variable identi es the current thread running on the processor and is used to coordinate the write-back process when threads are aborted.
Target-Store-Address-Generation (TSAG) Stage
In this software thread pipelining model, threads that are data-dependent on previous concurrent threads must wait on a ag corresponding to the dependent data item. Because these values are allocated in sharedmemory, there is no need to actually forward the data values between the threads. Instead, when the ag is set, the waiting thread knows that the data is available and can be read from the shared-memory.
The superthreaded architecture uses the memory address of each data access to determine at runtime whether a dependence exists or not. To enforce such a runtime data-dependence test for every data access in the software model is expensive both in terms of memory usage and execution-time overhead. Datadependences that can be determined at compile-time can be enforced through synchronization ags. Thus, only memory accesses that cannot be statically analyzed for the existence of data-dependences are handled by the runtime approach in the software implementation.
The runtime data-dependence test uses three data structures. Each possibly dependent memory location has an entry in the address data structure that holds its address. The identi er of the thread accessing the data item is stored in the corresponding entry in the thread data structure. Finally, a ag indicating whether the data access is complete is stored in the corresponding flag data structure. During the TSAG stage, each thread sets the next available entry in the address, thread, and flag data structures for every data access that may result in a cross-iteration dependence for its successor threads. The ag entry is set to 0 indicating that the access has not occurred yet. A successor thread can continue computation involving a dependent data item only if the corresponding ag is set to 1. The TSAG stages are synchronized with a tsag done ag using the wait tsag done and the release tsag done library calls, as in the original superthreaded model.
Computation Stage
During this stage the threads perform the main computation of each thread's portion of the application program. The computation stages of successive threads are overlapped. The actual amount of overlap is determined by dependences on previous threads. Each thread rst determines its data-dependences on previous threads by matching the memory addresses of its data accesses to those stored in the address data structure. If a match is found, the thread must wait on the corresponding flag entry before it can access the data. When a thread is done with a data item, it sets the corresponding ag to thereby allow successive threads to proceed. A thread must set the ag of a data item regardless of whether the data is actually accessed or not so that successor threads can proceed in case the instructions accessing the data are not executed.
Since successive threads can be forked with control speculation, a thread must check during the computation stage whether the control speculation is correct. If not, it must abort all successor threads by setting the ABORT FLAG. Setting this ag signals all following threads that they must abort. Each thread checks the ABORT FLAG just before entering the computation stage. If the thread nds that the ag has been set by a previous thread, it will bypass the computation stage. A thread that has already started its computation stage but needs to be aborted because a predecessor thread has set the ABORT FLAG will either execute the instruction that causes the abort (for example the terminating condition in a while loop) and so exit the computation stage, or it will complete its computation stage and will then check for the abort signal before it begins its write-back stage. To maintain non-speculative memory states, threads that are forked with control speculation perform their writes in a private memory space. If a thread completes successfully, it copies the local updates to the actual memory locations in the shared memory during the write-back stage. Memory update during the write-back stage reduces the amount of parallelism among concurrent threads since write-backs are performed in the original sequential order. Threads forked without speculation, on the other hand, are allowed to perform their write-updates to actual memory locations to eliminate the write-back overhead. The overhead of write-backs for speculative threads could be reduced by privatization and reduction parallelization 8], although we do not evaluate the e ectiveness of these techniques in this paper.
Write-back Stage
The write-back stages must be serialized to maintain the correct memory state for speculative execution. Our model uses a wb done ag on each processor to force the threads on the corresponding processors to wait. After a thread performs its writes, it sets the wb done ag of its immediate successor. The next thread then becomes the head thread. Thus, successive threads will perform their writes sequentially.
Those threads which are to be aborted, as determined in the computation stage, do not go through the write-back stage. Instead, the head thread restores the recurrence variables to the values with which the thread started so that new threads can begin from that point of execution. Threads that completed the computation stage without being aborted, which includes all threads preceding the rst one aborted, complete the write-back phase. At the end of the write-back stage, each thread checks the ABORT FLAG. If it is set, no more threads need to be initiated on this processor for the current execution phase. If the ag is reset, however, the corresponding processor begins executing from the continuation stage of the next thread in sequential order.
Library Routines for Thread Management and Communication
The implementation of the di erent thread pipeline stages requires mechanisms to allow communication between threads. The shared-memory architecture used for our software implementation of the thread pipelining model enables us to carry out the communications implicitly through ags and variables in the shared memory space. A thread that needs to communicate with another thread can simply update an appropriate ag or data item in the shared memory. The other threads then can read the values from the shared memory when needed. Table 1 lists the various library routines that implement the thread pipelining execution in software.
An Example Program
In this section we use the sequential code shown in Figure 4 to demonstrate how the thread management library routines can be used to parallelize a sequential code that would be impossible to parallelize with traditional compiler approaches. The code segment shown in Figure 4 is a do-while loop from the main procedure of the Unix word count utility program, wc. There is a loop-carried dependence caused by the update of the variable in word. There are also output-dependences caused by the memory updates of the variables lines Releases the flag for the data access. Figure 5 shows the corresponding coarse-grained thread pipelining code. Each thread is assigned one loop iteration. In this code, P is the total number of processors in the system and id is the process id of the process running a thread. The threads are initiated speculatively since the loop termination condition cannot be determined ahead of time. The continuation stage begins with a call to wait thread active and terminates with a call to thread init, which initiates the next thread. Each thread reads BUFFER SIZE bytes from the input le in its local bu er, buf, thereby advancing the le pointer, fd for the next thread. A synchronization ag, flag in word, is used to enforce the data dependence on the variable in word. Since there are no run-time data-dependences, the TSAG stages do not involve any work. The TSAG stages are synchronized with the wait tsag done and release tsag done function calls.
Before a thread begins its computation stage, it invokes the method check abort to check if a predecessor thread has been aborted. If the function returns true, which indicates that the current thread needs to be aborted, the thread will bypass the computation stage and will wait for the start of its write-back stage. Otherwise, the thread will begin its computation stage. In this stage, each thread begins by checking the loop exit condition, which in this case occurs when the end of the le is reached. If the condition is true, the thread invokes abort thread and then jumps out of the loop. Otherwise, the thread continues with the computation of the loop body. Since a thread cannot continue until the value of in word from the previous iteration is available, the necessary waiting is enforced by the while loop busy-waiting on the flag in word variable. Once the ag is set to its corresponding processor's id, the thread continues with the computation. The thread reads the value of in word into its local memory location my in word for its own use. It updates in word, if necessary, and sets flag in word for its successor thread. The thread then proceeds with the rest of the computation.
Each thread updates the variables lines and words in its own local memory copies, called my lines and my words. The threads arrive at, and then execute, their write-back stages sequentially. This serial order is enforced through the wait wb done and release wb done calls. A thread that successfully completed its computation stage will perform its write-back stage, which in this example is to update the variables lines and words with the thread's local copies, and will then retire using a call to stop thread. The conditional statement on the ag thread abort at the beginning of the write-back stage determines whether a thread should perform its write-back or should instead abort. A thread that needs to be aborted forces its successor thread to abort by calling abort next. All threads following the one that rst aborted will also be aborted.
If an aborting thread is the head thread at the time of its write-back stage, i.e., the thread to abort rst, it should restore the recurrence variable, which in this case is bytes read. After the write-back stage, the ABORT FLAG is checked to determine if new threads need to be initiated for the current code segment. If not, the processes will return and wait for the next phase of execution. Figure 5 : The code from the example in Figure 4 parallelized using the coarse-grained thread pipelining library routines.
Compilation Issues
Conversion of a sequential code into its coarse-grained thread pipelining parallel code can be performed automatically by a compiler. In fact, the compiler for the ne-grained superthreaded architecture 13, 14] could be adapted for this purpose. The superthreaded architecture relies on the compiler to extract thread-level parallelism and to generate the necessary superthreaded instructions. One of the major tasks of the compiler for the superthreaded model is to partition a sequential program into threads for concurrent execution. Another important task involves identifying target stores for run-time data-dependence checking. For these tasks, the superthreaded architecture uses some conventional parallelizing compiler techniques, such as function inlining, induction variable substitution, alias analysis, data-dependence analysis, variable privatization, loop unrolling and interchange, and so forth. The compiler can further enhance the execution overlap of concurrent threads by applying some advanced program transformation techniques speci c to superthreaded processors. The supertheading-speci c optimization techniques 13] are conversion of data speculation to control speculation, distributed heap memory management, using critical sections for orderindependent operations, and memory bu ering in the main memory.
After partitioning a program into threads using the above compiler techniques, each thread is partitioned into multiple stages for thread pipelining. These same complier techniques are applicable to the coarsegrained thread pipelining model. However, the analysis techniques need to be adjusted to generate threads with the coarser granularity appropriate for the shared-memory architecture.
The library routines developed for the coarse-grained thread pipelining model also can be used as a tool to evaluate the correctness of the superthreaded code that will be produced by the compiler for the superthreaded architecture. A simulator for the superthreaded architecture 3] executes the complied code and is very useful in the performance evaluation of the architecture itself. However, using a simulator to execute a code is often very time consuming. Instead, if the purpose is to verify the correctness of the compiled code, the library routines for the coarse-grained model will execute the code much faster. The superthreaded compiler can be easily modi ed to generate code with the library routines inserted in appropriate places in the program. The library routines may also be helpful in debugging and verifying the simulator for the superthreaded processor. 4 Performance Evaluation of the Coarse-Grained Thread Pipelining Model
In this section, we evaluate the performance of the coarse-grained thread pipelining model. First, we examine the execution time overhead of this model. We then examine the performance of this model when applied to some real application programs. Finally, we use a parameterized synthetic benchmark program to evaluate the performance potential of this coarse-grained model considering both a range of parallelism granularities and the fraction of the program that is inherently sequential. We also use the synthetic benchmark to evaluate the e ect of varying the amount of work within each thread. The performance data obtained using the synthetic loop can be used to estimate the performance of real applications that could be parallelized using this thread pipelining model. For the experiments, we used a 196 MHZ IP25 SGI Challenge shared-memory multiprocessor system. The system consists of 8 MIPS R10000 processors with a MIPS R10010 FPU, 32 Kbytes of separate data and instruction caches, 2 Mbytes of secondary uni ed cache and 1024 Mbytes of 8-way interleaved shared memory.
Thread Pipelining Overhead
The implementation of the software thread pipelining model requires the creation of threads as processes executing on multiple processors. This process creation requires a system call. The thread pipelining stages are implemented through the library calls described in Section 3. This initial thread creation, plus the execution of the library routines, adds to the overall parallel execution time of an application program and can be considered as the parallelization overhead.
The initial thread creation overhead, T cr , is incurred only once when the processes executing the threads are created. These processes are not terminated until the entire program terminates. Threads are created and aborted on these processes as needed. This initial thread creation overhead increases with the number of threads created. For the 4 processor SGI system the average initial thread creation overhead is 0.7456 seconds with a standard deviation of 0.014 seconds, while for the 8 processor system the average initial overhead is 0.8606 seconds with a standard deviation of 0.021 seconds. Table 2 : Overhead (in microseconds) associated with the di erent pipeline stages of the coarse-grained thread pipelining model as measured on the test system.
The library call overhead, T lib , is incurred for each thread being executed as each thread needs to call these library routines as they go through the di erent pipeline stages. Table 2 shows the overhead associated with each of the four pipeline stages. T lib is the sum of these four pipeline stage overheads. Note that the overhead for the computation stage is for the run-time data-dependence test which can be performed concurrently by successive threads. This overhead increases with the number of dependent data accesses in each thread, as well as with the number of concurrently executing threads. The write-back stage overhead, T wb , consists of a xed overhead per thread plus an additional overhead due to write-updates. This additional write-update overhead increases with the number of bytes written back. The write-update time depends on whether the data is in the cache or has to be accessed from the main memory. The library call overhead observed by the concurrent threads is overlapped due to the pipelined execution of the threads. In a P-processor system, every set of P threads will incur an overhead of T lib + (P ?1) T wb .
However, as shown in Figure 6 , the write-back overhead will be overlapped with the T lib overhead of the next set of P threads, except for the last P threads. Thus, if there are a total of N parallel iterations to be executed in a P-processor system, the total library call overhead will be N=P T lib + (P ? 1) T wb for that parallel loop. (P-1) T wb (P-1) T wb Figure 6 : Some of the library call overhead can be overlapped in a parallel loop due to the pipelined execution of threads.
Because of the overhead of this thread pipelining model, serial application benchmarks that are parallelized using this model must have su ciently large granularity (i.e., execution time) for each parallelized portion of the code. The greater the amount of work involved in each thread of the resulting parallelized code, the smaller the e ect of the thread creation and the library call overhead on the overall parallel execution time. To increase the amount of work per thread, one thread execution can include several loop iterations, for instance. Since the thread creation overhead is incurred only once during program execution, we can ignore the e ect of this overhead for loops that are executed many times. For such loops, the total library call overhead will dominate and will thereby determine the actual performance gain.
Evaluating Performance With Real Application Programs 4.2.1 Test Programs
Four application programs were selected to be parallelized with this thread pipelining model. The rst is the Gaussian Elimination program for solving an NxN system of linear equations. The second one is the TRFD program from the Perfect Club Benchmarks 2], which simulates a two-electron integral transformation using a fourth-order tensor equation. The third one is the MDG program from the Perfect Club Benchmarks 2], which performs molecular dynamic simulation of exible water molecules. The last one is the Unix wordcount utility program, wc.
The Gaussian Elimination (Gauss) program consists of two major components { forward elimination, which typically takes about 98-99% of the overall program execution time, depending on the problem size, and back substitution, which takes the remaining small fraction of the total execution time. The forward elimination component of the code consists of a three-level loop. While the outermost loop is sequential, the two innermost loops can be fully parallelized. The back substitution portion has a loop-carried dependence since each iteration calculates the value of one variable and uses the values calculated in all the previous iterations. This cross-iteration dependence limits the parallelism available in this loop.
The TRFD program consists of four major routines -initialize, rst trans, interchange and second trans. The program uses two NxN arrays, V and S, and an NxNxNxN array, SALL. The initialize routine initializes the V and the SALL arrays. The rst trans routine performs two standard matrix multiplications and a simple matrix transpose operation. The interchange routine exchanges the rst two dimensions of the SALL array with the last two dimensions. The second trans routine is similar to rst trans except that not all of the elements of the S array are transformed. Each of these routines are completely parallelizable 5].
In the MDG program, we considered Loop 1000 of the INTERF subroutine for parallelization. Loop 1000 is a two level loop that has loop-carried dependences due to reduction operations on shared data variables. The inner loop has data dependences on fewer data variables than the outer loop, but it also has smaller grain size. The outer loop, on the other hand, has cross-iteration dependences on more data variables than the inner loop, but it has a much higher grain size. We generated two parallel codes to compare for Loop 1000 in INTERF, one with the inner loop parallelized and one with the outer loop parallelized. The execution of the INTERF routines takes about 88-90% of the total execution time.
The main computation of the wc program consists of a while loop that reads a block of bytes from the input le and counts the number of lines, words, and characters in that block. The loop continues until the end of the le is encountered. The loop has cross-iteration data dependences on three shared data variables.
Standard loop parallelization techniques cannot parallelize while loops since the number of iterations that will be executed is determined dynamically at run-time. The control speculation provided by the coarse-grained speculative thread pipelining model allows these types of loops to be easily parallelized, however.
Execution Times
To transform the serial code into the thread pipelined parallel code, we identi ed the portions of the code that could be parallelized and manually inserted the library routines in the appropriate places. We executed the parallel codes on the SGI shared-memory multiprocessor system. The resulting speedups were computed using the execution time of the original sequential programs on a single processor of the same SGI system as the basis. Gauss was applied on a dense matrix of dimension NxN. The problem size, N, was varied from 100 to 1000 in increments of 100. To obtain an appropriate grain size for each thread, 8 iterations were combined to produce each thread's work. We show the speedup of the forward elimination and the back substitution portions of the code, excluding the initial thread creation overhead, in Figures 7(a) and 7(b) , respectively. Figure 7(c) shows the overall speedup of the program including the initial overhead in the total parallel execution time. The forward elimination code, which is completely parallelizable in a DOALL loop, shows signi cant speedup when parallelized using the thread pipelining model. However, the back substitution code shows relatively poor speedup with a maximum of only 1.5 with both 4 and 8 processors. The main reason for this poor speedup for this routine is the very small amount of work per thread. Also, because of the cross-iteration dependence, there is inherently less concurrency among the threads in this portion of the code. Since the major component of the total execution time is in the forward elimination portion of the code, however, the overall speedup is quite high. Furthermore, the overall speedup increases with increases in the problem size since larger problem sizes increase the amount of work per thread. This larger grain size reduces the e ect of the initial thread creation overhead and the overhead of the library calls. The speedup increases with increases in the problem size since larger problems again increase the work per thread. In general, the actual work per thread executed in TRFD is less than that of Gauss. As a result, TRFD shows lower speedups than Gauss. The INTERF subroutine of the MDG program was executed on 2,3,4,5,6,7, and 8 processor systems. For the inner loop parallelized code, we dynamically allocated the number of iterations per thread based on the outer loop iteration count. The inner loop in Loop 1000 is a triangular loop. Hence, if we assign a xed number of iterations per thread, more and more processors will remain idle due to an insu cient number of iterations available towards the end of the outer loop execution. This will result in poor utilization of the processors. For the outer loop parallelized code, 8 iterations per thread produced a su cient grain size. The cross-iteration dependence due to the reduction operation is handled through partial accumulation in local variables and then updating the actual data variables during the write-back stage.
Figures 9(a) shows the speedups of the INTERF subroutine for P = 2 to 8 processors while excluding the thread creation overhead. The speedups obtained by parallelizing the outer loop scale with the number of processors. The speedups are quite high considering the fact that the loop has some cross-iteration dependences. However, the inner loop parallelized code does not scale well with the number of processors. In fact, the speedup starts decreasing for P>6. As explained earlier, with an increasing number of processors, the grain size per thread decreases as the outer loop iteration count increases. Eventually, the grain sizes becomes too small with respect to the overheads and so produce lower speedups. Also, the speedups for the inner loop parallelization are less than that for the outer loop parallelization, again due to its smaller grain size. Figure 9(b) shows the overall speedup of the MDG program, including the initial thread creation overhead in the total execution time. Since about 90% of the program is parallelizable, the overall speedup is quite high. The wc program was executed using a 0.6 Mbytes input le while varying the input bu er size. The bu er size, which determines the number of bytes read from the input le for each iteration of the while loop, was varied from 1 Kbytes to 20 Kbytes. Each thread was assigned one iteration of the parallel loop. The speedup of the wc program with 2, 4, and 8 processors is shown in Figure 10 . The speedup of this program increases as the bu er size is increased, until the performance saturates, since increasing the bu er size increases the granularity of each thread.
Evaluating Performance Using a Synthetic Benchmark
We next use a synthetic application benchmark program to evaluate the potential range of performance of the coarse-grained thread pipelining execution model. We use this program to consider the e ects of three important factors on the performance of this model: 1) the e ect of varying the total degree of parallelism; 2) the e ect of dynamic random variations in the work per thread; and 3) the e ect of varying the average grain size (i.e., the average workload per thread).
for(i=0;i<N;i++){ /* a dummy loop simulating useful work during */ /* calculation of a dependent value */ for(j=0;j<Work1;j++); /* calculate dependent value */ value i] = C*value i-1]; /* a dummy loop simulating useful */ /* independent work */ for(k=0;k<Work2;k++); } Figure 11 : Synthetic benchmark program used to evaluate the performance range of the coarse-grained thread pipelining model.
To evaluate the performance e ects of these factors, we created the synthetic benchmark program shown in Figure 11 . This program consists of an N-iteration loop. Each iteration calculates the value of a list element using the value of the preceding list element. The rst loop, which is parameterized by Work1, simulates the time required to calculate the i th list element. The second loop, which is parameterized by Work2, simulates useful work that uses the i th list element and is independent of all other iterations. Hence, an iteration cannot proceed until it obtains the value from the previous iteration. If Work2 is 0 and Work1 is set to some nite value, each iteration of the loop has to wait for the previous iteration to complete before it can begin. Thus, the loop is completely sequential in this case. If we use a nite value for Work2 while setting Work1 to zero, the value from the previous iteration is available immediately and each iteration can proceed concurrently. Thus, in this case, the loop becomes a fully parallelizable loop.
The two parameters Work1 and Work2 determine the degree of parallelism within each iteration of the loop. The degree of parallelism for any application is de ned to be the ratio of work performed in parallel to the total work performed. In this test program, the degree of parallelism is simply Work2/(Work1+Work2). The grain size of each of the N outer loop iterations is determined by the total work performed in each iteration, (Work1+Work2). For a given grain size, di erent degrees of parallelism can be achieved by varying the parameters Work1 and Work2 appropriately. For this performance evaluation, we used two di erent grain sizes while varying the degree of parallelism from 0 to 1 within each grain size. Higher degrees of parallelism allow more execution overlap between successive threads and, hence, better speedup. Also, the larger the grain size, the lower the impact of the thread overhead and, thus, the higher the resulting speedup. To study the case where the amount of work per iteration varies as the program executes, we randomly varied the workload for each iteration using both a uniform distribution and an exponential distribution. We used a loop with a xed workload per iteration as the base loop. For both distributions, we kept the average workload the same as the per iteration workload of the base loop. For the uniform distribution, we used two di erent variances { 10% and 20%. Figures 12 and 13 show the speedup curves for the synthetic loop when parallelized using the coarsegrained thread pipelining model and executed with 4 and 8 processors respectively. The total loop workload has been normalized to the total library call overhead, that is, the ratio of the total sequential workload to the total library call overhead. We see that, the higher the normalized workload, the better the parallel execution performance, provided that there is su cient parallelism among the loop iterations. We used the same two workloads for both 4 and 8 processors. However, since with more processors there is more overlap in the pipelined execution of threads, the total library call overhead is lower with 8 processors than with 4 processors. This results in a higher normalized workload for the 8-processor system compared to the 4-processor system (e.g. 8x vs 16x for the same total workload of 44 milliseconds).
These speedup curves show that the speedup increases both as the degree of parallelism is increased and also as the average grain size is increased, as expected. The parallel execution performance is not a ected by uniformly distributed variations in the execution times of loop iterations. However, when the workload per iteration varies according to an exponential distribution, the performance can degrade signi cantly as the degree of parallelism increases. This degradation occurs because the higher variation in the workload per thread causes greater load imbalance, which causes lower average processor utilization.
Performance Prediction
We can analyze the speedup results of the real applications that were presented in Section 4.2 using the speedups from the synthetic benchmark. Let us consider the forward elimination loop of Gauss, which is nearly fully parallel and, thus, has a degree of parallelism of almost 1.0. The normalized sequential workload of this loop for a problem size of N=400 is 8.07 with respect to the total library call overhead for 4 processors. Mapping this ratio to the speedup plot of Figure 12 (a) shows that we would expect to see a speedup of approximately 2.8 for this loop. In fact, the plot in Figure 7 (a) shows the actual speedup to be 2.6 for 4 processors, which is within 7.7% of the predicted value. Similarly, the normalized serial runtime for 8 processors is found to be 16.12. Mapping this value to the corresponding point on the plot in Figure 13 (a), we expect a speedup of 3.8. The actual speedup from Figure 7 (a) is 3.6.
The poor speedup of the back substitution code can also be explained using the synthetic benchmark results. The degree of parallelism for this loop is lower for the earlier iterations and increases as dependent values are computed and become available to later iterations. On average, each loop iteration in the back substitution code has a degree of parallelism of about 50%. From Figures 12 and 13 , we nd the speedup to be close to 1.5 for both the 4 processor and 8 processor systems when the degree of parallelism is 0.5. This corresponds clearly to what we have in the plot of Figure 7(b) .
We conclude that we can use the results in Figures 12 and 13 to determine how much speedup we can expect from a real application program provided that we can determine the amount of concurrency present in the code. Additionally, we must calculate the normalized serial execution time with respect to the parallelization overhead of the thread pipelining model on the particular multiprocessor system that will be used.
Due to the speculative execution of threads, some overhead will be incurred when threads executing an incorrect speculation need to be aborted and restarted. If the threads that need to be aborted receive the abort signal from a predecessor thread before starting the computation stage, the overhead is negligible. However, once a thread begins its computation stage, it does not sense the abort signal until it completes the computation stage. In the worst case, all successor threads will be in their respective computation stages when a predecessor thread identi es an incorrect speculation and sets the ABORT FLAG. All these successor threads will complete their computation stages and will be ready to begin their write-back stages when they detect the abort command. Since the computation stages of successive threads are overlapped, the overhead due to incorrect speculation will be, in the worst case, the time required for the last thread to complete its computation stage plus the time it spends waiting to begin its write-back stage. If W is the work per thread done in the computation stage and P is the total number of threads active simultaneously, the worst case overhead due to misspeculation will be W + (P ? 1) T wb . Thus, in the worst case, the cost of misspeculation is determined by the work done within each thread and the total number of threads initiated.
Related Work
A number of run-time schemes 1, 8, 9, 10, 16] have been developed to exploit medium to coarse-grained looplevel parallelism in programs in which the parallelism cannot be detected at compile-time. In general, these schemes consist of an inspector stage followed by an executor stage. The inspector determines the dependence relations among data accesses across loop iterations that actually exist at run-time. The executor uses the information provided by the inspector to then execute the iterations in parallel in an order that preserves the dependence relations. The di erent algorithms proposed di er in the types of dependence patterns handled and the required system or architectural support.
One scheme, proposed by Zhu and Yew 16], consists of fully parallel inspector and executor stages. This scheme can handle any type of loop-carried dependence pattern. However, the inspector and the executor are tightly coupled. Hence, the inspector's results cannot be reused even if the dependence relations remain unchanged across di erent invocations of the loop, such as occurs when an inner loop of a nested loop structure is parallelized. The CYT algorithm 1] improves the previous scheme by completely separating the inspector and the executor. Thus, the CYT scheme allows the inspector's results to be reused across invocations of the loop. It also allows a partial overlap of the execution of dependent iterations.
Rauchwerger and Padua proposed another algorithm 8] for run-time parallelization of DOALL loops only. This algorithm can be applied in two modes. One is the inspector-executor mode where the inspector determines whether the loop is a DOALL loop. If it is, the executor runs the loop iterations in parallel. The other mode is the speculative mode where both the inspector's test and the loop iterations themselves are executed concurrently. If the inspector ultimately determines that the loop is not a DOALL loop, the entire loop is re-executed sequentially. Rauchwerger, Amato and Padua proposed an algorithm 9] that performs run-time parallelization of DOACROSS loops using the inspector-executor technique. This algorithm increases the amount of parallelism in a loop by using array privatization and \reduction parallelization".
Rauchwerger and Padua have proposed a general framework to parallelize do-while loops that do not have any cross-iteration data-dependences 10]. If the recurrence determining the loop termination is an induction or an associative recurrence, this approach transforms the original loop into two DOALL loops { one that evaluates the recurrence and the other that executes the actual loop body. For general recurrences, they proposed several techniques, including calculating recurrences in a pipelined manner, and computing the entire recurrence in each processor's local memory and then assigning to processor i the recurrence value k such that k = i mod nproc. The execution of iterations beyond the last valid iteration are undone by checkpointing the loop before the execution and by maintaining a timestamp (i.e., iteration number) of when a memory location is updated. Once the DOALL loop terminates and the last valid iteration is known, the values of memory locations written during the additional iterations are restored using the checkpointed values.
A number of hardware-based schemes also have been proposed that use speculative execution to exploit loop-level parallelism on multiprocessor systems 6, 7, 15] . These schemes execute loop iterations in parallel speculating that data-dependences do not exist. Data-dependence violations are detected in hardware using extensions to the cache coherence protocol. The scheme proposed by Zhang et al. 15 ] is a hardware extension of the Privatizing DOALL algorithm 8]. This scheme speculatively executes loop iterations as a DOALL loop on a distributed shared-memory multiprocessor. As soon as the hardware detects a dependence violation, the parallel execution is terminated, the program state is restored, and the loop is re-executed sequentially. The speculative execution models in 6, 7] can handle both DOALL and DOACROSS loops. These schemes can also parallelize do-while loops using control speculation. When the hardware detects a dependence violation, speculative threads are squashed and restarted. Scheduling and committing are done in order. Data written by speculative threads are kept in private bu ers until the threads become non-speculative. Since these schemes use hardware to detect data-dependences and to recover from misspeculation, the runtime overhead is much lower compared to a software-based approach. The thread pipelining model described in this paper performs parallelization of coarse-grained loops using run-time dependence analysis and control speculation. There is no separate inspector stage. Furthermore, the dependence checking and execution are overlapped in concurrent threads executing in di erent pipeline stages. Since our scheme does not require any additional overhead to execute the inspector stage, it is expected to perform better than the inspector-executor algorithms 1, 8, 9, 16] . Our thread pipeline model can execute speculatively, as can Rauchwerger and Paduas' algorithm 8] and the hardware-based schemes 6, 7, 15] . This speculation results in better speedup if the speculation is usually correct. However, the speculative scheme in 8] is applicable to DOALL loops only. Unlike the hardware-based schemes 6, 7, 15], our speculative execution model is implemented entirely in software and, thus, does not require any additional support from the hardware.
The most signi cant di erence between our approach and all the other software-based schemes is its broad applicability in parallelizing a variety of loop constructs with run-time data-dependence checking on a variety of data structures. For instance, the inspector-executor based schemes 1, 8, 9, 16] can parallelize for loops with run-time data-dependence on array data structures only. These schemes are not general enough to handle run-time dependence checking of general pointer structures. The run-time dependence checking in our scheme is applicable to any type of data structure, including arrays, linked lists, pointers, and so forth. Furthermore, the parallelization framework is general enough to handle both for and do-while loops as well as loops with complex if-then-else structures.
Conclusion
Compiler-based parallelization techniques are limited in what types of application programs they can parallelize due to the inherent incompleteness of compile-time information. This limitation has led to the development of run-time techniques for parallelizing codes that would otherwise have to be executed sequentially 1, 8, 9, 10, 16] . Most of these existing techniques are applicable to only well-structured array-based application programs, however. They cannot be used to parallelize program constructs with indeterminate termination conditions, such as do-while loops or loops with complex branch instructions.
In this paper, we have presented and evaluated a new parallelization model based on the superthreaded architecture 11, 12] that can exploit coarse-grained loop-level parallelism in shared-memory multiprocessor systems. This parallelization technique should provide better performance than the existing run-time parallelization schemes since it eliminates the overhead of the separate inspector-executor phases. Furthermore, its speculative thread execution with run-time data-dependence checking allows it to parallelize a variety of program constructs that cannot be parallelized with the other existing software-based run-time schemes.
We have evaluated this new thread pipelining model using some real application benchmarks. The test results show that programs with a su ciently large grain size compared to the thread start-up overhead can obtain signi cant speedups. We have also evaluated our model using a synthetic benchmark program to demonstrate the performance potential for di erent degrees of parallelism in the application program and di erent grain sizes. The results obtained from the evaluation of the synthetic benchmark are useful for predicting the performance of application programs that will be parallelized with our thread pipelining model. We transformed the original sequential codes into the corresponding thread pipelined parallel code in our experiments by manually inserting library calls in the program source code. However, we have discussed how the complier for the superthreaded architecture 13, 14] can be adapted to automatically generate the parallel code. In addition to being useful for parallelizing sequential programs, the library routines used in the implementation of the coarse-grained thread pipelining model could be quite useful in evaluating the correctness of the code generated by the superthreaded compiler and also in debugging and verifying the simulator for the superthreaded processor 3].
