The recent advent of multithreaded architectures holds many promises: the exploitation of intra-thread locality and the latency tolerance of multithreaded synchronization can result in a more e cient processor utilization and higher scalability. The challenge for a code generation scheme is to make e ective use of the underlying hardware by generating large threads with a large degree of internal locality without limiting the program level parallelism or increasing latency. Top-down code generation, where threads are created directly from the compiler's intermediate form, is e ective at creating a relatively large thread. However, having only a limited view of the code at any one time limits the quality of threads generated. These top-down generated threads can therefore be optimized by global, bottom-up optimization techniques. In this paper, we introduce the Pebbles multithreaded model of computation and analyze a code generation scheme whereby top-down code generation is combined with bottom-up optimizations. We evaluate the e ectiveness of this scheme in terms of overall performance and speci c thread characteristics such as size, length, instruction level parallelism, number of inputs and synchronization costs.
Introduction
Multithreading has been proposed as a processor execution model for building large scale parallel machines. Multithreaded architectures are based on the execution of threads of sequential code which are asynchronously scheduled, driven by the availability of data. A thread will block or terminate upon issuing a remote memory reference or function call and the processor switches to another ready thread. This provides for a high processor utilization while masking the latencies of remote references and processor communications. In many respects, a multithreaded model can be seen as combining the advantages of both von Neumann and data ow models: e cient exploitation of instruction level locality of the former and latency tolerance and e cient synchronization of the latter.
A challenge lies in generating code that can e ectively utilize the resources of multithreaded machines. There is a strong relationship between the design of a multithreaded processor and the code generation strategies, especially since the multithreaded processor allows a wide array of design parameters such as the use of blocking or nonblocking threads, the hardware support for synchronization and matching, the use of register les, code and data caches, multiple functional units, direct feedback loops within a processing element and vector support. In a blocking thread model, a thread will block upon issuing a remote memory reference, a function call, or when performing a synchronization operation. This requires the processor architecture to handle suspension and resumption of threads along with the saving and restoring their states. On the other hand, non-blocking threads, once started, run to completion. This requires support in the instruction set and the code generation scheme to generate threads that cannot block; for example, remote memory reads are turned into split-phase operations. Our approach is based on non-blocking threads.
For non-blocking threads, a deciding factor in the e ectiveness of a multithreaded machine is the balance between the size of threads that a language and its compiler can provide versus the thread length required by the hardware to hide latencies in thread switching and synchronization. Two approaches to thread generation have been proposed: the bottom up method starts with a ne-grain data ow graph and then coalesces instruction nodes into clusters (threads), the top down method generates threads directly from the compiler's intermediate data dependence graph form. The top down design su ers from working on one section of code at a time which limits the thread size. On the other hand, the bottom up approach, with its need to be conservative, su ers from the lack of knowledge of program structures, thereby also limiting the thread size. Our code generation scheme combines the two approaches. Initially, threads are generated top-down and then these threads are optimized via a bottom-up method.
In this paper, we introduce our multithreaded execution model, called Pebbles, and the code generation scheme for this model with particular emphasis on the optimizations of threaded code. We measure various characteristics of the generated code before and after applying certain optimizations. The results indicate relatively large thread size (16.2 instructions) and internal parallelism (3.6) . The optimizations achieve a much lower resource requirement (31% less communication tra c) and provide 20-30% run-time performance improvement compared to the top-down only thread generation scheme.
The organization of this paper is as follows. In Section 2, we describe the Pebbles multithreaded execution model. The related work along with the comparison of Pebbles against other models are explored in Section 3. In Section 4, we describe the top down code generation scheme. The bottom-up optimization techniques are described in Section 5. Measurements on the various characteristics of threads and the performance evaluation is presented in Section 6. Concluding remarks follow in Section 7.
The Pebbles Multithreaded Model of Computation
In this section we describe the Pebbles multithreaded execution model and relevant architectural issues including the instruction set architecture.
Execution Model
Pebbles 1 is a multithreaded execution model based on dynamic data ow scheduling where each actor, or node in the data ow graph, represents a sequentially executing thread. A thread is a statically determined sequence of RISC-style instructions operating on registers. Threads are dynamically scheduled to execute upon the availability of data. Once a thread starts executing, it runs to completion without blocking and with a bounded execution time. By bounded execution time, it is meant that each instruction in a thread must have a xed execution time, otherwise threads must block. Instruction level locality is exploited within a thread. Register values do not live across threads.
Inputs to a thread comprise all the data values required to execute the thread to its 1 The name symbolizes the thread size: no sand, nor rocks. completion. A thread is enabled to execute only when all the inputs to the thread are available. Multiple instances of a thread can be enabled at the same time and are distinguished from each other by a unique \color." The thread enabling condition is detected by the matching/synchronization mechanism which matches inputs to a particular instance of a thread. Data values are carried by tokens. Each token consists of a tag 2 , an input port number to the thread and a data value. Data structures, such as arrays and records, are stored in a logically shared structure store. Results of thread execution are either written to the structure store or directly sent to their destination thread(s). A Pebbles abstract machine consists of one or more more processing nodes connected by a general, high speed interconnection network. The abstract logical structure of the model is represented in Figure 1 . The local memory of each node consists of a Instruction Memory which is read by the Execution Unit and a Data Memory (or Frame Store) which is mainly accessed by the Synchronization Unit. The Ready Queue contains the continuations representing those threads that are ready to execute. A continuation consists of a pointer to the rst instruction of the enabled thread and the context speci er that points to the data needed by the thread. There may be di erent contexts of the same thread that may be enabled at any given time either on the same node or on di erent nodes. The global, logically shared memory (Structure Memory) for data structure storage may be either distributed among the nodes, or among dedicated memory modules arranged in a dancehall con guration. The MemUnit handles the structure memory requests.
For each instance of a thread, a xed size storage area (called a framelet) is allocated in the Frame Store to hold the incoming inputs to that thread. When the rst input of a thread activation (i.e. instance) arrives, the Synchronization Unit will allocate a framelet and set the count of the total number of inputs in the framelet. Each input token is stored in an appropriate slot within the framelet and the counter is decremented. When the count reaches zero, the thread is enabled to execute by making an entry in the Ready Queue. After the thread executes, the framelet is deallocated.
MIDC Instruction Set
Pebbles programs are represented in a form of data ow graphs called MIDC (Machine Independent Data ow Code). Each node of the graph represents a thread of machine independent instructions. Edges represent data paths through which tokens travel. In addition to the nodes and edges, there are pragmas and other speci ers to encode information (e.g. program-level constructs) that maybe helpful to the post-processors and program loaders. The MIDC syntax de nitions are presented in Table I .
An MIDC program consists of a number of function de nitions, one of which is called main and communicates with the outside world. A function consists of a header and a body. The function header consists of a Function Input Interface and a Function Output Interface, collectively referred as the Function Interface. The connection between the function header and its body is shown in Figure 2 . Call parameters are passed to the function input interface which sends them to the function body. Return contexts are sent from the caller to the function output interface; unique contexts are provided for every function activation. The function output interface matches every context to the corresponding result value and creates dynamic arcs back to the caller.
The body of each function consists of a number of nodes. For each node, the node header provides a node-label, the number of registers used, the number of input ports, and the destinations of all the outputs. The node header is followed by a stream of instructions. elds, viz., a data pointer to the start of the array, the array lower bound, the shift value, and the array size. The shift value is used to distinguish among several concatenated logical arrays that are \built-in-place." Records are represented by a pointer to the rst eld of the record, shown in Figure 3 -b. Unions are represented by a pointer to a tag followed by a value eld (see Figure 3-c) . Union elements may be either basic types, records, or arrays.
Related Work
There exists a growing number of hardware and software projects that are based on either blocking or non-blocking multithreaded executions. In blocking thread models, such as HEP 35] 7, 32, 39 ] is a software model of multithreaded execution. It provides a compiler-based approach to latency tolerance and synchronization for the purpose of executing non-strict functional languages, such as Id, on conventional architectures. Thread sizes are on the order of 5 instructions per thread, and so e cient context switching is necessary, which is achieved by bypassing the operating system process scheduler and directly manipulating the program counter and stack pointer. In addition, TAM is based on the code-block model where each code-block represents a semantically distinguishable code segment such as a non-nested loop or function body 17]. A storage segment, called frame, is allocated for each code-block instance or set of instances (as in k-bounded loops 8]). All the data values pertinent to a given code-block are stored in the corresponding frame; these include synchronization slots, temporary storage area, loop constants, and etc.
The Tera 1] provides hardware support for thread switching, replicated processor states, and split-phase transactions, with the goal of masking long latency operations with the e cient execution and switching of threads. A pipelined processor interleaves threads. The system can guarantee that the results of arithmetic and conditional operations are available to the next instruction in the thread through this interleaving. The J-Machine 9] provides hardware support for inter-processor communication. The hardware supports several programming model including data-parallel and data ow.
Each iteration of a parallel loop instance in Pebbles can be concurrently active on di erent processors. All computation in Pebbles is strict, and therefore Pebbles does not require I-structure memory (or software emulation of it) as in TAM, Hybrid, Tera, or Monsoon. Although this reduces the potential exploitable parallelism, the hardware requirements are simpler and more generally accepted. Since each thread is relatively small (10 to 30 MIDC instructions), global (dynamic) scheduling and near perfect load balancing is achieved by a simple hashing of the tag (color and the thread pointer) in each token to speci c nodes.
Several researchers have worked on developing top-down approaches to generating threads from functional languages 31, 2, 12] . In most of these instances, threads block for remote memory access and are targeted for conventional distributed memory multi-processors. A non-strict macro-data ow model is described The direct bottom-up code generation is equivalent to a graph partitioning problem. Iannucci 16] describes the dependence sets method of graph partitioning as well as the conditions that must be satis ed for a correct and e cient partitioning of a data ow graph into medium grains. The dependence sets algorithm is also used with several optimizations 32] to generate code for the Threaded Abstract Machine (TAM) 7] from Id90 22]. Traub 37] describes an algorithm for generating sequential threads of instructions in a data ow program. Larger threads are created through a combination of dependence and demand sets algorithms along with global analysis 36]. An execution model based on strongly connected blocks is described in 30] for the EM-4. Nodes that are strongly connected are executed sequentially on a single processor. Normal data ow execution rules are used between strongly connected blocks.
In general, pure top-down schemes create larger threads than pure bottom-up code generation schemes 4].
4 Multithreaded Code Generation
In this section, we discuss the philosophy and details behind our top down code generation scheme.
Code Generation Philosophy
The goal for most code generation schemes for non-blocking threads, including ours, is to generate as large a thread as possible 16], on the premise that the thread is not going to be too large, due to several constraints imposed by the execution model. The construction of our threads is guided by the following objectives:
1. Minimize synchronization overhead.
2. Maximize intra-thread locality.
3. Assure non-blocking (and deadlock-free) threads.
Preserve functional and loop parallelism.
The rst two objectives call for very large threads that maximize the locality within a thread and decrease the synchronization overhead. The thread size, however, is limited by the last two objectives. Due to the third objective, a thread will typically be much smaller than a loop or function body, as memory reads must be turned into split-phase transactions. In addition, even when the non-blocking and parallelism objectives are satis ed, blind e orts to increase the thread size can result in a decrease in overall performance 27]. Larger threads tend to have a larger number of inputs which can result in a larger input latency 3 . The resulting MIDC code should exploit many forms of parallelism including parallel loops and functions.
Top Down Code Generation
In this section a compiler that generates MIDC code from Sisal programs is described. Figure 4 shows the various phases of the compilation process. Sisal 18] is a pure, rst order, functional programming language with loops and arrays. Sisal programs are initially compiled into a functional, block-structured, acyclic, data dependence graph form IF1 34] which closely follows the source code. The functional semantics of IF1 prohibits the expression of copy-avoiding optimizations. The frontend generates very simplistic IF1 graphs. Function inlining and other optimizations IF2 41], an extension of IF1, allows operations that explicitly allocate and manipulate memory in a machine independent way through the use of bu ers. A bu er comprises of a bu er pointer into a contiguous block of memory and an element descriptor that de nes the constituent type. All scalar values are operated by value and therefore copied to wherever they are needed. On the other hand, all of the fanout edges of a structured type are assumed to reference the same bu er. IF2 edges are decorated with pragmas to indicate when an operation such as \update-in-place" can be done safely, which dramatically improves the run time performance of the system. The top down cluster generation process then transforms IF2 into MIDC. The rst step in the IF2 to MIDC translation process is the node rewriting phase to remove any hidden latencies represented in a single node. For instance, the AElement node in IF2 represents reading of a value from an array, with an array descriptor and an index value being the inputs to this node. The data representation in MIDC represents the array as two di erent structures, the array descriptor and the array data. In order to read an array element, the array descriptor is rst read to nd the start address of the data elements and then the actual data value is read. This represents two di erent latencies in the operation. In order to make the latencies explicit, the AElement node is rewritten as two distinct nodes, a non-standard IF2 node AReadDesc and the AElement node. All nodes that contain more than one latency are rewritten to make the latencies explicit.
At the end of the IF2 compilation phase, most but not all of the memory copy operations are eliminated. Fortunately, IF2 edges have pragmas that indicate whether or not the data is built in place. The necessary edges have to be checked to see if the data is built in place or not. If the array is not built in place, code has to be introduced to perform the necessary copy operations. It is easiest to add the copy operations at this time rather than during code generation. It is recognized that the copy operations are independent of each other when introducing the copy code.
In this process of rewriting di erent nodes, redundant nodes will be generated. The use of classical compiler optimizations, such as local common subexpression elimination, removes these redundant nodes. For instance, if there exist two AElement nodes that read elements from the same array, they will independently be rewritten as two AReadDesc and AElement nodes during the rewriting phase. Since the array being read is the same, the two AReadDesc nodes can be merged into one. In a similar vein, if the newly introduced nodes are within loop bodies and are identi ed as loop invariant, they can be easily moved out of the loop body. Thus, this code motion should improve the run-time performance.
When parallel loops are considered, array gather operations can be performed within the loop body, thereby reducing the amount of code necessary to perform the synchronization and thus reducing the amount of sequential code. Thus, in this case, moving the array gather operation from the returns code to the body is performed.
The second step in the IF2 to MIDC translation process is the graph analysis and partitioning phase. This phase breaks up the complex IF2 graphs so that threads can be generated. Initial values for reduction operators are generated in the appropriate threads. Threads terminate at control graph interfaces for loops and conditionals 4 , and at nodes for which the execution time is not statically determinable, such as function calls and memory (arrays and structures) accesses. These nodes are called terminal nodes. More accurately, termination occurs at the use of the values returned from the function calls and memory reads. Therefore, multiple calls and memory reads may be initiated in a thread. Structure store reads are turned into split-phase reads, with the initiator and consumer residing in di erent threads. In the case of function calls and memory reads, the above step ensures that threads can execute deterministically, in keeping with our objectives. Terminal nodes are identi ed and the IF2 graphs are partitioned along this seam.
The code motion optimization described above helps reduce the number of threads generated and executed. Consider the case of reading an array element A i] in the body of a for loop with loop variable i. As described above, the array access requires reading the descriptor followed by reading the array element. In an unoptimized loop body reading the descriptor would give rise to an additional memory latency, causing an extra thread. Code motion pulls the access of the descriptor out of the loop, which reduces the number of threads in the loop body. The read of the descriptor can be merged with the rest of the loop initialization thread. 5 Bottom-Up Multithreaded Code Optimization MIDC represents low level intermediate code from which machine code for a target machine can be relatively easily generated. It also contains several pieces of data, including structural level information, represented via pragmas that are useful to target machine code generator or postprocessors. One postprocessor that we designed performs machine independent optimizations. Even though an impressive set of optimizations are performed at the IF1 and IF2 levels 5] including function in-lining, loop transformations, CSE, and update-in-place, and that MIDC compiler also does some optimizations, thread generation creates opportunities for more optimizations that were not visible before. We have applied these optimizations at both the intra-thread and inter-thread levels.
Local Optimizations. These are the traditional compiler optimizations whose main purpose is to reduce the number of instructions within a thread. 6 There is the possibility that the loop body is too small, making it expensive to exploit all possible parallelism. The application of two di erent techniques, slicing and chunking, helps reduce this problem. Slicing is a method by which parallelism is constrained by spawning a variable amount of work among a xed number of workers. Chunking, on the other hand, spawns out xed amounts of work over a variable number of workers. In both the techniques, the slice or chunk of work is performed sequentially. The analysis of these techniques is beyond the scope of this paper. Local dead code elimination eliminates instructions whose results are not used.
Constant folding/copy propagation removes unnecessary data movement.
Redundant instruction elimination removes unnecessary duplication of work.
The need for local optimizations arise when threads are rst generated (there are some dead code and copy propagation opportunities that arise after the initial thread generation), and after global optimizations are performed (e.g. after merging).
Global Optimizations. The objectives of these optimizations are threefold:
1. Reduce the amount of data communication among threads.
2. Increase the thread size without limiting inter-thread parallelism.
3. Reduce the total number of threads and possibly the critical path lengths of programs.
These optimizations occur at the inter-thread level across the entire program graph. They consist of:
Global copy propagation/constant folding, which reduces unnecessary token tra c by bypassing intermediate threads. For example, if Thread A sends a value to Thread B which in turn passes it on to Thread C without using it, then the code is rewired so that Thread A sends the value directly to Thread C. Also, if Thread A sends a constant value to Thread B, then change is made so that Thread B uses the constant value directly without having Thread A send the value. This optimization should also reduce the critical path length of programs.
Merging attempts to form larger threads by combining two neighboring ones while preserving the semantics of strict execution. The merge up/merge down phases have been described in more detail elsewhere 32, 27] . In order to ensure that functional parallelism is preserved and bounded thread execution time is retained, merging is not performed across remote memory access operations, function call interface and parallel loop bodies. However, merging is allowed to take place across branch boundaries. Figure 5 shows the two merge operations. The results of these merging operations are the elimination of the tokens passed between the merged threads and the corresponding output instructions, and therefore the synchronization cost is reduced.
Redundant arc elimination, for which opportunities typically arise after merging or copy propagation when arcs carrying the same data can be eliminated. For example, if Thread A sends the same value to Thread B and Thread C, and Thread B and C are later merged, there is a duplication of data sent.
Global dead code and dead edge elimination are typically applied after other global optimizations which could reduce the arcs or even the entire thread. Since these optimizations are inter-procedural, they can trace through all calls of a given function and eliminate unused arguments and return values and the related computations. For example, if no caller to a given function uses one of its results, then that unused result along with any predecessor which is itself not used in deriving the other results are eliminated. This process can extend past the callers' arguments and beyond.
These optimizations are performed bottom-up, applied to the entire program graph. The optimizer reads in the MIDC code and rst builds a data ow graph of threads. Each thread is then decoded into a data ow graph of instructions. Hence, a two-level data ow graph is built. At the same time, by decoding the pragmas, it builds a second, dual graph whose block structure corresponds to the program source structure with its constituent threads and instructions.
At this point we have the complete data ow graph that a bottom-up compiler would generate but have retained the additional top-down structural information. The structural information graph is used to help determine when certain types of optimizations, such as merging, are safe or even advisable to do. When the optimized code is generated, each thread is generated from the mini-data ow graphs. The global view thereby enables various optimization steps to take place and enable redrawing of \boundaries" to easily remold threads. In addition, when generating threads, instruction scheduling is performed to exploit the instruction level parallelism. This is accomplished by ranking instructions according to dependencies such that dependent instructions are as far from each other as possible. The data and control ow representations allow this to be accomplished relatively easily. The optimization steps also make the following changes to the syntax:
The top down generator does not allow branches within a thread. Conditional outputs (OUTC) are used as gateways to branches. Since merge operations could introduce branches in the middle of a thread, the bottom-up optimizer introduces an if-then-else construct. In the process, conditional outputs are eliminated.
The code generator uses several operators that output results (e.g. PARAM, OUT, SAN, DST) that add descriptive power. These operators can either explicitly specify a color value or implicitly assume the color of the incoming color. The instructions have served their usefulness at this point, and they are all replaced by a single OUT instruction and the color value is given explicitly.
A simple Sisal source and its compiled, optimized MIDC code is shown in Figure 6 . The code is taken from Purdue Benchmark 4 which inverts each nonzero element of an array and sums them. The top right of Figure 6 shows the thread level graph descriptions. Nodes 1 and 2 are the main function input and output interfaces, respectively. Node 3 reads the structure pointer and size information. Node 4 is the loop initializer/generator. Nodes 5, 6, and 7 comprise the loop body with Node 5 handling the reduction.
Evaluation
In this section we evaluate the dynamic properties of our top-down code before and after applying various bottom-up optimizations. We have obtained these results by running codes on a multithreaded machine simulator. The following set of dynamic parameters are measured to evaluate the intra-thread characteristics: S 1 : measures the average number of MIDC instructions executed in a thread. 4 [16] 6 [4] 7 [3] 5 [7] number of instructions in thread = m } n[m] = { thread # = n, %function input interface. 2 1 0)(5 3 0)(5 1 0) ( 6 4 0)(6 1 0)(6 2 0 We also measure the following inter-thread, or program-wide, parameters:
Matches: measures the total number of matches performed.
Instructions: measures the total number of instructions executed.
Threads: measures the total number of threads executed.
CPL: measures the critical path length of programs in terms of threads.
The values of these inter-thread parameters are presented in normalized form with respect to the top-down generated code. In our evaluation we use the following optimization levels:
\No optimization" (NO) is the code generated by the MIDC code generator.
\Local optimization" (LO) performs the intra-thread only optimizations.
\Full optimization" (FO) performs both intra-thread and global optimizations.
Benchmarks
We use a suite of benchmarks programs all written in Sisal. The PURDUE benchmark 26] contains 16 programs used to benchmark parallel computer systems. The Lawrence Livermore LOOPS 19] consist of 24 loop kernels used in scienti c programs. AMR is an unsplit integrator taken from an adaptive mesh re nement code at Livermore. BMK11A is a particle transport code. RICARD is a production code simulating elution patterns of proteins and ligands in a column of gel. HILBERT computes the condition number for Hilbert matrix coe cients using Linpack routines. LIFE is the game of life. SGA is a genetic algorithm that nds a minimum of a bowl-shaped function. SIMPLE is a Lagrangian 2-D hydrodynamics code that simulates the behavior of uid in a sphere. The size of the nontrivial benchmarks ranges from 300 to over 2000 lines of Sisal code. LIFE and SGA form the non-scienti c codes. The gures for the Purdue and Livermore Loops are each given as a weighted average values of their constituent programs. The amount of parallelism available in each benchmark can be gleaned from Table II 7 . It shows the number of total loops and the number of parallel forall loops in the generated code 8 . It should be noted that not all the parallel loops are vectorizable. They show that a substantial fraction of loops are forall loops. This pattern is what we normally expect from a well-written Sisal programs. In other words, it is expected that the programmer will use the parallel loop construct whenever possible and revert to the sequential loop construct if and only if the parallel loop construct cannot be utilized. The last column in Table II shows the number of threads being executed per run in fully optimized cases. Table III shows There is a steady reduction in S 1 as we go from no optimization to local optimizations. In going from local to full optimizations, there are noticeable di erences in each benchmark, but no clear pattern. On the whole, the average thread sizes are similar. There are two opposing forces at 7 In 33] we have reported that the number of vectorizable loops is approximately 40% of the number of parallel loops. The benchmarks in that paper is a subset of those in this paper. 8 The number of parallel loops in the generated code will di er from the SISAL source due to the fact that additional parallel loops are inserted in the code to perform the task of copying when needed and the fact that some loops would be fused in the process of optimizing the intermediate code S1 Table III : Intra-Thread Characteristics work: the merge operations tend to increase the thread size while other global optimizations tend to reduce the total number of instructions executed.
Intra-thread Results
The weighted distribution of S 1 across all benchmarks is shown in Figure 7 for S 1 20.
It shows that nearly 30% of all threads executed have 5 instructions. The number of threads with S 1 20 account for more than 10% of all threads.
Thread Parallelism. The gures for show no clear trend when going from no optimization to local optimization. S 1 decreases in going from no optimization to local optimizations, so this implies that the critical path length (S 1 ) of each thread in general are reduced.
However, when going from local to full optimization, parallelism decreases. In this case, the merging operations play the bigger role in determining the average parallelism they tend to increase S 1 .
In fully optimized cases, the average parallelism ranges from about 2.4 to 5.5 with an average of 3.6. The distribution of parallelism is shown in Figure 8 . It shows that 75% of all threads have a parallelism less than 3 with about 6% having more than 6. The intra-thread parallelism is large enough to justify a superscalar processor implementation. A four-way issue processor, such as the PowerPC, should be su cient for almost 90% of all threads executed.
Thread Inputs. For local-only optimizations, the number of inputs per thread does not change.
However, there is a signi cant reduction after global optimizations are performed, as it goes from 8.9 to 7.6 inputs, a drop of 17%. The average number of inputs varies widely between di erent benchmarks: Hilbert has about 3.8 inputs versus 15.3 in BMK11A. The average number of inputs The distribution of the number of inputs per thread is shown in Figure 9 . More than 40% of threads executed have between 2 to 4 inputs. Threads with seven or less inputs account for more than 70% of all executed threads across all benchmarks. However, there are a signi cant 16% of threads having ten or more inputs.
Matches Per Instruction. For a given optimization level, the matches per instructions are con ned within a relatively narrow range for all benchmarks except RICARD. The MPI goes up when local optimizations are applied, since the number of instructions executed goes down while the number of inputs remains the same; MPI goes down by 9 to 21% from no optimization to fully optimized with an average decrease of 11%. This implies that the reduction in the number of matches is greater than the reduction in the number of instructions, resulting in a smaller MPI for the fully optimized code. Among benchmarks, RICARD's MPI is only 0.33 versus SIMPLE's 0.54. 24 
Program-wide Characteristics
In Table IV , the program wide characteristics of the benchmark programs are given. The values shown are normalized with respect to the un-optimized code.
Matches. As would be expected, the number of matches remains unchanged when only local optimizations are applied. When global optimizations are also applied, there is a signi cant reduction in the number of matches performed, typically about 31%. The reductions are remarkably similar for all the benchmarks except LIFE which achieves only a 17% reduction.
Instructions. There is on the average a 4% reduction in the total number of instructions executed after local optimizations are performed. After applying full optimizations, the average reduction is about 23%. The reductions are most noticeable in SGA and smallest in LIFE.
Threads. The number of threads executed remains unchanged for locally optimized code. When full optimizations are applied, there is a 17% reduction in the number of threads on the average. However, there are signi cant di erences among the benchmarks with RICARD and HILBERT having almost no reduction whereas SGA has more than 35% reduction.
Critical Path Length. The characteristics of critical path lengths of programs are very similar to the number of threads executed. The critical path lengths are measured in terms of the number of threads. The path lengths are reduced by about 18% on the average after the global optimizations.
While the CPL is reduced by 18%, the total number of threads is reduced by 17%: this implies that, on average, the program parallelism at the inter-thread level remains relatively the same.
Run-time Performance
We assume that all structure store reads and writes are remote, and hence we do not consider the e ect of data partitioning.
The following processor architectural con guration is used to get real time measurements: a 4-way issue super-scalar CPU with the instruction latencies of the Motorola 88110 and an output network bandwidth of two tokens per cycle. The synchronization latencies are those of the EM-4: a pipelined synchronization unit with a throughput of one synchronization per cycle and a latency of three cycles on the rst input. All inter-node communications take 50 CPU cycles in network transit time. Every structure memory read takes the minimum of two network transit (one to send the request and another to send the reply). Also, the size of matching store is unlimited, and therefore can handle any amount of parallelism. We obtained the results for 1, 10, and in nite number of processors. In each case, the run time performances for both locally and fully optimized codes are compared with respect to the un-optimized code.
As a side note, n-processor speedup over 1-processor threaded code executions in our model is solely determined by the parallelism of code. Any thread can execute on any processor without additional overhead. For completeness sake, Table V shows the speedups under the unoptimized case. Due to the fact that the reductions are currently implemented in a serial fashion and that simulation limits the problem sizes (and thereby the parallelism) that can be run in reasonable time, speedups shown are not large. However, as will be shown later, optimizations do not signi cantly depend on the number of processors. We do not further consider this kind of speedup. Table VI shows the percentage speedup over the un-optimized code. It can be seen that the improvement in speed is signi cant with almost all improvement occurring when global optimizations are applied. With local optimizations only, the speedup is almost insigni cant. In general, with full optimizations, speedup ranges from a low of 5% in LIFE up to 75% in BMK11A. In the case of LIFE, the reduction in the number of matches are much smaller than the others (as shown in Table IV ) and is a limiting factor with a smaller number of processors. With a larger number of processors, the limitation is removed and the speedup also increases to 25%. In the processors, there is plenty of work for each processor to do, and therefore the speedup comes mainly from the reduction in the amount of work to be done. With larger number of processors, CPL plays the limiting role; therefore, the reduction in CPL give rise to even greater speedup. This is especially apparent in RICARD which has almost no reduction in the CPL when full optimizations are applied. Here, the speedup drops with in nite number of processors compared to 10 processors. Next, the e ectiveness of various optimizations is measured. Since local optimizations alone do not give much speedup, we concentrate on global optimizations. The following symbols are used for the di erent types of optimizations: FD is fuse down, FU is fuse up, DC is global dead code elimination, CP is global copy propagation/constant folding, and RA is redundant arcs elimination.
In Table VII , speedups due to various optimizations are given using a 10-processor con guration. Some optimizations rely on other optimizations to be performed afterwards; for example, copy propagation must be followed by global dead code elimination that performs some cleanup operations. The result clearly shows that merging optimizations is most e ective. For BMK11A, global copy propagation/constant folding is extremely e ective. Its e ectiveness for this benchmark explains the sharp drop in the program critical path. In some cases, such as RICARD and HILBERT, performing one optimization creates more opportunities for other optimizations, resulting in bigger total speedups than the sum of individual speedups. In other cases, performing one optimization can remove an opportunity for others; for example, a copy propagation removing the only link between Thread A and Thread B prevents A and B from merging, since they are 28 unconnected now. 6 .5 Discussion and Summary Thread Characteristics: Overall, the average size of threads, 16.2, is relatively large compared to some of the values reported (3-5 instructions) for bottom-up thread generation techniques 32, 21] . In particular, the threads in BMK11A are much larger than in the other benchmarks.
The internal thread parallelism, 3.59, is large compared with our previous work on the evaluation of the bottom-up Manchester cluster generations which only achieved parallelism of about 1.15-1.20, even considering that Manchester instructions are more CISC-like. This is mainly due to the larger thread sizes of MIDC.
We observe that the number of inputs required are relatively large, on the order of 4 to 10. This implies a need for handling these variable number of inputs e ciently.
E ect of Optimizations: In general, we observe that even though there is some improvement in the various measures when local-only optimizations are applied, the biggest improvement comes from global optimizations. Among global optimizations, the merging operations are particularly e ective. The results of the bottom-up optimizations indicate that they have achieved our objectives in code generation:
The synchronization overhead (e.g. number of matches) has been reduced signi cantly. The total number of matches has been reduced by 17-37% and MPI has been reduced by about 10%, thus reducing the required synchronization bandwidth.
The number of inputs to a thread has been reduced by about 15%, thus reducing the input latency.
The internal parallelism has been reduced by about 10%, meaning that processor requirements are not as large.
While the statistics on intra-thread and program level parameters are important to understand the behavior of the thread generation and the e ects of the various optimizations, the bottom-line of any performance measure is the overall program execution time. In this respect, the local optimizations had an insigni cant e ect while the global optimizations resulted in 20-30% reduction in execution time on the average 9 . In local-only optimizations, the critical path lengths of the program (at the thread level) have not been changed; and, due to the ability to hide latencies, slightly smaller threads do not speed up the execution as much. Therefore, global bottom-up optimizations are essential in achieving better performance.
Conclusion
Multithreaded architectures promise to combine the advantages of the von Neumann and data ow execution models. The proposed models and prototypes span the range from data ow architectures with a limited support of state to traditional microprocessors with some support for messaging and thread scheduling. Threaded code generation has followed two approaches: a top-down approach that generates threads directly from the compiler's intermediate data dependence graph form, and a bottom-up approach where instructions in a ne-grain data ow graph are coalesced into larger grains.
In this paper we have introduced a non-blocking thread execution model called Pebbles, described and evaluated a code generation scheme whereby the threads are generated top down and then optimized via a bottom-up method. The initial threaded code is generated from Sisal via its intermediate form IF2. The optimization techniques are both local and global. Local optimizations consist of traditional techniques such as dead code elimination and copy propagation. Global optimizations follow a bottom-up style; its most signi cant job is to merge multiple threads into one thread which increases thread size while reduces the cost of synchronization.
The dynamic intra-thread measures of the optimized code indicate: (1) an average thread size of 16.2 instructions per thread, (2) an average parallelism within a thread of 3.6 instructions per cycle, (3) an average number of inputs per thread of 7.6, and (4) an average synchronization cost of 0.47 matches per instruction.
The e ect of optimizations at the program level includes reductions in the following set of parameters: 31% in the total number of matches, 23% in the total number of instructions executed, 17% in the total thread count and 18% in the average critical path of programs (measured in threads). The total execution time was decreased by 20-30%.
