ABSTRACT Advances in parallel computing have made it clear that the ability to express computations in a machine-independent manner and the ability to handle dynamic and irregular computations are two necessary features of future programming systems. In this paper, we describe the nested data-parallel model of programming, which has both these capabilities. We present a n i n termediate-level language called Vcode embodying the properties of the model. We discuss several representative irregular problems and show both their expression in the model and their performance on shared-memory multiprocessors. We also brie y discuss the compiler techniques necessary for achieving good performance on MIMD machines with this programming model.
1. Introduction. The hardware and software for parallel processing available today i s c haracterized by its diversity and bewildering variety o f choices. With newer and faster parallel machines arriving on the market at a tremendous pace, portability of software is of paramount importance. The variation of architectural features and system balance is much greater among parallel architectures than serial ones. This leads to a wider range of cost-performance tradeo s that portable software must handle. Unfortunately, most parallel programming models have been closely tied to speci c architectures, and do not port across multiple architectural classes or even among various architectures of the same class. This is in sharp contrast to 1 the world of sequential programming, where we n o w h a ve w ell-understood models for programming disparate machines in machine-independent highlevel languages and compiling these languages to produce e cient object code. The development of such techniques for parallel systems is complicated by additional concerns raised by parallelism: partitioning the computation among multiple parallel activities, and mapping data structures and parallel activities on the target machine to make good use of the machine resources.
Most parallel programming models used for MIMD machines have a process orientation, in which the program is viewed in terms of the interactions of a set of processes tasks, threads using mechanisms such as critical sections, locks, or channels. However, while proper management of data structures has a rst-order e ect on performance, most systems only provide ad hoc mechanisms for managing their layout. It is our view that such a process orientation promotes machine speci city and hinders the development o f portable parallel programming systems, and that a data-oriented approach to parallelism where we focus on the transformations to the data aggregates rather than how they are achieved by a set of interacting processes would be a better approach to portability. W e are not debating the usefulness of processes as a programming mechanism, merely their suitability as a userlevel abstraction. Processes are undoubtedly the proper abstraction at the operating system level, but the tasks of partitioning the computation into processes and scheduling them are tedious and error-prone. With a structured approach to parallel programming on the part of the user which i s universally bene cial, process creation and management can be adequately handled by a compiler. This allows the user to work with higher-level abstractions closer to the problem domain.
We do not, however, subscribe to the notion of automatic parallelization of dusty deck" sequential programs, as attempted by several groups 33, 8 , 40 . In principle, automatic parallelization would allow the use of existing code, and allow programmers with little or no experience in parallel programming to take advantage of parallel machines. In practice, however, it has been generally less successful than automatic vectorization, and there is a substantial interactive component to these systems. No compiler, however sophisticated, can divine the user's intentions and discover parallelism where there is none in the source. Such techniques are at best a stopgap measure. Good parallel algorithms are often very di erent from good serial ones, and cannot be derived by source transformations of the sequential source.
A language model that promotes portability m ust therefore be machineindependent and explicitly parallel, but not process-oriented. Further, it should express the maximal parallelism available in an algorithm. Such parallelism is usually found at a very ne grain. In such a setting, it is the user's job to express the parallelism in a machine-independent manner. The compiler is responsible for converting the parallelism into a form best suited for the target machine, creating and scheduling a process structure, and managing data layout. Data parallelism 23 meets all these criteria. We de ne a data-parallel model as one where the parallelism of an algorithm is expressed in terms of parallel operations on data aggregates, without reference to how these operations are realized on the target machine.
It should be emphasized that data parallelism is not a panacea. However, researchers have demonstrated that data-parallel languages are useful for applications as diverse as parsing 42 , DNA sequence comparison 29 , object recognition 45 , VLSI design 27 , and computer graphics 15 . Most of the parallel algorithms found in the literature for network and PRAM models 30 are either data-parallel or easily converted into such a form 19 . Furthermore, the semantic simplicity of data-parallel languages, and their ability to easily express large amounts of parallelism, makes them easy to understand, easy to debug, and scalable. For these reasons, it has been argued that for many applications data parallelism should be the paradigm of choice for all parallel machines, whether SIMD or MIMD 3, 38, 3 4 . Support for this argument is found in the design of many recent parallel languages, such a s A L 4 4 , Apply 21 , the Fortran 90 array extensions 2 , and Fortran D 20 , which are all data-parallel in nature.
The remainder of the paper is organized as follows. Section 2 describes the programming model and a language that embodies it. Section 3 surveys the compilation techniques necessary for achieving good performance with this model on MIMD multiprocessors. Section 4 discusses eight benchmark programs and presents performance results for them on the Encore Multimax, a bus-connected shared-memory machine. Finally, Section 5 presents conclusions and directions for future work.
2. Programming and Execution Models. This section introduces the language and execution models used in the paper. After a brief overview of data-parallel languages in Section 2.1, we i n troduce the source language, Vcode 4 , as our model of data parallelism in Section 2.3. Finally, i n Section 2.5, we i n troduce our execution model, which corresponds to how we expect to execute the translated programs on MIMD machines.
2.1. Overview of data-parallel languages. For the sake of completeness, this section contains a review of the essential features of dataparallel languages. For a more detailed treatment, the reader is referred to the tutorial paper by Sipelstein and Blelloch 41 .
A data-parallel language model consists of two parts: the aggregate data types it supports, and the operations it supports on the aggregate types. By implication, the operations on aggregate types must be amenable to parallel implementation for the language to qualify as data-parallel. We n o w discuss each of these aspects.
2.1.1. Aggregate data types. Examples of aggregate data types include vectors, multidimensional arrays, relations, sets, sequences, mappings, and lists. A notion of shape is invariably associated with aggregate types, in a manner reminiscent of the use of type systems in sequential programming languages. Shape is an attribute of aggregates that is used to specify ordering relations among the elements of an aggregate, to determine conformance between aggregate variables, and to determine the layout of aggregates in memory. Shapes may be supplied by the user, inferred at compile-time, or determined at runtime.
Aggregates can either be at or nested. A at aggregate is one whose elements are themselves not aggregates. A nested aggregate removes this restriction, allowing elements to be of aggregate types possibly nested again. A F ortran 90 array o f i n tegers is an example of a at aggregate. Note that a m ultidimensional array o f i n tegers is at, not nested, since the elements of the aggregate are scalars. As an example of a nested aggregate, consider a chapter of a book. The chapter can be viewed as an aggregate of sections; sections themselves are aggregates of paragraphs; paragraphs, of sentences; sentences, of words; and words, of characters. Nested aggregates are a powerful feature in data-parallel languages. We will return to them after considering operations on aggregate types.
Operations on aggregate types. Languages vary widely in
the operations they support on aggregate types. Most languages support information operations length, type and some kind of an apply-to-each form. The latter construct takes a function and applies it to each element of the aggregate. For non-monadic functions, the aggregates must be conforming and there must be a notion of correspondence between elements of the di erent aggregates. Apply-to-each forms can be implemented in a parallel fashion by e v aluating the multiple applications in parallel. The language semantics must specify how side-e ects are resolved. Other operations supported by languages in various forms are data movement operations permutations, pre x operations scans, and summing operations reductions.
Nested parallelism. The interaction of nested aggregates and
apply-to-each forms gives rise to a very powerful feature called nested p arallelism, which is the ability to apply parallel functions to multiple sets of data within a nested aggregate in parallel. Consider the application of an apply-to-each form to a nested aggregate. The application could result in multiple instances of the function that are active simultaneously. Also, the 4 items to which the function is being applied are themselves aggregates. If the function is parallel which it might w ell be, considering that it takes an aggregate as an argument, we then have t wo kinds of parallelism in action: the intra-function parallelism within each application, and the inter-function parallelism due to the multiple active instances of the function. Note that the language merely speci es the opportunity for multiple kinds of parallelism; it is up to the implementation to actually take advantage of these di erent forms of parallelism.
To illustrate the power of nested parallelism, consider parallel quicksort. At a n y stage of the sort, we h a ve an aggregate of subsequences that must be sorted individually. The function applied to each subsequence selects a pivot, and partitions the elements to form new sequences. This is the split operation discussed in Section 4.2, and has intra-function parallelism. Each application of split is independent of all other applications, giving rise to inter-function parallelism. The e ects of these di erent forms of parallelism can be seen in Figure 1 , where we plot the parallelism pro le of an idealized quicksort program for three implementation strategies: the rst employing only intra-function parallelism, the second employing only inter-function parallelism, and the third employing both. It is immediately obvious that it is insu cient to use one or the other of the strategies; both forms must be used simultaneously to exploit the full bene t of parallelism.
Blelloch and Sabot 6 i n troduce a attening technique to implement nested parallelism on a at parallel machine. This technique represents the nested aggregate using two data structures: a at aggregate describing the data values, and another structure describing the segmentation of the aggregate. This can then be mapped e ciently on a at parallel machine. This technique has been used in implementing Paralation Lisp 38 , and in translating Nesl 5 i n to Vcode. Cytron, Lipkis, and Schonberg 16 present another approach for implementing nested parallelism on a at parallel machine.
2.3. The language model: Vcode. Our language model, Vcode, was designed as a testbed for a systematic study of the compiler and implementation issues that arise in data-parallel languages. Accordingly, its design concentrates on data parallelism to the exclusion of other issues more commonly seen in language designs. We consider it to be an intermediate language rather than a high-level one because it lacks many features associated with high-level languages: multidimensional arrays, record types, user-de ned types, polymorphism, automatic coercion, richer control structures, block structure, and so on. It is a small, simple language that could ultimately serve as the core of a portable intermediate representation for high-level data-parallel languages, and be used to map them to a variety 5 The reason for developing an intermediate representation such a s Vcode is to enable compiler optimizations that are hard to perform on a high-level language: changing program granularity, aggregating program operations into larger loops, scheduling program operations, optimizing storage use, and reducing synchronization 9 . Program transformations such as the attening of nested parallelism are best done on the high-level language before it is converted to the intermediate representation.
The di culty of implementing nested data parallelism when nested function calls can interact with each other through side-e ects argues for an applicative side-e ect-free high-level language. A stack language matches the applicative nature of such a language. Hence the design decision to make Vcode a stack language. The Vcode compiler is careful to optimize storage use so that this feature does not cause a performance penalty. The decision to perform storage optimizations within the Vcode compiler makes other intermediate representations, such as three-address vector code, unsuitable for our purposes.
We c hose not to use any of the existing data-parallel languages as they either lacked features we w anted to study, o r w ere too cumbersome. For example, we could not use C* 37 o r F ortran 90 because they lack support for nested parallelism. Support for nested parallelism was considered critical, 6 The remainder are for control and memory management. The instruction set contains instructions for rearranging the elements of vectors, and instructions that operate on segmented v e ctors. The rearrangement instructions can be used for vector-based indexing in C*, Fortran 90, and APL, for the operation in CM-Lisp, and for the move operation in Paralation Lisp. The segmented instructions permit the implementation of the nested parallelism allowed in CM-Lisp, SETL, and Paralation Lisp. Vcode can dynamically create and destroy arbitrarily long vectors, as this is necessary for most data-parallel languages. Unlike machine-oriented languages, it provides no notion of processors, either physical or virtual. This allows the interpreter or compiler to choose appropriate strategies for process mapping and memory management depending on the target machine.
Vcode is conveniently de ned in terms of an abstract machine, the vector stack machine. This is a standard stack machine, except that each stack location contains an arbitrarily long vector of atomic values. Each vector has a length associated with it, and vectors in di erent locations can 8
have di erent lengths. Instructions operate on entire vectors at a time. For example, the + instruction pops the top two v ectors of equal length and appropriate type from the stack, adds corresponding elements together, and returns the result vector to the top of the stack.
Data types.
There are four basic types in Vcode: boolean vectors, integer vectors, oating-point v ectors, and segment descriptors. The rst three types correspond to data vectors. Floating-point v ectors are double precision. Boolean vectors may be packed or unpacked depending on implementation. The segment descriptor often abbreviated segdes" is a data structure that speci es a partitioning of one or more data vectors into segments. Segment descriptors are used in the implementation of nested parallelism. Various possible concrete representations are possible for this type, such a s a n i n teger vector containing the lengths of the segments, an integer vector containing the starting points of the segments, or a boolean vector set to T at segment starting points and to F elsewhere. The various representations are appropriate for di erent machines. In this paper we will use the lengths representation unless otherwise mentioned. Thus, the two vectors:
A = s y l l a b l e s S = 3 2 4 where A is the data vector and S the segment descriptor in lengths form would represent the following partition:
A 0 = s y l l a b l e s .
Instructions. Vcode instructions fall into two classes, vector
instructions and memory control instructions. The vector instructions operate on a xed numb e r o f v ectors from the top of the stack and return their result to the top of the stack. The memory control instructions are used for stack manipulations and for program control. Table 1 lists some of the Vcode instructions. A complete list may be found in 9, Appendix B . The vector instructions can be further divided into the following subclasses: Elementwise instructions: These are apply-to-each extensions of arithmetic and logical operations. There is also a select instruction that selects from one or the other of two data vectors based on a boolean vector of ags, which is an extension of the ternary conditional operation in the C programming language.
Scan parallel pre x and reduction instructions: These take a source vector and a segment descriptor and return a vector whose elements are the sum" with respect to some binary associative operator of all proper pre xes in the source vector, within each Scan and reduction +-scan, max-scan, or-reduce Permute permute, dpermute, bpermute, dist Segment descriptor manipulation lengths, make-segdes segment. Reduction instruction return the sum" of all the elements within each segment of the source vector. The operators available are +, and, or, max and min.
Permute instructions: In their simplest form, these rearrange the elements of a vector according to another vector of indices. A segment descriptor is also needed to describe the partitioning of the data and index vectors. The more complex instructions in this class allow indices to be selectively masked, and a default vector to be supplied for the result.
Segment descriptor instructions: These instructions create and manipulate segment descriptors and change between various concrete representations.
I O instructions: These instructions read and write vectors and segment descriptors. A Vcode program is a set of Vcode functions. Vcode functions are maps from stack pre xes to stack pre xes; the stack is used to pass parameters and return results. There are no global variables. Control ow within a function is directed by an if-then-else form and recursion. The if-then-else form has the restriction that both branches must leave the stack in the same state with respect to depth and data types. This is necessary for ensuring wellformedness of functions. All instructions except the if-then-else form are strict, i.e., all inputs must be available before the instruction can be executed. Synchronization is implicit and lock-step: an instruction must be completed before the next instruction can begin execution.
Rather than including all possible instructions in Vcode, w e c hose to go with a greatly reduced core instruction set, and to supply libraries of functions built out of these instructions. Instructions that are primitive and used often such as the ones in Table 1 were included in the core instruction set, while operations that could be built out of the core instructions were put in 10 Table 2 List of library functions in the current Vcode de nition. The implementor of Vcode on a new machine only needs to implement the small set of primitive instructions, and, optionally, optimized versions of the library functions for e ciency. In the course of working with Vcode, w e h a ve identi ed several functions that could bene t from such optimization. Some of these are machine-speci c, while others are sequences of core operations that could be reduced in strength for e ciency.
Operation Description
A list of such library functions is shown in Table 2. 2.4. Representations of functions. The stack model is clumsy to read and understand. However, it is convenient for an interpreter. We now i n troduce other representations of Vcode programs that do away with the stack altogether. The rst such representation is used in the compiler; the preferred representation of programs for this purpose is as computation graphs. As Vcode is applicative, the computation graph represents both data and control dependences.
Given a Vcode function represented as stack code, it is easy to generate the graph representation by simulating the e ect of the stack. The mapping from stack code to graphs is, however, many-to-one, as one can always add redundant stack operations to a given piece of stack code.
The other representation, which is useful for human consumption, is a Lisp-like syntax. This can be generated from a graph representation by a recursive descent code generator that traverses the computation graph from outputs to inputs. For a graph that is not a tree, it is more e cient to perform common subexpression elimination by identifying internal nodes that have fanout greater than one, and naming their output vectors. Code can then be generated for these vectors using a let* form before generating 11 code for the graph as explained above. This transformation is unique up to common subexpression elimination. In this paper, we will exclusively use this representation for computations e.g., in Appendix A.
2.5. The execution model: SPMD loop parallelism. As we w ould like run Vcode e ciently on a variety of parallel machines, we m ust select an execution model that can be parameterized to handle architectural variations typically found in such machines. These include details of memory organization, memory consistency, control structure, interconnection topology, and communication architecture. We w ould also like to determine what components of the compilers are common among the machines, so that we may structure a common compiler to take advantage of the similarities. The parallel execution model used in this paper is loop parallelism. It is based on data partitioning, i.e., each thread is responsible for a contiguous portion of the vector. Parallelism comes from the simultaneous execution of multiple iterations of a loop. The output of the compiler is single-program multiple-data SPMD loop code 25 suitable for execution on stock MIMD multiprocessors. Our de nition of SPMD code is as follows. All processors must execute the same body of code, which m a y h a ve data-dependent conditionals. The use of the processor number is limited to computations performed by the runtime system. Our adoption of loop parallelism as our model of parallel execution has been guided by the Fortran experience, which suggests that this form of parallelism is well-matched to the capabilities of MIMD multiprocessors. The model is su ciently high-level that it can be tuned for speci c machines to take i n to account their architectural characteristics.
While loop parallelism is a good execution model on most parallel machines, other models may be better suited to some machines. For instance, a macro-actor model 39 m a y be suitable for a machine such as the Jmachine 17 , and functional pipelines 28, 1 1 m a y b e w ell-suited to a machine such a s i W arp 7 . We c hose not to pursue such alternative execution models as loop parallelism is e ective o ver a wider range of parallel machines.
In choosing a parallel execution model, we need to examine the following issues:
1. Overhead: What is the overhead of initiating a parallel activity? 2. Granularity: How m uch computation does a parallel activity perform on average before participating in a communication or synchronization operation? 3. Homogeneity: Is the set of parallel activities homogeneous or heterogeneous, i.e., do they perform the same computation on di erent data sets, or do they perform completely unrelated computations? We n o w consider these issues for the parallel execution model we are using.
In the parallel loop model, the parallel activities are threads that process sets of loop iterations. To amortize the overhead of thread creation, some number of threads speci ed at runtime are spawned when the program begins execution, and remain active u n til the program terminates. This is in contrast to microtasking 14 on the CRAY, where threads can be created and destroyed at loop boundaries. Note that the number of threads is not known at compile time.
The granularity of activities depends on the ratio of loop iterations to threads. The only Vcode operations that have loop-carried dependences are scans and reductions, and these can be converted to remove such dependences from the loops. Thus, the loops are forall loops and do not require inter-iteration synchronization. Synchronization is only required at loop boundaries.
The SPMD code generated by the compiler is homogeneous; each thread executes the same code, although there may be branches based on data values. Rogers 36 has suggested that it is better to decide on the mapping of iterations to threads at compile time, and generate specialized code for each thread. However, this requires compile-time knowledge of data and processor sizes. This was not considered to be an acceptable alternative given the language model. Furthermore, the techniques used by Rogers would not work in the presence of arbitrary permutations whose communication patterns are not amenable to compile-time analysis, as would happen with the Vcode permutation primitives.
We generate one thread per physical processor in the system. This is typically the case for loop parallelism, as having multiple threads per processor can have a n a d v erse e ect on locality due to context switches, thread migration, and cache interference.
The segmentation of data vectors provides a natural notion of locality in the language model, as operations in di erent segments are independent. This carries across directly to the execution model. Locality considerations also suggest that a block of consecutive iterations should be assigned to the same thread rather than being interleaved across multiple threads.
Consistency with the sequential execution model is guaranteed in almost all cases by the semantics of the Vcode primitives. Scans and reductions present a problem; the parallel implementation is consistent with the sequential one if and only if the operator is associative. However, computer arithmetic is not associative, due to problems such a s o ver ow, under ow, roundo , and quantization e ects. This problem is faced by all compilers, and the usual solution is to provide a compiler directive that instructs the compiler to treat the operator as though it were associative. We take the position that such an action is taken by default. The only other instruction that causes a problem is RAND, since random number generators are usually not re-entrant. Once again, this problem is common to all parallel models, and various solutions have been suggested 32, 1 8 .
3. Compilation. In this section, we give a brief overview of the compiler, and discuss the alternative strategies available for allocating loop iterations to processors at runtime. For a fuller description of these issues, the reader is referred to 9, 1 0 . 14 3.1. The structure of the compiler. A high-level block diagram of the compiler is shown in Figure 3 . The input language of the compiler is Vcode, and the output of the compiler is C code with Mach cthread extensions 13 . We n o w brie y describe the various modules of the compiler. 3.2. Runtime models of parallel execution. Scalar code is executed by all processors, and any scalar results are replicated in all processors. Replication enables the processors to proceed independently and guarantees that the results will be consistent.
Since numerical values are not known for the number of processors P and the loop bounds, it is clear that the actual allocation of iterations to processors must be deferred until runtime. However, we do not want t o use self-scheduling 46 o r a n y of its variants for two reasons: they incur a runtime overhead, and loops containing scan instructions must be executed in a xed order. Similarly, i n terleaved allocation of iterations to processors i.e., processor p executes iterations p; p+P ; : : : is ruled out, as scans cannot be executed in this manner. Moreover, such an allocation has poor locality properties on machines with caches. Iterations must therefore be allocated in blocks to processors.
The most signi cant factor a ecting performance is the question of what kind of concurrency to exploit for segmented operations: inter-segment, intra-segment, or both. As the operations on di erent segments are independent, we can think of a variety of possible strategies for exploiting concurrency. These include:
Inter-segment: At a n y time, at most p segments are being processed, each one by a single processor. Some processors may b e idle.
Intra-segment: At a n y time, a single segment is being processed cooperatively by u p t o p processors.
Mixed: The set of segments is divided up equally among the set of processors, possibly with some segments being split up among multiple processors. The three strategies are illustrated in Figure 4 . The three strategies correspond to the three implementations of nested parallelism in Figure 1 .
The inter-segment strategy works best when there are a large number of evenly-sized segments. If there are fewer segments than processors, some of the processors will remain idle; if the segment sizes are not balanced, the completion time will be dominated by the size of the longest segment. The intra-segment strategy works best when there are a small number of long segments. If the segments are short compared to the number of processors, 16 some of the processors will remain idle. The mixed strategy combines the advantages of the other two strategies. In e ect, it attens out segment boundaries, enabling each segment to be processed by a share of the processor pool proportional to its size. Since this results in the best load balancing, we use this strategy for exploiting concurrency. The inter-segment strategy is an aligned-segment strategy, since segment boundaries are always aligned with processor boundaries. In contrast, the two other strategies are split-segment, since a segment can be split across multiple processors. Although split-segment strategies have better load balancing, they do have a hidden cost. In aligned-segment strategies, the individual segments can be processed as for a uniprocessor implementation, since a segment i s n e v er split across multiple processors. This means that scans require less work, and synchronization is not needed after permutation operations. Thus, depending on the data and the amount of processing, we can trade o load balancing and synchronization cost. The proper choice for minimizing completion time is to provide a runtime disambiguation mechanism to choose between an aligned-segment and a split-segment strategy.
We n o w present a simple cost model and a runtime test to select the better strategy. W e use the following notation. and the split-segment strategy is better if the inequality is reversed. The quantities k s , k a , d s and d a can be derived at compile-time given the cost of elementary operations. The value l can be maintained as a eld in the segment descriptor structure, and is known at runtime, as are n and P. The synchronization cost can be measured and stored in an array indexed by p in the runtime system. Caches complicate the equation by making the cost estimates inexact.
The e ect of aligned-and split-segment concurrency strategies are shown in Figure 5 for three benchmarks: sparse matrix-vector product, dynamic programming, and line drawing. The results do not indicate the clear superiority of either strategy. If there are a large number of small segments, the average load imbalance arising from the alignment of segments and processors is small and does not signi cantly a ect performance. The condition is violated in the line drawing example, where the split-segment strategy is clearly superior. 4. Algorithms. In this section, we discuss eight benchmark programs and evaluate their performance on the Encore Multimax, a bus-connected shared-memory multiprocessor. The benchmarks emphasize problems that are dynamic and irregular. The code for all the benchmark programs has been collected in Appendix A. A preprocessor translated this Lisp code into Vcode. W e present the following measurements for each benchmark:
1. The running time of compiler-generated sequential C code vs. the running time of handwritten sequential C code for the same problem for various problem sizes, running on a Sun-4 65. This gives a baseline comparison and provides an upper bound on obtainable speedup. 2. The speedup of compiler-generated cthreads code with respect to the handwritten sequential C code for the same problem for a xed problem size, running on the Encore Multimax. This demonstrates the feasibility of a parallel algorithm to the problem and the overheads incurred by switching to a parallel solution. 3. The speedup of the compiler-generated code as a function of problem size for a xed number of processors. This provides an estimate of the problem size where the parallel algorithm becomes cost-e ective.
Our de nition of speedup is as follows. Let A P be the parallel algorithm under test, A S a baseline serial algorithm for the same problem, n the problem size, and p the number of processors. Let Ta; n; p denote the running time of algorithm a on a problem of size n with p processors in a given operating environment. The term operating environment" covers both the hardware environment the processor architecture, the amount of memory, etc. and the software environment the operating system, the compiler, etc. in which a program executes. Then the speedup S is de ned as SA P ; A S ; n; p = TA S ; n; 1 TA P ; n; p For our experiments, A P will be the compiler-generated parallel code, and A S the handwritten serial code.
Note that only the problem is held xed in this equation. We place no constraints on the baseline sequential algorithm. Its data structures and algorithmic techniques can be very di erent from those used by the parallel algorithm. We c hoose to use this de nition as it is an honest measure of speedup.
4.1. Pack. The rst benchmark is the library function that packs a segmented vector of integer values based on a boolean vector of ags, maintaining relative order within segments. The packed data vector is returned. The size of the output vector depends on the number of TRUE elements in the ags vector. As the pack operation is heavily used in many other irregular programs, it is crucial to obtain good speedup on it. A na ve translation of the Vcode program shown in Appendix A would result in three loops, one for the reduction that computes the size of the output vector, and two more for the scan and permutation. However, the rst loop of the scan and the reduction loop do exactly the same work. Only the synchronization routines applied to the result are di erent. The compiler discovers this by exposing the microstructure of scans and reductions, and thus produces a translation of the benchmark using only two loops. The rst loop computes the size of the output vector, and the second loop permutes the values.
The handwritten serial code for this problem avoids determining the size of the output vector in advance by noting that it can be no more than the size of the input vector. It therefore allocates that much storage, counts the number of elements actually stored in the permutation loop, and xes up the length eld of the output vector after the permutation. Although this wastes space, it speeds up the serial code by a factor of two. This trick cannot be used in the parallel version, since each thread needs to know the start-o set of each segment in the output vector. The trick could be used to enhance the performance of the parallel algorithm if segments were not required to be contiguous and an aligned-segment strategy was used. However, such a scheme has other disadvantages, and is therefore not recommended.
The performance of the code is shown in Figure 6 . The output vector is 25 of the size of the input vector. The parallel code gets a speedup of four running on 12 processors. The speedup is limited by the extra work that must be done in the parallel version of the scan, and also by the loss of locality in the nal permutation. The multiple curves for each size are 20 for di erent n umbers of segments 1 and 8. The performance shows little dependence on the number of segments.
Split. The split operation take s a v ector of values and a vector
of boolean ags, packs the values corresponding to the FALSE ags to the bottom and those to the TRUE ags to the top, maintaining order within each part, and returns the resulting vector. This operation forms the core of sorting algorithms such as radix sort and stable quicksort 26, 4 7 . Size inference concludes that all operations can be put into a single cluster. However, the output of the plus-reduction causes an access con ict, resulting in the cluster being split into two epochs. The nal parallel code has two loops, as does the serial C code. The rst loop computes the number of FALSE ags; the second loop computes the permutation indices and performs the actual permutation.
The performance of the code is shown in Figure 7 . The maximum speedup achieved by the parallel code is about four on 12 processors. The compiler-generated uniprocessor code is about 1.5 times slower than the handwritten serial code. The speedup of the parallel code is limited by the extra work required in the parallel version of the scan. The multiple curves for each size correspond to di erent n umbers of segments 1 and 8. The speedup shows little dependence on the number of segments.
4.3. Sparse matrix-vector product. This benchmark computes the product of a sparse matrix and a dense vector, both containing doubleprecision oating-point v alues. We assume that there is no structure to the sparsity in the matrix. We use a segmented-vector representation of sparse matrices as follows: each r o w of the matrix is a segment; for each segment there are two v ectors, one giving the column positions of nonzero values and the other giving the values themselves; the rows are in ascending order and columns are in ascending order within rows. The matrix is fully speci ed by the vector of indices, the vector of values, and the segment descriptor. The parallel algorithm fetches values from the vector based on the index vector of the matrix, multiplies them with the matrix data values, and performs a segmented plus-reduction.
The performance of the benchmark is shown in Figure 8 than that. The algorithm used, called quickmedian, is similar to quicksort. The vector is split into two parts based on a pivot element, the rst part containing those elements that are smaller than the pivot, and the second part elements greater than the pivot. Let n be the size of the input vector, and l and h the sizes of the two subvectors. If k l , then the result is the k th element of the rst subvector, and we can exclude the second vector from further consideration. Dually, i f k n , h, then the result is the k , n + h th element of the second subvector. If both tests fail, then the pivot is the desired value. As the recursion only works on one of the two subvectors, the expected work of the algorithm is linear in the size of the vector.
The performance of the parallel code is shown in Figure 9 . The sequential algorithm mirrors the parallel algorithm explained above. The input data were random integers in the range 0 255, with k = n=2. The compilergenerated uniprocessor code runs competitively with the handwritten sequential code, which is not surprising considering they are encoding the same algorithm. Two factors limit the performance of the parallel code. Firstly, the two pack operations do not get perfect speedup. Secondly, xed overheads cause a drop in the performance as the data vectors get shorter. The net result is a speedup of about six on 12 processors.
Connected components. This benchmark nds the connected
components of an undirected graph. The algorithm used is a variant of the random mate algorithm 35 . The n vertices of the graph are numbered 0 through n , 1, and the edges are represented by the numbers of the head and tail nodes. The graph is represented with one vector enumerating the nodes the nodes vector and two more vectors containing the numbers of the head and tail nodes of the edges the head and tail vectors. The basic idea behind the algorithm is as follows. Each node of the graph tosses a coin, resulting in either a head or a tail value. Each edge examines the values received by its head and tail nodes. An edge is active if the nodes have di erent v alues. Each active edge permutes the number of its tail node into the nodes vector at the position given by the number of its head node. This permutation simulates a concurrent-write operation. Each edge then retrieves the node number from the updated nodes vector, from the position given by the number of its head node. If this value is equal to the node number of its tail node, then the tail node has been merged into the head node, and the edge can be deleted. The above computation is repeated after deleting all such edges. The computation terminates when the edge vectors become empty; the numbering of the nodes vector at this point constitutes a labeling of the connected components of the graph. The labeling is not unique as it depends on the outcome of the concurrent-write operations. 
26
The algorithm is probabilistic and does On lg n w ork on average for a graph with n vertices. The baseline serial algorithm uses the standard depth-rst search technique, with an edge-list representation of the graph. The performance of the code is shown in Figure 10 . Two-dimensional meshes were used as test cases, with 25 of the edges being actually present. The handwritten serial code runs almost twice as fast as the compilergenerated uniprocessor code. The parallel code su ers from the same problems as some of the other benchmarks in this section: the overhead of packing, and the e ect of small vectors in the later stages of the algorithm. Nevertheless, the best speedup achieved on 12 processors is a little under ve.
4.6. Dynamic programming. This problem uses dynamic programming to nd the minimum cost of multiplying n conforming matrices M 1 through M n 1, p. 68 . As the matrices are conforming, their sizes can be represented as a vector r, where matrix M i has r i,1 rows and r i columns.
The algorithm works as follows. A matrix C is used to represent costs of multiplying chains of matrices, with C ij being the minimum cost of multiplying matrices M i through M j ; 1 i j n. Then the cost can be expressed by the following recurrence:
C ij = 0 if i = j min ik j C ik + C k+1;j + r i,1 r k r j if j i .
The dynamic programming algorithm computes the values of the matrix C by diagonals. The algorithm consists of three nested loops, an outer serial loop iterating over diagonals, and two inner parallel loops computing the values on each diagonal. Since values are only de ned for the upper triangle, we could save storage by storing only the upper half of the matrix. However, this requires more complex address arithmetic and results in much slower code. The sequential algorithm therefore takes On 3 time and n 2 space, trading storage for speed. Since the parallel algorithm has to perform the address arithmetic explicitly, w e c hose to store the upper triangle by diagonals as this makes the address computation simpler.
The performance of the program is shown in Figure 11 . The compilergenerated uniprocessor code performs only slightly worse than the handwritten serial code. The parallel code obtains a speedup close to nine on 12 processors for a little over 200 matrices. The slight loss of performance is attributable to the overhead of explicit address computations. The input to the algorithm is a binary image corresponding to edge points. Each active point x; y in the image computes the lines that could pass through it. This is done by quantizing the space and computing the values for each v alue of the independent v ariable . A v ote is cast for each such ; pair and tallied in the hough array, a t wo-dimensional array indexed by and . The trigonometric functions are computed in a prepass and stored in a table. This avoids unnecessary recomputation of these quantities in the inner loop of the program. For an image of size n n, the quantization of -space is chosen to be arctan1=n. The ranges of and are ,n p 2 n p 2 and 0 . The Vcode algorithm rst packs the points that are turned on in the image, as only these points contribute to the output. The values are then computed for each point at each quantized value and the nal tallying of votes is done using a combine-permute.
In essence, the Hough transform is simply a two-dimensional histogram in ; space. Multiple images could be processed concurrently exploiting inter-segment parallelism.
The performance of the program is shown in Figure 12 . The input data was a 256 256 binary image, with 5 15 of the points being active i n various runs. For the lower graph, the image size was varied, keeping the edge concentration constant at 15. The compiler-generated uniprocessor program is about as e cient as the handwritten serial code. The parallel code gets a speedup close to eight on 12 processors. Choudhary and Ponnusamy 12 present a n umber of di erent handwritten parallel implementations of the Hough transform on the Encore Multimax. Our speedup numbers are comparable to theirs. 4.8. Line drawing. The nal benchmark is also taken from image processing. Given a set of lines in ; representation e.g., from the Hough transform example above, this program clips the lines to a given bounding box, determines the pixels that will be illuminated in the output image, and outputs this image. Determining the clipping points requires some simple elementwise computations using trigonometric functions, as follows. We assume that we are clipping the lines to the square de ned by the lines x = 0 , y = 0 , x = N, and y = N. W e rst compute the x-values at which a given line intersects the lines y = 0 and y = N. These points are given by the The corresponding y-coordinates can be calculated by substituting these xvalues into the line equation.
The pixels to be turned on are determined using a scan-based version of the standard digital di erential analyzer technique 3, pp. 50 51 . Brie y, this technique is as follows. Given a pair of endpoints x 1 ; y 1 and x 2 ; y 2 , the number of pixels to be turned on is given by L = maxjx 2 , x 1 j; jy 2 , y 1 j:
Let these L pixels be numbered 0 through L , 1 Multiple lines are drawn simply by making all operations segmented. A nal permutation turns on these pixels in the output image. This benchmark, like the connected components benchmark, uses concurrent-write in the permutation, as a single point m a y contribute to more than one line and therefore appear in the points-list multiple times.
The performance of the code is shown in Figure 13 . The input contained 16 lines to be drawn. The size of the output image was varied from 1010 to 1000 1000. The compiler-generated uniprocessor code is competitive with the handwritten serial code. The performance of the parallel code is limited to about nine on 12 processors. The probable cause of this is that a single thread may h a ve to update points in a wide range of the output image. This would reduce the e ectiveness of caches. 4.9. Benchmark performance. The performance of all the benchmarks discussed in this paper are summarized in Table 3 . We c haracterize each benchmark by t wo attributes, regularity and communication. Regularity roughly corresponds to the fraction of elementwise instructions in the program, while communication corresponds to the fraction of permutation instructions. For each benchmark, we present the asymptotic" speedup S 1 achieved on twelve processors and the problem size n 1=2 where half this speedup is obtained. In this section we discuss the reasons that limit the speedups of some of the benchmarks. We can identify four major reasons that limit the performance of the data-parallel programs. They are as follows.
1. Extra work in scans: Any parallel implementation of scans requires at least twice as much w ork as a serial implementation. While the parallel implementation is still optimal in the asymptotic sense, the loss of a factor of 2 is a signi cant loss of performance. This fundamental limitation of parallel algorithms for scans will be an issue in any data-parallel model that provides associative scans as primitive operations.
2. Limited operators for scans and reductions: Vcode only allows a xed set of scan and reduction operators. This restriction makes the formulation of certain benchmarks unnecessarily circuitous. In both cases, user-de ned scans and reductions would allow for much more e cient solutions. This de ciency in Vcode should be corrected by adding general scans and reductions. The addition of linear recurrence solvers as primitive operations is also desirable, as is the addition of a reverse-scan. 34 3. Small size e ect: Each parallel loop has some overhead associated with work distribution, synchronization, and loop overheads. For large loop bounds i.e., long vectors, these overheads are amortized over the work in the loop body. H o wever, the overheads can dominate for short vectors. This e ect is seen in benchmarks where the vectors get shorter in the later stages of the algorithm, reducing the overall speedup. This is a problem with all parallel loops; the solution to this is to make the work distribution routines more efcient, to do more useful work in each loop, and possibly to unroll loops to further reduce loop overheads.
4. Extra work in parallel algorithms: In several benchmarks, the parallel algorithm is very di erent from the serial algorithm and expends more e ort in solving the problem. This extra e ort may only involve constant factors or might actually be asymptotically worse. Another way of viewing the same problem is to note that the sequential algorithm is very e cient. The only real solution to this problem is the development of more realistic performance models for parallel algorithms.
5. Conclusions. We h a ve presented a data-parallel model suitable for expressing computations that are dynamic and irregular. We h a ve given several examples of programs of this type expressed in the model, and have presented performance results for these programs on a MIMD machine. The results demonstrate that very good performance can be obtained given the proper compiler technology. The benchmark programs span diverse application domains, and show that data parallelism is capable of handling much more than simple regular computations.
The work described in this paper can be extended in several ways. On the languages side, the possibilities include: adding record types; allowing user-de ned operators in scans and reductions; machine-independent parallel I O primitives; and integration with control parallelism. On the compilers side, work still remains in automatic mapping of irregular data structures for distributed-memory machines, alternative runtime models, and the design of a structured annotation system.
The combination of appropriate language models and powerful compiler technology can make portable parallel programming a reality. This paper has presented such a model and compiler and partially demonstrated the claim. However, much w ork still remains for a complete solution. The bene ts of portability strongly argue for continued research e orts in the eld.
