In the context of sequential computers, it is common practice to exploit temporal locality of reference through devices such as caches and virtual memory. In the context of multiprocessors, we believe that it is equally important to exploit spatial locality of reference. We are developing a system which, given a sequential program and its domain decomposition, performs process decomposition so as to enhance spatial locality of reference. We describe an application of this methodgenerating code from shared-memory programs for the (distributed memory) Intel iPSC/2.
Introduction
Fundamental limits on the switching times and integration densities of devices constrain the computational speeds of single processors. To achieve the computation rates required for large problems such as PEE solutions, it is necessary to harness the power of multiprocessing.
*This research is supported
by NSF grant CCR-8702668 and by grants from the Math Sciences Institute, Cornell and the GE Corporation.
Permission to copy without fee all or part of this material is granted provided that the copies are not made or distributed for direct commercial advantage, the ACM copyright notice and the title of the publication and its date appear, and notice is given that copying is by permission of the Association for Computing Machinery. To copy otherwise, OT to republish, requires a fee and/or specific permission. Q 1989 ACM 0-89791-306-X/89/0006/0069 $1.50
The trend towards multiprocessing is very evident in the market-place -the best-selling CRAY machine is the CRAY-XMP which obtains its performance from four processors (each of which is only 30% faster than the original CRAY-1) while multiprocessors with less complex processing elements, such as the BBN Butterfly (based on the 68020 microprocessor) and the Intel Hypercube (based on the 80286 microprocessor) are becoming increasingly popular.
The major barrier to widespread acceptance of multiprocessors is the primitive state of parallelieing compilers. Most vendors have taken the easy way out and have simply added parallel constructs to an existing language like C or FORTRAN. The compiler provides little or no help in parallelism detection or ensuring correct synchronization, leaving that entirely to the programmer. This places a severe burden on the programmer and opens the door to time-dependent bugs such as deadlocks and races between reads and writes, which are extremely difficult to track down. It is no exaggeration to say that compiler technology for multiprocessors is in the same sorry state that vectorizing compilers were in ten years ago. The CRAY-1 of the mid-703 sold well in spite of poor vectorizing software mainly because it could be used as a very fast scalar machine. Most multiprocessors, on the other hand, are being built from single chip microprocessors which are no match for single processor supercomputers. The only way such machines will gain acceptance is if compiler technology improves to a point where programmers can exploit a large number of processing elements without undue programming effort. This will be achieved when the programmer can write his application program using standard high-level control and data abstractions such as procedures, loops, and arrays, leaving it to the compiler and run-time system to worry about low level details of process decom-position, synchronization, and load balancing. We are implementing a compiler that can take a high-level program and decompose it into processes, introducing synchronization where necessaryl. Our approach to process decomposition is based on exploiting locality of reference. In the context of uniprocessors, it is commonplace to exploit lemporal locality of reference through devices such as caches, virtual memory, and translation look-aside buffers. We believe that it is just as important to exploit spatial locality of reference in the context of multiprocessors.
This concept can be explained with reference to the way multiprocessors are organized: as message-passing machines and as sharedmemory machines.
In message-passing machines, like the Intel iPSC/2 and the NCUBE, each process has its own address space and processes must communicate by explicitly sending and receiving messages. Message-passing tends to be very slow: on the Intel iPSC/2, a zero-length message requires 250 psecs to be packed and unpacked, while the time per hop is between 10 and 20 psecs [S] . This is more than two orders of magnitude more expensive than the cost of reading or writing into local memory. Since the time per. hop is a small fraction of the total time it takes to deliver a message, an appropriate abstraction of the memory hierarchy is a two level hierarchy in which all non-local accesses are about two orders of magnitude more expensive than local accesses. Clearly, it is important to exploit spatial locality of reference in such machines -process decomposition must ensure that code and data referenced by the code are packed into the same process as jar as possible.
In shared-memory machines, such as the BBN Butterfly and the IBM RP3, there is a single, global address space that is shared by all processes. Inter-process communication is accomplished by reading and writing of memory locations. The single, shared address space is usually an illusion presented to the programmer by the operating system since most shared memory systems are implemented as a number of processor-memory pairs interconnected through some network. The cost of accessing a non-local data item (i.e., across the network) is on the order of tens of cycles. Therefore, even in shared-memory machines, spatial locality of reference is extremely important for good performance2.
To exploit spatial locality of reference in multiprocessors, one can rely on either run-time mechanisms or 1 We have opted to make our run-time system responsible for load balancing since we have no particular insight into how this oroblem could be tackled bv the comniler -the work characteristics of most programs that we have looked at (such as particle in-the-cell codes) are difficult to predict at compilstime. 2The only exception to this is the Ultracomputer [2] in which all memory is equally far away from all processors.
This uniformity is achieved by making all accesses equally expensive! compile-time analysis. One hardware solution (,t least for shared-memory machines) is caching. The penalty of a non-local reference is paid only the first time a datum is accessed; subsequent accesses of the same datum are satisfied in cache. There are two problems with this approach. First, caches in multiprocessors open up the problem of maintaining cache coherency; at present, no general solution to this problem is known [9] . Second, the utility of caching in scientific programs that manipulate large arrays and matrices is debatable (CRAY machines do not have caches). For these reasons, we rejected hardware solutions in favour of compile-time analysis and restructuring of programs. The need for such analysis has been expressed most eloquently by Karp as follows: ".. . we see that data organization is the key to parallel algorithms even on shared memory systems. It will take some retraining to get programmers to plan their data first and their program flow later. The importance of data management is also a problem for people writing automatic parallelization compilers....A new kind of analysis will have to match the data structures to the executable code in order to minimize memory traffic.n [I]
These admonitions have not been heeded by most researchers working on automatic parallelization.
The most popular approach to automatic parallelization is the 'program-driven' approach, a typical example of which is the Camp system of Peir and Gajski [3] . Their strategy is to parallelize the program by distributing ioop iterations among processors. Synchronization is required for loops with loop-carried dependencies and is implemented through complex bit-masks at every word of memory. A similar approach is being pursued in the CEDAR system at Illinois.
Most of these efforts discuss locality but do very little to exploit it. In contrast, exploiting locality of reference is the cornerstone of our approach. The intuitive idea is the following. The programmer writes and debugs his program in a high-level language using standard high-level abstractions such a.s loops and arrays. Once this is accomplished, he specifies the domain decomposition -that is, how data structures are to be distributed across the multiprocessor.
In most programs we have looked at (such as matrix algorithms, SIMPLE, and particle-in-the-cell), this is quite straightforward since the programmer thinks naturally in terms of decompositions by columns, rows, blocks, etc. Given this data decomposition, the compiler performs process decomposition by analyzing the program and specializing it to the data that resides at each processor. Thus, our approach to process decomposition is 'data-driven' rather than 'program-driven'.
An interesting facet of our technique is that it can be viewed formally as a novel kind of type inference for overloaded operators.
Currently, our compiler does not worry about the as-signment of processes to processors -in fact, load balancing may cause a process to migrate among many different.processors during its lifetime. This approach runs counter to more traditional approaches to programming distributed memory machines that focus on mapping the topology of problems (rings, trees, etc.) to the topology of machines (hypercube, shuffle-exchange etc.) so as to exploit nearest-neighbor communication [lO,ll] . Exploiting this kind of locality is important for architectures with multClevelmemory hierarchies like the Intel iPSC/l, not to our tw+level hierarchy. While our system can be extended to include this kind of topological information, we believe that the performance advantages of dynamic load balancing outweigh any advantage gained from a static assignment of processes to processors.
The rest of the paper is organized as follows. In Section 2, we introduce the programming language and the machine model we use in this paper. The programming language is a functional language augmented with Istructures, an array construct borrowed from logic progra.mming languages. The machine model is a simple message-passing model similar to that supported by the Intel Hypercube or the Ncube. In this section, we also introduce the 'wavefront' problem which we use as a running example in this paper. In Section 3, we discuss our code generation strategy. The basic strategy is embodied in a simple but inefficient algorithm caIled run-time resolution. The code generated by this algorithm can be improved considerably by partial evaluation (evaluation at compile-time) and incorporating this optimization leads to an improved algorithm that we call compile-time resolution. We also remark on some connections between our techniques and standard code generation strategies for languages with overloaded operators. In Section 4, we discuss other optimizations that must be performed to generate good code. To reduce the overhead of message passing, it is preferable to combine messages together, thereby reducing message traffic. However, this must be done judiciously since combining messages can have an adverse effect on parallelism. The transformations in Section 4 attempt to strike a balance between these two concerns. We present experimental results that highlight the importance of these transformations.
We discuss some extensions in Section 5, related work in Section 6, and conclusions in Section 7.
Language and Machine Model
The programming language we use in this paper is Id Nouveau [4] which is a functional language with an array construct called I-structures. Our techniques work equally well for an imperative language such as FOR-TRAN. The machine model is a simple message-passing model like the one supported on the Intel Hypercube or the Ncube. We chose to work with this model since we are implementing the system on the Intel iPSC/2.
Programming Language
Id Nouveau is a functional language augmented with Istructures, an array construct borrowed from logic programming languages. The rationale behind this integration of functional and logic programming constructs is to permit the programmer to define large arrays and matrices incrementally without incurring the copy overhead of functional arrays. We assume that the reader is familiar with functional languages; therefore, we will describe only I-structures. In a functional language, the allocation of storage for an array is inseparable from the definition of the array elements. This makes it difficult to write programs in which arrays must be defined incrementally.
I-structures get around this problem by separating the allocation of storage from the definition of the array elements. This is similar to imperative arrays; however, unlike imperative arrays, I-structure elements cannot be redefined once it has been given a value. Looked at another way, I-structures are 'writeonce' arrays. We describe briefly the primitives for manipulating two-dimensional I-structures.
array(ei,e2)
The expressions el and e2 must evaluate to positive integers. A two-dimensional array of that size is allocated.
The expression e is evaluated and the resulting value is stored into A [ii, i21. If A [ii, i21 has already been written into, a run-time error occurs.
ACil,i2]
The contents of ACil,i2l are returned.
is undefined, a run-time error occurs.
These primitive operations are 'overloaded' in the sense that they can be used to allocate, define and access arrays of any number of dimensions. For example, the expression array (el) allocates a one dimensional array of size ei etc.
For a complete description of Id Nouveau, we refer the reader to [5] . Figure 1 shows the Id Nouveau program for the Gauss-Seidel relaxation method applied to a grid in normal order. In this program, procedure init-boundary initializes the boundary of the array iVew. Interior elements in the new matrix are computed by averaging neighbors, two from the old matrix and two from the new. The code in italics specifies the doma.in decomposition and will be explained later.
Machine Model
Although our techniques work equally well for shared memory as well as message-passing machines, we will assume a message-passing model similar to that provided by the Intel iPSC or the NCube. There are n processors in the model, each of which executes one process3. Each process has its own address space. Processes communicate and synchronize by exchanging messages. The primitives for message-passing are the following: send(v, Pi) : The value v is sent by the process executing this command to process Pi. The sending process does not, have to wait for the receiving process to get the message.
receive(X,Pi) : The process executing this command waits until it receives a value from process Pi and then stores that value into variable X. While it is waiting, messages it may receive from processes other than Pi are ignored temporarily. In block sends and receives, there is no limit on the size of the data being exchanged -packetization of long blocks of data and assembly of packets at the receiving end are the responsibility of the underlying implementation.
In most message-passing systems, block sends and block receives are more efficient than equivalent sequences of sends and receive. We will assume this in our model.
Domain Decomposition
To exploit spatial locality, a programmer decomposes the data in a manner appropriate for the problem and the architecture.
For example, given an architecture that contains a ring of size s, a good data organization for executing this version of Gauss-Seidel in parallel is to wrap the columns of the matrix around the ring like a. dealer deals cards, one column to each processor in turn until all of the columns have been distributed. In general, column j is assigned to processor j mod s. In Figure 1 , which is the Gauss-Seidel program mentioned earlier, the italicized portion of the program specifies the domain decomposition.
Examination of the dependencies between the elements of the New matrix reveal the 'wavefront' parallelism in this program -elements 3St,rictly speaking, the iPSC permits multiple processes to execute on a processor but we can take that into account simply by increasing the number of processors in our model. of the New matrix along a (major or minor) diagonal parallel to the diagonal from (l,N) to (N,l) can be computed simultaneously. In general, domain decomposition is specified as follows. Variables may be mapped to either a single processor (a: Pi) or to all processors (a: ALL). This processor is said to own the variable. Array mappings consist of three functions:
Map Given the indices of an array reference, map computes the processor on which the element resides.
Local Given the indices of a reference, local computes the location of the reference in the processor on which it resides.
Allot Given the subscript ranges for the original array, allocate an appropriately sized local array.
The owner of an array element is the processor to which it is mapped. For example, the wrapped columns discussed earlier are defined as: 
Code Generation
We first discuss run-time resolufion which is a simple but fairly inefficient implementation.
Next, we show how this code can be improved by partial evaluation at compile-time -the resulting code generation strategy is called compde-time resolution. 'We also point out some connections between this problem and the problem of code generation for languages with overloaded operators.
3.1

Run-time Resolution
Our first method, called run-time resolution, produces the same program for each processor. Three simple rules drive code generation:
l Every process examines each statement and determines its role (if any) in the execution of the statement by using the two rules below.
l The process that owns a variable or array element is responsible for computing its defining expression and recording its value. l The process that owns a variable or array element must communicate its value to any process that needs its value.
For example, in the program of Figure 3a , the first st,atement will be executed by processor Pi since the identifier a is mapped onto processor PI. Similarly, the expression on the right hand side of the third statement is computed by ~3, with Pi and ~2 participating only in the communication of the values of a and b respectively. Figure 3b shows the code generated by the run-time resolution strategy from the program of Figure 3a . This code is executed by all processors. mynode is a procedure that is executed by a processor to determine its own identity, Coerce sends a value from the processor that owns it to the processor that needs it. These processors may be the same, in which case just a read is performed.
This style of code has been called Single Program Multiple Data (SPMD) code[I].
Compile-time Resolution
Run-time resolution inserts many extra lines of code into the program for each processor. In our example, none of the tests that follow the first coercion in the code will evaluate to true for processor Pl. Techniques similar to those used to resolve overloading in conventional compilers can be used to generate less extraneous code. When compiling languages like Lisp, an overloaded operator like + is usually compiled into a case statement that tests the type of the arguments and dispatches to the appropriate type specific addition routine. The naive code generated by this strategy can be improved considerably if the compiler knows the types of the arguments or the result (for example, through type declarations) since the case statement can be replaced by a dispatch to the relevant addition routine. This kind of code improvement through 'specialization' of generic code can be used profitably in our context as well. The code generated by run-time resolution is like generic code that can be specialized to each process by using the mapping information. This approach is called compile-time resolution.
When generating code for each processor using compile-time resolution, the compiler examines ea.ch statement to determine the processor's role in the evaluation of that statement. This is done in two stages. The compiler uses conventional abstract syntax trees as the internal representation of programs. In the first stage, the user's mapping information is propagated through the program's abstract syntax tree. In the second stage, this information is used to generate code. Each node of the abstract syntax tree has two attributes named evaluators and participants. The evaluators of a node in the abstract syntax tree is the set of processors that perform the operation defined by the node. The participants of a node, n, in the abstract syntax tree is the set of processors that must participate in the evaluation of some node in the subtree rooted at the node, i.e. the union of the evaluators of the nodes in the subtree rooted at n. For lack of space, we do not give the details of the determination of the evaluators and participants attributes; we refer the interested reader to the forthcoming thesis of one of the authors[l8].
For the most part, these rules are quite straight-forward; the only complication is that the set of participants is used to determine the evaluators for some types of nodes, such as conditionals -the union of the participants of the then-branch and else-branch defines the evaluators of the boo1ea.n test in a conditional expression. Figure 3c shows these sets for the our simple exa.mpie; the evaluators are enclosed in braces and the participants are enclosed in angle brackets. In this simple example, only processor names appear in the evaluators and participants. This is not necessarily the case since the mapping for an array reference will be an expression that may include program variables. The information collected in the propagation stage is used to generate code. Given a processor name and a t,ree node, the compiler tries to determine if the processor is a member of the evaluators of the node. Three outcomes are possible: true, false, and inconclusive. True means that the processor must perform the operation defined by the node. False means it need not. Inconclusive means that run-time resolution must be applied because the compiler cannot analyze the mappings sufficiently. This evaluation will require techniques such as subscript analysis that are commonly used in vectorizing compilers. The code generation phase produces code for each processor by walking the annotated abstract syntax tree while applying this evaluation scheme at each node. Figure 3d contains the code that this method generates for our simple example.
Returning to the Gauss-Seidel example discussed earlier, Figure 4 contains the code that would be generated by compile-time resolution for a non-boundary processor p. Since the goal of compile-time resolution is to have each processor participate in only those computations for which it has data, it is important to ensure that a processor executes only required loop iterations, rather than go through all iterations looking for work. To compute the required set of iterations for a given processor, we set the expressions in the evaluators equal to the processor name and solve for the loop variable.
Opt imizat ions
While Figure 4 , the end result of compile-time resolution, resembles the handwritten program ( Figure 2 ) to some extent, there are important differences in the treatment of messages in the two programs. By combining messages that have the same source and destination processors, the handwritten version attempts to cut down on the number of messages that must be sent -for example, values in the Old column are sent through a single send command, rather than one element at a time. This is useful because of the relatively high start-up cost of messages on the iPSC/2. On the other hand, combining messages may impact adversely on parallelism -for example, if the New column is passed only after all of its elements have been computed, there is no parallelism in the execution of the program. Thus, communication and computation have to be pipelined in order to achieve the best tradeoff between minimizing the number of messages and exploiting parallelism. The handwritten version achieves this by sending the new elements in blocks of size 8, a compromise between sending them one at a time and sending them all at once.
How important are these optimizations? To understand this issue, we first performed experiments to determine how the code generated using run-time and compile-time resolution compares with handwritten code. The run-time resolution programs for a 128 x 128 integer grid took 24 seconds no matter how many processors were involved.
Similarly, the compile-time programs took 15 seconds no matter how many processors were used. The lowest graph in Figure 5 shows t,he performance of the handwritten code. Compared to the handwritten version, the run-time resolution code performs rather poorly. This was to be expected because it exchanges many more messages than the handwritten code *. The absence of speedup arises from the fact that there is no parallelism being exploited in this program -all the processes go through all the statements in the program. The compile-time implementation is more encouraging (as we expected) but it is still bad compared to the hand-written program. There are two reasons for this -first, it exchanges as many messages as the runtime resolution program (which is a lot more than the handwritten code does) and second, this program does not exploit any of the parallelism in the problem! To cut down on the number of messages, we need to use block sends and receives wherever possible. The resulting code achieves the goal of performing as case, the referenced array element) must be checked to well as hand-written code. ensure that there is no cycle of data dependencies within It is reasonable to wonder if the optimizations in this the loop. If no such cycle of dependencies exists, the read may be converted to a vector read. This in turn will be converted during code generation to block sends and receives or will be removed by copy elimination, if the process that owns the vector is the same as the process that needs it. The topmost line in Figure 5 shows the result of performing this optimization.
Even after this optimization is performed, there is no speed-up when the number of processors is increased. This came as a surprise to us initially, but closer examination of the code generated by compile-time resolution revealed the reason for this. In the code generated by compile-time resolution, the owner of a variable computes its value but does not immediately send that value to other processes that may need it -rather, the process computes onward to the point in the original computation where the value is used and then sends the value. In our program, this results in each new column being computed in its entirety by a process before that column is transmitted onwards to other processes. Only one process is active at a time! To remedy this, computation of the elements of the column and communication of these values should be overlapped. The present implementation is a combination of standard optimizations: loop distribution, loop jamming, and code motion [6] all of which have been modified to preserve properties of the communication patterns amoung the processes in addition to preserving data dependencies. For details, we refer the interested reader to [lg] ; the point we wish to make here is that this optimization, like the previous one, can be performed by a code transformation that is independent of the parameters of the actual machine. Not surprisingly, the most impressive gains in our experiments are demonstrated by Figure 5 which shows the improvements due to pipelining of computation and communication.
Pipelining the Gauss-Seidel program results in a program in which a value is sent as soon as it is computed. This leads to a lot of message traffic. By 'blocking' these values, we obtain the curve of Figure 5 which has the best performance -the block size is a compromise between decreasing the number of messages and exploiting parallelism. The optimal block size is a combination of the machine dependent cost of message overhead and the a,mount of computation between sends. We have not yet implemented this optimization in our compiler but we plan to use an estimate of the communication costs to determine the optimal block size. Given the bIock size, the code transformation that accomplishes section are required only because of idiosyncracies of the iPSC/2. The answer to this question is an emphatic no. Let us first consider other message passing machines. The only 'machine-dependent' part of our transformations is in determining the optimal block size. Pipelining of computation and communication is necessary to exploit parallelism on any machine -this has nothing to do with the overhead of message passing. Vectorization and blocking are required because the overhead associated with a long message is less than the overhead of many small messages. This is likely to remain true for message-passing machines.
In the context of shared memory machines with a two level memory hierarchy, these optimizations become a matter of chasing the appropriate granularity for synchronization between processes. In any situation in which one process is a producer of a sequence of values and another is a consumer of these values, there is need for synchronization.
The granularity of this synchronization affects parallelism and the cost of doing synchronization -if the processes synchronize on every value, parallelism may be enhanced but the cost of synchronization goes up and vice versa. Thus, in sharedmemory architectures as well, one needs to determine an optimal 'block' size.
We have successfully compiled several small programs for the iPSC/2 using our techniques and our results so far have been encouraging. Our current goal is to compile SIMPLE, a large scientific benchmark for this machine.
Related Work
Mehrotra and van Rosendale [13] at Purdue are translating Blaze, a functional language with a forall construct, into an extension of Blaze that includes constructs for explicit process creation, data storage layout, and interprocessor communication and synchronization. They use programmer supplied data decomposition information to schedule forall loops to exploit spatial locality. Recently, we have come to know that a group led by Kennedy and Zima at Rice University are studying similar techniques for compiling a version of FORTRAN 77 that includes annotations for specifying a data decomposition, for the Intel iPSC/2.
[la] describes a method quite similar to our run-time resolution. They also discuss how existing transformations may be used to improve their generated code. Our methods are equally a.pplicable to FORTRAN.
Extensions
This section briefly presents several extensions we plan to incorporate into our system: accumulators, mapping polymorphism, higher order functions, and load balancing.
Accumulators
The techniques described in the previous sections are inadequate for 'accumulation' problems in which a commutative and associate reduction operation is applied to a set of values. A common example of such an operation is finding the maximum or the minimum of an array. If the variable used to compute the maximum is owned by a single process, then all of the array elements not mapped to that process must be sent to it through messages. In contrast, a handwritten solution would compute each process's local maximum and use a fanin/fan-out scheme to compute the global maximum and broa.dcast that value to all the processes. It is difficult for a compiler to come up with this code starting from t,lie sequential program, unless it can recognize accumulations in the code. Fortunately, Id Nouveau includes a generalization of I-structures called accumulators [l6] for expressing such accumulation problems. We believe that a version of this construct can be used to generate good code for our model.
Mapping Polymorphism
One disadvantage of the system described above is that a procedure must have a fixed mapping -if the programmer wants two different mappings for a procedure, he must make two copies of the code of procedure. This is, of course, analogous to the situation in ordinary type specification -a PASCAL programmer who wants to sort both integer lists and floating point lists must write two procedures with similar code but different type specifications. To get around this problem, we can introduce mapping polymorphism by permitting t,he abstraction of mapping specifications, in much the same way that abstracting types from procedures yields polymorphic type systems. A generalized type system of this sort was studied by Lucassen and Gifford[l5]; we are studying the implications of such a system for code generation.
Higher Order Functions
Higher order functions are those that may take functions as arguments and return functions as results. Run-time resolution can handle higher-order functions without any modifications.
To permit good compile-time resolution, some restrictions must be placed on functions. All functions passed as an argument to a given function must have the same argument types and return type, the same argument mappings and return mapping, and the same participants function. Without these restrictions, compile-time analysis may produce code that is not much better than run-time resolution.
We do not as yet know if these restrictions are overly constraining.
Load Balancing
A good process decomposition places several processes on one processor to ensure that when one process needs to wait for a remote reference the processor running it will have work to do. In addition to hiding memory latency, having multiple processes on a node facilitates load balancing. Processes may be shuffled from overloaded to underloaded nodes without slowing their execution if the data associated with a process is moved along with the code. We would like to experiment with a simple load balancing scheme that moves a process and its data together. Preliminary work on this problem has been done by Fox et al [7] and we are studying their work to see if it can be adapted to our situation.
Conclusions
Good parallelizing compilers are essential for the widespread acceptance and use of parallel machines. The trend in coarse-grain architectures is toward machines that have non-uniform memory access times. An effective parallelizing compiler for such a machine must exploit spatial locality of reference. We have presented a method of code generation that is based on using domain decomposition specifications to exploit spatial locality.
