While the data ow execution model can potentially uncover all forms and levels of parallelism in a program, in its traditional ne-grain form, it does not exploit any form of locality. Recent evidence indicates that the exploitation of locality in data ow programs could have a dramatic impact on performance. The current trend in the design of data ow processors suggest a synthesis of traditional non-strict ne grain instruction execution and a strict coarse grain execution in order to exploit locality. While an increase in instruction granularity will favor the exploitation of locality within a single execution thread, the resulting grain size may increase latency among execution threads. We de ne ne grain intra-thread locality as a dynamic measure of instruction level locality and quantify it using a set of numeric and non-numeric benchmarks. The results point to a very large degree of intra-thread locality and a remarkable uniformity and consistency of the distribution of thread locality across a wide variety of benchmarks. We also evaluate the resulting latency incurred through the partitioning of ne grain instructions into coarser grain threads. We de ne the concept of a cluster of ne grain instructions to quantify coarse grain input and output latencies. The results of our experiments o er compelling evidence that the inner loops of a signi cant number of numeric codes would bene t from coarse grain execution. Based on cluster execution times, more than 60% of the measured benchmarks favor a coarse grain execution. In 63% of the cases the input latency to the cluster is the same in coarse or ne grain execution modes. These results suggest that the e ects of increased instruction granularity on latency is minimal for a high percentage of the measured codes, and in large part is o set by available intra-thread locality. Furthermore, simulation results indicate that strict or non-strict data structure access does not change the basic cluster characteristics.
Introduction
The past decade or so has seen a tremendous acceleration in the research and development of parallel processing systems. A large number of research prototypes as well as commercial systems have brought new insight into the suitability of various parallel execution models for a variety of scienti c and engineering applications. Parallel execution models can be classi ed according to two key parameters: granularity and synchronization. Representatives of a large grain asynchronous execution model are: shared memory multiprocessors (e.g. Sequent, Encore, Alliant), message passing multicomputers (e.g. Intel iPSC, NCUBE) and tightly coupled vector processors (e.g. Cray X-MP). The ne grain synchronous execution model is represented by SIMD architectures (e.g. CM-2, DAP, MassPar) and by VLIW architectures (e.g. Multi ow, Cydrome). The data ow model ts in the category of ne grain asynchronous execution.
The advantages of the ne grain data ow model have been extensively documented in the literature. They can be summarized as hiding latency: the time between a request such as a remote memory access and its response, and providing an e cient mechanism for synchronization: partially ordering parallel events in time 1]. Experience from the rst generation of data ow machines however has shown that the classical ne-grain data ow model also su ers from serious drawbacks. By executing solely on a ne level of granularity, the model does not take advantage of any form of program or data locality 2]. To alleviate some of that overhead, the second generation of data ow architectures introduced modi cations to the basic ne grain model that allow some form of coarser grain execution. While preliminary studies have shown signi cant improvements over the basic model, these new architectures are mostly incremental modi cations of an original model each addressing just one aspect of the tradeo between locality and ne grain parallelism. A systematic study of architectural features supporting coarser grain execution in second generation data ow architectures is clearly needed.
This paper presents the results of a rst attempt at quantifying the parameters that determine the tradeo between locality, parallelism and latency in the execution of data ow programs. The results of these measurements are used to gain a better understanding of which architecture features can best be exploited in a hybrid data ow/von Neumann architecture. We de ne ne grain intra-thread locality, in a dyadic operation 1 , as the inverse of the number of time steps one token has to wait for the other in the matching store and quantify it using a set of numeric and non-numeric benchmarks. The results point to a very large degree of thread locality: for example, over 70% of the instructions have to wait less than 5 instruction execution steps for their input data. While an increase in instruction granularity will favor the exploitation of locality within a single execution thread, the resulting grain size may increase latency among execution threads. We therefore evaluate the latency incurred through the partitioning of ne grain instructions into coarser grain threads. The results of our experiments o er compelling evidence that the inner loops of a signi cant number of numeric codes would bene t from coarse grain execution. More than 60% of the measured benchmarks favor a coarse over a ne grain execution. In 63% of the cases a coarse grain execution does not incur any additional latency. These results suggest that the e ects of increased instruction granularity on latency is minimal for a high percentage of the measured codes, and in large part is o set by available intra-thread locality.
The rest of this paper is organized as follows: the following section is a discussion of the evolution of data ow models as described in the literature and implemented on related research prototypes through the rst and second generation. The issues related to the tradeo between locality, parallelism and latency in the data ow model are discussed in Section 3. Section 4 presents the methodology and results of our quantitative evaluation of intra-thread locality and loop latency. Possible implementations that exploit locality in coarse grain data ow models are discussed in Section 5. These are: a data ow vector execution model aimed at exploiting data structure locality while preserving the bene ts of the basic data ow model, and a bottom-up algorithm for the creation of coarse grains from a ne-grain graph that preserve deadlock freeness and allows the exploitation of instruction level parallelism and locality. Concluding remarks are o ered in Section 6.
Background
First generation data ow computers are implemented in a strictly ne-grain fashion. Instructions are allowed up to two input and two output tokens. This both reduces the complexity of the local matching and instruction stores, and limits the size and execution time of each instruction. These e ects are thought to be bene cial to the implementation of data ow computers. The cost of matching is reduced, and the design of a pipelined processor with multiple function unit is made more feasible. Most notable among these are: MIT Tagged-Token Data ow Architecture 3], the Manchester Machine 4] and the ETL Sigma- 1 5] . Most data ow machines provide hardware support for I-structure storage 6] where data locations are tagged with presence bits. The data is accessed in a split phase fashion where request and reply are two asynchronous events. This has the advantage of masking memory access latency and allows asynchronous reading and writing of the same location. The above implementations of the ne-grain data ow model have demonstrated a number of advantages:
Parallelism: by its very nature the model extracts all the available parallelism in a program. Tolerance to Latency: latency is hidden by excess parallelism, split phase memory accessing and simple synchronization supported by the hardware. Simple Code Generation: the straightforward semantics of operators and the hardware support for synchronization allow for simple code generation strategies. Load Distribution and Scalability: the ne granularity and functionality of the operators and their large numbers makes it possible to e ectively employ simple load distribution strategies over a scalable multiprocessor data ow machine. These rst projects also uncovered a number of di culties associated with the purely ne grain execution model.
Instruction Overhead: data ow machines execute a relatively large number of non-compute operators, such as tag manipulation instructions and branch instructions. Matching Overhead: the matching store is a complex stage of the execution pipeline through which every data token must go. Because of the asynchronous nature of data ow execution, the generation rate of ready instructions by the matching store is not predictable. When all instructions are dyadic, the speed of the matching store must be twice that of the ALU to achieve 100% utilization. This is hard to attain when matching store and ALU are implemented in the same technology, e.g. on one chip. Control of Parallelism and Resource Management: although the aggressive exposure of large degrees of parallelism is essential for high-speed computing, it introduces a resource management problem. When the available program parallelism exceeds the machine parallelism, tokens that are ready for execution will saturate the matching store and other data bu ers. In the worst case this can result in deadlock 7]. Communication Overhead: ne granularity potentially increases the amount of data communication in the system. Structure Store Overhead: because the execution model is ne grain and asynchronous, the structure store is designed to cope with asynchronous element-wise access patterns and is therefore unnecessarily costly in dealing with regular data structures such as vectors and matrices. Solutions to several of the above problems have been proposed in the design of second generation data ow and multithreaded machines, such as Monsoon 8, ?] and *T 9], PRisc 10], Epsilon-2 ?], 12 ], the hybrid model proposed by Iannucci 13] and the multithreaded execution model TAM 14, 15, 16] . The major alterations to the basic ne grain data ow model are:
Increased granularity has been introduced to reduce communication and matching overhead and simplify resource management.
Simpli ed matching mechanisms have been implemented to eliminate the need for an associative search in matching. These von Neumann/data ow hybrid machines are a departure from the original ne grain data ow approach. For example: the explicit token store in the MIT Monsoon machine simpli es the matching considerably and uses registers for intermediate results eliminating the need for data ow synchronization; the iterative instructions in a later version of the Manchester data ow machine produce whole streams of tokens in one instruction execution reducing matching overheads and shortening critical path lengths; the Sandia Epsilon-2 machine uses the repeat mechanism to dispatch a thread of instructions e ciently, and has von Neumann style operand memory and registers apart from data ow matching store. Finally, the ETL EM-4 data ow machine executes the instructions of subgraphs, called strongly connected blocks, in a von Neumann fashion, while it schedules these strongly connected blocks in a data ow fashion.
The tag in hybrid data ow systems has become simpler; it consists of an activation name only. Matching is not implemented by hardware hashing as in the Manchester and the Sigma-1 machines, but by virtual addressing, where the activation name is used as a page address and the destination (instruction address) is either directly used as an o set in the page (as in the EM-4 design) or yields the o set indirectly (as in the Epsilon and Monsoon designs).
The nodes in the data ow graph are now von Neumann style threads of code. The various designs di er in the characteristics of a thread. A thread can either be blocked and later resumed in the midst of its execution (because of a remote memory reference for example), or it cannot block. In the latter case a remote memory reference would be the last instruction in a thread. The compiler is responsible for creating the threads in such a way that on the one hand the loss of parallelism is minimized and no load balancing problems occur (threads should not become arbitrarily large) and on the other hand communication is minimized (avoid matching and token exchange by creating relatively large grains). The size of the threads is in uenced by the ring rule in the hybrid model of computation. A strict ring rule allows a thread to execute only when all its inputs are available, avoiding threads to block but potentially increasing latency and decreasing thread size. Conversely, a non-strict ring rule allows a thread to execute when some of its inputs are available. In this case threads can become larger, but the architecture must cope with blocking threads, which may increase the complexity of synchronization.
These coarse grain features have brought signi cant improvements over the basic model. Each one of these features addresses one aspect of the tradeo between locality and ne grain parallelism. The characteristics and nature of these are relatively well understood in a qualitative way individually. However, very little research has been done to quantify their combined e ect on performance.
Locality and Latency in Data ow Execution
The data ow and von Neumann models have often been compared and contrasted. The major point of di erence is in their respective approaches to instruction scheduling. In the data ow model instructions are scheduled at run time based on the availability of data. Scheduling and synchronization of concurrent instructions is provided implicitly by the model. Using relatively simple code generation techniques, this model can exploit all the available parallelism at all levels. In the von Neumann model instructions are scheduled statically which, in the general cases, complicates the extraction of parallelism from programs. By relying on reusable high speed registers and caches, the von Neumann model can exploit all forms of locality e ciently.
The objective of the hybrid von Neumann-data ow approach is to provide high-speed parallel computing by combining the positive aspects of both models: e cient instruction level exploitation of locality of the von Neumann model and the tolerance of latency and e cient synchronization of the data ow model. A coarser granularity of execution opens the way to a wide spectrum of design options in the architecture of the processor (e.g. register les, code and data caches, multiple functional units, direct feedback loops within a processing element and vector instructions). These in turn have a strong impact on instruction set design, code generation and program partitioning strategies. The success of these hybrid data ow designs depends on careful quantitative analysis of program behavior given certain machine characteristics.
While the von Neumann model is sequential and synchronous and its locality is well understood, the data ow model is parallel and asynchronous and therefore leads us to re-examine the concept of locality in this hybrid execution model.
Three forms of locality can be identi ed in a hybrid multithreaded execution model.
Intra-thread locality is the equivalent of the von Neumann instruction level locality. It can be exploited using traditional features such as register les, cache memories, multiple functional units, instruction prefetch.
Inter-thread locality and latency within one function or loop body activation. In implementations such as the Monsoon Explicit Token Store ?], the Epsilon 2 ?] and the EM-4 machine, it is exploited using stack frames for local variables. Here we can see a tradeo between locality and parallelism. Where in the Monsoon machine an activation runs on one processor allowing a e cient use of stack frame storage, in the EM-4 machine an activation executes on a group of processors allowing an exploitation of parallelism at the cost of additional storage. While increasing the size of grains has potential for exploiting locality, it can cause loss of parallelism and may increase the critical path length of the program. This is possible for example when a strict ring rule is applied and grains are executed sequentially.
Data structure locality where the same instructions are applied to the elements of a regular data structure. This is analogous to the locality exploited by vector and array processors in the von Neumann model. Stream generating instructions, as described in 17], allows not only the exploitation of data structure locality, but can also speed-up the spawning of tasks which can have a dramatic e ect on the overall parallelism of the program 18].
Quantitative Evaluation
In this section we describe the quantitative evaluation of intra-thread locality and inner loop input latency in data ow programs. The measurements were obtained by instrumenting the simulated execution of a ne-grain data ow graph. The results were then used to analytically evaluate the potential reduction in overhead and speed-up a coarse-grain execution can provide. Section 4.1 describes the methodology and benchmarks used in the measurements. The evaluation of intra-thread locality is discussed in Section 4.2 and that of inner loop latency in Section 4.3.
Methodology and Benchmark Suites
In order to avoid biases introduced by program partitioning and allocation among processors, we will base our measurements of the ne-grain data ow execution on an ideal execution model. This model is characterized by:
the availability of an in nite number of processors, all instructions have the same (unit) execution time, the matching and communication overhead for tokens is zero implying that no restrictions is imposed on the out-degree of a node. Therefore, at every time step all enabled instructions will execute concurrently. The set of instructions executed in a given time step will be referred to as a generation. A generation is therefore the equivalent of one instruction cycle. It follows from this model that the number of generations necessary to execute a program graph is the size of the critical path through that graph in number of instructions.
It should be noted that by assuming an in nite number of processors, we measure the intra-thread locality intrinsic to the data ow graph itself. This measure is therefore an upper bound on the locality that would be experienced with a limited number of processors or functional units per processor since the waiting time would be larger or equal. These measurements are analogous to the parallelism pro les measurements that provide an upper bound on the amount of parallelism intrinsic
The benchmarks used in these measurements were compiled into Manchester Data ow Machine code 19] from SISAL 20] source code. The following benchmarks were used in our analysis of thread locality: The rst ve benchmarks are clearly numeric code, the Purdue suite of benchmarks is mostly numeric but includes some non-numeric program fragments and nally the last two are strictly non-numeric programs.
Sisal is a purely functional strict language, which means that arrays, can, (but do not have to), be completely de ned before the array elements are read. Therefore Sisal allows both a strict and non-strict array implementation. Unless explicitly stated, our programs are compiled into strict code and are automatically garbage collected. Non-strict code allows for more parallelism and has a lower instruction count, at least on data ow machines, but in this case garbage collection is not performed.
Intra-Thread Locality
The asynchronous nature of data ow execution tends to obscure the presence of program and data locality. There is, however, substantial empirical evidence that points to the existence of large amounts of locality that can be exploited to enhance performance. In this section we focus on intra-thread locality which is the locality among the instructions constituting a thread. We will simply refer to it as thread locality.
Our objective in this section is to quantify thread locality in order to evaluate the potential bene t of using registers for temporary data storage and thereby reduce matching store overhead and token queue tra c. Data ow programs exhibiting a high degree of locality would bene t from the conglomeration of tightly coupled ne grain instructions into coarse grain, von Neumann style execution threads.
Measuring temporal data and instruction locality for a single thread of von Neumann code involves tracking the time between successive references to the same memory address (data and instruction references). This is normally done through data and instruction breakpoints. These values are sorted by increasing reference times to form a locality pro le. The asynchronous nature of data ow execution complicates similar measurements on a data ow processor: the scheduling of instructions is not based on their lexical sequencing but on data availability.
De nition 1 Given a dyadic instruction, waiting time is de ned as the time delay between the arrival of its rst input data value and that of the second assuming the availability of in nite resources.
De nition 2 Intra-thread locality is simply de ned as the inverse of waiting time.
From these de nitions, it follows that single-input instructions and those cases where both tokens arrive at the same time step, have in nite locality.
Just as the delay between two successive memory references characterizes the temporal locality in a von Neumann thread, the delay between the arrival of the rst and second operand of an instruction is a measure of the locality that exist between the nodes of a data ow graph. We will measure the waiting time of a token in the matching store as the number of generations.
The proposed measure of locality is therefore a dynamic machine independent measure that is an intrinsic characteristic of the program graph and its input data. This model assumes that every instruction executes as soon as all its operands are available. In a real machine this is not the case because of the limitation of available resources, communication delays and the partitioning and allocation of a program graph. However, our measure of locality is still valid, because: It is a measure of the potential degree of locality to be exploited by any coarse-grain execution model. In existing machines, such as the Monsoon, EM-4 and Epsilon 2, a thread gets executed eagerly as soon as it is enabled, thus its timing behavior is analogous to our execution model. Similar locality measurements are reported in 26] where the waiting time in the matching store is measured for a single iteration of the Simple code on 10x10 mesh executing on a simulated Tagged-Token Data ow Architecture. The results are used to evaluate token store capacities for various loop unrolling strategies. Related research is reported in 27] where a high-speed register-cache memory design is proposed for a multi-threaded architecture. In this design registers are allocated at runtime and exploit the locality within super-actors.
Measurements of Intra-Thread Locality
The majority of time in numeric scienti c programs is spent in inner loop executions (the 90/10 temporal locality rule 28]). To isolate locality measurements to those program segments that execute most frequently, the e ects of program initialization and control bias was removed. The test suites Livermore and Purdue were averaged over all token generations and then weighted by the total number of matches per benchmark.
Single-input instructions are a special case in our measurements: for these instructions, intra-thread locality, as de ned above, would be in nite because any single-input instruction can execute as soon as its predecessor instruction has completed. The percentage of single-input instructions in the benchmarks used are shown in Table 1 . It is evident that these proportions are often dependent on the problem size. They, however, represent a signi cant proportion of all instructions executed indicating that, at least in this instance, there is a substantial amount of intra-thread locality that can be exploited. We will not include single-input instructions in our measurements of intra-thread locality. This stems in part from our belief that in a hybrid coarse-grain data ow architecture model it would be often possible to merge single input instructions with their preceding instruction in one schedulable macro-instruction. Figure 1 shows the distributions of waiting time, in histogram form, for the Livermore, Purdue and Simple benchmarks as percentage of matched instructions (i.e excluding single-input ones). Figure 2 shows the cumulative waiting time distributions for all the benchmarks considered. 
Discussion of Results
The results of our simulations indicate that considerable intra-thread locality is available in ne-grain data ow programs. The following observations can be made based on the reported measurements:
The measurements tend to cluster in the lower values of the delay indicating a high degree of locality. For the Livermore Loops 42% of all two input instructions have waiting times less than 5 generations (Figure 1 ). If we include single input instructions, more than 70% of the instructions have waiting times less than 5 generations. In all benchmarks over 20% of all matches have a delay 2.
The simulation results are remarkably uniform across all benchmarks as can be seen in Figure 2 . The knee of all curves occurs at or before 10 generations and indicates the percentage of tokens that could be allocated in von Neumann style temporary storage such as registers or data cache. It also indicates that the expected lifetime of data in this storage is less than 10 instruction cycles. The plots in Figure 2 also show that 20 to 55% of all matches will have a waiting time larger than 20 generations. The maximum waiting time across our benchmarks ranges from a few generations to over 10,000 generations. The distribution of waiting times larger than 20 is relatively uniform. This indicates that an inexpensive long term secondary storage combined with a high speed temporary store would be su cient. These results indicate that there is much to be gained by incorporating speci c architectural and compiler features to exploit thread locality in a hybrid data ow machine. One obvious approach is the grouping of ne grain instructions into a single coarse grain instruction with multiple inputs and outputs. Coarse grain instructions would be bene cial in eliminatinga large percentage of single input instructions and non-compute overhead instructions. In addition, increasing instruction granularity would reduce the load on the matching unit and the communication network.
Coarse grain instructions would allow thread locality to be exploited through the use of register banks, data structure locality could be exploited through pipelining. In 17] it was shown that the introduction of iterative instructions (instructions that produce a sequence of outputs when presented with a single set of inputs) was bene cial in reducing program execution times on the Manchester Data ow Processor. The partitioning of ne grain instructions into coarse grain instructions would be based on several complex tradeo s such as: matching cycles saved, instruction input latency, the ratio of the number of operands to the instruction operation and the coarse grain instruction execution time. The matching unit itself should be altered to exploit locality through the use of a token cache. The cache should employ a direct matching scheme using a di erent segment for each token color and a directly computable o set within the segment based on the instruction address.
Inner Loop Input Latency
While an increase in instruction granularity will favor the exploitation of locality within a single execution thread, the resulting grain size may increase latency among execution threads which might lead to a performance degradation. Our objective in this section is therefore to quantitatively evaluate the added latency cost that is incurred in a coarse-grain data ow execution model.
De nitions
In this section we de ne the execution models as well as the graph and timing parameters that will be used in the evaluation.
De nition 3 A cluster of instructions is a connected directed acyclic graph with only one output arc. In a cluster each node is a ne grain instruction and the output is always generated by the last executing node.
An example of a cluster is depicted in Figure 3 . The reason behind a single output arc of a cluster is to allow only one xed termination point in the execution of a cluster. This constraint is not necessary in a general coarse-grain execution. While a cluster could be any collection of ne grain instructions, in this paper we will focus on the body of loops as a speci c type of cluster. Figure 5 shows the body of loop number 7 in the Livermore loops benchmark suite. Two parameters are associated with the graph of a cluster: In the remainder of this section we will compare two execution models of a cluster: Fine grain execution model: any instruction in the cluster will execute as soon as its inputs are available, instructions execute in parallel assuming in nite hardware parallelism. This model is essentially the ideal execution model de ned in Section 4.1. Coarse grain execution model: no instruction will execute until all the input tokens to the cluster have arrived. Instructions in the cluster are executed in a strictly sequential mode. Next we de ne a number of timing parameters that characterize the execution of a cluster under either ne or coarse grain models. t 0 : the arrival time of the rst input to the cluster, t n : the arrival time of the last token in the cluster, t r : the time at which the result is generated, The relationships between these values, shown in Figure 4 , are de ned by:
l: the delay between t 0 and t n , (l = t n ? t 0 ), d: the delay between the arrival of the last input token (t n ) and the rst result token (t r ), (d = t r ?t n ), In general, for a cluster to be e ciently executed under a ne grain model, 1; S 1 d, and l l 0 , should hold. These conditions guarantee that the data ow circular pipeline stays full, hence the cost of token matching and structure store latency can be hidden by fast context switching. A cluster may favor a coarse grain execution strategy for l l 0 or d S 1 , provided the average cluster parallelism , is reasonably small and the available intra-thread locality is high.
Based on these parameters, we can derive the following properties for an ideal execution model (i.e with zero matching costs):
1. l = 0 ) d = S 1 . By the de nition of d, if no instruction in the cluster incurs any input latency, the cluster output delay will be equal to its critical path length. 
Latency Measurements
To quantify the cluster parameters, l, l 0 , and d, the Manchester Data ow Machine (MDFM) simulator 4] was used to trace the execution of inner loop clusters of some of the Livermore loops (1-10, 13-15) and the Purdue benchmarks (1-4, 7-14) . The clusters were simply chosen as the inner loop in each benchmark. The cluster boundary was chosen such that the top level instruction in each cluster is an arithmetic instruction. In other words all the inputs to the cluster are inputs to an arithmetic instruction within that cluster. To study the e ects of both strict and non-strict structure store access on cluster latency, all benchmarks were compiled using both strict and non-strict/no garbage collection options (this aspect is discussed in Section 4.1).
As an example, the Sisal code for Livermore Loop 7 as well as the cluster of instructions of its inner loop are shown in Figure 5 . The opcodes MLR and ADR are multiply reals, and add reals respectively. With each input is also speci ed its arrival time, in generation numbers, obtained from the simulator after compiling the source code using strict structure store access and garbage collection enabled. For this loop Simulation results obtained for code generated using strict structures with garbage collection active are summarized in Table 2 . The following observation can be made: l and l 0 are the same for 63% of the simulation runs, implying the token arrival to token result delay, d, is very nearly equal to S 1 for these clusters. In these cases a coarse grain execution would not incur any additional latency over a ne grain one. Under these conditions there can be little pipelining of instruction execution, and therefore the overall execution time of the cluster is not a ected by its execution mode. The average value of is small 1.3 for the Livermore Loops, and 1.1 for the Purdue benchmarks. This indicates that there is very little parallelism to exploit within a single iteration of these loops (this result has been reported in 29]). Because is small, token matching and structure store latency cannot ]. Another alternative is a sequential non-blocking execution strategy which o ers great potentials for exploiting intra-thread locality and also reducing matching store costs. Of course, traditional compiler optimization techniques such as loop unrolling and software pipelining can be applied and would provide a tremendous increase in intra-cluster parallelism.
Strict and Non-Strict Structures
The Sisal to MDFM code generation system can be con gured to generate code for both strict and non-strict access to the structure store. The structure store accesses data without the use of tags, implementing the I-structure paradigm through a deferred read mechanism 30]. The structure store consist of an allocation unit, structure memory, a deferred access queue and a clearance unit. The allocation unit manages storage allotment, while the clearance unit performs garbage collection. Structure store access can be made asynchronously via a deferred access queue. When a read request is made for data not yet written, the request is held in the deferred access queue until the required data arrives. To compare the e ects of strict and non-strict structure store access on cluster parameters, each benchmark was compiled to generate code for both paradigms. For strict structure store access, the compiler generates code to ensure strictness, i.e. no instruction can access an array element until the entire data structure is built. The e ect of strictness on cluster inputs is to delay the arrival of tokens transporting data structure elements. Simulation results indicate that the basic cluster characteristics do not change. Table 3 shows the timing parameters for a non-strict execution of some Livermore loops. Comparing these with the values in Table 2 , the main conclusions that can be drawn are:
The token arrival latencies l 0 and l, are in general longer for code compiled with strict structure store accesses than non-strict. This result is to be expected since no array element is available prior to completion of the whole array. In most of the cases in the strict execution we have l = l 0 whereas it is not the case in the non-strict execution. Again this is to be expected since strict array creation makes all array elements available at the same time.
From these initial results we conclude that a coarse grain execution is not hampered by strict array implementation whereas a non-strict array implementation enhances ne grain execution. These initial results however, must be con rmed by larger benchmark programs.
Analysis of Results
The results in the previous section describe the latency of input token arrival at cluster boundaries. In this section we analyze the e ects of the input latency on the execution time of a cluster under the ne and T is the total delay in the cluster execution from the arrival of the rst input to the availability of its output. T min is the lower bound on T under the ne grain execution model and T max is the upper bound on T under sequential coarse grain execution. The di erence can be viewed as a measure of the potential loss in performance between the coarse and ne grain models. Surprisingly, for the combined benchmarks, T min = T max in 45% of the cases. The following analysis is based on a hypothetical data ow machine depicted in Figure 8 where the matching stage and the execution stage both have a unit time delay. The machine can operate in both coarse and ne grain modes as de ned in Section 2. In both modes an instruction will execute in one time step.
Let T f and T c represent the execution time in the ne and coarse grain modes respectively. They can be derived as T f = l + 2d T c = l + 1 + S 1 For a cluster of instructions executed in ne grain mode, a latency l 0 is incurred until the rst instruction packet can re. At time t 0 + l there will be d instructions that will have to go through the two stages of the machine. In coarse grain mode the execution unit incurs a startup latency of l, until the matching unit collects all cluster inputs. The rst instruction is ready to execute at time l + 1, all others instructions requiring an additional S 1 cycles to produce the nal result.
Let the speed-up achievable in the ne grain mode over the coarse grain one be de ned by The values of SP are shown in Table 2 , indicating that 60% of the benchmarks would bene t from a coarse grain execution. The distribution of the speed-up SP across the benchmarks is shown in Figure 7 . It can be noted that all values of SP lie between 0.5 and 1.2 implying that a coarse grain execution would outperform a ne grain one by at most a factor of 2, while the ne grain execution will perform at most 20% better and that in only one case. These results are even more compelling considering that communication costs or resource con icts have not been accounted for in the ne grain execution model. A ne grain model of execution will perform well for small input latency l, and su ciently large cluster parallelism. In this case a ne grain execution model can mask memory latency through fast and inexpensive synchronization. However, if cluster parallelism is small and input latency high, the ne grain model performs poorly because there is insu cient useful work to mask input latency. The full cost of token matching is incurred because there are no instructions in the execution unit's ready queue. By exploiting intrathread locality, a coarse grain model can e ciently execute clusters consisting of sequential code with few synchronization points.
Hybrid Models
In this section we provide a brief summary describing some of the research projects currently being investigated by our team that address the issue of hybrid data ow von Neumann models.
V-Structures: A data ow vector model
A V-Structure consists of a number of xed size chunks of data elements 31]. Each data array consists of a number of chunks, each chunk is tagged with a presence bit and is accessed as whole from the structure store. In the processor a chunk is operated upon by vector instructions using vector functional units and registers very much like in vector machines. Unlike the Manchester Data ow Machine iterative instructions 17], where structure elements are produced in a stream mode but consumed as scalars, in this model arrays elements are both produced and consumed in vector mode. The V-Structure model, therefore, can be seen as a hybrid strict/non-strict data structure providing intra-chunk strictness and inter-chunk non-strictness to data structure access. It is also a hybrid between I-Structures 32] and traditional vectors.
Experimental simulation results employing this model indicate an order of magnitude improvement in performance over conventional ne grain architectures is possible. Furthermore, at least an order of magnitude reduction in executed instructions results from coarse grain vector operations, resulting in a signi cant reduction in matching store operations. These results also show that the execution time, in machine cycles, is minimal for certain values of the chunk size. We have developed an analytical model that derives the optimal values of the chunk size based on program and machine characteristics. This model corresponds well with our experimental results.
Clusters: A coarse grain model
This work examines a cluster based model of coarse grain data ow execution where neighboring instructions are grouped into clusters that become the schedulable unit of execution ?]. The primary objective of this model is to reduce the runtime overhead incurred in ne grain data ow execution by exploiting instruction level locality within a cluster thereby reducing the costs of the matching overhead. Clusters implements a strict matching function with non-blocking execution meaning that all inputs must be present before the execution of a cluster can start and once started the execution does not block.
A bottom-up clustering algorithm has been developed, that, starting with a ne grain data ow graph, builds a cluster based graph that satis es this execution model. This three phase algorithm attempts to: avoid any potential deadlock, preserve any loop and function level parallelism, and maximize instruction level locality within a cluster.
The results from the simulated execution of these cluster graphs show a signi cant reduction in the amount of matches per instruction executed: from 1.78 in the ne grain model to 0.96 in our cluster model. Furthermore, the results also show that the cluster based execution does not reduce the parallelism intrinsically available in the original ne grain data ow graph. The average cluster across these programs has 5.0 nodes and 3.0 inputs as opposed to the 1.78 inputs per single node in the ne grain model. However, the number of single node clusters is still relatively high: 30% of all clusters. We expect a future reduction in this gure by using a more sophisticated algorithm.
Conclusion
The ability of the data ow paradigm to exploit program parallelism at all levels and resolving data dependencies at run time has been demonstrated by several research projects. The ne grain execution model, however, fails to exploit the inherent locality in programs and thereby introduces unnecessary run time overhead in the most expensive stage of the data ow pipeline: the matching stage. Hybrid von Neumann/data ow architectures, that alleviate this run time overhead, have been proposed as alternatives to the ne grain model. This paper addresses the issue of quantitatively evaluating the dynamic instruction level locality present in data ow graph and the added latency introduced by a non-blocking execution of coarse grain data ow graphs. The experimental measurements evaluating thread locality on a data ow machine simulator and based on a set of numeric and non-numeric benchmarks.
We have de ned and quanti ed intra-thread locality in data ow execution. The results present compelling quantitative evidence that a high degree of intra-thread locality exist in data ow programs. On the other hand, it has been shown that a substantial percentage of tokens have very large waiting time indicating the need for long term secondary matching storage. The distribution of the locality appears to be quite consistent across a wide variety of benchmarks.
We have used the concept of a cluster of ne grain instructions to quantify input and output loop latencies under both coarse and ne grain execution models. The results show that in a large percentage of the cases (63%) there is no increase in latency in the coarse grain execution model. An analysis of the execution of these clusters on an idealized hypothetical data ow machine shows that a coarse grain execution will outperform a ne grain one in over 60% of the cases. Where the ne grain model has a smaller execution time, the speed-up is less than 1.2.
Two aspects of hybrid models of data ow execution have been summarized: the V-Structure model exploiting data structure locality in the form of vectors and the Cluster based execution model exploiting instruction level locality.
