This paper presents a new design that implements the data-driven (i.e. dataflow) computation paradigm with intelligent memories. Also, a relevant prototype that employs FPGAs is presented for the support of intelligent memory structures. Instead of giving the CPU the privileged right to decide what instructions to fetch in each cycle (as is the case for control-flow CPUs), instructions in dataflow computers enter the execution unit on their own when they are ready to execute. This way, the application-knowledgeable algorithm, rather than the applicationignorant CPU, is in control. This approach could eventually result in outstanding performance and elimination of large numbers of redundant operations that plague current control-flow designs. Control-flow and dataflow machines are two extreme computation paradigms. In their pure form, the former machines follow an inherently sequential execution process while the latter are parallel in nature. The sequential nature of control-flow machines makes them relatively easy to implement compared to dataflow machines, which have to address a number of issues that are easily solved in the realm of the control-flow paradigm. Our dataflow design solves these issues at the intelligent memory level, separating the processor from dataflow maintenance tasks. It is shown that using intelligent memories with basic components similar to those of FPGAs produces a feasible approach. Expected improvements within the next few years in underlying intelligent memory and FPGA technologies will have the potential to make the effect of our approach even more dramatic.
4 multithreading, where a thread is left to run until completion. However, the compiler must make sure that all data is available to the thread before it is activated. For this reason, the compiler must identify instructions that can be implemented with split-phase operations. Such an instruction is a load from remote memory. Two distinct phases are used for its implementation. The load operation is actually initiated in the first phase (within the thread where the load instruction appears). The instruction that requires the returned value as input then resides in a different thread. This split-phase technique guarantees the completion of the first thread without extra memory-access delay.
EARTH (Efficient Architecture for Running THreads) is a multiprocessor that contains multithreaded nodes [8] . Each node contains a COTS RISC processor (called EU: Execution Unit) for executing threads sequentially and an ASIC synchronization unit (SU) that supports dataflow-like thread synchronizations, scheduling and remote memory requests. A ready queue contains the IDs of threads ready to execute and EU chooses a thread to run from this queue. EU executes the thread to completion and then chooses the next one from the ready queue.
EU's interface with the network and SU is implemented with an event queue that stores messages. SU manages the latter queue. The local memory shared by EU and SU in the local node is part of the global address space. A thread is activated when all its input data become available. SU is in charge of finding out that all of its data is available and sends its ID to the ready queue. A sync(hronization) signal is sent by the producer of a value to each of the corresponding consumers. The sync signal is directed to a specific sync slot. Three fields constitute the sync slot, namely reset count, sync count and thread ID. The reset count is the total number of sync signals required to activate the thread. The sync count is the current number of sync signals still needed to activate the thread (this count is decremented with each arriving sync signal). When it reaches zero, it is set back to its original value and the thread ID is placed in the ready queue. Frames are allocated dynamically using a heap structure. Each frame contains local variables and sync slots. The thread ID is actually a pair containing the starting address of the corresponding frame and a pointer to the first instruction in the thread. The code can explicitly interlink frames by passing frame pointers from one function to another. User instructions can access only EU, not SU. The implementation of EU with a COTS processor implies that its communication with SU is made via loads and stores to special addresses. However, multithreading does not implement dataflow at the instruction level and for the entire program. Also, multithreading and prefetching significantly increase the memory bandwidth requirements.
The conceptual simplicity of dataflow computation strucks a chord in many computer designers, but the lack of methods to efficiently implement dataflow has been a great challenge. As discussed earlier, the problem of implementing dataflow has been tackled numerous times at the processor or multi-processor levels. Dataflow computers in the past have been implemented by using a modified processor, often called the Processing Element (PE), which was composed of a processing unit along with memory to store partially active instructions and tags.
The PE would also contain a matching unit for incoming tokens/tags and was assigned the task of sending and receiving tags, from/to other PEs. Most of the inactive instructions would reside in some memory outside the PE.
Dataflow implementation at the intelligent memory level has not yet been attempted. Each dataflow instruction in this memory should have its own logic unit that determines when the instruction is ready to execute. This approach should incur lesser overhead than previous methods. Our objective is to design and implement a proof-of-concept dataflow computer based on intelligent memories that can be prototyped with current FPGAs. Our prototype should demonstrate the viability of our approach while future FPGAs will have the potential to better match the requirements of our solution.
II. MORE DATAFLOW COMPUTING CHALLENGES
It is difficult to implement on dataflow machines simple programming constructs that are taken for granted in von Neumann architectures, like conditionals, loops and modules (procedures and functions). Accommodating loops and conditionals requires nodes that implement controlled branching. For efficient implementation of loops, each iteration can be executed as a separate instance/copy of the reentrant subgraph (representing the loop code).
This code-copying method requires facilities to create a new instance of a subgraph and to direct tokens to the appropriate instance. A potentially more efficient way to implement code copying is to share the node descriptions between the different instances of a graph without confusing tokens that belong to separate instances. This is accomplished by attaching a tag to each token that identifies the instance of the node that it is directed to. These tagged-token architectures enable a node (instruction) if all its input arcs contain tokens with identical tags. This method increases concurrency but its implementation is not easy and involves considerable overhead. Problems with procedure calls are similar to those with reentrancy. The methods described above can still be applied. In code-copying architectures, a copy of the called procedure is made. In tagged-token architectures, a new tag area is allocated for each procedure call so that each invocation executes in its own context. Nested procedure calls, recursion and co-routines can, therefore, be implemented without any additional problems. However, it is required to direct the output tokens of the procedure to the proper calling location.
Machines that handle reentrancy by the lock or acknowledge method are called static. Those involving code copying or tagged tokens are called dynamic. Static machines are much simpler to implement than dynamic machines but for most algorithms their effective concurrency is lower. The earliest design at MIT had a two-stage structure with heterogeneous functional units, where each enabling unit was dedicated to one node [2] . It was later extended into a series of machines differing in the way they handled reentrancy and data structures. They ranged from the elementary Form I processor, which was static and could only handle elementary data, to the full-fledged Form IV processor, which had extensive structure facilities and could copy subgraphs on demand. Sandia National Laboratories designed and implemented around 1990 the εpsilon static dataflow computer [5] . It was designed as a scalable multiprocessor architecture, consisting of εpsilon processors and structure memory units connected with a packet switched network. The whole design was implemented on a single board using COTS components.
III. OVERVIEW OF OUR DATAFLOW COMPUTER DESIGN
The basic structure of our dataflow computer is shown in Figure 1 . It has three major components: the instruction 
A. Data Flow Memory (DFM)
DFM is the heart of our design. It is a special kind of memory, where each memory location is an intelligent cell consisting of the instruction (memory cell) and processing element (PE) parts. The instruction part is in turn divided into seven components as shown in Figure 2 , where dyadic operations are assumed: instruction opcode (OPCODE), source of the first operand (D1A), source of the second operand (D2A), source of the clause operand (CAD), operand to be obtained from the first source (OPD1), operand to be obtained from the second source (OPD2) and flags that control the behavior of the instruction (FLAGS). The little arrows within each cell in Figure   2 denote the direction of information flow between the cell and the PE. The instruction bus and the result bus are also shown along with the format in which the PE communicates with them.
The result of an instruction that finishes executing is immediately broadcasted to all cells in DFM. Each result packet contains the result along with the address of the instruction in DFM that produced it. The PE in each cell is responsible for picking up broadcasted result packets sent by the processor pool (PP). If the address in a broadcasted result packet is either equal to D1A or D2A for an instruction in DFM, then the result is written into OPD1 or OPD2, respectively, and appropriate flags are set. When both OPD1 and OPD2 are available (determined by examining the appropriate flags) to the dyadic instruction, the PE sends an executable packet (composed of the instruction opcode, operands and cell address of the sending PE) to the bank arbitrator. It is important to point out here that the bus on which the result packets are broadcasted is independent of the bus on which executable packets are relayed to the bank arbitrator. This is done to avoid congestion that would occur if only one bus were used. The CAD field is used to store the address of an instruction that sends the clause. A clause is a boolean value stored as a flag in the flags field; it acts as a permission for the instruction that needs it, i.e. an instruction will execute only if its clause bit is 1. The clause is useful in the construction and execution of conditional or looping program constructs, as discussed later. The Bus Controller (BC) in DFM controls the arbitration of the result packets sent by the processor pool to DFM on the Result Bus.
FIGURES 1,2 GO HERE B. Instruction Queue (IQ)
IQ is an intermediate buffer between DFM and PP. It is made up of the Bank Arbitrator (BA) and multiple banks of memory. The number of banks in IQ is equal to the number of processors in PP. Each processor is assigned one bank exclusively. This assures that all processors can access the memory at the same time; it also keeps the initial design simple. Ideally, a crossbar switch should be implemented so that a processor could access any bank. No processor is busy or idles all the time during the execution; this is because BA makes sure that the number of executable instructions allotted to each memory bank is the same, using a round robin scheme.
C. Processor Pool (PP)
PP is a pool of execution units. Each processor sequentially executes ready instructions (executables) from its assigned memory bank. The instructions are executed in no particular order because all instructions in the memory bank are waiting to be executed. When a processor receives an instruction, it also gets the originating address (OA) of that instruction. After execution, the produced result along with OA is sent to the Bus Controller of DFM in the form of a result packet; this packet is then broadcasted to all cells via the Result Bus. This design is suitable for small-scale dataflow computers, i.e. machines having up to about 32 processors in the processor pool. This restriction is foreseen due to two reasons. First, the number of memory banks in IQ will increase linearly with the number of processors, which for a large number of processors may be unrealistic. Secondly, having a large number of processors may increase the number of broadcasted messages causing congestion on the bus.
IV. FIELD-PROGRAMMABLE GATE ARRAYS
Fastest performance is achieved when a design is implemented directly on silicon, with dedicated/specialized logic for all units. Since the major objective of our prototype was to prove the viability of our intelligent memory based implementation of dataflow, the performance of the prototype was to be of secondary importance. Hence, a device was needed that could easily be reconfigured, if design flaws were detected. In addition, this device required to have sufficient logic to be able to accommodate the different components and also implement intelligent memory.
The ideal solution for prototyping the dataflow computer was to use Field Programmable Gate Arrays (FPGAs) that have recently achieved tremendous technological advances.
FPGAs are user-programmable devices, which are now widely accepted as an excellent technology for implementing and prototyping moderately large digital circuits. They offer a cost-effective solution for prototyping because they have a fast turn around time (i.e. short design and development cycles). Since FPGAs can be reprogrammed an unlimited number of times, they can be used in innovative designs where hardware is changed dynamically and must be adapted to different user applications. Though dynamically changeable hardware is not a consideration in this project, this feature is particularly useful when prototyping, where the design is constantly being changed and updated. In terms of speed, most FPGAs are slower than Complex Programmable Logic Devices. However, rapid advances in FPGA technology are quickly closing the gap on speed and denseness. One prominent disadvantage of FPGA technology is that circuit propagation delays are dependent on the performance of the design implementation tools used, which is not a major handicap for this project.
The internal architecture of an FPGA consists of several uncommitted logic blocks in which the design is to be encoded. These logic blocks consist of several universal gates that can be programmed to operate like multiplexers, decoders, registers, random access memory (RAM) and as many other digital logic devices. The logic blocks are connected through a set of programmable interconnects that implement buses and direct connections.
FPGAs have elaborate clocking schemes and optimization methods (programs) can be used to use faster hardware with fewer logic blocks. Since logic blocks are independent, a single FPGA can be used to implement multiple units, all of which can work independently and in parallel (a key factor in the design of a dataflow computer).
The FPGAs used for our prototype are those made by Altera Corp. Three Altera devices were chosen to implement the three major components of our dataflow computer: DFM was implemented on a FLEX10KE, IQ and the bank arbitrator were implemented on an ACEX1K and PP was implemented on a MAX9000. The FLEX10KE was a good candidate for implementing DFM because each device provides up to 98,304 RAM bits that can be configured as dual-port memory. Also, each device contains sufficient logic, up to 200,000 gates depending on the device chosen, to implement the PEs in DFM. All these are connected together by a fast interconnect network that has predictable interconnect delays. The ACEX1K is a less powerful relative of the FLEX10KE. It has the same features as the FLEX10KE, except there is less of everything. The largest ACEX1K has 49,152 RAM bits and about 100,000 gates. Since the logic and memory requirements for IQ are less than those for DFM, this was a good choice. The MAX9000 is the smallest of the three devices used. It contains no memory bits, but has sufficient logic gates and flip-flops, up to 12,000 and 772, respectively, to implement PP.
A major factor in deciding to use Altera's FPGAs to develop the prototype was the availability of a free development kit from Altera called MAX+ PLUS II BASELINE version 10. In addition, Altera has a university support program through which it is possible to get free manuals on how to use the software, a programming language reference and tutorial on AHDL (Altera Hardware Description Language), and sample boards for hardware development. In conclusion, Altera FPGAs provided a low cost, highly configurable solution along with a simple environment to develop the dataflow prototype.
V. DFM DESCRIPTION
For our prototype, DFM is divided into the Queue Buffer (QB), DFM Cells and PEs, as shown in Figure 3 . The internal structure of a DFM cell is shown in Figure 4 . Each DFM memory cell is broken up into four sections, Cell Sections CS1 through CS4. Each cell section holds a piece of the instruction to be executed. Controlling each CS is a Logic Unit (LU). The LUs, LU1 through LU4, manage CS1 through CS4, respectively, and constitute a PE. The bits in CSs are grouped, so that minimum communication is needed between different LUs controlling different CSs. By having an LU controlling only one CS, work on CSs is done independently and in parallel, avoiding potential waits that would arise if an LU controlled more than one CS across the cell.
FIGURES 3, 4 GO HERE A. Cell Section 1 (CS1)
CS1 is made up of nine fields. Each field is described below.
• Operand1 (OPD1): it holds the first operand, which is received from the instruction whose address matches the value in D1A (source of operand1). OPD1 is copied into CS1 after LU3 picks up a result packet and makes a match between the originating address (OA) in the packet and D1A.
• Operand2 (OPD2): it holds the second operand, which is received from the instruction whose address matches the value in D2A (source of operand2). OPD2 is copied into CS1 after LU4 picks up a result packet and makes a match between OA in the packet and D2A.
• Opcode (OP): the opcode of the instruction to be executed.
• Operand1Obtained (D1O): a one-bit flag set by LU1 when LU1 receives OPD1 from LU3. This bit may be set at compile or run time. If it is set at compile time, then the operand is already available (immediate value).
• OPRD2 Obtained (D2O): a one-bit flag set by LU1 when LU1 receives OPD2 from LU4. This bit may be set at compile or run time. If it is set at compile time, then the operand is already available (immediate value).
• Clause Answer (CAN): a one-bit flag which holds the boolean answer that LU2 picked up from the packet it received from the instruction whose OA matches the value in CAD. This bit is used during execution of conditional and loop constructs. It is set to 1 at compile if a clause is not required by the instruction.
• Operand1 Reuse (D1U) and Operand2 Reuse (D2U): fields D1U, D2U and LP (described below) are used to implement loops. Often instructions inside a loop require values from outside the loop, but instructions from outside the loop execute and transmit their values only once (i.e. an instruction is fired only once). Hence, the instruction inside the loop will receive that value only once and will normally fire only once. To overcome this problem, the reuse bit is used. Setting this bit at compile time allows LU1 to realize that the received value has to be reused, and will not change the D1O/D2O bit of the firing instruction whose D1U/D2U bit(s) is(are) set.
• Loop (LP): loops are difficult to manage in dataflow machines, but they also are the most commonly used constructs in programming. Usually the value of the loop-controlling variable is checked before a loop is entered (e.g. "while" and "for" loops). Thus, the first time the value of the variable is obtained from outside the loop and every subsequent time it is obtained from inside the loop. Hence, we need a primitive to obtain the same variable from two different sources. Also, dataflow machines are runaway machines; firing one instruction subsequently fires many instructions in different parts of the code. In particular, when a dataflow computer executes a loop it is highly probable that the machine may be simultaneously executing different iterations of the same loop. This would not be a problem if there were no dependencies between consecutive iterations, but would be a disaster if there were any. Some way to control the execution of the loop is needed.
To accomplish this control, the 2-bit flag LOOP(LP) is used to implement loop constructs. It can only be set at compile time with a value from 0 to 3. Only instructions that enclose a loop have their LP values greater than 0.
All other instructions inside and outside the loop have LP=0. The LP values of 1 through 3 are not used to distinguish between different types of loop constructs but to control how a loop executes, as follows:
• LP = 1 is used to initiate a loop. The instruction which has its LP set to 1 is the SPecial instruction SP.
It is not actually an instruction at all (it is never sent for execution). For a loop, there would be an SP instruction whose LP value would be 1; its D1A field would contain the OA from where the value of the loop-controlling variable is obtained the first time and D2A would contain the OA from where the value is obtained all other times. The "firing" of SP does not require the instruction to proceed to PP.
• An SP instruction is always followed by a conditional instruction in the program graph. The latter fires only if SP has received a value. The result of this conditional instruction is sent out as a clause to all instructions inside the loop. This instruction has LP=2, because if it is treated as a regular instruction (LP = 0), upon execution its CAN bit will be reset instantly and this instruction will not execute again (until its CAN bit is set). LP=2 instructs LU1 to leave the CAN bit intact after instruction firing.
• An instruction with LP=3 is used along with SP to avoid situations where the runaway effect will cause a problem due to data dependencies between consecutive loop iterations. With LP=3, a process similar to when LP=1 goes into action, except that instead of "firing" SP after the value of the variable is obtained from either D1A or D2A, SP will fire only when the value is obtained from both D1A and D2A. D1A is always the address of the SP instruction with LP=1 (beginning of loop) and D2A is always the address of the instruction which must be executed before the next iteration starts.
B. Cell Section 2 (CS2)
It is made up of two fields.
• Clause Address (CAD): it stores the address of the instruction that sends a clause. This instruction will execute only if its CAN field contains 1 (claused received). If the instruction does not need a clause, then its CAN field is set to 1 at compile time and the value in the CAD field is irrelevant.
• Clause Required (CR): a one-bit flag set to 1 at compile time if the instruction requires a clause.
C. Cell Section 3 (CS3)
• Operand1 Address (D1A): it holds the address of the instruction from which the value of the first operand is expected. If the first operand is an immediate value, then the address in this field is irrelevant.
• Operand1 Required (D1R): a one-bit flag set at compile time to let LU3 know if this instruction needs to receive the first operand. When OPD1 is an immediate value, this bit is set to 0.
Cell
D. Queue Buffer (QB)
It is a repository where result packets coming on the Result Bus are deposited if the results are coming faster than they can be absorbed. As LUs become free, QB sends queued result packets in the order that they were received.
VI. DETAILED IMPLEMENTATION
This section covers the actual implementation of the different components in the dataflow computer, and the design decisions that were made that did not completely conform to the design presented earlier. The reasons for these decisions are also discussed. Very detailed diagrams are included in the Appendix.
A. DFM Implementation
The structure of a single DFM cell was shown in Figure 4 . It is 61 bits long and all addresses are 11 bits long, thus giving a total addressable memory of 2Kwords. All operands are 7 bits wide. This small operand size is not considered a limitation because our main purpose is to prove the viability of our design. While implementing DFM, some design decisions were made so that it would fit in a single Altera device. One in particular was the implementation of the PE, i.e. units LU1 -LU4. Ideally, each intelligent cell should have its own set of LU1 -LU4, but that required more logic than would fit in a single Altera device. Thus, each LU was assigned to a group of cells called a sub-block, as shown in Figure 5 . Each LU controls a sub-block of 256 cells; LU1 controls a subblock of 256 CS1s, LU2 controls a sub-block of 256 CS2s, and so on. The group of five sub-blocks formed by LU1 -LU5 is called a block. There are eight blocks in this design, thus giving a total space of 2K memory words.
Notice that a new LU, namely LU5, was introduced. LU5 controls QB in a block and is not part of the PE, which is made up of LU1 -LU4; it takes no part in the manipulation of instructions. safe to send an instruction for execution. The three most significant bits of an 11-bit address determine the block number. Address 255 was used to initiate execution by sending a positive clause on the Result Bus. Finally, the block was broken up into two parts because it could not be fit in a single Altera device (see Appendix).
FIGURE 6 GOES HERE

A.2. Logic Units
An LU1 keeps tabs on which instructions have already been fired and those that need to refire (loop instructions).
LU234 process anything that is sent by LU5 in the format shown in Figure 7 . LU1 is also in charge of dispatching instructions for firing when all operands/clauses become available. It sends them on the Instruction Bus in the format shown in Figure 8 . LU1 processes operands and clauses sent by LU234 in the format of Figure 9 .
FIGURES 7, 8, 9 GO HERE B. Instruction Queue (IQ)
The instruction queue is composed of the bank arbitrator (BA) and memory banks. A memory bank within IQ is shown in Figure 10 . Each bank has two pointers, the New Instruction Load Pointer (NILP) and Next Instruction Execute Pointer (NIEP), to implement a circular queue. NILP holds the address of the next available location where the bank arbitrator can insert an executable coming from DFM. NIEP holds the address of the next instruction that is ready to execute. The processor associated with the bank uses this address to fetch the next instruction. For the sake of simplicity, we implemented two processors. MC (memory controller) in each bank uses two signals to communicate with its processor. Instr_RD is used by a bank to indicate to its processor that a new instruction(s) is(are) available for execution. When an instruction has finished executing, the processor notifies its MC via Instr_done, upon which MC increments NIEP, if NIEP ≠ NILP. Each memory bank is implemented using a dual port memory, allowing BA to send instructions to it while its processor can read from it. BA puts instructions into each bank using a round robin scheme. Each bank has sixteen words in its circular queue. The number of memory words in each bank is not an optimized number, but an arbitrary number was chosen to build the prototype.
FIGURE 10 GOES HERE C. Processor Pool (PP)
Each processor is associated with only one memory bank in the memory pool, and vice versa. The processors are more like execution units than general-purpose processors. When a processor is ready for execution, it looks for the Instr_RD signal from its corresponding memory bank. When it sees the signal go high, it reads the instruction from its memory bank and executes it. An instruction is always executed with OPD1 on the left side and OPD2 on the right (e.g. OPD1 -OPD2, OPD1 / OPD2, check if OPD1 > OPD2, etc). The processor then sends a request to the Result Bus Controller in DFM, asking permission to send the result to DFM. Upon acknowledgement, the requesting processor sends the result over to DFM in the format shown in Figure 7 and sends an Instr_done signal to its bank, which in turn readies a new instruction for execution. Since all dataflow maintenance work is relegated to DFM, either processor is blindly executing instructions and sending signals, independently of the other. The instructions that the processor can currently execute are set to the bare minimum, just enough to make this dataflow machine work. The instructions and their corresponding opcodes are shown in Table I .
TABLE I GOES HERE VII. PROGRAMMING THE DATAFLOW COMPUTER
All instructions are commonly used instructions, except for SP and LK. This computer has six compare instructions, which is unlike most machines that usually have four. The two compare instructions not usually found in other computers are CGE and CLE, because these two instructions can normally by made up by combining other compare instructions (e.g. CGT & CEQ for CGE). However, this is not suitable in our architecture; e.g. when a combination of instructions are used to build a conditional instruction to control a loop, each instruction within the loop should be capable of receiving two clauses and fire if either one is true. This ability to receive two clauses is not provided in this implementation because it would make the DFM design more complex. SP is not really an instruction because it is never sent to the processor pool for execution. Though its opcode is 0000, it is irrelevant. SP is a directive for LU1 as explained earlier, instructing it to forward all values it receives for SP to other cells in DFM. What constitutes an SP instruction is the value of LP in an instruction; if the value of LP is 1, LU1 realizes that this is a directive and treats it accordingly. Currently, SP is used exclusively as part of a loop construct. When LP=1, LU1 sends the operands it receives from either LU3 or LU4 out onto the operand bus, if the CAN bit of SP is 1. SP always has its D1R and D2R bits set, but unlike other instructions where LU1 will wait for both operands to arrive before dispatching them, LU1 dispatches the arriving operands immediately.
LK also is a directive that has its LP set to 3 (opcode irrelevant). While setting LP to 1 makes SP transmit any operand it receives (from either LU3 or LU4) onto the Operand Bus, setting LP to 3 causes LU1 to transmit OPD1 (the value it receives from LU3), but only after LK has received both operands and the CAN bit is set to one. LK is particularly useful in loop constructs where, before incrementing the control variable of a loop it may be necessary to check if all instructions within the loop have been executed. This is to avoid race conditions, which would start a new iteration before the completion of an earlier one. Concurrently executing multiple iterations of a loop is often required in parallel computing, but it is disastrous if there are dependencies between iterations.
Before we present an example program, it is important to learn to interpret the nodes of dataflow graphs. Figure 11 shows three primitives that are used to represent nodes. Figure 11 are available to this program and those that it is waiting for. Labeling the arcs with memory addresses does have the drawback of not being able to draw the flow graph until the program has already been written. Eliminating all memory address labels overcomes this problem but reduces the amount of information the graph conveys. Some output arcs of conditionals get converted into clause arcs for some nodes; this is allowed because the result from the execution of a conditional is either 0 or 1, thus it can be a clause arc for a node. Table III shows a program segment that contains conditional statements, reentrant code and reusable variables. The code is presented in two formats, in high level pseudo-code and pseudo-assembly code.
The equivalent program in dataflow machine language is presented in Table IV .A dataflow language and compiler were not developed for our machine, so the code in Table IV is not how one would write the program for this computer to be compiled but is actually the code resident in DFM. The flow graph for this code is shown in Figure 12 .
TABLES II, III, IV & FIGURE 12 GO HERE
The two loops execute concurrently and the values being used by them are not interdependent. In a von Neumann machine, it would be quite difficult to concurrently execute the two loops because the variable 'x' is needed by both loops; while one (FOR) uses variable 'x', the other (WHILE) modifies 'x'. The FOR loop must finish executing before the WHILE loop starts. However, in the dataflow machine each loop is sent its own copy of 'x', thus allowing the loops to execute concurrently. The FOR loop is bounded by an LK instruction because there are dependencies between consecutive iterations of the loop.
Such dependencies do not exist between consecutive iterations of the WHILE loop, hence no LK instruction is needed to bind the loop. Writing a program for the dataflow machine is tricky and tedious because the programmer has to be aware of the data dependencies in the program. This issue can, however, be overcome using an intelligent compiler.
VIII. CONCLUSIONS AND REMARKS
A new hardware design was presented for dataflow computers. The incorporation of intelligent memories is at the core of the proposed design, so that processing can be separated from all other dataflow related operations. Recent advances in FPGAs were taken advantage of to prototype our design. The logic capabilities of FPGAs allow the implementation of intelligent memories for a proof-of-concept approach. Our results prove that our approach is valid and further technological improvements in FPGAs and/or intelligent memories may make our design even more attractive in a few years' time.
Higher primitives such as code-copying and tagged tokens were not implemented here, h ence procedure invocation and indirect memory addressing are not currently possible. However, it is possible to add these features into the current architecture by making some modifications in the design. There is one feature of this architecture that eliminates a problem faced by past dataflow designs, and that is the problem of data fan out.
In past designs, the data fan out of an instruction was limited, usually to two or three. So, programs to be run on such machines had to be written adhering to this restriction. Of course, this was a major drawback and different schemes were developed to overcome this restriction. Two ways that were devised to nullify this restriction were through hardware, as implemented in the εpsilon dataflow processor [5] with a repeat hardware unit that circulates the result value using a tagging scheme. The second method uses specialized instructions, which hold addresses of additional instructions (beyond the allowed limit) that need the result. The address of the special instruction is on the list of addresses that the executing instruction needs to send the result to. When the special instruction receives the result, it forwards this result to its list of destination addresses. However, no such means need be employed in our architecture, since no instruction maintains a list of destination addresses to send the result to; instead, each instruction has the addresses of the sources for its two operands. When a result packet is sent out, all instructions pick up the packet and compare the source addresses they have with the originating address in the result packet.
All the instructions that make a positive match absorb the result (this eliminates the data fan out problem).
Our removing the processor from dataflow memory maintenance tasks ha s the advantage of simpler processors and the ability to easily replace simpler execution units with powerful ones having a compatible interface. Besides, dataflow in DFM is performed using only the flag bits, thus offering opcode independence.
Hence, a compatible language can be used to write the same program or new opcodes can be added to the existing language without affecting dataflow; of course, a compatible processor needs be used to execute different or new instructions. 
APPENDIX FIGURES A.1, A.2 and A.3 GO HERE
Queue Buffer CS4 CS1 CS2 CS3 L U 1 L U 2 L U 3 L U 4 L U
