AbstractÐIn this paper, the Scheduled Dataflow (SDF) architectureÐa decoupled memory/execution, multithreaded architecture using nonblocking threadsÐis presented in detail and evaluated against Superscalar architecture. Recent focus in the field of new processor architectures is mainly on VLIW (e.g., IA-64), superscalar, and superspeculative designs. This trend allows for better performance, but at the expense of increased hardware complexity and, possibly, higher power expenditures resulting from dynamic instruction scheduling. Our research deviates from this trend by exploring a simpler, yet powerful execution paradigm that is based on dataflow and multithreading. A program is partitioned into nonblocking execution threads. In addition, all memory accesses are decoupled from the thread's execution. Data is preloaded into the thread's context (registers) and all results are poststored after the completion of the thread's execution. While multithreading and decoupling are possible with control-flow architectures, SDF makes it easier to coordinate the memory accesses and execution of a thread, as well as eliminate unnecessary dependencies among instructions. We have compared the execution cycles required for programs on SDF with the execution cycles required by programs on SimpleScalar (a superscalar simulator) by considering the essential aspects of these architectures in order to have a fair comparison. The results show that SDF architecture can outperform the superscalar. SDF performance scales better with the number of functional units and allows for a good exploitation of Thread Level Parallelism (TLP) and available chip area.
INTRODUCTION
T HE performance gap between processors and memory has widened in recent years and the trend appears to continue in the foreseeable future. In this paper, we present an architecture that can overcome this problem, with better scalability than superscalar processors with increased number of pipelines. Our architecture is based on multithreading and dataflow concepts.
Multithreading has been touted as a solution to minimize the loss of CPU cycles by executing several instruction streams simultaneously. While there are several different approaches to multithreading, there is a consensus that multithreading, in general, achieves higher instruction issue rates on processors that contain multiple functional units (e.g., Superscalar and VLIW) or multiple processing elements (i.e., Chip Multiprocessors) [11] , [23] , [24] , [40] , [42] , [44] . Nevertheless, research is open to find an appropriate multithreaded model and implementation to achieve the best possible performance. Recent efforts like the MP98 [15] show that attention to data-dependencies and hardware support for forking multiple threads help increase the performance.
We have found that the use of nonblocking dataflowbased threads are appropriate for improving the performance of superscalar architectures. Dataflow ideas are often utilized in modern processor architectures. However, these architectures rely on conventional programming paradigms and perform runtime transformations of the control-flow programs into dataflow programs, requiring complex hardware to detect data and control hazards, and reorder and issue multiple instructions.
Our architecture differs from other multithreaded architectures in two ways: 1) Our programming paradigm is based on dataflow, which eliminates the need for runtime instruction scheduling, thus reducing the hardware complexity significantly and 2) complete decoupling of all memory accesses from execution pipeline. The underlying dataflow and nonblocking models permit for clean separation of memory accesses from execution (which is very difficult to coordinate in other programming models). Data is preloaded into an enabled thread's register context prior to its scheduling on the execution pipeline. After a thread completes execution, the results are poststored from its registers into memory. The instruction set implements dataflow computational model, while the execution engine relies on control-flow-like sequencing of instructions (hence the name Scheduled Dataflow). Unlike Superscalar, our architecture performs no (dynamic) Out-of-Order execution and thus eliminates the need for complex instruction issue and retiring hardware. These hardware savings could be utilized to include either more processing units on a chip or more register sets to increase the degree of multithreading (i.e., Thread Level Parallelism). Moreover, it was stated that a significant power is expended by instruction issue logic and the power consumption increases quadratically with the size of the instruction issue width [27] , [43] . Some researchers are exploring mechanisms to construct dependence graphs at runtime to guide which instructions should be examined for readiness (rather than all instruction in the issue window) [27] . In our architecture, we perform no dynamic instruction scheduling (or select one or more instructions for issue from a large set of instructions). Thus, our approach could naturally obviate the need for runtime construction of dependence graphs.
We have translated several programs into our SDF instruction set. Using a cycle-level simulator developed at the University of Alabama in Huntsville (UAH), we have compared the execution performance of our architecture with that of conventional superscalar architecture with multiple functional units and aggressive Out-of-Order instruction issue logic as facilitated by the SimpleScalar tool set [10] .
In Section 2, we will present research that is most closely related to ours. In Section 3, we will present our Scheduled Dataflow architecture in detail. Section 4 will discuss the methodology that we used in our evaluation and Section 5 will show our numerical results for real programs.
RELATED RESEARCH AND BACKGROUND

Decoupling Memory Accesses From Execution Pipeline
Decoupling memory accesses from the execution pipeline to overcome the ever-increasing processor-memory communication cost was first introduced in [34] . Since then larger cache memories have been used to alleviate the memory latency problem. But, the gap between processor speed and average memory access time is still a major limitation in achieving high performance. Increasing cache capacities, while consuming an increasingly large silicon area on processor chips, often results in diminishing returns. Decoupled architectures may again present a solution to leaping over the ªmemory wall.º Decoupled ideas were recently used in a multithreaded architecture known as Rhamma [17] . Rhamma uses conventional control-flow programming paradigm and blocking threads, hence requiring many more thread context switches than our nonblocking dataflow threads. Moreover, SDF groups all Load instructions together into ªpreloadº and all Store instructions together into ªpoststore.º An analytical comparison with the Rhamma architecture was presented in [21] and, on that basis, we found that SDF outperforms Rhamma.
Dataflow Model and Architectures
The dataflow model and architecture have been studied for more than two decades and held the promise of an elegant execution paradigm with the ability to exploit inherent parallelism available in applications [4] , [5] , [12] , [14] , [28] , [29] , [30] . However, actual implementations of the model have failed to deliver the promised performance. Nevertheless, several features of the dataflow computational model have found their place in modern processor architectures and compiler technology (e.g., Static Single Assignment (SSA) [13] , register renaming, dynamic scheduling and Out-of-Order instruction execution [18] , I-structure-like synchronization [1] , [6] , nonblocking threads [8] ). Most modern processors utilize complex hardware techniques to detect data and control hazards, and dynamic parallelismÐto bring the execution engine closer to an idealized dataflow engine. It is our contention that such complexities can be eliminated if a more suitable and direct implementation of the dataflow model can be discovered. Some of the limitations of the pure dataflow model that prevented its practical implementations include the following: 1) too fine-grained (instruction level) multithreading, 2) difficulty in exploiting memory hierarchies and registers, and 3) asynchronous triggering of instructions. Many researchers have addressed the first two limitations of dataflow architectures [12] , [20] , [30] , [35] , [37] , [38] , [39] . Our current architecture specifically addresses the third limitation. Some researchers have proposed hybrid designs in which the dataflow scheduling is applied only at thread level (i.e., macro-dataflow), while each thread is comprised of conventional control-flow instructions [16] , [19] , [31] . In such systems, the instructions within a thread do not retain functional properties and hence, introduce Write-After-Write (WAW) and Write-After-Read (WAR) dependencies. This in turn requires complex hardware to perform dynamic instruction scheduling. In our system, the instructions within a thread still retain functional properties of dataflow model and thus eliminate the need for complex hardware. The results (or data) flow from instruction to instruction; the data destined to an instruction is stored in registers exclusively assigned for the instruction. Our deviation, in the Scheduled Dataflow (SDF) system, from pure dataflow is a deviation from data driven asynchronous 1 execution (or token driven execution) that is traditionally used for the implementation of ªpureº dataflow processors.
THE SCHEDULED DATAFLOW PROCESSOR
We will first show how it is possible to ªscheduleº dataflow instructions. Let us consider a simple dataflow graph, shown in Fig. 1 , and the corresponding SDF code. Each node of the graph will be translated into an SDF instruction. The two source operands (i.e., input data) destined for a dyadic SDF instruction are stored in a pair of registers specifically assigned to that instruction; a pair consists of even-odd registers; for example, RR2 refers to registers R2 and R3 within a specified thread context. Predecessor instructions store the data in either the left or right half of a register pair, as dictated by the data dependencies of the program. Unlike in conventional dataflow architecturesÐe.g., Monsoon [29] , Tagged-Token Dataflow Architecture (TTDA) [7] Ðin our architecture, an instruction is not scheduled for execution immediately when the operands are matched (i.e., available). Instead, operands are saved in the register-pair associated with the instruction and the enabled instruction is scheduled for execution based on 1 . It is often believed that dataflow means parallel execution. The dataflow model of computation only exposes the inherent parallelism and the parallelism can only be exploited if multiple functional units or processing elements are available. In the presence of a single processing element (or functional unit), dataflow instructions still execute sequentially, albeit asynchronously.
compile time ordering of the dataflow graph, eliminating the asynchronous execution implied by dataflow.
Assuming that the inputs A, B, X, and Y to the dataflow graph of Fig. 1 are available in R2, R3, R4, and R5, respectively (this is achieved during preload, as explained below), the five instructions shown in Fig. 2 will be scheduled for execution sequentially and perform the necessary computations, as indicated by the graph in Fig. 1 . Note that a pair of registers is specified as source operands with each instruction. For example, ADD RR2, R11, R13 adds the contents of registers R2 and R3 and stores the result in R11 and R13. Our instructions still retain the functional nature of dataflowÐeach instruction stores its results in the registers that are specifically associated with destination instructions. There are no Write-After-Read (WAR or conceptually equivalent anti) and Write-After-Write (WAW or equivalent output) dependencies with our instructions.
The code shown in Fig. 2 is for the Execution Pipeline (EP). Since our architecture is a decoupled multithreaded system, we use two separate units: Synchronization Pipeline (SP) and Execution Pipeline (EP). SP prepares enabled threads for execution on EP by preloading threads' context (i.e., registers) with data from the threads' Frame Memories (Frame Memory is a portion of memory allocated to a thread) and poststoring results from completed threads' registers in frame memories of destination threads.
To illustrate the preload concept, consider Fig. 1 and the SDF code shown in Fig. 2 . Assume that the code block of Fig. 2 (viewed as a thread) receives the four inputs (A, B, X, Y) from other threads. Inputs to a thread are saved in the frame memory allocated for the thread when the thread is created and a thread is enabled for execution only when it receives all inputs (as specified by its synchronization count). When enabled, a register context is allocated to the thread and the input data for the thread is ªpreloadedº from its frame memory into its registers.
Assuming that the inputs for the thread (A, B, X, and Y) are stored in its frame (pointed by RFP) at offsets 2, 3, 4, and 5, the first four LOAD instructions of Fig. 3 (executed by SP) preload the thread's data into registers R2, R3, R4, R5 of the register context allocated for the thread. After the preload, the thread is scheduled for execution on EP. The EP then uses only its registers during the execution of the thread body (Fig. 2) . Consider that the results generated by MULT and DIV in our code example (i.e., R14 and R15) are needed by two other threads. The frame pointers and frame-offsets for the destination threads are made available to the current thread in register couples R6|R7 and R8|R9, as shown in the preload code above (the last four LOAD instructions of Fig. 3 ).
The instructions shown in Fig. 4 transfer (or poststore) the results of the current thread (i.e., from MULT in R14 and DIV in R15) to frames pointed to by R6 and R8 at frameoffsets contained in R7 and R9. SP executes STORE instructions after a thread completes its execution at EP. As can be observed from this example, when a thread is created, it is necessary to provide the thread with the destination thread's frame pointers and offsets.
Continuations
To better understand the implementation of the SDF architecture, we need to focus first on the dynamic scenario that can be generated at run time. We need to introduce the concept of continuation: a continuation in our architecture is simply a four-value tuple, designated as <FP, IP, RS, SC>, where FP is the Frame Pointer (where thread input values are stored), IP is the Instruction Pointer (which points to the thread code), RS is a Register Set (a dynamically allocated register set), and SC is a Synchronization Count (the number of values needed to enable that thread). Each thread has an associated continuation. At a given time a thread continuation can be one of the following, where ª--ª means that the value is either undefined or unnecessary: A Scheduler Unit (SU) handles the management of continuations and processing resources. In our design, the SU is very simple and can be implemented in hardware using a PLA. We now describe the details of the main functional units in our architecture: the EP, the SP, and the SU. Fig. 6 shows the block diagram of the Execution Pipeline (EP). Remember that EP executes computations of a thread using only registers. The instruction fetch unit behaves like a traditional fetch unit, relying on a program counter to fetch the next instruction. 2 We rely on compile time analysis to produce the code for EP so that instructions can be executed in sequence and assure that the instruction data already available in its pair of source registers.
Execution Pipeline (EP)
The instruction fetch unit fetches an instruction belonging to the current thread using PC. The decode (and register fetch) unit decodes the instruction and obtains a pair of registers that contains (up to two) source operands for the instruction. The execute unit executes the instruction and sends the results to the write-back unit along with the destination register numbers. The write-back unit writes (up to) two values to the register file. As can be seen, the Execution Pipeline (EP) behaves more like a conventional pipeline while retaining the primary dataflow properties: Data flows from instruction to instruction. Moreover, the EP does not access data cache memory and, hence, requires no pipeline stalls (or context switches) due to cache misses. Fig. 7 shows the organization of the Synchronization Pipeline (SP), which mainly deals with memory accesses. Here, we deal with preload and poststore instructions. The pipeline consists of the following stages: the instruction fetch unit fetches an instruction belonging to the current thread using PC. The decode unit decodes the instruction and fetches register operands (using a Register Set). The effective address unit computes the effective address for LOAD and STORE instructions. LOAD and STORE instructions only reference the Frame memories 3 of threads, using a framepointer (FP) and an offset into the frame, both of which are contained in registers. The memory access unit completes LOAD and STORE instructions. Pursuant to a poststore, the synchronization count of a thread is decremented. The write-back unit completes LOAD (preload).
Synchronization Pipeline
Scheduling Unit (SU)
In our architecture, a thread is created using a FALLOC instruction. FALLOC allocates a frame (accessible by a Frame Pointer FP) and initializes the frame by storing an Instruction pointer (IP) for the thread and a Synchronization Count (SC), which indicates the number of inputs needed to enable the thread. The FALLOC thus creates a WTC (<FP, IP, --, SC>).
In order to speed up frame allocation, fixed sized frames for threads are preallocated and a stack of indices pointing to the available frames is maintained. The Scheduling Unit actually carries out this operation by popping an index from that stack. The SP pushes deallocated frames when executing FFREE instruction subsequent to poststores of completed threads. This policy permits fast context switching and creation of threads.
When some thread completes its execution and ªpost-storesº results (performed by SP), the synchronization counts of each awaiting (WTC) thread are decremented. The SU takes care of checking when the synchronization count becomes zero. Then, it allocates a Register Set (RS) to that thread. The register sets are viewed as circular buffers for assigning (and deallocating) register contexts to enabled threads.
The thread's continuation becomes a PLC (<FP, IP, RS, -->) and it is scheduled for execution on SP for preload. Then, SP loads the thread's data from its frame memory into the register context allocated. Upon the completion of preload, the thread continuation (in state EXC) is handed off to the Execution Processor (EP), using a FORKEP instruction. After the execution stage, we use FORKSP to move this thread back to SP.
FALLOC and FFREE take two cycles in our architecture. FORKEP and FORKSP take four cycles to complete. These numbers are based on the observations made in Sparcle [2] that a 4-cycle context switch can be implemented in hardware. Note the scheduling is at thread level in our system, rather than at instruction level as done in other multithreaded systems (e.g., Tera [3] , SMT [41] ), and, thus, requires simpler hardware.
The Scheduler Unit is also responsible for scheduling preload (PLC) and poststore (PSC) on multiple SPs and preloaded threads on multiple EPs in superscalar implementations of our architecture (Section 5.2). 2. Since both EP and SP need to execute instructions, our instruction cache is assumed to be dual ported. Since instruction memory causes no coherency related problems, it may be possible to utilize separate cache memories for EP and SP. This is not unlike most Superscalar systems.
3. Following the traditional dataflow paradigm, we use I-Structure memory for arrays and other structures.
We have evaluated our architecture based on execution of generated code for actual programs using our instruction level simulator. 4 The simulator used for this paper assumes a perfect cache (i.e., all memory accesses take one cycle). We have also developed a backend to Sisal [9] and used MIDC as intermediate language [32] , [33] to generate code for our architecture, but still without implementing any particular optimization.
Previously, we have reported comparisons of SDF with MIPS-like architectures in [22] . In this paper, we will compare our SDF with a superscalar architecture with multiple functional units and Out-of-Order instruction issue logic as facilitated by the SimpleScalar tool set [10] . Also, we will investigate the effect of parallelism (i.e., number of enabled threads) and thread granularity (average run-lengths of the execution threads on EP) on the performance of our architecture (Sections 5.3, 5.4). We will investigate the performance gained by increasing the number of SPs and EPs (that is, Superscalar-SDF) and compare the performance with that of conventional superscalar processors containing multiple functional units (Section 5.2. The programs used for this study include a Matrix Multiply, FFT, Fibonacci, Zoom. We chose these applications since they exhibit different characteristics. Matrix multiply can be written to exploit both thread level and instruction level parallelism; FFT exhibits higher degrees of thread level parallelism with increasing data sizes; recursive Fibonacci exhibits very little parallelism (either instruction level or thread level); Zoom (a code segment of picture zooming application [36] ) consists of three nested loops and a substantial amount of instruction level parallelism in the middle loop (but only small degrees of thread level parallelism).
EVALUATION OF THE DECOUPLED SCHEDULED DATAFLOW ARCHITECTURE
In the first experiment, we have compared the execution performance of SDF (with one SP and one EP), with a superscalar processor with one Integer ALU (one integer adder and an integer multiply/divide unit) and one Floating Point ALU (one floating-point adder and a floating point multiply/divide unit). This way SDF and the superscalar have the same number of functional units. 5 For the superscalar, 4. We will provide the simulator and our benchmark programs to any interested reader so that our experimental data can be verified.
5. Actually, the superscalar system contains four functional units, one integer adder, one integer multiply/divide, one floating point adder, one floating point multiply/divide. SDF has only two arithmetic units, one in SP and one in EP. There are no separate multiply/divide units.
we will show data for both In-Order and Out-of-Order instruction issue. In all systems, we have set all instructions to take one cycle and assume perfect cache (all memory accesses are set to one cycle). For superscalar ( Tables 2, 3 , 4, and 5 compare SDF with In-Order and Out-ofOrder superscalar systems. We indicated in boldface the cases when SDF cycles result lower than both the In-Order and Out-of Order superscalar cases.
For the Matrix Multiply program (Table 2) , we have forked 10 threads to execute concurrently on the SDF. The Out-of-Order Superscalar system consistently outperforms SDF, although SDF performs better than the In-Order superscalar.
We are not surprised by this result since the SimpleScalar tool set performs extensive optimizations and dynamic instruction scheduling. SDF does not perform any dynamic instruction scheduling, eliminating complex hardware (e.g., Scoreboards or reservation stations [18] ). Moreover, SimpleScalar utilizes branch prediction (the data shown uses Bimodal prediction with 2,048 entries). At present, SDF uses no branch prediction. Matrix Multiply program exhibits a large degree of instruction level parallelism and a good branch prediction is easy to achieve. Although the Matrix multiply program can be written to exhibit greater thread level parallelism, we have used a fixed number of threads (10) in this experiment. Later, we will show how the thread level parallelism can improve SDF performance (Section 5.3).
While executing FFT (Table 3) , unlike for Matrix Multiply, SDF outperforms Out-of-Order superscalar only for larger input sizes (shown in bold). This is because of the available instruction-level and thread-level parallelism. For very small data sizes, Out-of-Order Superscalar system performs better than all other systems by exploiting instruction level parallelism. Very little thread level parallelism is available for such data sizes. However, for data sizes of 256 or larger, the available thread level parallelism in SDF (and the overlapped execution of SP and EP) exceeds the available instruction level parallelism, leading to a better performance by SDF. This data is in line with the studies performed on Simultaneous Multithreading systems [26] , [25] , which indicates that high performance is achieved by using a combination of thread level and instruction level parallelism. Fig. 8 shows this more clearlyÐfor larger data sizes, SDF performs better than superscalar architectures.
The Recursive Fibonacci program (Table 4 ) exhibits very little parallelism (neither instruction level nor thread level). For very small data sizes, conventional superscalar systems appear to incur overheads in creating recursive function calls, while SDF creates very few threads and incurs smaller overhead. As the data size increases, SDF creates too many threads, yet there is very little thread level parallelism, leading to poor performance by SDF as compared to superscalar systems. This is again in line with the general observation that multithreaded architectures perform poorly for applications with little or no thread level parallelism (and for single threaded applications).
The Zoom program (Table 5) contains substantial amounts of sequential code in the middle loop. This code allows for the exploitation of instruction level parallelism. However, it limits the amount of thread level parallelism. Moreover, in SDF, newly created threads wait for preload (and poststore) operations, causing the SP to be overloaded.
As we will see later, SDF's performance improves when multiple SPs are used (see Tables 7, 8 , 9, 10).
Summarizing
The data thus far confirms that any multithreaded architecture requires greater thread level parallelism to achieve good performance; superscalar architectures require greater instruction level parallelism. We feel that our nonblocking model is better suited for decoupling memory accesses from execution unit. The functional nature of our instructions eliminates the need for dynamic scheduling of instructions within a thread. Since our architecture uses two different types of pipelines (SP and EP), it is necessary to achieve a good balance of utilization between these two units. Our architecture incurs unavoidable overheads for creating threads (allocation of frames, allocation of register contexts) and transferring threads between SP and EP (FORKEP and FORKSP instructions). At present, data can only be exchanged between threads by storing them in threads' frames (memory). These memory accesses can be avoided by storing the results of a thread directly into another thread's register context. Our experiments show that Matrix Multiply needs 16 frames with 10 parallel threads (for data shown in Table 2 ). For this application, we could have eliminated storing (and loading) thread's data in memory by allocating all frames directly in register sets (by providing sufficient register sets in hardware). It is our contention that the hardware savings achieved by SDF (by eliminating dynamic instruction scheduling logic) can be used to either increase the number of register sets (thus supporting greater thread-level parallelism) or add more SPs and EPs; either of which can improve the performance of SDF.
Execution Performance of SDF with Multiple SPs and EPs
In our next experiment, we have investigated the performance of SDF using multiple SPs and EPs and compared the performance with superscalar architectures using multiple Integer and Floating-Point units. We have utilized an equal number of functional units in our comparisons by setting the number of functional units in a superscalar (#Integer ALUs + #Floating Point ALUs) 6 equal to the number of SPs and EPs (#SPs + #EPs). It is our contention that conventional superscalar systems do not scale well with increasing number of functional units and the scalability is limited by the instruction fetch/decode window size and the RUU size. SDF relies primarily on thread level parallelism, and the decoupling of memory accesses from execution. SDF performance can scale better with a proper balance of workload among SPs and EPs. Tables 7, 8 , 9, 10 show the results for this series of experiments. In order to provide greater opportunities for dynamic instruction scheduling for the superscalar system, we have set the Instruction Fetch & Decode window widths to 32 and RUU to 32 (Table 6) . We have observed little change in the performance (for the selected benchmarks) when the window width is increased beyond 32. We have also explored the impact of changing the RUU size. When RUU is set to 64, the performance of superscalar showed less than 5 percent improvement as compared to that with RUU set to 32.
In Table 7 , we show the data for the Matrix Multiply program. As can be noted, when we add more SPs and EPs (correspondingly, more Integer and Floating Point functional units in Superscalar), SDF outperforms superscalar architecture (shown in bold in Table 7 ), even when compared to complex out-of-order scheduling used by superscalar architectures.
SDF performance overtakes that of the Out-of-Order superscalar architecture with three SPs and three EPs (correspondingly, with three Integer and three FP ALUs in the Superscalar system). It should also be noted that, for the superscalar architecture, the performance improvement with increasing number of functional units scales poorlyÐ superscalar architecture exhibits no improved performance beyond three Integer and three Floating Point ALUs. For SDF, the performance is limited by SPsÐperformance is improved consistently by adding more SPs.
This can more easily be seen from Fig. 9 . The x-axis shows the number of functional units (#SP + #EP for SDF; #Integer ALUs + #FP ALUs for superscalar). The figure shows the execution times for matrix multiplication with a 150*150 data size.
The next table (Table 8) shows the results for FFT. In this case, SDF outperforms Out-of-Order Superscalar for data sizes greater that 256 for all machine configurations. Once again, SDF performance scales better with added SPs than that of a superscalar when more functional units are added. Fig. 10 shows the scalability of SDF for FFT (data size 256). Again, the x-axis shows the number of functional units (#SP + #EP for SDF; #Integer ALUs + #FP ALUs for Superscalar).
For Fibonacci (Table 9) , as the number of SPs is increased, SDF compares more favorably with Out-of-Order Superscalar with a similar number of Integer units (as compared to the data in Table 4 ). As before, SDF performance scales better with more SPs and EPs than the superscalar case (when more functional units are added). Adding more FP ALUs in superscalar shows no improvement since Fibonacci does not utilize Floating Point arithmetic). Fig. 11 shows the scalability of SDF for Fibonacci (data size 15) more clearly. The x-axis shows the number of functional units (#SP + #EP for SDF; #Integer ALUs + #FP ALUs for superscalar). Table 10 shows the data for the Zoom program. Once again, the performance of SDF scales better than Superscalar. With five SPs and four EPs, SDF outperforms the Out-of-Order Superscalar system with five Integer and four FP ALUs (shown in bold in Table 10 ), even for this program. 6. Again, each ALU in superscalar contains separate adder and multiply/divide units. In SDF, each ALU is treated as a single unit performing all arithmetic operations.
Summarizing
The data for each of the benchmarks (Tables 7, 8 , 9, 10) is consistent with our contention that SDF can be a viable alternative with multiple SPs and EPs to Superscalar architectures that utilize complex dynamic instruction scheduling logic. In fact, it would be fairer to compare SDF with more functional units (#EPs + #SPs) than those in a Superscalar because of the hardware savings. Our SPs and EPs are no more complex than a traditional functional unit used in Superscalar systems. We eliminate the complex instruction issue, register renaming, and instruction retiring logic.
Scheduling of threads among available SPs and EPs is
performed at thread level (instead of at instruction level, as done in Tera and SMT). 
Effect of Thread Level Parallelism on Execution Behavior
Here, we will explore the performance benefits of increasing the thread level parallelism (i.e., number of concurrent threads) using one SP and one EP for SDF architecture. We have used the Matrix Multiply for this purpose. We have executed a 50*50 matrix multiply by varying the number of concurrent threads. Each thread has executed five (unrolled) loop iterations. In this data collection, we concentrated only on the innermost loop of Matrix Multiply, unlike previous data, where we have parallelized all three nested loops of Matrix Multiply, see Fig. 13 . As can be expected, increasing the degree of parallelism will not always decrease the number of cycles needed in a linear fashion. This is due to the saturation of SP (reaching more than 90 percent utilization with 10 threads). As shown previously ( Table 7 , Fig. 9 ), adding additional SP and EP units (i.e., Superscalar-SDF implementation) will allow us to utilize higher levels of thread parallelism. Although not presented in this paper, we have observed very similar behavior with other data sizes for Matrix Multiply and with the other benchmarks, Fibonacci, FFT, and Zoom.
Effect of Thread Granularity on Execution Behavior
In the next experiment with Matrix Multiply, we have held the number of threads at five and varied the thread granularity by varying the number of innermost loop iterations executed by each thread (i.e., degree of unrolling).
Once again, we have used one SP and one EP for this experiment and concentrated only on the innermost loop of Matrix Multiply. The data size for Fig. 14 is 50*50. Here, the thread granularity ranged from an average of 27 instructions (12 for SP and 15 for EP), with no loop unrolling, to 51 instructions (13 for SP and 39 for EP), when each thread executed 10 unrolled loop iterations. Once again, the execution performance improves (i.e., execution time decreases) as the threads become coarser. However, the improvement becomes less significant beyond a certain granularity. Similar behavior has been observed for larger data sizes and other benchmarks. We are exploring innovative compiler optimizations utilizing static branch prediction to speculatively preload threads to increase thread run-lengths (i.e. granularities).
CONCLUSIONS AND FUTURE WORK
In this paper, we have presented a nonblocking multithreaded dataflow architecture that utilizes control-flowlike scheduling of instructions. Our architecture separates memory accesses from instruction execution. Using an instruction set simulator for our decoupled Scheduled Dataflow (SDF), we have compared the execution performance of SDF with that of a superscalar with multiple functional units and aggressive Out-of-Order instruction issue logic. When the thread level parallelism is high, SDF substantially outperforms superscalar architectures (with multiple functional units) using In-Order instruction execution. SDF underperforms superscalar architectures with Out-of-Order execution, when the instruction level parallelism is high, but thread level parallelism is low. Also, the SP in SDF can be a bottleneck since threads can only be scheduled on EP after preload operations. The performance can be improved by adding more SPs. As a matter of observation, when more functional units are added, the Out-of-Order execution of superscalar architecture does not scale as well as SDF. Another factor that must be kept in mind while analyzing the data is that SDF uses no branch prediction (unlike superscalar architectures). At this time, we did not optimize our instruction set or the compiler that generated the code for the benchmarks.
SDF reduces the complexity of the processor by eliminating the need for complex logic (e.g., scoreboard or reservation stations) needed for resolving data dependencies, register renaming, Out-of-Order instruction issue, and branch predictions. The silicon area thus saved may be used to include more register-sets and registers per set to improve thread level parallelism and thread granularities or add more SPs and EPs. We are working to improve both the instruction set and the compiler to produce more efficient executions of programs. At present, SDF uses no branch prediction, although we are planning to experiment with static branch prediction to speculatively preload data and increase run-lengths of our threads. Using compiler optimizations, speculative executions, and branch-prediction, we aim to increase the run-lengths of threads executing on EP.
While decoupled access/execute implementations are possible within the scope of conventional architectures, the multithreading model presents greater opportunities for exploiting the separation of memory accesses from the execution pipeline. We feel that, even among multithreaded alternatives, nonblocking models are more suited for the decoupled execution. In our model, threads exchange data only through the frame memories of threads (array data is provided through I-structure memory). The use of frame memories for thread data permits a clean decoupling of memory accesses into preloads and poststores. This can lead to greater data localities and relatively low cache-miss rates.
Krishna M. Kavi received the BE (electrical) degree from the Indian Institute of Science and the MS and PhD degrees (computer science and engineering) from Southern Methodist University. He is currently a professor and eminent scholar of computer engineering at the University of Alabama at Huntsville (UAH). Prior to joining UAH, he was a professor of computer science and engineering at the University of Texas at Arlington. For two years (1993) (1994) (1995) , he was a program manager at the National Science Foundation, managing operating systems and programming languages and compilers programs in the CCR Division. He was an IEEE Computer Society (CS) Distinguished Visitor (1989) (1990) (1991) , editor of the IEEE Transactions on Computers (1993 Computers ( -1997 , and editor of the Computer Society Press (1987) (1988) (1989) (1990) (1991) . His primary research interest lies in computer systems architecture, including dataflow and multithreaded systems, memory management, operating systems, and compiler optimization. His other research interests include formal specification of concurrent processing systems, performance modeling, and evaluation, load balancing, and scheduling of parallel programs. He has published more than 125 technical papers on these topics. He is a senior member of the IEEE and a member of the ACM.
Roberto Giorgi received the MS degree in electronic engineering, summa cum laude, and the PhD degree in computer engineering, both from the University of Pisa, Italy. He is currently an assistant professor in the Department of Information Engineering, University of Siena, Italy. He was a research associate in the Department of Electrical and Computer Engineering, University of Alabama at Huntsville. His main academic interest is computer architecture and, in particular, multithreaded and multiprocessors systems. He is exploring coherence protocols, compile time optimizations, behavior of user and system code, architectural simulation for improving the performance of a wide range of applications from desktop to embedded-systems, web-servers, and e-commerce servers. He took part in the ChARM project in cooperation with VLSI Technology Inc., San Jose, California, developing part of the software used for performance evaluation of ARM-processor-based embedded systems with cache memory. He is a member of the IEEE, IEEE Computer Society, and ACM.
Joseph Arul received the BSc degree in mathematics in 1981 from Indore University, India, and the MS degree in computer science in 1994 from De Paul University, Chicago. From 1994-1995, he was a lecturer at Fu Jen Catholic University, Taiwan. Currently, he is a computer engineering PhD student at the University of Alabama at Huntsville (UAH). His current research interests are computer architecture, parallel and distributed computing, multithreaded programs, and compilers. He is a student member of the IEEE and the IEEE Computer Society and a member of the ACM.
F For further information on this or any computing topic, please visit our Digital Library at http://computer.org/publications/dlib.
