217 research outputs found
Beyond Dataflow
This paper presents some recent advanced dataflow architectures. While the dataflow concept offers the potential of high performance, the performance of an actual dataflow implementation can be restricted by a limited number of functional units, limited memory bandwidth, and the need to associatively match pending operations with available functional units. Since the early 1970s, there have been significant developments in both fundamental research and practical realizations of dataflow models of computation. In particular, there has been active research and development in multithreaded architectures that evolved from the dataflow model. Also some other techniques for combining control-flow and dataflow emerged, such as coarse-grain dataflow, dataflow with complex machine operations, RISC dataflow, and micro dataflow. These developments have also had certain impact on the conception of highperformance superscalar processors in the “post-RISC” era
Recommended from our members
A token caching waiting-matching unit for tagged-token dataflow computers
Computers using the tagged-token dataflow model are among the best candidates for delivering extremely high levels of performance required in the future. Instruction scheduling in these computers is determined by associatively matching data-bearing tokens in a Waiting-Matching Unit (W-M unit). At the W-M unit, incoming tokens with matching contexts are forwarded to an instruction while non-matching tokens are stored to await their matching partner. Requirements of the W-M unit are exacting. Necessary token storage capacity at each processing element (PE) is presently estimated to be 100,000 tokens. Since the most often executed arithmetic instructions require two operands, the bandwidth of the W-M unit must be approximately twice that of the ALU. The contradictory requirements of high storage capacity and high memory bandwidth have compromised the M-W units of previous dataflow computers limiting their speed. However, tokens arriving at a PE exhibit strong temporal locality. This naturally suggests the use of some caching technique. Using a recently developed CAM memory structure as a base, a token caching scheme is described which allows rapid, fully associative token matching while allowing a large token storage capacity. The key to the caching scheme is a fast and compact, articulated, first-in, first-out, content addressable memory (AFCAM) which allows associative matching and garbage collection while maintaining temporal ordering. A new memory cell is developed as the basis for the AFCAM in an advanced CMOS (Complementary Metal Oxide Semiconductor) technology. The design of the cell is discussed as well as electrical simulation results, verifying its operation and performance. Finally, estimated system performance of a dataflow computer using the caching scheme is presented
Recommended from our members
Program allocation for hypercube based dataflow systems
The dataflow model of computation differs from the traditional control-flow
model of computation in that it does not utilize a program counter to sequence
instructions in a program. Instead, the execution of instructions is based solely on the
availability of their operands. Thus, an instruction is executed in a dataflow computer
when all of its operands are available. This asynchronous nature of the dataflow model of
computation allows the exploitation of fine-grain parallelism inherent in programs.
Although the dataflow model of computation exploits parallelism, the problem of
optimally allocating a program to processors belongs to the class of NP-complete
problems. Therefore, one of the major issues facing designers of dataflow
multiprocessors is the proper allocation of programs to processors.
The problem of program allocation lies in maximizing parallelism while
minimizing interprocessor communication costs. The culmination of research in the area
of program allocation has produced the proposed method called the Balanced Layered
Allocation Scheme that utilizes heuristic rules to strike a balance between computation
time and communication costs in dataflow multiprocessors. Specifically, the proposed
allocation scheme utilizes Critical Path and Longest Directed Path heuristics when
allocating instructions to processors. Simulation studies indicate that the proposed
scheme is effective in reducing the overall execution time of a program by considering
the effects of communication costs on computation times
Recommended from our members
Exploiting iteration-level parallelism in dataflow programs
The term "dataflow" generally encompasses three distinct aspects of computation - a data-driven model of computation, a functional/declarative programming language, and a special-purpose multiprocessor architecture. In this paper we decouple the language and architecture issues by demonstrating that declarative programming is a suitable vehicle for the programming of conventional distributed-memory multiprocessors.This is achieved by appling several transformations to the compiled declarative program to achieve iteration-level (rather than instruction-level) parallelism. The transformations first group individual instructions into sequential light-weight processes, and then insert primitives to: (1) cause array allocation to be distributed over multiple processors, (2) cause computation to follow the data distribution by inserting an index filtering mechanism into a given loop and spawning a copy of it on all PEs; the filter causes each instance of that loop to operate on a different subrange of the index variable.The underlying model of computation is a dataflow/von Neumann hybrid in that exection within a process is control-driven while the creation, blocking, and activation of processes is data-driven.The performance of this process-oriented dataflow system (PODS) is demonstrated using the hydrodynamics simulation benchmark called SIMPLE, where a 19-fold speedup on a 32-processor architecture has been achieved
The Dataflow Computational Model And Its Evolution
Το υπολογιστικό μοντέλο dataflow είναι ένα εναλλακτικό του von-Neumann. Τα κυριότερα χαρακτηριστικά του είναι ο ασύγχρονος προγραμματισμός εργασιών και το ότι επιτρέπει μαζική παραλληλία. Αυτή η πτυχιακή είναι μία μελέτη αυτού του μοντέλου, καθώς και μερικών υβριδικών μοντέλων, που βρίσκονται ανάμεσα στο αρχικό μοντέλο dataflow και στο von-Neumann. Τέλος, υπάρχουν αναφορές σε μερικές αρχές του dataflow, οι οποίες έχουν υιοθετηθεί σε συμβατικές μηχανές, γλώσσες προγραμματισμού και συστήματα κατανεμημένων υπολογισμών.The dataflow computational model is an alternative to the von-Neumann model. Its most
significant aspects are, that it is based on asynchronous instructions scheduling and exposes massive parallelism. This thesis is a review of the dataflow computational model,
as well as of some hybrid models, which lie between the pure dataflow and the von Neumann model. Additionally, there are some references to dataflow principles, that are or are being adopted by conventional machines, programming languages and distributed
computing systems
Recommended from our members
Fine-grain parallelism on sequential processors
There seems to be a consensus that future Massively Parallel Architectures
will consist of a number nodes, or processors, interconnected by high-speed network.
Using a von Neumann style of processing within the node of a multiprocessor system
has its performance limited by the constraints imposed by the control-flow execution
model. Although the conventional control-flow model offers high performance on
sequential execution which exhibits good locality, switching between threads and synchronization
among threads causes substantial overhead. On the other hand, dataflow
architectures support rapid context switching and efficient synchronization but require
extensive hardware and do not use high-speed registers.
There have been a number of architectures proposed to combine the instruction-level
context switching capability with sequential scheduling. One such architecture
is Threaded Abstract Machine (TAM), which supports fine-grain interleaving of multiple
threads by an appropriate compilation strategy rather than through elaborate hardware.
Experiments on TAM have already shown that it is possible to implement the dataflow
execution model on conventional architectures and obtain reasonable performance.
These studies also show a basic mismatch between the requirements for fine-grain
parallelism and the underlying architecture and considerable improvement is possible through hardware support.
This thesis presents two design modifications to efficiently support fine-grain parallelism. First, a modification to the instruction set architecture is proposed to reduce the cost involved in scheduling and synchronization. The hardware modifications are kept to a minimum so as to not disturb the functionality of a conventional RISC processor. Second, a separate coprocessor is utilized to handle messages. Atomicity and message handling are handled efficiently, without compromising per-processor performance and system integrity. Clock cycles per TAM instruction is used as a measure to study the effectiveness of these changes
- …