25 research outputs found
Fifty Years of ISCA: A data-driven retrospective on key trends
Computer Architecture, broadly, involves optimizing hardware and software for
current and future processing systems. Although there are several other top
venues to publish Computer Architecture research, including ASPLOS, HPCA, and
MICRO, ISCA (the International Symposium on Computer Architecture) is one of
the oldest, longest running, and most prestigious venues for publishing
Computer Architecture research. Since 1973, except for 1975, ISCA has been
organized annually. Accordingly, this year will be the 50th year of ISCA. Thus,
we set out to analyze the past 50 years of ISCA to understand who and what has
been driving and innovating computing systems thus far. Our analysis identifies
several interesting trends that reflect how ISCA, and Computer Architecture in
general, has grown and evolved in the past 50 years, including minicomputers,
general-purpose uniprocessor CPUs, multiprocessor and multi-core CPUs,
general-purpose GPUs, and accelerators.Comment: 17 pages, 11 figure
Beyond Dataflow
This paper presents some recent advanced dataflow architectures. While the dataflow concept offers the potential of high performance, the performance of an actual dataflow implementation can be restricted by a limited number of functional units, limited memory bandwidth, and the need to associatively match pending operations with available functional units. Since the early 1970s, there have been significant developments in both fundamental research and practical realizations of dataflow models of computation. In particular, there has been active research and development in multithreaded architectures that evolved from the dataflow model. Also some other techniques for combining control-flow and dataflow emerged, such as coarse-grain dataflow, dataflow with complex machine operations, RISC dataflow, and micro dataflow. These developments have also had certain impact on the conception of highperformance superscalar processors in the “post-RISC” era
ASC: A stream compiler for computing with FPGAs
Published versio
Looking to Parallel Algorithms for ILP and Decentralization
We introduce explicit multi-threading (XMT), a decentralized architecture
that exploits fine-grained SPMD-style programming; a SPMD program can
translate directly to MIPS assembly language using three additional
instruction primitives. The motivation for XMT is: (i) to define an
inherently decentralizable architecture, taking into account that the
performance of future integrated circuits will be dominated by wire costs,
(ii) to increase available instruction-level parallelism (ILP) by
leveraging expertise in the world of parallel algorithms, and (iii) to
reduce hardware complexity by alleviating the need to detect ILP at
run-time: if parallel algorithms can give us an overabundance of work to
do in the form of thread-level parallelism, one can extract
instruction-level parallelism with greatly simplified dependence-checking.
We show that implementations of such an architecture tend towards
decentralization and that, when global communication is necessary, overall
performance is relatively insensitive to large on-chip delays. We compare
the performance of the design to more traditional parallel architectures
and to a high-performance superscalar implementation, but the intent is
merely to illustrate the performance behavior of the organization and to
stimulate debate on the viability of introducing SPMD to the single-chip
processor domain. We cannot offer at this stage hard comparisons with
well-researched models of execution.
When programming for the SPMD model, the total number of operations that
the processor has to perform is often slightly higher. To counter this, we
have observed that the length of the critical path through the dynamic
execution graph is smaller than in the serial domain, and the amount of
ILP is correspondingly larger. Fine-grained SPMD programming connects with
a broad knowledge base in parallel algorithms and scales down to provide
good performance relative to high-performance superscalar designs even
with small input sizes and small numbers of functional units.
Keywords: Fine-grained SPMD, parallel algorithms. spawn-join, prefix-sum,
instruction-level parallelism, decentralized architecture.
(Also cross-referenced as UMIACS-TR- 98-40
The Dataflow Computational Model And Its Evolution
Το υπολογιστικό μοντέλο dataflow είναι ένα εναλλακτικό του von-Neumann. Τα κυριότερα χαρακτηριστικά του είναι ο ασύγχρονος προγραμματισμός εργασιών και το ότι επιτρέπει μαζική παραλληλία. Αυτή η πτυχιακή είναι μία μελέτη αυτού του μοντέλου, καθώς και μερικών υβριδικών μοντέλων, που βρίσκονται ανάμεσα στο αρχικό μοντέλο dataflow και στο von-Neumann. Τέλος, υπάρχουν αναφορές σε μερικές αρχές του dataflow, οι οποίες έχουν υιοθετηθεί σε συμβατικές μηχανές, γλώσσες προγραμματισμού και συστήματα κατανεμημένων υπολογισμών.The dataflow computational model is an alternative to the von-Neumann model. Its most
significant aspects are, that it is based on asynchronous instructions scheduling and exposes massive parallelism. This thesis is a review of the dataflow computational model,
as well as of some hybrid models, which lie between the pure dataflow and the von Neumann model. Additionally, there are some references to dataflow principles, that are or are being adopted by conventional machines, programming languages and distributed
computing systems
DEMAND-DRIVEN EXECUTION USING FUTURE GATED SINGLE ASSIGNMENT FORM
This dissertation discusses a novel, previously unexplored execution model called Demand-Driven Execution (DDE), which executes programs starting from the outputs of the program, progressing towards the inputs of the program. This approach is significantly different from prior demand-driven reduction machines as it can execute a program written in an imperative language using the demand-driven paradigm while extracting both instruction and data level parallelism. The execution model relies on an executable Single Assignment Form which serves both as the internal representation of the compiler as well as the Instruction Set Architecture (ISA) of the machine. This work develops the instruction set architecture, the programming language pragmatics, and the microarchitecture for the demand-driven execution paradigm
Recommended from our members
Fine-grain parallelism on sequential processors
There seems to be a consensus that future Massively Parallel Architectures
will consist of a number nodes, or processors, interconnected by high-speed network.
Using a von Neumann style of processing within the node of a multiprocessor system
has its performance limited by the constraints imposed by the control-flow execution
model. Although the conventional control-flow model offers high performance on
sequential execution which exhibits good locality, switching between threads and synchronization
among threads causes substantial overhead. On the other hand, dataflow
architectures support rapid context switching and efficient synchronization but require
extensive hardware and do not use high-speed registers.
There have been a number of architectures proposed to combine the instruction-level
context switching capability with sequential scheduling. One such architecture
is Threaded Abstract Machine (TAM), which supports fine-grain interleaving of multiple
threads by an appropriate compilation strategy rather than through elaborate hardware.
Experiments on TAM have already shown that it is possible to implement the dataflow
execution model on conventional architectures and obtain reasonable performance.
These studies also show a basic mismatch between the requirements for fine-grain
parallelism and the underlying architecture and considerable improvement is possible through hardware support.
This thesis presents two design modifications to efficiently support fine-grain parallelism. First, a modification to the instruction set architecture is proposed to reduce the cost involved in scheduling and synchronization. The hardware modifications are kept to a minimum so as to not disturb the functionality of a conventional RISC processor. Second, a separate coprocessor is utilized to handle messages. Atomicity and message handling are handled efficiently, without compromising per-processor performance and system integrity. Clock cycles per TAM instruction is used as a measure to study the effectiveness of these changes