Search CORE

15 research outputs found

DEMAND-DRIVEN EXECUTION USING FUTURE GATED SINGLE ASSIGNMENT FORM

Author: Javeri Omkar
Publication venue: Digital Commons @ Michigan Tech
Publication date: 01/01/2020
Field of study

This dissertation discusses a novel, previously unexplored execution model called Demand-Driven Execution (DDE), which executes programs starting from the outputs of the program, progressing towards the inputs of the program. This approach is significantly different from prior demand-driven reduction machines as it can execute a program written in an imperative language using the demand-driven paradigm while extracting both instruction and data level parallelism. The execution model relies on an executable Single Assignment Form which serves both as the internal representation of the compiler as well as the Instruction Set Architecture (ISA) of the machine. This work develops the instruction set architecture, the programming language pragmatics, and the microarchitecture for the demand-driven execution paradigm

Michigan Technological University

Implementation of a general purpose dataflow multiprocessor

Author: Papadopoulos Gregory M. (Gregory Michael)
Publication venue: Massachusetts Institute of Technology
Publication date: 01/01/1988
Field of study

Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 1988.GRSN 409671Includes bibliographical references (leaves 151-155).by Gregory Michael Papadopoulos.Ph.D

DSpace@MIT

Dynamic dependency analysis of ordinary programs

Author
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/1992
Field of study

Crossref

Hardware design of task superscalar architecture

Author: Yazdanpanah Fahimeh
Publication venue: Universitat Politècnica de Catalunya
Publication date: 01/01/2014
Field of study

Exploiting concurrency to achieve greater performance is a difficult and important challenge for current high performance systems. Although the theory is plain, the complexity of traditional parallel programming models in most cases impedes the programmer to harvest performance. Several partitioning granularities have been proposed to better exploit concurrency at task granularity. In this sense, different dynamic software task management systems, such as task-based dataflow programming models, benefit dataflow principles to improve task-level parallelism and overcome the limitations of static task management systems. These models implicitly schedule computation and data and use tasks instead of instructions as a basic work unit, thereby relieving the programmer of explicitly managing parallelism. While these programming models share conceptual similarities with the well-known Out-of-Order superscalar pipelines (e.g., dynamic data dependency analysis and dataflow scheduling), they rely on software-based dependency analysis, which is inherently slow, and limits their scalability when there is fine-grained task granularity and a large amount of tasks. The aforementioned problem increases with the number of available cores. In order to keep all the cores busy and accelerate the overall application performance, it becomes necessary to partition it into more and smaller tasks. The task scheduling (i.e., creation and management of the execution of tasks) in software introduces overheads, and so becomes increasingly inefficient with the number of cores. In contrast, a hardware scheduling solution can achieve greater speed-ups as a hardware task scheduler requires fewer cycles than the software version to dispatch a task. The Task Superscalar is a hybrid dataflow/von-Neumann architecture that exploits the task level parallelism of the program. The Task Superscalar combines the effectiveness of Out-of-Order processors together with the task abstraction, and thereby provides an unified management layer for CMPs which effectively employs processors as functional units. The Task Superscalar has been implemented in software with limited parallelism and high memory consumption due to the nature of the software implementation. In this thesis, a Hardware Task Superscalar architecture is designed to be integrated in a future High Performance Computer with the ability to exploit fine-grained task parallelism. The main contributions of this thesis are: (1) a design of the operational flow of Task Superscalar architecture adapted and improved for hardware implementation, (2) a HDL prototype for latency exploration, (3) a full cycle-accurate simulator of the Hardware Task Superscalar (based on the previously obtained latencies), (4) full design space exploration of the Task Superscalar component configuration (number and size) for systems with different number of processing elements (cores), (5) comparison with a software implementation of a real task-based programming model runtime using real benchmarks, and (6) hardware resource usage exploration of the selected configurations.Explotar la concurrencia para conseguir un mejor rendimiento es un reto importante y difícil para los sistemas de alto rendimiento. Aunque la teoría es sencilla, en muchos casos la complejidad de los modelos de programación paralela tradicionales impide al programador obtener un buen rendimiento. Se han propuesto diferentes granularidades de particionamiento de tareas para explotar mejor la concurrencia implícita en las aplicaciones. En este sentido, diferentes sistemas software de manejo dinámico de tareas utilizan los principios de ejecución "dataflow" para mejorar el paralelismo a nivel de tarea y superar el rendimiento de los sistemas de planificación estáticos. Estos modelos planfican la ejecución dinámicamente y utilizan tareas, en lugar de instrucciones, como unidad básica de trabajo. De esta forma descargan al programador de tener que realizar la sincronización de las tareas explícitamente en su programa. Aunque estos modelos de programación comparten muchas similitudes con los bien conocidos procesadores fuera de orden (como el análisis dinámico de dependencias y la ejecución en "dataflow"), dependen de un análisis dinámico software de las dependencias. Dicho análisis es inherentemente lento y limita la escalabilidad cuando hay un gran número de tareas pequeñas. Los problemas antes mencionados se incrementan exponencialmente con el número de núcleos disponibles. Para conseguir mantener todos los núcleos ocupados y conseguir acelerar el rendimiento global de la aplicación se hace necesario particionarla en muchas tareas pequeñas. La gestión de dichas tareas (es decir, su creación y distribución entre los núcleos) en software introduce sobrecostes, y por tanto resulta ineficiente conforme aumenta el número de núcleos. En contraposición, un sistema hardware de planificación de tareas puede conseguir mejores rendimientos ya que requiere una menor latencia en la gestión de las tareas. El Task Superscalar (TSS) es una arquitectura híbrida dataflow/von-Neumann que explota el paralelismo a nivel de tareas de los programas. El TSS combina la efectividad de los procesadores fuera de orden con la abstracción de tarea, y por tanto provee una capa unificada de gestión para los CMPs que gestiona los núcleos como unidades funcionales. Previo al trabajo de esta tesis el Task Superscalar se había implementado en software con un paralelismo limitado y mucho consumo de memoria debido a las limitaciones inherentes de una implementación software. En esta tesis se diseñado una implementación hardware de la arquitectura Task Superscalar con capacidad para manejar muchas tareas de pequeño tamaño que es integrable en un futuro computador de altas prestaciones. Así pues, las contribuciones principales de esta tesis son: (1) el diseño de un flujo operacional de la arquitectura Task Superscalar adaptado y mejorado para su implementación hardware; (2) un prototipo HDL de dicho flujo para la exploración de las latencias asociadas a la implementación hardware; (3) un simulador ciclo a ciclo del diseño hardware basado en los resultados obtenidos en la implementación hardware; (4) una exploración completa del espacio de diseño de los componentes hardware (número y cantidad de módulos, tamaños de las memorias, etc.) para diferentes tamaños de computadores (es decir, para diferentes cantidades de nucleos); (5) una comparación con la implementación software actual del mismo modelo de programación utilizando aplicaciones reales y; (6) una exploración de la utilización de recursos hardware de las diferentes configuraciones seleccionadas

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

UPCommons. Portal del coneixement obert de la UPC

Tesis Doctorals en Xarxa

Asynchrony in parallel computing: from dataflow to multithreading

Author: Robic B.
Silc J.
Ungerer Theo
Publication venue
Publication date: 02/08/2007
Field of study

KITopen

Stream Objects : dynamically-segmented scalable media over the Internet

Author: Niemczyk Steven, 1974-
Publication venue: Massachusetts Institute of Technology
Publication date: 01/01/1996
Field of study

Thesis (M. Eng.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 1996.Includes bibliographical references (p. 90).by Steven Niemczyk.M.Eng

DSpace@MIT

Recommended from our members

Fine-grain parallelism on sequential processors

Author: Kotikalapoodi Sridhar V.
Publication venue: 'Oregon State University'
Publication date
Field of study

There seems to be a consensus that future Massively Parallel Architectures will consist of a number nodes, or processors, interconnected by high-speed network. Using a von Neumann style of processing within the node of a multiprocessor system has its performance limited by the constraints imposed by the control-flow execution model. Although the conventional control-flow model offers high performance on sequential execution which exhibits good locality, switching between threads and synchronization among threads causes substantial overhead. On the other hand, dataflow architectures support rapid context switching and efficient synchronization but require extensive hardware and do not use high-speed registers. There have been a number of architectures proposed to combine the instruction-level context switching capability with sequential scheduling. One such architecture is Threaded Abstract Machine (TAM), which supports fine-grain interleaving of multiple threads by an appropriate compilation strategy rather than through elaborate hardware. Experiments on TAM have already shown that it is possible to implement the dataflow execution model on conventional architectures and obtain reasonable performance. These studies also show a basic mismatch between the requirements for fine-grain parallelism and the underlying architecture and considerable improvement is possible through hardware support. This thesis presents two design modifications to efficiently support fine-grain parallelism. First, a modification to the instruction set architecture is proposed to reduce the cost involved in scheduling and synchronization. The hardware modifications are kept to a minimum so as to not disturb the functionality of a conventional RISC processor. Second, a separate coprocessor is utilized to handle messages. Atomicity and message handling are handled efficiently, without compromising per-processor performance and system integrity. Clock cycles per TAM instruction is used as a measure to study the effectiveness of these changes

ScholarsArchive@OSU

Recommended from our members

Program allocation for hypercube based dataflow systems

Author: Freytag Vincent R.
Publication venue: 'Oregon State University'
Publication date
Field of study

The dataflow model of computation differs from the traditional control-flow model of computation in that it does not utilize a program counter to sequence instructions in a program. Instead, the execution of instructions is based solely on the availability of their operands. Thus, an instruction is executed in a dataflow computer when all of its operands are available. This asynchronous nature of the dataflow model of computation allows the exploitation of fine-grain parallelism inherent in programs. Although the dataflow model of computation exploits parallelism, the problem of optimally allocating a program to processors belongs to the class of NP-complete problems. Therefore, one of the major issues facing designers of dataflow multiprocessors is the proper allocation of programs to processors. The problem of program allocation lies in maximizing parallelism while minimizing interprocessor communication costs. The culmination of research in the area of program allocation has produced the proposed method called the Balanced Layered Allocation Scheme that utilizes heuristic rules to strike a balance between computation time and communication costs in dataflow multiprocessors. Specifically, the proposed allocation scheme utilizes Critical Path and Longest Directed Path heuristics when allocating instructions to processors. Simulation studies indicate that the proposed scheme is effective in reducing the overall execution time of a program by considering the effects of communication costs on computation times

ScholarsArchive@OSU

Recommended from our members

High Performance Architecture using Speculative Threads and Dynamic Memory Management Hardware

Author: Li Wentong
Publication venue: 'University of North Texas Libraries'
Publication date: 01/12/2007
Field of study

With the advances in very large scale integration (VLSI) technology, hundreds of billions of transistors can be packed into a single chip. With the increased hardware budget, how to take advantage of available hardware resources becomes an important research area. Some researchers have shifted from control flow Von-Neumann architecture back to dataflow architecture again in order to explore scalable architectures leading to multi-core systems with several hundreds of processing elements. In this dissertation, I address how the performance of modern processing systems can be improved, while attempting to reduce hardware complexity and energy consumptions. My research described here tackles both central processing unit (CPU) performance and memory subsystem performance. More specifically I will describe my research related to the design of an innovative decoupled multithreaded architecture that can be used in multi-core processor implementations. I also address how memory management functions can be off-loaded from processing pipelines to further improve system performance and eliminate cache pollution caused by runtime management functions

UNT Digital Library