63 research outputs found

    E-PUR: An Energy-Efficient Processing Unit for Recurrent Neural Networks

    Full text link
    Recurrent Neural Networks (RNNs) are a key technology for emerging applications such as automatic speech recognition, machine translation or image description. Long Short Term Memory (LSTM) networks are the most successful RNN implementation, as they can learn long term dependencies to achieve high accuracy. Unfortunately, the recurrent nature of LSTM networks significantly constrains the amount of parallelism and, hence, multicore CPUs and many-core GPUs exhibit poor efficiency for RNN inference. In this paper, we present E-PUR, an energy-efficient processing unit tailored to the requirements of LSTM computation. The main goal of E-PUR is to support large recurrent neural networks for low-power mobile devices. E-PUR provides an efficient hardware implementation of LSTM networks that is flexible to support diverse applications. One of its main novelties is a technique that we call Maximizing Weight Locality (MWL), which improves the temporal locality of the memory accesses for fetching the synaptic weights, reducing the memory requirements by a large extent. Our experimental results show that E-PUR achieves real-time performance for different LSTM networks, while reducing energy consumption by orders of magnitude with respect to general-purpose processors and GPUs, and it requires a very small chip area. Compared to a modern mobile SoC, an NVIDIA Tegra X1, E-PUR provides an average energy reduction of 92x

    Portable performance on heterogeneous architectures

    Get PDF
    Trends in both consumer and high performance computing are bringing not only more cores, but also increased heterogeneity among the computational resources within a single machine. In many machines, one of the greatest computational resources is now their graphics coprocessors (GPUs), not just their primary CPUs. But GPU programming and memory models differ dramatically from conventional CPUs, and the relative performance characteristics of the different processors vary widely between machines. Different processors within a system often perform best with different algorithms and memory usage patterns, and achieving the best overall performance may require mapping portions of programs across all types of resources in the machine. To address the problem of efficiently programming machines with increasingly heterogeneous computational resources, we propose a programming model in which the best mapping of programs to processors and memories is determined empirically. Programs define choices in how their individual algorithms may work, and the compiler generates further choices in how they can map to CPU and GPU processors and memory systems. These choices are given to an empirical autotuning framework that allows the space of possible implementations to be searched at installation time. The rich choice space allows the autotuner to construct poly-algorithms that combine many different algorithmic techniques, using both the CPU and the GPU, to obtain better performance than any one technique alone. Experimental results show that algorithmic changes, and the varied use of both CPUs and GPUs, are necessary to obtain up to a 16.5x speedup over using a single program configuration for all architectures.United States. Dept. of Energy (Award DE-SC0005288)United States. Defense Advanced Research Projects Agency (Award HR0011-10-9-0009)National Science Foundation (U.S.) (Award CCF-0632997

    A RECONFIGURABLE AND EXTENSIBLE EXPLORATION PLATFORM FOR FUTURE HETEROGENEOUS SYSTEMS

    Get PDF
    Accelerator-based -or heterogeneous- computing has become increasingly important in a variety of scenarios, ranging from High-Performance Computing (HPC) to embedded systems. While most solutions use sometimes custom-made components, most of today’s systems rely on commodity highend CPUs and/or GPU devices, which deliver adequate performance while ensuring programmability, productivity, and application portability. Unfortunately, pure general-purpose hardware is affected by inherently limited power-efficiency, that is, low GFLOPS-per-Watt, now considered as a primary metric. The many-core model and architectural customization can play here a key role, as they enable unprecedented levels of power-efficiency compared to CPUs/GPUs. However, such paradigms are still immature and deeper exploration is indispensable. This dissertation investigates customizability and proposes novel solutions for heterogeneous architectures, focusing on mechanisms related to coherence and network-on-chip (NoC). First, the work presents a non-coherent scratchpad memory with a configurable bank remapping system to reduce bank conflicts. The experimental results show the benefits of both using a customizable hardware bank remapping function and non-coherent memories for some types of algorithms. Next, we demonstrate how a distributed synchronization master better suits many-cores than standard centralized solutions. This solution, inspired by the directory-based coherence mechanism, supports concurrent synchronizations without relying on memory transactions. The results collected for different NoC sizes provided indications about the area overheads incurred by our solution and demonstrated the benefits of using a dedicated hardware synchronization support. Finally, this dissertation proposes an advanced coherence subsystem, based on the sparse directory approach, with a selective coherence maintenance system which allows coherence to be deactivated for blocks that do not require it. Experimental results show that the use of a hybrid coherent and non-coherent architectural mechanism along with an extended coherence protocol can enhance performance. The above results were all collected by means of a modular and customizable heterogeneous many-core system developed to support the exploration of power-efficient high-performance computing architectures. The system is based on a NoC and a customizable GPU-like accelerator core, as well as a reconfigurable coherence subsystem, ensuring application-specific configuration capabilities. All the explored solutions were evaluated on this real heterogeneous system, which comes along with the above methodological results as part of the contribution in this dissertation. In fact, as a key benefit, the experimental platform enables users to integrate novel hardware/software solutions on a full-system scale, whereas existing platforms do not always support a comprehensive heterogeneous architecture exploration

    Castell: a heterogeneous cmp architecture scalable to hundreds of processors

    Get PDF
    Technology improvements and power constrains have taken multicore architectures to dominate microprocessor designs over uniprocessors. At the same time, accelerator based architectures have shown that heterogeneous multicores are very efficient and can provide high throughput for parallel applications, but with a high-programming effort. We propose Castell a scalable chip multiprocessor architecture that can be programmed as uniprocessors, and provides the high throughput of accelerator-based architectures. Castell relies on task-based programming models that simplify software development. These models use a runtime system that dynamically finds, schedules, and adds hardware-specific features to parallel tasks. One of these features is DMA transfers to overlap computation and data movement, which is known as double buffering. This feature allows applications on Castell to tolerate large memory latencies and lets us design the memory system focusing on memory bandwidth. In addition to provide programmability and the design of the memory system, we have used a hierarchical NoC and added a synchronization module. The NoC design distributes memory traffic efficiently to allow the architecture to scale. The synchronization module is a consequence of the large performance degradation of application for large synchronization latencies. Castell is mainly an architecture framework that enables the definition of domain-specific implementations, fine-tuned to a particular problem or application. So far, Castell has been successfully used to propose heterogeneous multicore architectures for scientific kernels, video decoding (using H.264), and protein sequence alignment (using Smith-Waterman and clustalW). It has also been used to explore a number of architecture optimizations such as enhanced DMA controllers, and architecture support for task-based programming models. ii

    Parallel and Distributed Computing

    Get PDF
    The 14 chapters presented in this book cover a wide variety of representative works ranging from hardware design to application development. Particularly, the topics that are addressed are programmable and reconfigurable devices and systems, dependability of GPUs (General Purpose Units), network topologies, cache coherence protocols, resource allocation, scheduling algorithms, peertopeer networks, largescale network simulation, and parallel routines and algorithms. In this way, the articles included in this book constitute an excellent reference for engineers and researchers who have particular interests in each of these topics in parallel and distributed computing

    A Study of the Potential of Locality-Aware Thread Scheduling for GPUs

    Full text link

    A novel access pattern-based multi-core memory architecture

    Get PDF
    Increasingly High-Performance Computing (HPC) applications run on heterogeneous multi-core platforms. The basic reason of the growing popularity of these architectures is their low power consumption, and high throughput oriented nature. However, this throughput imposes a requirement on the data to be supplied in a high throughput manner for the multi-core system. This results in the necessity of an efficient management of on-chip and off-chip memory data transfers, which is a significant challenge. Complex regular and irregular memory data transfer patterns are becoming widely dominant for a range of application domains including the scientific, image and signal processing. Data accesses can be arranged in independent patterns that an efficient memory management can exploit. The software based approaches using general purpose caches and on-chip memories are beneficial to some extent. However, the task of efficient data management for the throughput oriented devices could be improved by providing hardware mechanisms that exploit the knowledge of access patterns in memory management and scheduling of accesses for a heterogeneous multi-core architecture. The focus of this thesis is to present architectural explorations for a novel access pattern-based multi-core memory architecture. In general, the thesis covers four main aspects of memory system in this research. These aspects can be categorized as: i) Uni-core Memory System for Regular Data Pattern. ii) Multi-core Memory System for Regular Data Pattern. iii) Uni-core Memory System for Irregular Data Pattern. and iv) Multi-core Memory System for Irregular Data Pattern.Les aplicacions de computació d'alt rendiment (HPC) s'executen cada vegada més en plataformes heterogènies de múltiples nuclis. El motiu bàsic de la creixent popularitat d'aquestes arquitectures és el seu baix consum i la seva natura orientada a alt throughput. No obstant, aquest thoughput imposa el requeriment de que les dades es proporcionin al sistema també amb alt throughput. Això resulta en la necessitat de gestionar eficientment les trasferències de memòria (dins i fora del chip), un repte significatiu. Els patrons de transferències de memòria regulars però complexos així com els irregulars són cada vegada més dominants per a diversos dominis d'aplicacions, incloent el científic i el processat d'imagte i senyals. Aquests accessos a dades poden ser organitzats en patrons independents que un gestor de memòria eficient pot explotar. Els mètodes basats en programari emprant memòries cau de propòsit general i memòries al chip són beneficioses fins a cert punt. No obstant, la tasca de gestionar eficientment les transferències de dades per a dispositius orientats a throughput pot ser millorada oferint mecanismes hardware que explotin el coneixement dels patrons d'accés de les aplicacions, així com la planificació dels accessos a una arquitectura de múltiples nuclis. Aquesta tesis està enfocada a explorar una arquitectura de memòria novedosa per a processadors de múltiples nuclis, basada en els patrons d'accés. En general, la recerca de la tesis cobreix quatres aspectes principals del sistema de memòria. Aquests aspectes són: i) sistema de memòria per a un únic nucli amb patrons regulars, ii) sistema de memòria per a múltiples nuclis amb patrons regulars, iii) sistema de memòria per a un únic nucli amb patrons irregulars, iv) sistema de memòria per a múltiples nuclis amb patrons irregulars

    Runtime-assisted optimizations in the on-chip memory hierarchy

    Get PDF
    Following Moore's Law, the number of transistors on chip has been increasing exponentially, which has led to the increasing complexity of modern processors. As a result, the efficient programming of such systems has become more difficult. Many programming models have been developed to answer this issue. Of particular interest are task-based programming models that employ simple annotations to define parallel work in an application. The information available at the level of the runtime systems associated with these programming models offers great potential for improving hardware design. Moreover, due to technological limitations, Moore's Law is predicted to eventually come to an end, so novel paradigms are necessary to maintain the current performance improvement trends. The main goal of this thesis is to exploit the knowledge about a parallel application available at the runtime system level to improve the design of the on-chip memory hierarchy. The coupling of the runtime system and the microprocessor enables a better hardware design without hurting the programmability. The first contribution is a set of insertion policies for shared last-level caches that exploit information about tasks and task data dependencies. The intuition behind this proposal revolves around the observation that parallel threads exhibit different memory access patterns. Even within the same thread, accesses to different variables often follow distinct patterns. The proposed policies insert cache lines into different logical positions depending on the dependency type and task type to which the corresponding memory request belongs. The second proposal optimizes the execution of reductions, defined as a programming pattern that combines input data to form the resulting reduction variable. This is achieved with a runtime-assisted technique for performing reductions in the processor's cache hierarchy. The proposal's goal is to be a universally applicable solution regardless of the reduction variable type, size and access pattern. On the software level, the programming model is extended to let a programmer specify the reduction variables for tasks, as well as the desired cache level where a certain reduction will be performed. The source-to-source compiler and the runtime system are extended to translate and forward this information to the underlying hardware. On the hardware level, private and shared caches are equipped with functional units and the accompanying logic to perform reductions at the cache level. This design avoids unnecessary data movements to the core and back as the data is operated at the place where it resides. The third contribution is a runtime-assisted prioritization scheme for memory requests inside the on-chip memory hierarchy. The proposal is based on the notion of a critical path in the context of parallel codes and a known fact that accelerating critical tasks reduces the execution time of the whole application. In the context of this work, task criticality is observed at a level of a task type as it enables simple annotation by the programmer. The acceleration of critical tasks is achieved by the prioritization of corresponding memory requests in the microprocessor.Siguiendo la ley de Moore, el número de transistores en los chips ha crecido exponencialmente, lo que ha comportado una mayor complejidad en los procesadores modernos y, como resultado, de la dificultad de la programación eficiente de estos sistemas. Se han desarrollado muchos modelos de programación para resolver este problema; un ejemplo particular son los modelos de programación basados en tareas, que emplean anotaciones sencillas para definir los Trabajos paralelos de una aplicación. La información de que disponen los sistemas en tiempo de ejecución (runtime systems) asociada con estos modelos de programación ofrece un enorme potencial para la mejora del diseño del hardware. Por otro lado, las limitaciones tecnológicas hacen que la ley de Moore pueda dejar de cumplirse próximamente, por lo que se necesitan paradigmas nuevos para mantener las tendencias actuales de mejora de rendimiento. El objetivo principal de esta tesis es aprovechar el conocimiento de las aplicaciones paral·leles de que dispone el runtime system para mejorar el diseño de la jerarquía de memoria del chip. El acoplamiento del runtime system junto con el microprocesador permite realizar mejores diseños hardware sin afectar Negativamente en la programabilidad de dichos sistemas. La primera contribución de esta tesis consiste en un conjunto de políticas de inserción para las memorias caché compartidas de último nivel que aprovecha la información de las tareas y las dependencias de datos entre estas. La intuición tras esta propuesta se basa en la observación de que los hilos de ejecución paralelos muestran distintos patrones de acceso a memoria e, incluso dentro del mismo hilo, los accesos a diferentes variables a menudo siguen patrones distintos. Las políticas que se proponen insertan líneas de caché en posiciones lógicas diferentes en función de los tipos de dependencia y tarea a los que corresponde la petición de memoria. La segunda propuesta optimiza la ejecución de las reducciones, que se definen como un patrón de programación que combina datos de entrada para conseguir la variable de reducción como resultado. Esto se consigue mediante una técnica asistida por el runtime system para la realización de reducciones en la jerarquía de la caché del procesador, con el objetivo de ser una solución aplicable de forma universal sin depender del tipo de la variable de la reducción, su tamaño o el patrón de acceso. A nivel de software, el modelo de programación se extiende para que el programador especifique las variables de reducción de las tareas, así como el nivel de caché escogido para que se realice una determinada reducción. El compilador fuente a Fuente (compilador source-to-source) y el runtime ssytem se modifican para que traduzcan y pasen esta información al hardware subyacente, evitando así movimientos de datos innecesarios hacia y desde el núcleo del procesador, al realizarse la operación donde se encuentran los datos de la misma. La tercera contribución proporciona un esquema de priorización asistido por el runtime system para peticiones de memoria dentro de la jerarquía de memoria del chip. La propuesta se basa en la noción de camino crítico en el contexto de los códigos paralelos y en el hecho conocido de que acelerar tareas críticas reduce el tiempo de ejecución de la aplicación completa. En el contexto de este trabajo, la criticidad de las tareas se considera a nivel del tipo de tarea ya que permite que el programador las indique mediante anotaciones sencillas. La aceleración de las tareas críticas se consigue priorizando las correspondientes peticiones de memoria en el microprocesador.Seguint la llei de Moore, el nombre de transistors que contenen els xips ha patit un creixement exponencial, fet que ha provocat un augment de la complexitat dels processadors moderns i, per tant, de la dificultat de la programació eficient d’aquests sistemes. Per intentar solucionar-ho, s’han desenvolupat diversos models de programació; un exemple particular en són els models basats en tasques, que fan servir anotacions senzilles per definir treballs paral·lels dins d’una aplicació. La informació que hi ha al nivell dels sistemes en temps d’execució (runtime systems) associada amb aquests models de programació ofereix un gran potencial a l’hora de millorar el disseny del maquinari. D’altra banda, les limitacions tecnològiques fan que la llei de Moore pugui deixar de complir-se properament, per la qual cosa calen nous paradigmes per mantenir les tendències actuals en la millora de rendiment. L’objectiu principal d’aquesta tesi és aprofitar els coneixements que el runtime System té d’una aplicació paral·lela per millorar el disseny de la jerarquia de memòria dins el xip. L’acoblament del runtime system i el microprocessador permet millorar el disseny del maquinari sense malmetre la programabilitat d’aquests sistemes. La primera contribució d’aquesta tesi consisteix en un conjunt de polítiques d’inserció a les memòries cau (cache memories) compartides d’últim nivell que aprofita informació sobre tasques i les dependències de dades entre aquestes. La intuïció que hi ha al darrere d’aquesta proposta es basa en el fet que els fils d’execució paral·lels mostren diferents patrons d’accés a la memòria; fins i tot dins el mateix fil, els accessos a variables diferents sovint segueixen patrons diferents. Les polítiques que s’hi proposen insereixen línies de la memòria cau a diferents ubicacions lògiques en funció dels tipus de dependència i de tasca als quals correspon la petició de memòria. La segona proposta optimitza l’execució de les reduccions, que es defineixen com un patró de programació que combina dades d’entrada per aconseguir la variable de reducció com a resultat. Això s’aconsegueix mitjançant una tècnica assistida pel runtime system per dur a terme reduccions en la jerarquia de la memòria cau del processador, amb l’objectiu que la proposta sigui aplicable de manera universal, sense dependre del tipus de la variable a la qual es realitza la reducció, la seva mida o el patró d’accés. A nivell de programari, es realitza una extensió del model de programació per facilitar que el programador especifiqui les variables de les reduccions que usaran les tasques, així com el nivell de memòria cau desitjat on s’hauria de realitzar una certa reducció. El compilador font a font (compilador source-to-source) i el runtime system s’amplien per traduir i passar aquesta informació al maquinari subjacent. A nivell de maquinari, les memòries cau privades i compartides s’equipen amb unitats funcionals i la lògica corresponent per poder dur a terme les reduccions a la pròpia memòria cau, evitant així moviments de dades innecessaris entre el nucli del processador i la jerarquia de memòria. La tercera contribució proporciona un esquema de priorització assistit pel runtime System per peticions de memòria dins de la jerarquia de memòria del xip. La proposta es basa en la noció de camí crític en el context dels codis paral·lels i en el fet conegut que l’acceleració de les tasques que formen part del camí crític redueix el temps d’execució de l’aplicació sencera. En el context d’aquest treball, la criticitat de les tasques s’observa al nivell del seu tipus ja que permet que el programador les indiqui mitjançant anotacions senzilles. L’acceleració de les tasques crítiques s’aconsegueix prioritzant les corresponents peticions de memòria dins el microprocessador

    Optimizations and Cost Models for multi-core architectures: an approach based on parallel paradigms

    Get PDF
    The trend in modern microprocessor architectures is clear: multi-core chips are here to stay, and researchers expect multiprocessors with 128 to 1024 cores on a chip in some years. Yet the software community is slowly taking the path towards parallel programming: while some works target multi-cores, these are usually inherited from the previous tools for SMP architectures, and rarely exploit specific characteristics of multi-cores. But most important, current tools have no facilities to guarantee performance or portability among architectures. Our research group was one of the first to propose the structured parallel programming approach to solve the problem of performance portability and predictability. This has been successfully demonstrated years ago for distributed and shared memory multiprocessors, and we strongly believe that the same should be applied to multi-core architectures. The main problem with performance portability is that optimizations are effective only under specific conditions, making them dependent on both the specific program and the target architecture. For this reason in current parallel programming (in general, but especially with multi-cores) optimizations usually follows a try-and-decide approach: each one must be implemented and tested on the specific parallel program to understand its benefits. If we want to make a step forward and really achieve some form of performance portability, we require some kind of prediction of the expected performance of a program. The concept of performance modeling is quite old in the world of parallel programming; yet, in the last years, this kind of research saw small improvements: cost models to describe multi-cores are missing, mainly because of the increasing complexity of microarchitectures and the poor knowledge of specific implementation details of current processors. In the first part of this thesis we prove that the way of performance modeling is still feasible, by studying the Tilera TilePro64. The high number of cores on-chip in this processor (64) required the use of several innovative solutions, such as a complex interconnection network and the use of multiple memory interfaces per chip. For these features the TilePro64 can be considered an insight of what to expect in future multi-core processors. The availability of a cycle-accurate simulator and an extensive documentation allowed us to model the architecture, and in particular its memory subsystem, at the accuracy level required to compare optimizations In the second part, focused on optimizations, we cover one of the most important issue of multi-core architectures: the memory subsystem. In this area multi-core strongly differs in their structure w.r.t off-chip parallel architectures, both SMP and NUMA, thus opening new opportunities. In detail, we investigate the problem of data distribution over the memory controllers in several commercial multi-cores, and the efficient use of the cache coherency mechanisms offered by the TilePro64 processor. Finally, by using the performance model, we study different implementations, derived from the previous optimizations, of a simple test-case application. We are able to predict the best version using only profiled data from a sequential execution. The accuracy of the model has been verified by experimentally comparing the implementations on the real architecture, giving results within 1 − 2% of accuracy
    • …
    corecore