Search CORE

37 research outputs found

Piattaforme multicore e integrazione tri-dimensionale: analisi architetturale e ottimizzazione

Author: Burgio Paolo <1981>
Publication venue: Alma Mater Studiorum - Università di Bologna
Publication date: 18/12/2013
Field of study

Modern embedded systems embrace many-core shared-memory designs. Due to constrained power and area budgets, most of them feature software-managed scratchpad memories instead of data caches to increase the data locality. It is therefore programmers’ responsibility to explicitly manage the memory transfers, and this make programming these platform cumbersome. Moreover, complex modern applications must be adequately parallelized before they can the parallel potential of the platform into actual performance. To support this, programming languages were proposed, which work at a high level of abstraction, and rely on a runtime whose cost hinders performance, especially in embedded systems, where resources and power budget are constrained. This dissertation explores the applicability of the shared-memory paradigm on modern many-core systems, focusing on the ease-of-programming. It focuses on OpenMP, the de-facto standard for shared memory programming. In a first part, the cost of algorithms for synchronization and data partitioning are analyzed, and they are adapted to modern embedded many-cores. Then, the original design of an OpenMP runtime library is presented, which supports complex forms of parallelism such as multi-level and irregular parallelism. In the second part of the thesis, the focus is on heterogeneous systems, where hardware accelerators are coupled to (many-)cores to implement key functional kernels with orders-of-magnitude of speedup and energy efficiency compared to the “pure software” version. However, three main issues rise, namely i) platform design complexity, ii) architectural scalability and iii) programmability. To tackle them, a template for a generic hardware processing unit (HWPU) is proposed, which share the memory banks with cores, and the template for a scalable architecture is shown, which integrates them through the shared-memory system. Then, a full software stack and toolchain are developed to support platform design and to let programmers exploiting the accelerators of the platform. The OpenMP frontend is extended to interact with it.I sistemi integrati moderni sono architetture many-core, in cui spesso lo spazio di memoria è condiviso fra i processori. Per ridurre i consumi, molte di queste architetture sostituiscono le cache dati con memorie scratchpad gestite in software, per massimizzarne la località alle CPU e aumentare le performance. Questo significa che i dati devono essere spostati manualmente da parte del programmatore. Inoltre, tradurre in perfomance l’enorme parallelismo potenziale delle piattaforme many-core non è semplice. Per supportare la programmazione, diversi programming model sono stati proposti, e siccome lavorano ad un alto livello di astrazione, sfruttano delle librerie di runtime che forniscono servizi di base quali sincronizzazione, allocazione della memoria, threading. Queste librerie hanno un costo, che nei sistemi integrati è troppo elevato e ostacola il raggiungimento delle piene performance. Questa tesi analizza come un programming model ad alto livello di astrazione – OpenMP – possa essere efficientemente supportato, se il suo stack software viene adattato per sfruttare al meglio la piattaforma sottostante. In una prima parte, studio diversi meccanismi di sincronizzazione e comunicazione fra thread paralleli, portati sulle piattaforme many-core. In seguito, li utilizzo per scrivere un runtime di supporto a OpenMP che sia il più possibile efficente e “leggero” e che supporti paradigmi di parallelismo multi-livello e irregolare, spesso presenti nelle applicazioni moderne. Una seconda parte della tesi esplora le architetture eterogenee, ossia con acceleratori hardware. Queste architetture soffrono di problematiche sia i) per il processo di design della piattaforma, che ii) di scalabilità della piattaforma stessa (aumento del numero degli acceleratori e dei processori), che iii) di programmabilità. La tesi propone delle soluzioni a tutti e tre i problemi. Il linguaggio di programmazione usato è OpenMP, sia per la sua grande espressività a livello semantico, sia perché è lo standard de-facto per programmare sistemi a memoria condivisa

A Survey on Hardware-aware and Heterogeneous Computing on Multicore Processors and Accelerators

Author: Buchty Rainer
Heuveline Vincent
Karl Wolfgang
Weiß Jan-Philipp
Publication venue: Karlsruher Institut für Technologie
Publication date: 01/01/2009
Field of study

Neuraghe: Exploiting CPU-FPGA synergies for efficient and flexible CNN inference acceleration on zynQ SoCs

Author: Benini L.
Brian M.
Capotondi A.
Conti F.
Deriu G.
Meloni P.
Raffo L.
Rossi D.
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 04/12/2017
Field of study

Deep convolutional neural networks (CNNs) obtain outstanding results in tasks that require human-level understanding of data, like image or speech recognition. However, their computational load is significant, motivating the development of CNN-specialized accelerators. This work presents NEURAghe, a flexible and efficient hardware/software solution for the acceleration of CNNs on Zynq SoCs. NEURAghe leverages the synergistic usage of Zynq ARM cores and of a powerful and flexible Convolution-Specific Processor deployed on the reconfigurable logic. The Convolution-Specific Processor embeds both a convolution engine and a programmable soft core, releasing the ARM processors from most of the supervision duties and allowing the accelerator to be controlled by software at an ultra-fine granularity. This methodology opens the way for cooperative heterogeneous computing: While the accelerator takes care of the bulk of the CNN workload, the ARM cores can seamlessly execute hard-to-accelerate parts of the computational graph, taking advantage of the NEON vector engines to further speed up computation. Through the companion NeuDNN SW stack, NEURAghe supports end-to-end CNN-based classification with a peak performance of 169GOps/s and an energy efficiency of 17GOps/W. Thanks to our heterogeneous computing model, our platform improves upon the state-of-the-art, achieving a frame rate of 5.5 frames per second (fps) on the end-to-end execution of VGG-16 and 6.6fps on ResNet-18

arXiv.org e-Print Archive

Archivio istituzionale della ricerca - Alma Mater Studiorum Università di Bologna

Hardware runtime management for task-based programming models

Author: Tan Xubin
Publication venue: Universitat Politècnica de Catalunya
Publication date: 26/11/2018
Field of study

Task-based programming models allow programmers to express applications as a collection of tasks with dependences. They are simple to use and greatly improve programmability by using software runtimes to exploit task parallelism and heterogeneity over multi-core, many-core and heterogeneous platforms. In these programming models, the runtimes guarantee correct execution order by managing tasks using task-dependence graphs (TDGs). These runtimes are powerful enough to provide high performance with coarse-grained tasks although they impose overheads on the application execution to maintain all the information they need to do their work. However, as the current trend in processor architectures keeps including more cores and heterogeneity (in fact complexity) in the systems, coarse-grained parallelism is not enough to feed all the underlying resources. Instead, fine-grained tasks are preferable as they are able to expose higher parallelism in applications but the overheads introduced by the software runtimes under these conditions prevent an efficient exploitation of fine-grained parallelism. The two most critical runtime overheads are task dependence graph management and task scheduling to heterogeneous systems. We propose a hardware architecture Picos, consisting of a hardware task dependence manager including nested task support, and a heterogeneous task scheduler, to accelerate the critical runtime functions for task-based programming models. With Picos, we aim at extending the benefit of these programming models into exploiting fine-grained task parallelism and heterogeneity. As a proof-of-concept, Three prototypes of Picos have been designed in VHDL and implemented in a System-on-chip platform consisting of regular ARM SMP cores and an integrated FPGA. They also have been analyzed with real benchmarks with OmpSs running and Linux on the platform. The first prototype is a hardware task dependence manager, which has been implemented in a Xilinx Zynq 7000 series SoCs. It is connected to a 2-core ARM Cortex A9 processor, with bare-metal OS integration. With 24 simulated workers, and running real task-dependence analysis in Picos, it scales up to 21x speedup. The second prototype Picos++ extended Picos with an exciting new feature for nested task support in hardware. To the best of our knowledge, this is the first time that such a feature has been support fully in hardware task dependence managers. This prototype is fully integrated in not only hardware, but also with a State-of-the-Art parallel programming model, and with Linux. The third prototype includes both a hardware task dependence manager and a heterogeneous task scheduler. The heterogeneous task scheduler receives ready tasks from the task-dependence manager and then schedule them to hardware execution units that have the estimated earliest finish time. It is implemented in a Xilinx Zynq Ultrascale+ MPSoC chip. In a system with 4 threads and up to 15 HW accelerators, it achieves up to 16.2x speedup for real benchmarks, and saves up to 90% of energy.Los modelos de programación basados en tareas permiten a los programadores expresar las aplicaciones como una colección de tareas con dependencias entre ellas. Dichos modelos son simples de usar y mejoran enormemente la programabilidad. Para ello se valen del uso de una runtime que en tiempo de ejecución ayuda a explotar el paralelismo de las tareas cuando se ejecutan en plataformas multi-cores, many-cores y heterogéneas. En estos modelos de programación los runtimes garantizan la ejecución de las tareas en el orden correcto mediante el uso de gráficos de dependencias entre tareas (TDG). Actualmente, los runtimes son lo suficientemente potentes para proporcionar un alto rendimiento con tareas de granularidad gruesa a pesar de que para mantener toda la información que necesitan para hacer su trabajo, introducen un sobrecoste importante en la ejecución de las aplicaciones. El problema viene dado por la tendencia actual en arquitectura de computadores a seguir incluyendo más núcleos y heterogeneidad (de hecho, complejidad) en los sistemas de procesado con lo que el paralelismo de granularidad gruesa no es suficiente para alimentar todos los recursos. En estos entornos complejos las tareas de granularidad fina son preferibles ya que son capaces de exponer un mayor paralelismo de las aplicaciones. Sin embargo, con tareas de granularidad fina, los sobrecostes introducidos por los runtimes software son mayores debido a la necesidad de manejar muchas más tareas más rápido. En general los mayores sobrecostes introducidos por los runtimes son: la administración de los grafos de dependencias que relacionan las tareas y la gestión de las tareas en sistemas heterogéneos. Proponemos una arquitectura hardware, llamada Picos, que consiste en un administrador de dependencias entre tareas incluyendo soporte para tareas anidadas y planificación de tareas heterogéneas. La función principal de dicha arquitectura es acelerar las funciones críticas de los runtimes para modelos de programación basados en tareas. Con Picos, se pretende extender el beneficio de estos modelos de programación para explotar el paralelismo y la heterogeneidad ejecutando tareas de granularidad fina. Como prueba de concepto, tres prototipos de Picos han sido diseñado en VHDL e implementado en una plataforma System-on-chip que consta de varios núcleos ARM integrados junto con una FPGA, y ademas analizados con ejecuciones reales con OmpSs y con Linux. El primer prototipo es un administrador hardware de tareas con dependencias, que se ha implementado en un SoC Xilinx Zynq serie 7000. Está conectado a un procesador ARM Cortex A9 de 2 núcleos, e integrado con el SO. Con 24 núcleos simulados y realizando el análisis real de las dependencias entre tareas en Picos, obtiene hasta un 21x sobre las mismas ejecuciones usando el entorno software. El segundo prototipo, Picos++, amplió Picos incorporando el soporte para la gestión de tareas anidadas en hardware. Hasta donde llega nuestro conocimiento, esta es la primera vez que dicha característica ha sido propuesta y/o incorporada en un administrador hardware de dependencias entre tareas. El segundo prototipo está completamente integrado en el sistema, no solo en hardware, sino también con el modelo de programación paralelo y con el sistema operativo. El tercer prototipo, incluye un administrador y planificador de tareas heterogéneas. El planificador de tareas heterogéneas recibe dichas tareas listas del administrador de dependencias entre tareas y las programa en la unidad de ejecución de hardware adecuada que tenga el tiempo de finalización estimado más corto. Este prototipo se ha implementado en un chip MPSoC Xilinx Zynq Ultrascale+. En dicho sistema con cuatro núcleos ARM y hasta 15 aceleradores HW funcionales, logra una aceleración de hasta 16.2x, y ahorra hasta el 90% de la energía con respecto al software

Hardware runtime management for task-based programming models

Author: Tan Xubin
Publication venue: Universitat Politècnica de Catalunya
Publication date: 01/01/2018
Field of study

A Dataflow Framework For Developing Flexible Embedded Accelerators A Computer Vision Case Study.

Author: Stoutchinin Arthur <1967>
Publication venue: Alma Mater Studiorum - Università di Bologna
Publication date: 08/04/2019
Field of study

The focus of this dissertation is the design and the implementation of a computing platform which can accelerate data processing in the embedded computation domain. We focus on a heterogeneous computing platform, whose hardware implementation can approach the power and area efficiency of specialized designs, while remaining flexible across the application domain. The multi-core architectures require parallel programming, which is widely-regarded as more challenging than sequential programming. Although shared memory parallel programs may be fairly easy to write (using OpenMP, for example), they are quite hard to optimize; providing embedded application developers with optimizing tools and programming frameworks is a challenge. The heterogeneous specialized elements make the problem even more difficult. Dataflow is a parallel computation model that relies exclusively on message passing, and that has some advantages over parallel programming tools in wide use today: simplicity, graphical representation, and determinism. Dataflow model is also a good match to streaming applications, such as audio, video and image processing, which operate on large sequences of data and are characterized by abundant parallelism and regular memory access patterns. Dataflow model of computation has gained acceptance in simulation and signal-processing communities. This thesis evaluates the applicability of the dataflow model for implementing domain-specific embedded accelerators for streaming applications

Parallel architectures and runtime systems co-design for task-based programming models

Author: Castillo Villar Emilio
Publication venue: Universitat Politècnica de Catalunya
Publication date: 01/01/2019
Field of study

The increasing parallelism levels in modern computing systems has extolled the need for a holistic vision when designing multiprocessor architectures taking in account the needs of the programming models and applications. Nowadays, system design consists of several layers on top of each other from the architecture up to the application software. Although this design allows to do a separation of concerns where it is possible to independently change layers due to a well-known interface between them, it is hampering future systems design as the Law of Moore reaches to an end. Current performance improvements on computer architecture are driven by the shrinkage of the transistor channel width, allowing faster and more power efficient chips to be made. However, technology is reaching physical limitations were the transistor size will not be able to be reduced furthermore and requires a change of paradigm in systems design. This thesis proposes to break this layered design, and advocates for a system where the architecture and the programming model runtime system are able to exchange information towards a common goal, improve performance and reduce power consumption. By making the architecture aware of runtime information such as a Task Dependency Graph (TDG) in the case of dataflow task-based programming models, it is possible to improve power consumption by exploiting the critical path of the graph. Moreover, the architecture can provide hardware support to create such a graph in order to reduce the runtime overheads and making possible the execution of fine-grained tasks to increase the available parallelism. Finally, the current status of inter-node communication primitives can be exposed to the runtime system in order to perform a more efficient communication scheduling, and also creates new opportunities of computation and communication overlap that were not possible before. An evaluation of the proposals introduced in this thesis is provided and a methodology to simulate and characterize the application behavior is also presented.El aumento del paralelismo proporcionado por los sistemas de cómputo modernos ha provocado la necesidad de una visión holística en el diseño de arquitecturas multiprocesador que tome en cuenta las necesidades de los modelos de programación y las aplicaciones. Hoy en día el diseño de los computadores consiste en diferentes capas de abstracción con una interfaz bien definida entre ellas. Las limitaciones de esta aproximación junto con el fin de la ley de Moore limitan el potencial de los futuros computadores. La mayoría de las mejoras actuales en el diseño de los computadores provienen fundamentalmente de la reducción del tamaño del canal del transistor, lo cual permite chips más rápidos y con un consumo eficiente sin apenas cambios fundamentales en el diseño de la arquitectura. Sin embargo, la tecnología actual está alcanzando limitaciones físicas donde no será posible reducir el tamaño de los transistores motivando así un cambio de paradigma en la construcción de los computadores. Esta tesis propone romper este diseño en capas y abogar por un sistema donde la arquitectura y el sistema de tiempo de ejecución del modelo de programación sean capaces de intercambiar información para alcanzar una meta común: La mejora del rendimiento y la reducción del consumo energético. Haciendo que la arquitectura sea consciente de la información disponible en el modelo de programación, como puede ser el grafo de dependencias entre tareas en los modelos de programación dataflow, es posible reducir el consumo energético explotando el camino critico del grafo. Además, la arquitectura puede proveer de soporte hardware para crear este grafo con el objetivo de reducir el overhead de construir este grado cuando la granularidad de las tareas es demasiado fina. Finalmente, el estado de las comunicaciones entre nodos puede ser expuesto al sistema de tiempo de ejecución para realizar una mejor planificación de las comunicaciones y creando nuevas oportunidades de solapamiento entre cómputo y comunicación que no eran posibles anteriormente. Esta tesis aporta una evaluación de todas estas propuestas, así como una metodología para simular y caracterizar el comportamiento de las aplicacionesPostprint (published version

Many-Core Architectures: Hardware-Software Optimization and Modeling Techniques

Author: Pinto Christian <1986>
Publication venue: Alma Mater Studiorum - Università di Bologna
Publication date: 19/05/2015
Field of study

During the last few decades an unprecedented technological growth has been at the center of the embedded systems design paramount, with Moore’s Law being the leading factor of this trend. Today in fact an ever increasing number of cores can be integrated on the same die, marking the transition from state-of-the-art multi-core chips to the new many-core design paradigm. Despite the extraordinarily high computing power, the complexity of many-core chips opens the door to several challenges. As a result of the increased silicon density of modern Systems-on-a-Chip (SoC), the design space exploration needed to find the best design has exploded and hardware designers are in fact facing the problem of a huge design space. Virtual Platforms have always been used to enable hardware-software co-design, but today they are facing with the huge complexity of both hardware and software systems. In this thesis two different research works on Virtual Platforms are presented: the first one is intended for the hardware developer, to easily allow complex cycle accurate simulations of many-core SoCs. The second work exploits the parallel computing power of off-the-shelf General Purpose Graphics Processing Units (GPGPUs), with the goal of an increased simulation speed. The term Virtualization can be used in the context of many-core systems not only to refer to the aforementioned hardware emulation tools (Virtual Platforms), but also for two other main purposes: 1) to help the programmer to achieve the maximum possible performance of an application, by hiding the complexity of the underlying hardware. 2) to efficiently exploit the high parallel hardware of many-core chips in environments with multiple active Virtual Machines. This thesis is focused on virtualization techniques with the goal to mitigate, and overtake when possible, some of the challenges introduced by the many-core design paradigm

Optimization Techniques for Parallel Programming of Embedded Many-Core Computing Platforms

Author: Tagliavini Giuseppe <1980>
Publication venue: Alma Mater Studiorum - Università di Bologna
Publication date: 08/05/2017
Field of study

Nowadays many-core computing platforms are widely adopted as a viable solution to accelerate compute-intensive workloads at different scales, from low-cost devices to HPC nodes. It is well established that heterogeneous platforms including a general-purpose host processor and a parallel programmable accelerator have the potential to dramatically increase the peak performance/Watt of computing architectures. However the adoption of these platforms further complicates application development, whereas it is widely acknowledged that software development is a critical activity for the platform design. The introduction of parallel architectures raises the need for programming paradigms capable of effectively leveraging an increasing number of processors, from two to thousands. In this scenario the study of optimization techniques to program parallel accelerators is paramount for two main objectives: first, improving performance and energy efficiency of the platform, which are key metrics for both embedded and HPC systems; second, enforcing software engineering practices with the aim to guarantee code quality and reduce software costs. This thesis presents a set of techniques that have been studied and designed to achieve these objectives overcoming the current state-of-the-art. As a first contribution, we discuss the use of OpenMP tasking as a general-purpose programming model to support the execution of diverse workloads, and we introduce a set of runtime-level techniques to support fine-grain tasks on high-end many-core accelerators (devices with a power consumption greater than 10W). Then we focus our attention on embedded computer vision (CV), with the aim to show how to achieve best performance by exploiting the characteristics of a specific application domain. To further reduce the power consumption of parallel accelerators beyond the current technological limits, we describe an approach based on the principles of approximate computing, which implies modification to the program semantics and proper hardware support at the architectural level

Heterogeneous Architectures For Parallel Acceleration

Author: Conti Francesco <1988>
Publication venue: Alma Mater Studiorum - Università di Bologna
Publication date: 09/06/2016
Field of study

To enable a new generation of digital computing applications, the greatest challenge is to provide a better level of energy efficiency (intended as the performance that a system can provide within a certain power budget) without giving up a systems's flexibility. This constraint applies to digital system across all scales, starting from ultra-low power implanted devices up to datacenters for high-performance computing and for the "cloud". In this thesis, we show that architectural heterogeneity is the key to provide this efficiency and to respond to many of the challenges of tomorrow's computer architecture - and at the same time we show methodologies to introduce it with little or no loss in terms of flexibility. In particular, we show that heterogeneity can be employed to tackle the "walls" that impede further development of new computing applications: the utilization wall, i.e. the impossibility to keep all transistors on in deeply integrated chips, and the "data deluge", i.e. the amount of data to be processed that is scaling up much faster than the computing performance and efficiency. We introduce a methodology to improve heterogeneous design exploration of tightly coupled clusters; moreover we propose a fractal heterogeneity architecture that is a parallel accelerator for low-power sensor nodes, and is itself internally heterogeneous thanks to an heterogeneous coprocessor for brain-inspired computing. This platform, which is silicon-proven, can lead to more than 100x improvement in terms of energy efficiency with respect to typical computing nodes used within the same domain, enabling the application of complex algorithms, vastly more performance-hungry than the current state-of-the-art in the ULP computing domain