161 research outputs found

    Efficiently and Transparently Maintaining High SIMD Occupancy in the Presence of Wavefront Irregularity

    Get PDF
    Demand is increasing for high throughput processing of irregular streaming applications; examples of such applications from scientific and engineering domains include biological sequence alignment, network packet filtering, automated face detection, and big graph algorithms. With wide SIMD, lightweight threads, and low-cost thread-context switching, wide-SIMD architectures such as GPUs allow considerable flexibility in the way application work is assigned to threads. However, irregular applications are challenging to map efficiently onto wide SIMD because data-dependent filtering or replication of items creates an unpredictable data wavefront of items ready for further processing. Straightforward implementations of irregular applications on a wide-SIMD architecture are prone to load imbalance and reduced occupancy, while more sophisticated implementations require advanced use of parallel GPU operations to redistribute work efficiently among threads. This dissertation will present strategies for addressing the performance challenges of wavefront- irregular applications on wide-SIMD architectures. These strategies are embodied in a developer framework called Mercator that (1) allows developers to map irregular applications onto GPUs ac- cording to the streaming paradigm while abstracting from low-level data movement and (2) includes generalized techniques for transparently overcoming the obstacles to high throughput presented by wavefront-irregular applications on a GPU. Mercator forms the centerpiece of this dissertation, and we present its motivation, performance model, implementation, and extensions in this work

    Task-Based Performance Portability in HPC: Maximising long-term investments in a fast evolving, complex and heterogeneous HPC landscape

    Get PDF
    White paperInternational audienceAs HPC hardware continues to evolve and diversify and workloads become more dynamic and complex, applications need to be expressed in a way that facilitates high performance across a range of hardware and situations. The main application code should be platform-independent, malleable and asynchronous with an open, clean, stable and dependable interface between the higher levels of the application, library or programming model and the kernels and software layers tuned for the machine. The platform-independent part should avoid direct references to specific resources and their availability, and instead provide the information needed to optimise behaviour.This paper summarises how task abstraction, which first appeared in the 1990s and is already mainstream in HPC, should be the basis for a composable and dynamic performance-portable interface. It outlines the innovations that are required in the programming model and runtime layers, and highlights the need for a greater degree of trust among application developers in the ability of the underlying software layers to extract full performance. These steps will help realise the vision for performance portability across current and future architectures and problems

    Learning for Optimization with Virtual Savant

    Get PDF
    Optimization problems arising in multiple fields of study demand efficient algorithms that can exploit modern parallel computing platforms. The remarkable development of machine learning offers an opportunity to incorporate learning into optimization algorithms to efficiently solve large and complex problems. This thesis explores Virtual Savant, a paradigm that combines machine learning and parallel computing to solve optimization problems. Virtual Savant is inspired in the Savant Syndrome, a mental condition where patients excel at a specific ability far above the average. In analogy to the Savant Syndrome, Virtual Savant extracts patterns from previously-solved instances to learn how to solve a given optimization problem in a massively-parallel fashion. In this thesis, Virtual Savant is applied to three optimization problems related to software engineering, task scheduling, and public transportation. The efficacy of Virtual Savant is evaluated in different computing platforms and the experimental results are compared against exact and approximate solutions for both synthetic and realistic instances of the studied problems. Results show that Virtual Savant can find accurate solutions, effectively scale in the problem dimension, and take advantage of the availability of multiple computing resources.Los problemas de optimización que surgen en múltiples campos de estudio demandan algoritmos eficientes que puedan explotar las plataformas modernas de computación paralela. El notable desarrollo del aprendizaje automático ofrece la oportunidad de incorporar el aprendizaje en algoritmos de optimización para resolver problemas complejos y de grandes dimensiones de manera eficiente. Esta tesis explora Savant Virtual, un paradigma que combina aprendizaje automático y computación paralela para resolver problemas de optimización. Savant Virtual está inspirado en el Sı́ndrome de Savant, una condición mental en la que los pacientes se destacan en una habilidad especı́fica muy por encima del promedio. En analogı́a con el sı́ndrome de Savant, Savant Virtual extrae patrones de instancias previamente resueltas para aprender a resolver un determinado problema de optimización de forma masivamente paralela. En esta tesis, Savant Virtual se aplica a tres problemas de optimización relacionados con la ingenierı́a de software, la planificación de tareas y el transporte público. La eficacia de Savant Virtual se evalúa en diferentes plataformas informáticas y los resultados se comparan con soluciones exactas y aproximadas para instancias tanto sintéticas como realistas de los problemas estudiados. Los resultados muestran que Savant Virtual puede encontrar soluciones precisas, escalar eficazmente en la dimensión del problema y aprovechar la disponibilidad de múltiples recursos de cómputo.Fundación Carolina Agencia Nacional de Investigación e Innovación (ANII, Uruguay) Universidad de Cádiz Universidad de la Repúblic

    GPU Usage for Parallel Functions and Contacts in Modelica

    Get PDF
    This thesis investigates two ways of incorporating GPUs in Modelica. The first by automatically generating GPU code for Modelica functions, and the second by using GPU accelerated external code for a contact handling package. Automatic parallelization of functions is desired, as it can potentially accelerate large simulations significantly. Special patterns of nested for-loops in Modelica code are recognized and translated into CUDA kernel functions. Inline integration allows a broader spectrum of models to take advantage of the parallelization, by reducing CPU-GPU transfers. The prototype has been tested and achieved a speed-up factor of up to five compared to the CPU. The contact handling package is capable of handling both complex contact behavior between arbitrarily shaped bodies and large DEM-like simulations, something which Modelica is currently lacking. Attempts to accelerate the package with GPUs were made, with partial success for the broad phase. The package uses Morton encoding for the broad phase, and the narrow phase is based on CSG intersection with BSP trees. Contact response is calculated using a volume dependent method, taking friction, damping and multiple contact points into account. The capability of the package was demonstrated by the fact that both complex contact behavior such as the inversion of the Tippe Top toy and tens of thousands of colliding spheres could be simulated.One of the key components in our modern society is the ability to simulate. By simulations, the industry can design new cars, phones, aircrafts etc., without having to go through prototype after prototype, allowing us the cheap and high-tech products most of us rely on in our day-to-day life. However, good as simulations are today there are still severe limitations on what can be simulated. Two of the largest limiting factors are how hard simulations are to design and how long they take to run. The first of these problems are tackled by the Modelica programming language, which is designed for easy set-up of simulations. We have attacked both these problems by using GPUs to speed-up Modelica simulations more than 5 times, and extending the capability of Modelica to handle collisions between objects

    Towards hardware as a reconfigurable, elastic, and specialized service

    Get PDF
    As modern Data Center workloads become increasingly complex, constrained, and critical, mainstream CPU-centric computing has had ever more difficulty in keeping pace. Future data centers are moving towards a more fluid and heterogeneous model, with computation and communication no longer localized to commodity CPUs and routers. Next generation data-centric Data Centers will compute everywhere, whether data is stationary (e.g. in memory) or on the move (e.g. in network). While deploying FPGAs in NICS, as co-processors, in the router, and in Bump-in-the-Wire configurations is a step towards implementing the data-centric model, it is only part of the overall solution. The other part is actually leveraging this reconfigurable hardware. For this to happen, two problems must be addressed: code generation and deployment generation. By code generation we mean transforming abstract representations of an algorithm into equivalent hardware. Deployment generation refers to the runtime support needed to facilitate the execution of this hardware on an FPGA. Efforts at creating supporting tools in these two areas have thus far provided limited benefits. This is because the efforts are limited in one or more of the following ways: They i) do not provide fundamental solutions to a number of challenges, which makes them useful only to a limited group of (mostly) hardware developers, ii) are constrained in their scope, or iii) are ad hoc, i.e., specific to a single usage context, FPGA vendor, or Data Center configuration. Moreover, efforts in these areas have largely been mutually exclusive, which results in incompatibility across development layers; this requires wrappers to be designed to make interfaces compatible. As a result there is significant complexity and effort required to code and deploy efficient custom hardware for FPGAs; effort that may be orders-of-magnitude greater than for analogous software environments. The goal of this dissertation is to create a framework that enables reconfigurable logic in Data Centers to be targeted with the same level of effort as for a single CPU core. The underlying mechanism to this is a framework, which we refer to as Hardware as a Reconfigurable, Elastic and Specialized Service, or HaaRNESS. In this dissertation, we address two of the core challenges of HaaRNESS: reducing the complexity of code generation by constraining High Level Synthesis (HLS) toolflows, and replacing ad hoc models of deployment generation by generalizing and formalizing what is needed for a hardware Operating System. These parts are unified by the back-end of HLS toolflows which link generated compute pipelines with the operating system, and provide appropriate APIs, wrappers, and software runtimes. The contributions of this dissertation are the following: i) an empirically guided set of systematic transformations for generating high quality HLS code; ii) a framework for instrumenting HLS compiler to identify and remove optimization blockers; iii) a framework for RTL simulation and IP generation of HLS kernels for rapid turnaround; and iv) a framework for generalization and formalization of hardware operating systems to address the {\it ad hoc}'ness of existing deployment generation and ensure uniform structure and APIs

    Designing a scalable dynamic load -balancing algorithm for pipelined single program multiple data applications on a non-dedicated heterogeneous network of workstations

    Get PDF
    Dynamic load balancing strategies have been shown to be the most critical part of an efficient implementation of various applications on large distributed computing systems. The need for dynamic load balancing strategies increases when the underlying hardware is a non-dedicated heterogeneous network of workstations (HNOW). This research focuses on the single program multiple data (SPMD) programming model as it has been extensively used in parallel programming for its simplicity and scalability in terms of computational power and memory size.;This dissertation formally defines and addresses the problem of designing a scalable dynamic load-balancing algorithm for pipelined SPMD applications on non-dedicated HNOW. During this process, the HNOW parameters, SPMD application characteristics, and load-balancing performance parameters are identified.;The dissertation presents a taxonomy that categorizes general load balancing algorithms and a methodology that facilitates creating new algorithms that can harness the HNOW computing power and still preserve the scalability of the SPMD application.;The dissertation devises a new algorithm, DLAH (Dynamic Load-balancing Algorithm for HNOW). DLAH is based on a modified diffusion technique, which incorporates the HNOW parameters. Analytical performance bound for the worst-case scenario of the diffusion technique has been derived.;The dissertation develops and utilizes an HNOW simulation model to conduct extensive simulations. These simulations were used to validate DLAH and compare its performance to related dynamic algorithms. The simulations results show that DLAH algorithm is scalable and performs well for both homogeneous and heterogeneous networks. Detailed sensitivity analysis was conducted to study the effects of key parameters on performance

    Piattaforme multicore e integrazione tri-dimensionale: analisi architetturale e ottimizzazione

    Get PDF
    Modern embedded systems embrace many-core shared-memory designs. Due to constrained power and area budgets, most of them feature software-managed scratchpad memories instead of data caches to increase the data locality. It is therefore programmers’ responsibility to explicitly manage the memory transfers, and this make programming these platform cumbersome. Moreover, complex modern applications must be adequately parallelized before they can the parallel potential of the platform into actual performance. To support this, programming languages were proposed, which work at a high level of abstraction, and rely on a runtime whose cost hinders performance, especially in embedded systems, where resources and power budget are constrained. This dissertation explores the applicability of the shared-memory paradigm on modern many-core systems, focusing on the ease-of-programming. It focuses on OpenMP, the de-facto standard for shared memory programming. In a first part, the cost of algorithms for synchronization and data partitioning are analyzed, and they are adapted to modern embedded many-cores. Then, the original design of an OpenMP runtime library is presented, which supports complex forms of parallelism such as multi-level and irregular parallelism. In the second part of the thesis, the focus is on heterogeneous systems, where hardware accelerators are coupled to (many-)cores to implement key functional kernels with orders-of-magnitude of speedup and energy efficiency compared to the “pure software” version. However, three main issues rise, namely i) platform design complexity, ii) architectural scalability and iii) programmability. To tackle them, a template for a generic hardware processing unit (HWPU) is proposed, which share the memory banks with cores, and the template for a scalable architecture is shown, which integrates them through the shared-memory system. Then, a full software stack and toolchain are developed to support platform design and to let programmers exploiting the accelerators of the platform. The OpenMP frontend is extended to interact with it.I sistemi integrati moderni sono architetture many-core, in cui spesso lo spazio di memoria è condiviso fra i processori. Per ridurre i consumi, molte di queste architetture sostituiscono le cache dati con memorie scratchpad gestite in software, per massimizzarne la località alle CPU e aumentare le performance. Questo significa che i dati devono essere spostati manualmente da parte del programmatore. Inoltre, tradurre in perfomance l’enorme parallelismo potenziale delle piattaforme many-core non è semplice. Per supportare la programmazione, diversi programming model sono stati proposti, e siccome lavorano ad un alto livello di astrazione, sfruttano delle librerie di runtime che forniscono servizi di base quali sincronizzazione, allocazione della memoria, threading. Queste librerie hanno un costo, che nei sistemi integrati è troppo elevato e ostacola il raggiungimento delle piene performance. Questa tesi analizza come un programming model ad alto livello di astrazione – OpenMP – possa essere efficientemente supportato, se il suo stack software viene adattato per sfruttare al meglio la piattaforma sottostante. In una prima parte, studio diversi meccanismi di sincronizzazione e comunicazione fra thread paralleli, portati sulle piattaforme many-core. In seguito, li utilizzo per scrivere un runtime di supporto a OpenMP che sia il più possibile efficente e “leggero” e che supporti paradigmi di parallelismo multi-livello e irregolare, spesso presenti nelle applicazioni moderne. Una seconda parte della tesi esplora le architetture eterogenee, ossia con acceleratori hardware. Queste architetture soffrono di problematiche sia i) per il processo di design della piattaforma, che ii) di scalabilità della piattaforma stessa (aumento del numero degli acceleratori e dei processori), che iii) di programmabilità. La tesi propone delle soluzioni a tutti e tre i problemi. Il linguaggio di programmazione usato è OpenMP, sia per la sua grande espressività a livello semantico, sia perché è lo standard de-facto per programmare sistemi a memoria condivisa

    Enabling task parallelism for many-core architectures

    Get PDF

    Type Oriented Parallel Programming

    Get PDF
    Context: Parallel computing is an important field within the sciences. With the emergence of multi, and soon many, core CPUs this is moving more and more into the domain of general computing. HPC programmers want performance, but at the moment this comes at a cost; parallel languages are either efficient or conceptually simple, but not both. Aim: To develop and evaluate a novel programming paradigm which will address the problem of parallel programming and allow for languages which are both conceptually simple and efficient. Method: A type-based approach, which allows the programmer to control all aspects of parallelism by the use and combination of types has been developed. As a vehicle to present and analyze this new paradigm a parallel language, Mesham, and associated compilation tools have also been created. By using types to express parallelism the programmer can exercise efficient, flexible control in a high level abstract model yet with a sufficiently rich amount of information in the source code upon which the compiler can perform static analysis and optimization. Results: A number of case studies have been implemented in Mesham. Official benchmarks have been performed which demonstrate the paradigm allows one to write code which is comparable, in terms of performance, with existing high performance solutions. Sections of the parallel simulation package, Gadget-2, have been ported into Mesham, where substantial code simplifications have been made. Conclusions: The results obtained indicate that the type-based approach does satisfy the aim of the research described in this thesis. By using this new paradigm the programmer has been able to write parallel code which is both simple and efficient
    corecore