347 research outputs found

    HeTM: Transactional Memory for Heterogeneous Systems

    Full text link
    Modern heterogeneous computing architectures, which couple multi-core CPUs with discrete many-core GPUs (or other specialized hardware accelerators), enable unprecedented peak performance and energy efficiency levels. Unfortunately, though, developing applications that can take full advantage of the potential of heterogeneous systems is a notoriously hard task. This work takes a step towards reducing the complexity of programming heterogeneous systems by introducing the abstraction of Heterogeneous Transactional Memory (HeTM). HeTM provides programmers with the illusion of a single memory region, shared among the CPUs and the (discrete) GPU(s) of a heterogeneous system, with support for atomic transactions. Besides introducing the abstract semantics and programming model of HeTM, we present the design and evaluation of a concrete implementation of the proposed abstraction, which we named Speculative HeTM (SHeTM). SHeTM makes use of a novel design that leverages on speculative techniques and aims at hiding the inherently large communication latency between CPUs and discrete GPUs and at minimizing inter-device synchronization overhead. SHeTM is based on a modular and extensible design that allows for easily integrating alternative TM implementations on the CPU's and GPU's sides, which allows the flexibility to adopt, on either side, the TM implementation (e.g., in hardware or software) that best fits the applications' workload and the architectural characteristics of the processing unit. We demonstrate the efficiency of the SHeTM via an extensive quantitative study based both on synthetic benchmarks and on a porting of a popular object caching system.Comment: The current work was accepted in the 28th International Conference on Parallel Architectures and Compilation Techniques (PACT'19

    Transactional memory on heterogeneous architectures

    Get PDF
    Tesis Leida el 9 de Marzo de 2018.Si observamos las necesidades computacionales de hoy, y tratamos de predecir las necesidades del mañana, podemos concluir que el procesamiento heterogéneo estará presente en muchos dispositivos y aplicaciones. El motivo es lógico: algoritmos diferentes y datos de naturaleza diferente encajan mejor en unos dispositivos de cómputo que en otros. Pongamos como ejemplo una tecnología de vanguardia como son los vehículos inteligentes. En este tipo de aplicaciones la computación heterogénea no es una opción, sino un requisito. En este tipo de vehículos se recolectan y analizan imágenes, tarea para la cual los procesadores gráficos (GPUs) son muy eficientes. Muchos de estos vehículos utilizan algoritmos sencillos, pero con grandes requerimientos de tiempo real, que deben implementarse directamente en hardware utilizando FPGAs. Y, por supuesto, los procesadores multinúcleo tienen un papel fundamental en estos sistemas, tanto organizando el trabajo de otros coprocesadores como ejecutando tareas en las que ningún otro procesador es más eficiente. No obstante, los procesadores tampoco siguen siendo dispositivos homogéneos. Los diferentes núcleos de un procesador pueden ofrecer diferentes características en términos de potencia y consumo energético que se adapten a las necesidades de cómputo de la aplicación. Programar este conjunto de dispositivos es una tarea compleja, especialmente en su sincronización. Habitualmente, esta sincronización se basa en operaciones atómicas, ejecución y terminación de kernels, barreras y señales. Con estas primitivas de sincronización básicas se pueden construir otras estructuras más complejas. Sin embargo, la programación de estos mecanismos es tediosa y propensa a fallos. La memoria transaccional (TM por sus siglas en inglés) se ha propuesto como un mecanismo avanzado a la vez que simple para garantizar la exclusión mutua

    A Safety-First Approach to Memory Models.

    Full text link
    Sequential consistency (SC) is arguably the most intuitive behavior for a shared-memory multithreaded program. It is widely accepted that language-level SC could significantly improve programmability of a multiprocessor system. However, efficiently supporting end-to-end SC remains a challenge as it requires that both compiler and hardware optimizations preserve SC semantics. Current concurrent languages support a relaxed memory model that requires programmers to explicitly annotate all memory accesses that can participate in a data-race ("unsafe" accesses). This requirement allows compiler and hardware to aggressively optimize unannotated accesses, which are assumed to be data-race-free ("safe" accesses), while still preserving SC semantics. However, unannotated data races are easy for programmers to accidentally introduce and are difficult to detect, and in such cases the safety and correctness of programs are significantly compromised. This dissertation argues instead for a safety-first approach, whereby every memory operation is treated as potentially unsafe by the compiler and hardware unless it is proven otherwise. The first solution, DRFx memory model, allows many common compiler and hardware optimizations (potentially SC-violating) on unsafe accesses and uses a runtime support to detect potential SC violations arising from reordering of unsafe accesses. On detecting a potential SC violation, execution is halted before the safety property is compromised. The second solution takes a different approach and preserves SC in both compiler and hardware. Both SC-preserving compiler and hardware are also built on the safety-first approach. All memory accesses are treated as potentially unsafe by the compiler and hardware. SC-preserving hardware relies on different static and dynamic techniques to identify safe accesses. Our results indicate that supporting SC at the language level is not expensive in terms of performance and hardware complexity. The dissertation also explores an extension of this safety-first approach for data-parallel accelerators such as Graphics Processing Units (GPUs). Significant microarchitectural differences between CPU and GPU require rethinking of efficient solutions for preserving SC in GPUs. The proposed solution based on our SC-preserving approach performs nearly on par with the baseline GPU that implements a data-race-free-0 memory model.PhDComputer Science and EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/120794/1/ansingh_1.pd

    A semantically aware transactional concurrency control for GPGPU computing

    Get PDF
    PhD ThesisThe increased parallel nature of the GPU a ords an opportunity for the exploration of multi-thread computing. With the availability of GPU has recently expanded into the area of general purpose program- ming, a concurrency control is required to exploit parallelism as well as maintaining the correctness of program. Transactional memory, which is a generalised solution for concurrent con ict, meanwhile allow application programmers to develop concurrent code in a more intu- itive manner, is superior to pessimistic concurrency control for general use. The most important component in any transactional memory technique is the policy to solve con icts on shared data, namely the contention management policy. The work presented in this thesis aims to develop a software trans- actional memory model which can solve both concurrent con ict and semantic con ict at the same time for the GPU. The technique di ers from existing CPU approaches on account of the di erent essential ex- ecution paths and hardware basis, plus much more parallel resources are available on the GPU. We demonstrate that both concurrent con- icts and semantic con icts can be resolved in a particular contention management policy for the GPU, with a di erent application of locks and priorities from the CPU. The basic problem and a software transactional memory solution idea is proposed rst. An implementation is then presented based on the execution mode of this model. After that, we extend this system to re- solve semantic con ict at the same time. Results are provided nally, which compare the performance of our solution with an established GPU software transactional memory and a famous CPU transactional memory, at varying levels of concurrent and semantic con icts

    Memoria Transaccional Software en Procesadores CPU+GPU Heterogéneos

    Get PDF
    En los procesadores multi-núcleo, la memoria transaccional (TM) ha aparecido como una alternativa prometedora a las técnicas basadas en cerrojos para garantizar exclusión mutua y está siendo incluida como parte de procesadores comerciales. De igual forma, dado que las GPUs se están convirtiendo en el acelerador más popular de la actualidad, los fabricantes están integrándolas dentro del mismo chip, creando las llamadas APUs (Accelerated Processing Units). Sin embargo, la sincronización entre CPU y GPU aún se lleva a cabo con mecanismos muy simples basados en operaciones atómicas y señales. Por tanto, es responsabilidad de los programadores implementar técnicas más avanzadas de exclusión mútua. Las técnicas basadas en TM aún no han sido explotadas en este tipo de procesadores y, por tanto, es importante hacer propuestas de sincronización avanzadas. En este artículo proponemos una librería de TM software enfocada a su uso en procesadores APU. El objetivo es que las transacciones puedan ejecutarse tanto en CPU como en GPU simultáneamente y que se permita la sincronización en forma de exclusión mutua entre ambos dispositivos. Nuestra propuesta, llamada APUTM, se enfoca en minimizar la comunicación entre la CPU y la GPU de los metadatos requeridos para manejar TM. La evaluación de esta propuesta muestra que, utilizando este mecanismo de sincronización, es posible mejorar el tiempo de ejecución de las aplicaciones secuenciales con un reducido esfuerzo en la programación

    Efficient Transactional-Memory-based Implementation of Morph Algorithms on GPU

    Get PDF
    General Purpose GPUs (GPGPUs) are ideal platforms for parallel execution of applications with regular shared memory access patterns. However, majority of real world multithreaded applications require access to shared memory with irregular patterns. The morph algorithms, which arise in many real world applications, change their graph data structures in unpredictable ways, thus, leading to irregular access patterns to shared data. Such irregularity makes morph algorithms more challenging to be implemented on GPUs which favor regularity. The Borouvka’s algorithm for calculating Minimum Spanning Forest (MSF), and multilevel graph partitioning are two examples of morph algorithms with varied levels of expressed parallelism. In this work we show that a transactional-memory-based design and implementation of the morph algorithms on GPUs can handle some of the challenges arising due to irregularities such as complexity of code and overhead of synchronization. First, we identify the major phases of the algorithm which requires synchronization of the shared data. If the algorithm exhibits certain algebraic properties (e.g., monotonicity, idempotency, associativity), we can use lock-free synchronizations for performance; otherwise we utilize a Software Transactional Memory (STM) based synchronization method. Experimental results show that our GPU-based implementation of Borouvka’s algorithm outperforms both the fastest sequential implementation and the existing STM-based implementation on multicore CPUs when tested on large-scale graphs with diverse densities. Moreover, to show the applicability of our approach to other morph algorithms, we do a pen-and-paper implementation and complexity analysis of multilevel graph partitioning

    FLAME GPU 2: a framework for flexible and performant agent based simulation on GPUs

    Get PDF
    Agent based modelling (ABM) offers a powerful abstraction for scientific study in a broad range of domains. The use of agent based simulators encourages good software engineering design such as separation of concerns, that is, the uncoupling of the model description from its implementation detail. A major limitation in current approaches to ABM simulation is that of the trade off between simulator flexibility and performance. It is common that highly optimised simulations, such as those which target graphics processing units (GPU) hardware, are implemented as standalone software. This work presents a software framework (FLAME GPU 2) which balances flexibility with performance for general purpose ABM. Methods for ensuring high computational efficacy are demonstrated by, minimising data movement, and ensuring high device utilisation by exploiting opportunities for concurrent code execution within a model and through the use of ensembles of simulations. A novel hierarchical sub-modelling approach is also presented which can be used to model certain types of recursive behaviours. This feature is shown to be essential in providing a mechanism to resolve competition for resources between agents within a parallel environment which would otherwise introduce race conditions. To understand the performance characteristics of the software, a benchmark model with millions of agents is used to explore the use of simulation ensembles and to parametrically investigate concurrent code execution within a model. Performance speedups are demonstrated of 3.5 and 10 respectively over a baseline GPU implementation. Our hierarchical sub-modelling approach is used to demonstrate the implementation of a recursive algorithm to resolve competition of agent movement which occurs as a result of agent desire to simultaneously occupy discrete areas high in a ‘resource’. The algorithm is used to implement a classical socio-economics model, Sugarscape, with populations of up to 16M agents
    • …
    corecore