595 research outputs found

    Accelerating sequential programs using FastFlow and self-offloading

    Full text link
    FastFlow is a programming environment specifically targeting cache-coherent shared-memory multi-cores. FastFlow is implemented as a stack of C++ template libraries built on top of lock-free (fence-free) synchronization mechanisms. In this paper we present a further evolution of FastFlow enabling programmers to offload part of their workload on a dynamically created software accelerator running on unused CPUs. The offloaded function can be easily derived from pre-existing sequential code. We emphasize in particular the effective trade-off between human productivity and execution efficiency of the approach.Comment: 17 pages + cove

    Simulation of 1+1 dimensional surface growth and lattices gases using GPUs

    Get PDF
    Restricted solid on solid surface growth models can be mapped onto binary lattice gases. We show that efficient simulation algorithms can be realized on GPUs either by CUDA or by OpenCL programming. We consider a deposition/evaporation model following Kardar-Parisi-Zhang growth in 1+1 dimensions related to the Asymmetric Simple Exclusion Process and show that for sizes, that fit into the shared memory of GPUs one can achieve the maximum parallelization speedup ~ x100 for a Quadro FX 5800 graphics card with respect to a single CPU of 2.67 GHz). This permits us to study the effect of quenched columnar disorder, requiring extremely long simulation times. We compare the CUDA realization with an OpenCL implementation designed for processor clusters via MPI. A two-lane traffic model with randomized turning points is also realized and the dynamical behavior has been investigated.Comment: 20 pages 12 figures, 1 table, to appear in Comp. Phys. Com

    Coprocessor integration for real-time event processing in particle physics detectors

    Get PDF
    Els experiments de física d’altes energies actuals disposen d’acceleradors amb més energía, sensors més precisos i formes més flexibles de recopilar les dades. Aquesta ràpida evolució requereix de més capacitat de càlcul; els processadors massivament paral·lels, com ara les targes acceleradores gràfiques, ens posen a l’abast aquesta major capacitat de càlcul a un cost sensiblement inferior a les CPUs tradicionals. L’ús d’aquest tipus de processadors requereix, però, de nous algoritmes i nous enfocaments de l’organització de les dades que són difícils d’integrar en els programaris actuals. En aquest treball s’exploren els problemes derivats de l’ús d’algoritmes paral·lels en els entorns de programari existents, orientats a CPUs, i es proposa una solució, en forma de servei, que comunica amb els diversos pipelines que processen els esdeveniments procedents de les col·lisions de partícules, recull les dades en lots i els envia als algoritmes corrent sobre els processadors massivament paral·lels. Aquest servei s’integra en Gaudí - l’entorn de software de dos dels quatre experiments principals del Gran Col·lisionador d’Hadrons. S’examina el sobrecost que el servei afegeix als algoritmes paral·lels. S’estudia un cas d´ùs del servei per fer una reconstrucció paral·lela de les traces detectades en el VELO Pixel, el subdetector encarregat de la detecció de vèrtex en l’upgrade de LHCb. Per aquest cas, s’observen les característiques del rendiment en funció de la mida dels lots de dades. Finalment, les conclusions en posen en el context dels requeriments del sistema de trigger de LHCb.La física de altas energías dispone actualmente de aceleradores con energías mayores, sensores más precisos y métodos de recopilación de datos más flexibles que nunca. Su rápido progreso necesita aún más potencia de cálculo; el hardware masivamente paralelo, como las unidades de procesamiento gráfico, nos brinda esta potencia a un coste mucho más bajo que las CPUs tradicionales. Sin embargo, para usar eficientemente este hardware necesitamos algoritmos nuevos y nuevos enfoques de organización de datos difíciles de integrarse con el software existente. En este trabajo, se investiga cómo se pueden usar estos algoritmos paralelos en las infraestructuras de software ya existentes y que están orientadas a CPUs. Se propone una solución en forma de un servicio que comunica con los diversos pipelines que procesan los eventos de las correspondientes colisiones de particulas, reúne los datos en lotes y se los entrega a los algoritmos paralelos acelerados por hardware. Este servicio se integra con Gaudí — la infraestructura del entorno de software que usan dos de los cuatro gran experimentos del Gran Colisionador de Hadrones. Se examinan los costes añadidos por el servicio en los algoritmos paralelos. Se estudia un caso de uso del servicio para ejecutar un algoritmo paralelo para el VELO Pixel (el subdetector encargado de la localización de vértices en el upgrade del experimento LHCb) y se estudian las características de rendimiento de los distintos tamaños de lotes de datos. Finalmente, las conclusiones se contextualizan dentro la perspectiva de los requerimientos para el sistema de trigger de LHCb.High-energy physics experiments today have higher energies, more accurate sensors, and more flexible means of data collection than ever before. Their rapid progress requires ever more computational power; and massively parallel hardware, such as graphics cards, holds the promise to provide this power at a much lower cost than traditional CPUs. Yet, using this hardware requires new algorithms and new approaches to organizing data that can be difficult to integrate with existing software. In this work, I explore the problem of using parallel algorithms within existing CPU-orientated frameworks and propose a compromise between the different trade-offs. The solution is a service that communicates with multiple event-processing pipelines, gathers data into batches, and submits them to hardware-accelerated parallel algorithms. I integrate this service with Gaudi — a framework underlying the software environments of two of the four major experiments at the Large Hadron Collider. I examine the overhead the service adds to parallel algorithms. I perform a case study of using the service to run a parallel track reconstruction algorithm for the LHCb experiment's prospective VELO Pixel subdetector and look at the performance characteristics of using different data batch sizes. Finally, I put the findings into perspective within the context of the LHCb trigger's requirements

    Autonomic behavioural framework for structural parallelism over heterogeneous multi-core systems.

    Get PDF
    With the continuous advancement in hardware technologies, significant research has been devoted to design and develop high-level parallel programming models that allow programmers to exploit the latest developments in heterogeneous multi-core/many-core architectures. Structural programming paradigms propose a viable solution for e ciently programming modern heterogeneous multi-core architectures equipped with one or more programmable Graphics Processing Units (GPUs). Applying structured programming paradigms, it is possible to subdivide a system into building blocks (modules, skids or components) that can be independently created and then used in di erent systems to derive multiple functionalities. Exploiting such systematic divisions, it is possible to address extra-functional features such as application performance, portability and resource utilisations from the component level in heterogeneous multi-core architecture. While the computing function of a building block can vary for di erent applications, the behaviour (semantic) of the block remains intact. Therefore, by understanding the behaviour of building blocks and their structural compositions in parallel patterns, the process of constructing and coordinating a structured application can be automated. In this thesis we have proposed Structural Composition and Interaction Protocol (SKIP) as a systematic methodology to exploit the structural programming paradigm (Building block approach in this case) for constructing a structured application and extracting/injecting information from/to the structured application. Using SKIP methodology, we have designed and developed Performance Enhancement Infrastructure (PEI) as a SKIP compliant autonomic behavioural framework to automatically coordinate structured parallel applications based on the extracted extra-functional properties related to the parallel computation patterns. We have used 15 di erent PEI-based applications (from large scale applications with heavy input workload that take hours to execute to small-scale applications which take seconds to execute) to evaluate PEI in terms of overhead and performance improvements. The experiments have been carried out on 3 di erent Heterogeneous (CPU/GPU) multi-core architectures (including one cluster machine with 4 symmetric nodes with one GPU per node and 2 single machines with one GPU per machine). Our results demonstrate that with less than 3% overhead, we can achieve up to one order of magnitude speed-up when using PEI for enhancing application performance

    Bigger Buffer k-d Trees on Multi-Many-Core Systems

    Get PDF
    A buffer k-d tree is a k-d tree variant for massively-parallel nearest neighbor search. While providing valuable speed-ups on modern many-core devices in case both a large number of reference and query points are given, buffer k-d trees are limited by the amount of points that can fit on a single device. In this work, we show how to modify the original data structure and the associated workflow to make the overall approach capable of dealing with massive data sets. We further provide a simple yet efficient way of using multiple devices given in a single workstation. The applicability of the modified framework is demonstrated in the context of astronomy, a field that is faced with huge amounts of data
    • …
    corecore