Search CORE

595 research outputs found

Accelerating sequential programs using FastFlow and self-offloading

Author: Aldinucci Marco
Danelutto Marco
Kilpatrick Peter
Meneghin Massimiliano
Torquati Massimo
Publication venue
Publication date: 12/02/2010
Field of study

FastFlow is a programming environment specifically targeting cache-coherent shared-memory multi-cores. FastFlow is implemented as a stack of C++ template libraries built on top of lock-free (fence-free) synchronization mechanisms. In this paper we present a further evolution of FastFlow enabling programmers to offload part of their workload on a dynamically created software accelerator running on unused CPUs. The offloaded function can be easily derived from pre-existing sequential code. We emphasize in particular the effective trade-off between human productivity and execution efficiency of the approach.Comment: 17 pages + cove

arXiv.org e-Print Archive

UnipiEprints

Simulation of 1+1 dimensional surface growth and lattices gases using GPUs

Author: Amir
Barabási
Barma
Bernaschi
Castro
Chowdhury
Chowdhury
Edwards
Facsko
Family
Forster
Gergely Ódor
Gross
Géza Ódor
Halpin-Healy
Harris
Henrik Schulz
Hinrichsen
Hwa
Janssen
Juhász
Juhász
Kardar
Kardar
Katz
Klumpp
Krug
Krug
Ligget
Meakin
Máté Ferenc Nagy
Plischke
Preis
Prähofer
Sasamoto
Sasamoto
Schittmann
Schliwa
van Beijeren
van Meel
Weigel
Ódor
Ódor
Ódor
Ódor
Ódor
Publication venue: 'Elsevier BV'
Publication date: 01/01/2011
Field of study

Restricted solid on solid surface growth models can be mapped onto binary lattice gases. We show that efficient simulation algorithms can be realized on GPUs either by CUDA or by OpenCL programming. We consider a deposition/evaporation model following Kardar-Parisi-Zhang growth in 1+1 dimensions related to the Asymmetric Simple Exclusion Process and show that for sizes, that fit into the shared memory of GPUs one can achieve the maximum parallelization speedup ~ x100 for a Quadro FX 5800 graphics card with respect to a single CPU of 2.67 GHz). This permits us to study the effect of quenched columnar disorder, requiring extremely long simulation times. We compare the CUDA realization with an OpenCL implementation designed for processor clusters via MPI. A two-lane traffic model with randomized turning points is also realized and the dynamical behavior has been investigated.Comment: 20 pages 12 figures, 1 table, to appear in Comp. Phys. Com

arXiv.org e-Print Archive

Crossref

ELTE Digital Institutional Repository (EDIT)

Coprocessor integration for real-time event processing in particle physics detectors

Author: Badalov Alexey Pavlovich
Publication venue: Blanquerna - Universitat Ramon Llull
Publication date: 01/01/2016
Field of study

Els experiments de física d’altes energies actuals disposen d’acceleradors amb més energía, sensors més precisos i formes més flexibles de recopilar les dades. Aquesta ràpida evolució requereix de més capacitat de càlcul; els processadors massivament paral·lels, com ara les targes acceleradores gràfiques, ens posen a l’abast aquesta major capacitat de càlcul a un cost sensiblement inferior a les CPUs tradicionals. L’ús d’aquest tipus de processadors requereix, però, de nous algoritmes i nous enfocaments de l’organització de les dades que són difícils d’integrar en els programaris actuals. En aquest treball s’exploren els problemes derivats de l’ús d’algoritmes paral·lels en els entorns de programari existents, orientats a CPUs, i es proposa una solució, en forma de servei, que comunica amb els diversos pipelines que processen els esdeveniments procedents de les col·lisions de partícules, recull les dades en lots i els envia als algoritmes corrent sobre els processadors massivament paral·lels. Aquest servei s’integra en Gaudí - l’entorn de software de dos dels quatre experiments principals del Gran Col·lisionador d’Hadrons. S’examina el sobrecost que el servei afegeix als algoritmes paral·lels. S’estudia un cas d´ùs del servei per fer una reconstrucció paral·lela de les traces detectades en el VELO Pixel, el subdetector encarregat de la detecció de vèrtex en l’upgrade de LHCb. Per aquest cas, s’observen les característiques del rendiment en funció de la mida dels lots de dades. Finalment, les conclusions en posen en el context dels requeriments del sistema de trigger de LHCb.La física de altas energías dispone actualmente de aceleradores con energías mayores, sensores más precisos y métodos de recopilación de datos más flexibles que nunca. Su rápido progreso necesita aún más potencia de cálculo; el hardware masivamente paralelo, como las unidades de procesamiento gráfico, nos brinda esta potencia a un coste mucho más bajo que las CPUs tradicionales. Sin embargo, para usar eficientemente este hardware necesitamos algoritmos nuevos y nuevos enfoques de organización de datos difíciles de integrarse con el software existente. En este trabajo, se investiga cómo se pueden usar estos algoritmos paralelos en las infraestructuras de software ya existentes y que están orientadas a CPUs. Se propone una solución en forma de un servicio que comunica con los diversos pipelines que procesan los eventos de las correspondientes colisiones de particulas, reúne los datos en lotes y se los entrega a los algoritmos paralelos acelerados por hardware. Este servicio se integra con Gaudí — la infraestructura del entorno de software que usan dos de los cuatro gran experimentos del Gran Colisionador de Hadrones. Se examinan los costes añadidos por el servicio en los algoritmos paralelos. Se estudia un caso de uso del servicio para ejecutar un algoritmo paralelo para el VELO Pixel (el subdetector encargado de la localización de vértices en el upgrade del experimento LHCb) y se estudian las características de rendimiento de los distintos tamaños de lotes de datos. Finalmente, las conclusiones se contextualizan dentro la perspectiva de los requerimientos para el sistema de trigger de LHCb.High-energy physics experiments today have higher energies, more accurate sensors, and more flexible means of data collection than ever before. Their rapid progress requires ever more computational power; and massively parallel hardware, such as graphics cards, holds the promise to provide this power at a much lower cost than traditional CPUs. Yet, using this hardware requires new algorithms and new approaches to organizing data that can be difficult to integrate with existing software. In this work, I explore the problem of using parallel algorithms within existing CPU-orientated frameworks and propose a compromise between the different trade-offs. The solution is a service that communicates with multiple event-processing pipelines, gathers data into batches, and submits them to hardware-accelerated parallel algorithms. I integrate this service with Gaudi — a framework underlying the software environments of two of the four major experiments at the Large Hadron Collider. I examine the overhead the service adds to parallel algorithms. I perform a case study of using the service to run a parallel track reconstruction algorithm for the LHCb experiment's prospective VELO Pixel subdetector and look at the performance characteristics of using different data batch sizes. Finally, I put the findings into perspective within the context of the LHCb trigger's requirements

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Tesis Doctorals en Xarxa

Autonomic behavioural framework for structural parallelism over heterogeneous multi-core systems.

Author: Goli Mehdi
Publication venue
Publication date: 31/05/2015
Field of study

With the continuous advancement in hardware technologies, significant research has been devoted to design and develop high-level parallel programming models that allow programmers to exploit the latest developments in heterogeneous multi-core/many-core architectures. Structural programming paradigms propose a viable solution for e ciently programming modern heterogeneous multi-core architectures equipped with one or more programmable Graphics Processing Units (GPUs). Applying structured programming paradigms, it is possible to subdivide a system into building blocks (modules, skids or components) that can be independently created and then used in di erent systems to derive multiple functionalities. Exploiting such systematic divisions, it is possible to address extra-functional features such as application performance, portability and resource utilisations from the component level in heterogeneous multi-core architecture. While the computing function of a building block can vary for di erent applications, the behaviour (semantic) of the block remains intact. Therefore, by understanding the behaviour of building blocks and their structural compositions in parallel patterns, the process of constructing and coordinating a structured application can be automated. In this thesis we have proposed Structural Composition and Interaction Protocol (SKIP) as a systematic methodology to exploit the structural programming paradigm (Building block approach in this case) for constructing a structured application and extracting/injecting information from/to the structured application. Using SKIP methodology, we have designed and developed Performance Enhancement Infrastructure (PEI) as a SKIP compliant autonomic behavioural framework to automatically coordinate structured parallel applications based on the extracted extra-functional properties related to the parallel computation patterns. We have used 15 di erent PEI-based applications (from large scale applications with heavy input workload that take hours to execute to small-scale applications which take seconds to execute) to evaluate PEI in terms of overhead and performance improvements. The experiments have been carried out on 3 di erent Heterogeneous (CPU/GPU) multi-core architectures (including one cluster machine with 4 symmetric nodes with one GPU per node and 2 single machines with one GPU per machine). Our results demonstrate that with less than 3% overhead, we can achieve up to one order of magnitude speed-up when using PEI for enhancing application performance

Open Access Institutional Repository at Robert Gordon University

Bigger Buffer k-d Trees on Multi-Many-Core Systems

Author: AA Mahabal
J Bentley
J Friedman
K Polsterer
M Blum
M Scarpino
N Nakasato
PN Tan
Publication venue: 'Springer Fachmedien Wiesbaden GmbH'
Publication date: 01/01/2019
Field of study

A buffer k-d tree is a k-d tree variant for massively-parallel nearest neighbor search. While providing valuable speed-ups on modern many-core devices in case both a large number of reference and query points are given, buffer k-d trees are limited by the amount of points that can fit on a single device. In this work, we show how to modify the original data structure and the associated workflow to make the overall approach capable of dealing with massive data sets. We further provide a simple yet efficient way of using multiple devices given in a single workstation. The applicability of the modified framework is demonstrated in the context of astronomy, a field that is faced with huge amounts of data

Crossref

Copenhagen University Research Information System

Caltech Authors

Radboud Repository