1,475 research outputs found

    GeantV: Results from the prototype of concurrent vector particle transport simulation in HEP

    Full text link
    Full detector simulation was among the largest CPU consumer in all CERN experiment software stacks for the first two runs of the Large Hadron Collider (LHC). In the early 2010's, the projections were that simulation demands would scale linearly with luminosity increase, compensated only partially by an increase of computing resources. The extension of fast simulation approaches to more use cases, covering a larger fraction of the simulation budget, is only part of the solution due to intrinsic precision limitations. The remainder corresponds to speeding-up the simulation software by several factors, which is out of reach using simple optimizations on the current code base. In this context, the GeantV R&D project was launched, aiming to redesign the legacy particle transport codes in order to make them benefit from fine-grained parallelism features such as vectorization, but also from increased code and data locality. This paper presents extensively the results and achievements of this R&D, as well as the conclusions and lessons learnt from the beta prototype.Comment: 34 pages, 26 figures, 24 table

    A framework for scientific computing with GPUs

    Get PDF
    Dissertação para obtenção do Grau de Mestre em Engenharia InformáticaCommodity hardware nowadays includes not only many-core CPUs but also Graphics Processing Units (GPUs) whose highly data-parallel computational capabilities have been growing at an exponential rate. This computational power can be used for purposes other than graphics-oriented applications, like processor-intensive algorithms as found in the scientific computing setting. This thesis proposes a framework that is capable of distributing computational jobs over a network of CPUs and GPUs alike. The source code for each job is an OpenCL kernel, and thus universal and independent from the specific architecture and CPU/GPU type where it will be executed. This approach releases the software developer from the burden of specific, customized revisions of the same applications for each type of processor/hardware, at the cost of a possibly sub-optimal but still very efficient solution. The proposed run-time scales up as more and more powerful computing resources become available, with no need to recompile the application. Experiments allowed to conclude that, although performance improvement achievements clearly depend on the nature of the problem and how it is coded, speedups in a distributed system containing both GPUs and multi-core CPUs can be up to two orders of magnitude.Centro de Informática e Tecnologias da Informação(CITI), and Fundação para a Ciência e Tecnologia (FCT/MCTES)- research projects PTDC/EIA/74325/2006, PTDC/EIA-EIA/108963/2008, PTDC/EIA-EIA /102579/2008, and PTDC/EIA-EIA/113613/200

    Efficiently and Transparently Maintaining High SIMD Occupancy in the Presence of Wavefront Irregularity

    Get PDF
    Demand is increasing for high throughput processing of irregular streaming applications; examples of such applications from scientific and engineering domains include biological sequence alignment, network packet filtering, automated face detection, and big graph algorithms. With wide SIMD, lightweight threads, and low-cost thread-context switching, wide-SIMD architectures such as GPUs allow considerable flexibility in the way application work is assigned to threads. However, irregular applications are challenging to map efficiently onto wide SIMD because data-dependent filtering or replication of items creates an unpredictable data wavefront of items ready for further processing. Straightforward implementations of irregular applications on a wide-SIMD architecture are prone to load imbalance and reduced occupancy, while more sophisticated implementations require advanced use of parallel GPU operations to redistribute work efficiently among threads. This dissertation will present strategies for addressing the performance challenges of wavefront- irregular applications on wide-SIMD architectures. These strategies are embodied in a developer framework called Mercator that (1) allows developers to map irregular applications onto GPUs ac- cording to the streaming paradigm while abstracting from low-level data movement and (2) includes generalized techniques for transparently overcoming the obstacles to high throughput presented by wavefront-irregular applications on a GPU. Mercator forms the centerpiece of this dissertation, and we present its motivation, performance model, implementation, and extensions in this work
    corecore