1,367 research outputs found
From Physics Model to Results: An Optimizing Framework for Cross-Architecture Code Generation
Starting from a high-level problem description in terms of partial
differential equations using abstract tensor notation, the Chemora framework
discretizes, optimizes, and generates complete high performance codes for a
wide range of compute architectures. Chemora extends the capabilities of
Cactus, facilitating the usage of large-scale CPU/GPU systems in an efficient
manner for complex applications, without low-level code tuning. Chemora
achieves parallelism through MPI and multi-threading, combining OpenMP and
CUDA. Optimizations include high-level code transformations, efficient loop
traversal strategies, dynamically selected data and instruction cache usage
strategies, and JIT compilation of GPU code tailored to the problem
characteristics. The discretization is based on higher-order finite differences
on multi-block domains. Chemora's capabilities are demonstrated by simulations
of black hole collisions. This problem provides an acid test of the framework,
as the Einstein equations contain hundreds of variables and thousands of terms.Comment: 18 pages, 4 figures, accepted for publication in Scientific
Programmin
Parallel Implementations of Cellular Automata for Traffic Models
The Biham-Middleton-Levine (BML) traffic model is a simple two-dimensional,
discrete Cellular Automaton (CA) that has been used to study self-organization
and phase transitions arising in traffic flows. From the computational point of
view, the BML model exhibits the usual features of discrete CA, where the state
of the automaton are updated according to simple rules that depend on the state
of each cell and its neighbors. In this paper we study the impact of various
optimizations for speeding up CA computations by using the BML model as a case
study. In particular, we describe and analyze the impact of several parallel
implementations that rely on CPU features, such as multiple cores or SIMD
instructions, and on GPUs. Experimental evaluation provides quantitative
measures of the payoff of each technique in terms of speedup with respect to a
plain serial implementation. Our findings show that the performance gap between
CPU and GPU implementations of the BML traffic model can be reduced by clever
exploitation of all CPU features
Improved Parallel Rabin-Karp Algorithm Using Compute Unified Device Architecture
String matching algorithms are among one of the most widely used algorithms
in computer science. Traditional string matching algorithms efficiency of
underlaying string matching algorithm will greatly increase the efficiency of
any application. In recent years, Graphics processing units are emerged as
highly parallel processor. They out perform best of the central processing
units in scientific computation power. By combining recent advancement in
graphics processing units with string matching algorithms will allows to speed
up process of string matching. In this paper we proposed modified parallel
version of Rabin-Karp algorithm using graphics processing unit. Based on that,
result of CPU as well as parallel GPU implementations are compared for
evaluating effect of varying number of threads, cores, file size as well as
pattern size.Comment: Information and Communication Technology for Intelligent Systems
(ICTIS 2017
A hierarchic task-based programming model for distributed heterogeneous computing
Distributed computing platforms are evolving to heterogeneous ecosystems with Clusters, Grids and Clouds introducing in its computing nodes, processors with different core architectures, accelerators (i.e. GPUs, FPGAs), as well as different memories and storage devices in order to achieve better performance with lower energy consumption. As a consequence of this heterogeneity, programming applications for these distributed heterogeneous platforms becomes a complex task. Additionally to the complexity of developing an application for distributed platforms, developers must also deal now with the complexity of the different computing devices inside the node. In this article, we present a programming model that aims to facilitate the development and execution of applications in current and future distributed heterogeneous parallel architectures. This programming model is based on the hierarchical composition of the COMP Superscalar and Omp Superscalar programming models that allow developers to implement infrastructure-agnostic applications. The underlying runtime enables applications to adapt to the infrastructure without the need of maintaining different versions of the code. Our programming model proposal has been evaluated on real platforms, in terms of heterogeneous resource usage, performance and adaptation.This work has been supported by the European Commission through the Horizon 2020 Research and Innovation program
under contract 687584 (TANGO project) by the Spanish Government under contract TIN2015-65316 and grant SEV-2015-0493 (Severo Ochoa Program) and by Generalitat de Catalunya under contracts 2014-SGR-1051 and 2014-SGR-1272.Peer ReviewedPostprint (author's final draft
Parallelising wavefront applications on general-purpose GPU devices
Pipelined wavefront applications form a large portion of the high performance scientific computing workloads at supercomputing centres. This paper investigates the viability of graphics processing units (GPUs) for the acceleration of these codes, using NVIDIA's Compute Unified Device Architecture (CUDA). We identify the optimisations suitable for this new architecture and quantify the characteristics of those wavefront codes that are likely to experience speedups
SU(2) Lattice Gauge Theory Simulations on Fermi GPUs
In this work we explore the performance of CUDA in quenched lattice SU(2)
simulations. CUDA, NVIDIA Compute Unified Device Architecture, is a hardware
and software architecture developed by NVIDIA for computing on the GPU. We
present an analysis and performance comparison between the GPU and CPU in
single and double precision. Analyses with multiple GPUs and two different
architectures (G200 and Fermi architectures) are also presented. In order to
obtain a high performance, the code must be optimized for the GPU architecture,
i.e., an implementation that exploits the memory hierarchy of the CUDA
programming model.
We produce codes for the Monte Carlo generation of SU(2) lattice gauge
configurations, for the mean plaquette, for the Polyakov Loop at finite T and
for the Wilson loop. We also present results for the potential using many
configurations () without smearing and almost configurations
with APE smearing. With two Fermi GPUs we have achieved an excellent
performance of the speed over one CPU, in single precision, around
110 Gflops/s. We also find that, using the Fermi architecture, double precision
computations for the static quark-antiquark potential are not much slower (less
than slower) than single precision computations.Comment: 20 pages, 11 figures, 3 tables, accepted in Journal of Computational
Physic
- âŠ