575 research outputs found
Density Estimations for Approximate Query Processing on SIMD Architectures
Approximate query processing (AQP) is an interesting alternative for exact
query processing. It is a tool for dealing with the huge data volumes where
response time is more important than perfect accuracy (this is typically the
case during initial phase of data exploration). There are many techniques for
AQP, one of them is based on probability density functions (PDF). PDFs are
typically calculated using nonparametric data-driven methods. One of the most
popular nonparametric method is the kernel density estimator (KDE). However, a
very serious drawback of using KDEs is the large number of calculations
required to compute them. The shape of final density function is very sensitive
to an entity called bandwidth or smoothing parameter. Calculating it's optimal
value is not a trivial task and in general is very time consuming. In this
paper we investigate the possibility of utilizing two SIMD architectures: SSE
CPU extensions and NVIDIA's CUDA architecture to accelerate finding of the
bandwidth. Our experiments show orders of magnitude improvements over a simple
sequential implementation of classical algorithms used for that task
Data-Parallel Hashing Techniques for GPU Architectures
Hash tables are one of the most fundamental data structures for effectively
storing and accessing sparse data, with widespread usage in domains ranging
from computer graphics to machine learning. This study surveys the
state-of-the-art research on data-parallel hashing techniques for emerging
massively-parallel, many-core GPU architectures. Key factors affecting the
performance of different hashing schemes are discovered and used to suggest
best practices and pinpoint areas for further research
Dynamic autotuning of adaptive fast multipole methods on hybrid multicore CPU & GPU systems
We discuss an implementation of adaptive fast multipole methods targeting
hybrid multicore CPU- and GPU-systems. From previous experiences with the
computational profile of our version of the fast multipole algorithm, suitable
parts are off-loaded to the GPU, while the remaining parts are threaded and
executed concurrently by the CPU. The parameters defining the algorithm affects
the performance and by measuring this effect we are able to dynamically balance
the algorithm towards optimal performance. Our setup uses the dynamic nature of
the computations and is therefore of general character
Data Protection: Combining Fragmentation, Encryption, and Dispersion, a final report
Hardening data protection using multiple methods rather than 'just'
encryption is of paramount importance when considering continuous and powerful
attacks in order to observe, steal, alter, or even destroy private and
confidential information.Our purpose is to look at cost effective data
protection by way of combining fragmentation, encryption, and dispersion over
several physical machines. This involves deriving general schemes to protect
data everywhere throughout a network of machines where they are being
processed, transmitted, and stored during their entire life cycle. This is
being enabled by a number of parallel and distributed architectures using
various set of cores or machines ranging from General Purpose GPUs to multiple
clouds. In this report, we first present a general and conceptual description
of what should be a fragmentation, encryption, and dispersion system (FEDS)
including a number of high level requirements such systems ought to meet. Then,
we focus on two kind of fragmentation. First, a selective separation of
information in two fragments a public one and a private one. We describe a
family of processes and address not only the question of performance but also
the questions of memory occupation, integrity or quality of the restitution of
the information, and of course we conclude with an analysis of the level of
security provided by our algorithms. Then, we analyze works first on general
dispersion systems in a bit wise manner without data structure consideration;
second on fragmentation of information considering data defined along an object
oriented data structure or along a record structure to be stored in a
relational database
Performance report and optimized implementations of Weather & Climate dwarfs on multi-node systems
This document is one of the deliverable reports created for the ESCAPE
project. ESCAPE stands for Energy-efficient Scalable Algorithms for Weather
Prediction at Exascale. The project develops world-class, extreme-scale
computing capabilities for European operational numerical weather prediction
and future climate models. This is done by identifying Weather & Climate dwarfs
which are key patterns in terms of computation and communication (in the spirit
of the Berkeley dwarfs). These dwarfs are then optimised for different hardware
architectures (single and multi-node) and alternative algorithms are explored.
Performance portability is addressed through the use of domain specific
languages.
Here we summarize the work performed on optimizations of the dwarfs focusing
on CPU multi-nodes and multi-GPUs. We limit ourselves to a subset of the dwarf
configurations chosen by the consortium. Intra-node optimizations of the dwarfs
and energy-specific optimizations have been described in Deliverable D3.3. To
cover the important algorithmic motifs we picked dwarfs related to the
dynamical core as well as column physics. Specifically, we focused on the
formulation relevant to spectral codes like ECMWF's IFS code.
The main findings of this report are: (a) Up-to 30% performance gain with CPU
based multi-node systems compared to optimized version of dwarfs from task 3.3
(see D3.3), (b) up to 10X performance gain on multiple GPUs from optimizations
to keep data resident on the GPU and enable fast inter-GPU communication
mechanisms, and (c) multi-GPU systems which feature a high-bandwidth all-to-all
interconnect topology with NVLink/NVSwitch hardware are particularly well
suited to the algorithms.Comment: 35 pages, 22 figure
GHOST: Building blocks for high performance sparse linear algebra on heterogeneous systems
While many of the architectural details of future exascale-class high
performance computer systems are still a matter of intense research, there
appears to be a general consensus that they will be strongly heterogeneous,
featuring "standard" as well as "accelerated" resources. Today, such resources
are available as multicore processors, graphics processing units (GPUs), and
other accelerators such as the Intel Xeon Phi. Any software infrastructure that
claims usefulness for such environments must be able to meet their inherent
challenges: massive multi-level parallelism, topology, asynchronicity, and
abstraction. The "General, Hybrid, and Optimized Sparse Toolkit" (GHOST) is a
collection of building blocks that targets algorithms dealing with sparse
matrix representations on current and future large-scale systems. It implements
the "MPI+X" paradigm, has a pure C interface, and provides hybrid-parallel
numerical kernels, intelligent resource management, and truly heterogeneous
parallelism for multicore CPUs, Nvidia GPUs, and the Intel Xeon Phi. We
describe the details of its design with respect to the challenges posed by
modern heterogeneous supercomputers and recent algorithmic developments.
Implementation details which are indispensable for achieving high efficiency
are pointed out and their necessity is justified by performance measurements or
predictions based on performance models. The library code and several
applications are available as open source. We also provide instructions on how
to make use of GHOST in existing software packages, together with a case study
which demonstrates the applicability and performance of GHOST as a component
within a larger software stack.Comment: 32 pages, 11 figure
Acceleration of Computational Geometry Algorithms for High Performance Computing Based Geo-Spatial Big Data Analysis
Geo-Spatial computing and data analysis is the branch of computer science that deals with real world location-based data. Computational geometry algorithms are algorithms that process geometry/shapes and is one of the pillars of geo-spatial computing. Real world map and location-based data can be huge in size and the data structures used to process them extremely big leading to huge computational costs. Furthermore, Geo-Spatial datasets are growing on all V’s (Volume, Variety, Value, etc.) and are becoming larger and more complex to process in-turn demanding more computational resources. High Performance Computing is a way to breakdown the problem in ways that it can run in parallel on big computers with massive processing power and hence reduce the computing time delivering the same results but much faster.This dissertation explores different techniques to accelerate the processing of computational geometry algorithms and geo-spatial computing like using Many-core Graphics Processing Units (GPU), Multi-core Central Processing Units (CPU), Multi-node setup with Message Passing Interface (MPI), Cache optimizations, Memory and Communication optimizations, load balancing, Algorithmic Modifications, Directive based parallelization with OpenMP or OpenACC and Vectorization with compiler intrinsic (AVX). This dissertation has applied at least one of the mentioned techniques to the following problems. Novel method to parallelize plane sweep based geometric intersection for GPU with directives is presented. Parallelization of plane sweep based Voronoi construction, parallelization of Segment tree construction, Segment tree queries and Segment tree-based operations has been presented. Spatial autocorrelation, computation of getis-ord hotspots are also presented. Acceleration performance and speedup results are presented in each corresponding chapter
CUDACLAW: A high-performance programmable GPU framework for the solution of hyperbolic PDEs
We present cudaclaw, a CUDA-based high performance data-parallel framework
for the solution of multidimensional hyperbolic partial differential equation
(PDE) systems, equations describing wave motion. cudaclaw allows computational
scientists to solve such systems on GPUs without being burdened by the need to
write CUDA code, worry about thread and block details, data layout, and data
movement between the different levels of the memory hierarchy. The user defines
the set of PDEs to be solved via a CUDA- independent serial Riemann solver and
the framework takes care of orchestrating the computations and data transfers
to maximize arithmetic throughput. cudaclaw treats the different spatial
dimensions separately to allow suitable block sizes and dimensions to be used
in the different directions, and includes a number of optimizations to minimize
access to global memory
Analysis of heterogeneous computing approaches to simulating heat transfer in heterogeneous material
The simulation of heat flow through heterogeneous material is important for
the design of structural and electronic components. Classical analytical
solutions to the heat equation PDE are not known for many such domains, even
those having simple geometries. The finite element method can provide
approximations to a weak form continuum solution, with increasing accuracy as
the number of degrees of freedom in the model increases. This comes at a cost
of increased memory usage and computation time; even when taking advantage of
sparse matrix techniques for the finite element system matrix. We summarize
recent approaches in solving problems in structural mechanics and steady state
heat conduction which do not require the explicit assembly of any system
matrices, and adapt them to a method for solving the time-depended flow of
heat. These approaches are highly parallelizable, and can be performed on
graphical processing units (GPUs). Furthermore, they lend themselves to the
simulation of heterogeneous material, with a minimum of added complexity. We
present the mathematical framework of assembly-free FEM approaches, through
which we summarize the benefits of GPU computation. We discuss our
implementation using the OpenCL computing framework, and show how it is further
adapted for use on multiple GPUs. We compare the performance of single and dual
GPUs implementations of our method with previous GPU computing strategies from
the literature and a CPU sparse matrix approach. The utility of the novel
method is demonstrated through the solution of a real-world coefficient inverse
problem that requires thousands of transient heat flow simulations, each of
which involves solving a 1 million degree of freedom linear system over
hundreds of time steps
A GPGPU based program to solve the TDSE in intense laser fields through the finite difference approach
We present a General-purpose computing on graphics processing units (GPGPU)
based computational program and framework for the electronic dynamics of atomic
systems under intense laser fields. We present our results using the case of
hydrogen, however the code is trivially extensible to tackle problems within
the single-active electron (SAE) approximation. Building on our previous work,
we introduce the first available GPGPU based implementation of the Taylor,
Runge-Kutta and Lanczos based methods created with strong field ab-initio
simulations specifically in mind; CLTDSE. The code makes use of finite
difference methods and the OpenCL framework for GPU acceleration. The specific
example system used is the classic test system; Hydrogen. After introducing the
standard theory, and specific quantities which are calculated, the code,
including installation and usage, is discussed in-depth. This is followed by
some examples and a short benchmark between an 8 hardware thread (i.e logical
core) Intel Xeon CPU and an AMD 6970 GPU, where the parallel algorithm runs 10
times faster on the GPU than the CPU.Comment: 12 figure
- …