12,116 research outputs found
Reproducibility, accuracy and performance of the Feltor code and library on parallel computer architectures
Feltor is a modular and free scientific software package. It allows
developing platform independent code that runs on a variety of parallel
computer architectures ranging from laptop CPUs to multi-GPU distributed memory
systems. Feltor consists of both a numerical library and a collection of
application codes built on top of the library. Its main target are two- and
three-dimensional drift- and gyro-fluid simulations with discontinuous Galerkin
methods as the main numerical discretization technique. We observe that
numerical simulations of a recently developed gyro-fluid model produce
non-deterministic results in parallel computations. First, we show how we
restore accuracy and bitwise reproducibility algorithmically and
programmatically. In particular, we adopt an implementation of the exactly
rounded dot product based on long accumulators, which avoids accuracy losses
especially in parallel applications. However, reproducibility and accuracy
alone fail to indicate correct simulation behaviour. In fact, in the physical
model slightly different initial conditions lead to vastly different end
states. This behaviour translates to its numerical representation. Pointwise
convergence, even in principle, becomes impossible for long simulation times.
In a second part, we explore important performance tuning considerations. We
identify latency and memory bandwidth as the main performance indicators of our
routines. Based on these, we propose a parallel performance model that predicts
the execution time of algorithms implemented in Feltor and test our model on a
selection of parallel hardware architectures. We are able to predict the
execution time with a relative error of less than 25% for problem sizes between
0.1 and 1000 MB. Finally, we find that the product of latency and bandwidth
gives a minimum array size per compute node to achieve a scaling efficiency
above 50% (both strong and weak)
Parallel Integer Polynomial Multiplication
We propose a new algorithm for multiplying dense polynomials with integer
coefficients in a parallel fashion, targeting multi-core processor
architectures. Complexity estimates and experimental comparisons demonstrate
the advantages of this new approach
Marathon: An open source software library for the analysis of Markov-Chain Monte Carlo algorithms
In this paper, we consider the Markov-Chain Monte Carlo (MCMC) approach for
random sampling of combinatorial objects. The running time of such an algorithm
depends on the total mixing time of the underlying Markov chain and is unknown
in general. For some Markov chains, upper bounds on this total mixing time
exist but are too large to be applicable in practice. We try to answer the
question, whether the total mixing time is close to its upper bounds, or if
there is a significant gap between them. In doing so, we present the software
library marathon which is designed to support the analysis of MCMC based
sampling algorithms. The main application of this library is to compute
properties of so-called state graphs which represent the structure of Markov
chains. We use marathon to investigate the quality of several bounding methods
on four well-known Markov chains for sampling perfect matchings and bipartite
graph realizations. In a set of experiments, we compute the total mixing time
and several of its bounds for a large number of input instances. We find that
the upper bound gained by the famous canonical path method is several
magnitudes larger than the total mixing time and deteriorates with growing
input size. In contrast, the spectral bound is found to be a precise
approximation of the total mixing time
A Survey on Homomorphic Encryption Schemes: Theory and Implementation
Legacy encryption systems depend on sharing a key (public or private) among
the peers involved in exchanging an encrypted message. However, this approach
poses privacy concerns. Especially with popular cloud services, the control
over the privacy of the sensitive data is lost. Even when the keys are not
shared, the encrypted material is shared with a third party that does not
necessarily need to access the content. Moreover, untrusted servers, providers,
and cloud operators can keep identifying elements of users long after users end
the relationship with the services. Indeed, Homomorphic Encryption (HE), a
special kind of encryption scheme, can address these concerns as it allows any
third party to operate on the encrypted data without decrypting it in advance.
Although this extremely useful feature of the HE scheme has been known for over
30 years, the first plausible and achievable Fully Homomorphic Encryption (FHE)
scheme, which allows any computable function to perform on the encrypted data,
was introduced by Craig Gentry in 2009. Even though this was a major
achievement, different implementations so far demonstrated that FHE still needs
to be improved significantly to be practical on every platform. First, we
present the basics of HE and the details of the well-known Partially
Homomorphic Encryption (PHE) and Somewhat Homomorphic Encryption (SWHE), which
are important pillars of achieving FHE. Then, the main FHE families, which have
become the base for the other follow-up FHE schemes are presented. Furthermore,
the implementations and recent improvements in Gentry-type FHE schemes are also
surveyed. Finally, further research directions are discussed. This survey is
intended to give a clear knowledge and foundation to researchers and
practitioners interested in knowing, applying, as well as extending the state
of the art HE, PHE, SWHE, and FHE systems.Comment: - Updated. (October 6, 2017) - This paper is an early draft of the
survey that is being submitted to ACM CSUR and has been uploaded to arXiv for
feedback from stakeholder
GHOST: Building blocks for high performance sparse linear algebra on heterogeneous systems
While many of the architectural details of future exascale-class high
performance computer systems are still a matter of intense research, there
appears to be a general consensus that they will be strongly heterogeneous,
featuring "standard" as well as "accelerated" resources. Today, such resources
are available as multicore processors, graphics processing units (GPUs), and
other accelerators such as the Intel Xeon Phi. Any software infrastructure that
claims usefulness for such environments must be able to meet their inherent
challenges: massive multi-level parallelism, topology, asynchronicity, and
abstraction. The "General, Hybrid, and Optimized Sparse Toolkit" (GHOST) is a
collection of building blocks that targets algorithms dealing with sparse
matrix representations on current and future large-scale systems. It implements
the "MPI+X" paradigm, has a pure C interface, and provides hybrid-parallel
numerical kernels, intelligent resource management, and truly heterogeneous
parallelism for multicore CPUs, Nvidia GPUs, and the Intel Xeon Phi. We
describe the details of its design with respect to the challenges posed by
modern heterogeneous supercomputers and recent algorithmic developments.
Implementation details which are indispensable for achieving high efficiency
are pointed out and their necessity is justified by performance measurements or
predictions based on performance models. The library code and several
applications are available as open source. We also provide instructions on how
to make use of GHOST in existing software packages, together with a case study
which demonstrates the applicability and performance of GHOST as a component
within a larger software stack.Comment: 32 pages, 11 figure
A domain-specific language and matrix-free stencil code for investigating electronic properties of Dirac and topological materials
We introduce PVSC-DTM (Parallel Vectorized Stencil Code for Dirac and
Topological Materials), a library and code generator based on a domain-specific
language tailored to implement the specific stencil-like algorithms that can
describe Dirac and topological materials such as graphene and topological
insulators in a matrix-free way. The generated hybrid-parallel (MPI+OpenMP)
code is fully vectorized using Single Instruction Multiple Data (SIMD)
extensions. It is significantly faster than matrix-based approaches on the node
level and performs in accordance with the roofline model. We demonstrate the
chip-level performance and distributed-memory scalability of basic building
blocks such as sparse matrix-(multiple-) vector multiplication on modern
multicore CPUs. As an application example, we use the PVSC-DTM scheme to (i)
explore the scattering of a Dirac wave on an array of gate-defined quantum
dots, to (ii) calculate a bunch of interior eigenvalues for strong topological
insulators, and to (iii) discuss the photoemission spectra of a disordered Weyl
semimetal.Comment: 16 pages, 2 tables, 11 figure
Synthesis and Optimization of Reversible Circuits - A Survey
Reversible logic circuits have been historically motivated by theoretical
research in low-power electronics as well as practical improvement of
bit-manipulation transforms in cryptography and computer graphics. Recently,
reversible circuits have attracted interest as components of quantum
algorithms, as well as in photonic and nano-computing technologies where some
switching devices offer no signal gain. Research in generating reversible logic
distinguishes between circuit synthesis, post-synthesis optimization, and
technology mapping. In this survey, we review algorithmic paradigms ---
search-based, cycle-based, transformation-based, and BDD-based --- as well as
specific algorithms for reversible synthesis, both exact and heuristic. We
conclude the survey by outlining key open challenges in synthesis of reversible
and quantum logic, as well as most common misconceptions.Comment: 34 pages, 15 figures, 2 table
Towards Comprehensive Parametric Code Generation Targeting Graphics Processing Units in Support of Scientific Computation
The most popular multithreaded languages based on the fork-join concurrency model (CIlkPlus, OpenMP) are currently being extended to support other forms of parallelism (vectorization, pipelining and single-instruction-multiple-data (SIMD)). In the SIMD case, the objective is to execute the corresponding code on a many-core device, like a GPGPU, for which the CUDA language is a natural choice. Since the programming concepts of CilkPlus and OpenMP are very different from those of CUDA, it is desirable to automatically generate optimized CUDA-like code from CilkPlus or OpenMP.
In this thesis, we propose an accelerator model for annotated C/C++ code together with an implementation that allows the automatic generation of CUDA code. One of the key features of this CUDA code generator is that it supports the generation of CUDA kernel code where program parameters (like number of threads per block) and machine parameters (like shared memory size) are treated as unknown symbols. Hence, these parameters need not to be known at code-generation-time: machine parameters and program parameters can be respectively determined when the generated code is installed on the target machine. In addition, we show how these parametric CUDA programs can be optimized at compile-time in the form of a case discussion, where cases depend on the values of machine parameters (e.g. hardware resource limits) and program parameters (e.g. dimension sizes of thread-blocks). This generation of parametric CUDA kernels requires to deal with non-linear polynomial expressions during the dependence analysis and tiling phase. To achieve these algebraic calculations, we take advantage of techniques from computer algebra, in particular in the RegularChains library of Maple. Various illustrative examples are provided together with performance evaluation
- …