42,340 research outputs found
Computing hypergeometric functions rigorously
We present an efficient implementation of hypergeometric functions in
arbitrary-precision interval arithmetic. The functions , ,
and (or the Kummer -function) are supported for
unrestricted complex parameters and argument, and by extension, we cover
exponential and trigonometric integrals, error functions, Fresnel integrals,
incomplete gamma and beta functions, Bessel functions, Airy functions, Legendre
functions, Jacobi polynomials, complete elliptic integrals, and other special
functions. The output can be used directly for interval computations or to
generate provably correct floating-point approximations in any format.
Performance is competitive with earlier arbitrary-precision software, and
sometimes orders of magnitude faster. We also partially cover the generalized
hypergeometric function and computation of high-order parameter
derivatives.Comment: v2: corrected example in section 3.1; corrected timing data for case
E-G in section 8.5 (table 6, figure 2); adjusted paper siz
Computing the Lambert W function in arbitrary-precision complex interval arithmetic
We describe an algorithm to evaluate all the complex branches of the Lambert
W function with rigorous error bounds in interval arithmetic, which has been
implemented in the Arb library. The classic 1996 paper on the Lambert W
function by Corless et al. provides a thorough but partly heuristic numerical
analysis which needs to be complemented with some explicit inequalities and
practical observations about managing precision and branch cuts.Comment: 16 pages, 4 figure
Reproducibility, accuracy and performance of the Feltor code and library on parallel computer architectures
Feltor is a modular and free scientific software package. It allows
developing platform independent code that runs on a variety of parallel
computer architectures ranging from laptop CPUs to multi-GPU distributed memory
systems. Feltor consists of both a numerical library and a collection of
application codes built on top of the library. Its main target are two- and
three-dimensional drift- and gyro-fluid simulations with discontinuous Galerkin
methods as the main numerical discretization technique. We observe that
numerical simulations of a recently developed gyro-fluid model produce
non-deterministic results in parallel computations. First, we show how we
restore accuracy and bitwise reproducibility algorithmically and
programmatically. In particular, we adopt an implementation of the exactly
rounded dot product based on long accumulators, which avoids accuracy losses
especially in parallel applications. However, reproducibility and accuracy
alone fail to indicate correct simulation behaviour. In fact, in the physical
model slightly different initial conditions lead to vastly different end
states. This behaviour translates to its numerical representation. Pointwise
convergence, even in principle, becomes impossible for long simulation times.
In a second part, we explore important performance tuning considerations. We
identify latency and memory bandwidth as the main performance indicators of our
routines. Based on these, we propose a parallel performance model that predicts
the execution time of algorithms implemented in Feltor and test our model on a
selection of parallel hardware architectures. We are able to predict the
execution time with a relative error of less than 25% for problem sizes between
0.1 and 1000 MB. Finally, we find that the product of latency and bandwidth
gives a minimum array size per compute node to achieve a scaling efficiency
above 50% (both strong and weak)
Fourier-based schemes with modified Green operator for computing the electrical response of heterogeneous media with accurate local fields
A modified Green operator is proposed as an improvement of Fourier-based
numerical schemes commonly used for computing the electrical or thermal
response of heterogeneous media. Contrary to other methods, the number of
iterations necessary to achieve convergence tends to a finite value when the
contrast of properties between the phases becomes infinite. Furthermore, it is
shown that the method produces much more accurate local fields inside
highly-conducting and quasi-insulating phases, as well as in the vicinity of
the phases interfaces. These good properties stem from the discretization of
Green's function, which is consistent with the pixel grid while retaining the
local nature of the operator that acts on the polarization field. Finally, a
fast implementation of the "direct scheme" of Moulinec et al. (1994) that
allows for parcimonious memory use is proposed.Comment: v2: `postprint' document (a few remaining typos in the published
version herein corrected in red; results unchanged
Hierarchical Parallelisation of Functional Renormalisation Group Calculations -- hp-fRG
The functional renormalisation group (fRG) has evolved into a versatile tool
in condensed matter theory for studying important aspects of correlated
electron systems. Practical applications of the method often involve a high
numerical effort, motivating the question in how far High Performance Computing
(HPC) can leverage the approach. In this work we report on a multi-level
parallelisation of the underlying computational machinery and show that this
can speed up the code by several orders of magnitude. This in turn can extend
the applicability of the method to otherwise inaccessible cases. We exploit
three levels of parallelisation: Distributed computing by means of Message
Passing (MPI), shared-memory computing using OpenMP, and vectorisation by means
of SIMD units (single-instruction-multiple-data). Results are provided for two
distinct High Performance Computing (HPC) platforms, namely the IBM-based
BlueGene/Q system JUQUEEN and an Intel Sandy-Bridge-based development cluster.
We discuss how certain issues and obstacles were overcome in the course of
adapting the code. Most importantly, we conclude that this vast improvement can
actually be accomplished by introducing only moderate changes to the code, such
that this strategy may serve as a guideline for other researcher to likewise
improve the efficiency of their codes
QCD simulations with staggered fermions on GPUs
We report on our implementation of the RHMC algorithm for the simulation of
lattice QCD with two staggered flavors on Graphics Processing Units, using the
NVIDIA CUDA programming language. The main feature of our code is that the GPU
is not used just as an accelerator, but instead the whole Molecular Dynamics
trajectory is performed on it. After pointing out the main bottlenecks and how
to circumvent them, we discuss the obtained performances. We present some
preliminary results regarding OpenCL and multiGPU extensions of our code and
discuss future perspectives.Comment: 22 pages, 14 eps figures, final version to be published in Computer
Physics Communication
Interest rate models with Markov chains
Imperial Users onl
An IoT Endpoint System-on-Chip for Secure and Energy-Efficient Near-Sensor Analytics
Near-sensor data analytics is a promising direction for IoT endpoints, as it
minimizes energy spent on communication and reduces network load - but it also
poses security concerns, as valuable data is stored or sent over the network at
various stages of the analytics pipeline. Using encryption to protect sensitive
data at the boundary of the on-chip analytics engine is a way to address data
security issues. To cope with the combined workload of analytics and encryption
in a tight power envelope, we propose Fulmine, a System-on-Chip based on a
tightly-coupled multi-core cluster augmented with specialized blocks for
compute-intensive data processing and encryption functions, supporting software
programmability for regular computing tasks. The Fulmine SoC, fabricated in
65nm technology, consumes less than 20mW on average at 0.8V achieving an
efficiency of up to 70pJ/B in encryption, 50pJ/px in convolution, or up to
25MIPS/mW in software. As a strong argument for real-life flexible application
of our platform, we show experimental results for three secure analytics use
cases: secure autonomous aerial surveillance with a state-of-the-art deep CNN
consuming 3.16pJ per equivalent RISC op; local CNN-based face detection with
secured remote recognition in 5.74pJ/op; and seizure detection with encrypted
data collection from EEG within 12.7pJ/op.Comment: 15 pages, 12 figures, accepted for publication to the IEEE
Transactions on Circuits and Systems - I: Regular Paper
- …