8,695 research outputs found
Parallel Algorithms for Summing Floating-Point Numbers
The problem of exactly summing n floating-point numbers is a fundamental
problem that has many applications in large-scale simulations and computational
geometry. Unfortunately, due to the round-off error in standard floating-point
operations, this problem becomes very challenging. Moreover, all existing
solutions rely on sequential algorithms which cannot scale to the huge datasets
that need to be processed.
In this paper, we provide several efficient parallel algorithms for summing n
floating point numbers, so as to produce a faithfully rounded floating-point
representation of the sum. We present algorithms in PRAM, external-memory, and
MapReduce models, and we also provide an experimental analysis of our MapReduce
algorithms, due to their simplicity and practical efficiency.Comment: Conference version appears in SPAA 201
A hierarchically blocked Jacobi SVD algorithm for single and multiple graphics processing units
We present a hierarchically blocked one-sided Jacobi algorithm for the
singular value decomposition (SVD), targeting both single and multiple graphics
processing units (GPUs). The blocking structure reflects the levels of GPU's
memory hierarchy. The algorithm may outperform MAGMA's dgesvd, while retaining
high relative accuracy. To this end, we developed a family of parallel pivot
strategies on GPU's shared address space, but applicable also to inter-GPU
communication. Unlike common hybrid approaches, our algorithm in a single GPU
setting needs a CPU for the controlling purposes only, while utilizing GPU's
resources to the fullest extent permitted by the hardware. When required by the
problem size, the algorithm, in principle, scales to an arbitrary number of GPU
nodes. The scalability is demonstrated by more than twofold speedup for
sufficiently large matrices on a Tesla S2050 system with four GPUs vs. a single
Fermi card.Comment: Accepted for publication in SIAM Journal on Scientific Computin
Efficient implementation of the Hardy-Ramanujan-Rademacher formula
We describe how the Hardy-Ramanujan-Rademacher formula can be implemented to
allow the partition function to be computed with softly optimal
complexity and very little overhead. A new implementation
based on these techniques achieves speedups in excess of a factor 500 over
previously published software and has been used by the author to calculate
, an exponent twice as large as in previously reported
computations.
We also investigate performance for multi-evaluation of , where our
implementation of the Hardy-Ramanujan-Rademacher formula becomes superior to
power series methods on far denser sets of indices than previous
implementations. As an application, we determine over 22 billion new
congruences for the partition function, extending Weaver's tabulation of 76,065
congruences.Comment: updated version containing an unconditional complexity proof;
accepted for publication in LMS Journal of Computation and Mathematic
Zeroes of partial sums of the zeta-function
This article considers the positive integers for which has zeroes in the half-plane . Building on
earlier results, we show that there are no zeroes for and for
. For all other there are infinitely many zeroes.Comment: 5 Pages - Final Version will appear in LMS JC
Secure -ish Nearest Neighbors Classifier
In machine learning, classifiers are used to predict a class of a given query
based on an existing (classified) database. Given a database S of n
d-dimensional points and a d-dimensional query q, the k-nearest neighbors (kNN)
classifier assigns q with the majority class of its k nearest neighbors in S.
In the secure version of kNN, S and q are owned by two different parties that
do not want to share their data. Unfortunately, all known solutions for secure
kNN either require a large communication complexity between the parties, or are
very inefficient to run.
In this work we present a classifier based on kNN, that can be implemented
efficiently with homomorphic encryption (HE). The efficiency of our classifier
comes from a relaxation we make on kNN, where we allow it to consider kappa
nearest neighbors for kappa ~ k with some probability. We therefore call our
classifier k-ish Nearest Neighbors (k-ish NN).
The success probability of our solution depends on the distribution of the
distances from q to S and increase as its statistical distance to Gaussian
decrease.
To implement our classifier we introduce the concept of double-blinded
coin-toss. In a doubly-blinded coin-toss the success probability as well as the
output of the toss are encrypted. We use this coin-toss to efficiently
approximate the average and variance of the distances from q to S. We believe
these two techniques may be of independent interest.
When implemented with HE, the k-ish NN has a circuit depth that is
independent of n, therefore making it scalable. We also implemented our
classifier in an open source library based on HELib and tested it on a breast
tumor database. The accuracy of our classifier (F_1 score) were 98\% and
classification took less than 3 hours compared to (estimated) weeks in current
HE implementations
Optimistic Parallelization of Floating-Point Accumulation
Floating-point arithmetic is notoriously non-associative due to the limited precision representation which demands intermediate values be rounded to fit in the available precision. The resulting cyclic dependency in floating-point accumulation inhibits parallelization of the computation, including efficient use of pipelining. In practice, however, we observe that floating-point operations are "mostly" associative. This observation can be exploited to parallelize floating-point accumulation using a form of optimistic concurrency. In this scheme, we first compute an optimistic associative approximation to the sum and then relax the computation by iteratively propagating errors until the correct sum is obtained. We map this computation to a network of 16 statically-scheduled, pipelined, double-precision floating-point adders on the Virtex-4 LX160 (-12) device where each floating-point adder runs at 296 MHz and has a pipeline depth of 10. On this 16 PE design, we demonstrate an average speedup of 6× with randomly generated data and 3-7× with summations extracted from Conjugate Gradient benchmarks
Computing floating-point logarithms with fixed-point operations
International audienceElementary functions from the mathematical library input and output floating-point numbers. However it is possible to implement them purely using integer/fixed-point arithmetic. This option was not attractive between 1985 and 2005, because mainstream processor hardware supported 64-bit floating-point, but only 32-bit integers. Besides, conversions between floating-point and integer were costly. This has changed in recent years, in particular with the generalization of native 64-bit integer support. The purpose of this article is therefore to reevaluate the relevance of computing floating-point functions in fixed-point. For this, several variants of the double-precision logarithm function are implemented and evaluated. Formulating the problem as a fixed-point one is easy after the range has been (classically) reduced. Then, 64-bit integers provide slightly more accuracy than 53-bit mantissa, which helps speed up the evaluation. Finally, multi-word arithmetic, critical for accurate implementations, is much faster in fixed-point, and natively supported by recent compilers. Novel techniques of argument reduction and rounding test are introduced in this context. Thanks to all this, a purely integer implementation of the correctly rounded double-precision logarithm outperforms the previous state of the art, with the worst-case execution time reduced by a factor 5. This work also introduces variants of the logarithm that input a floating-point number and output the result in fixed-point. These are shown to be both more accurate and more efficient than the traditional floating-point functions for some applications
- …