10,974 research outputs found
Generalising the Fast Reciprocal Square Root Algorithm
The Fast Reciprocal Square Root Algorithm is a well-established approximation
technique consisting of two stages: first, a coarse approximation is obtained
by manipulating the bit pattern of the floating point argument using integer
instructions, and second, the coarse result is refined through one or more
steps, traditionally using Newtonian iteration but alternatively using improved
expressions with carefully chosen numerical constants found by other authors.
The algorithm was widely used before microprocessors carried built-in hardware
support for computing reciprocal square roots. At the time of writing, however,
there is in general no hardware acceleration for computing other fixed
fractional powers. This paper generalises the algorithm to cater to all
rational powers, and to support any polynomial degree(s) in the refinement
step(s), and under the assumption of unlimited floating point precision
provides a procedure which automatically constructs provably optimal constants
in all of these cases. It is also shown that, under certain assumptions, the
use of monic refinement polynomials yields results which are much better placed
with respect to the cost/accuracy tradeoff than those obtained using general
polynomials. Further extensions are also analysed, and several new best
approximations are given.Comment: 19 pages, 8 figure
A Many-Core Overlay for High-Performance Embedded Computing on FPGAs
In this work, we propose a configurable many-core overlay for
high-performance embedded computing. The size of internal memory, supported
operations and number of ports can be configured independently for each core of
the overlay. The overlay was evaluated with matrix multiplication, LU
decomposition and Fast-Fourier Transform (FFT) on a ZYNQ-7020 FPGA platform.
The results show that using a system-level many-core overlay avoids complex
hardware design and still provides good performance results.Comment: Presented at First International Workshop on FPGAs for Software
Programmers (FSP 2014) (arXiv:1408.4423
Parallel Algorithm for Solving Kepler's Equation on Graphics Processing Units: Application to Analysis of Doppler Exoplanet Searches
[Abridged] We present the results of a highly parallel Kepler equation solver
using the Graphics Processing Unit (GPU) on a commercial nVidia GeForce 280GTX
and the "Compute Unified Device Architecture" programming environment. We apply
this to evaluate a goodness-of-fit statistic (e.g., chi^2) for Doppler
observations of stars potentially harboring multiple planetary companions
(assuming negligible planet-planet interactions). We tested multiple
implementations using single precision, double precision, pairs of single
precision, and mixed precision arithmetic. We find that the vast majority of
computations can be performed using single precision arithmetic, with selective
use of compensated summation for increased precision. However, standard single
precision is not adequate for calculating the mean anomaly from the time of
observation and orbital period when evaluating the goodness-of-fit for real
planetary systems and observational data sets. Using all double precision, our
GPU code outperforms a similar code using a modern CPU by a factor of over 60.
Using mixed-precision, our GPU code provides a speed-up factor of over 600,
when evaluating N_sys > 1024 models planetary systems each containing N_pl = 4
planets and assuming N_obs = 256 observations of each system. We conclude that
modern GPUs also offer a powerful tool for repeatedly evaluating Kepler's
equation and a goodness-of-fit statistic for orbital models when presented with
a large parameter space.Comment: 19 pages, to appear in New Astronom
High-Speed Function Approximation using a Minimax Quadratic Interpolator
A table-based method for high-speed function approximation in single-precision floating-point format is presented in this paper. Our focus is the approximation of reciprocal, square root, square root reciprocal, exponentials, logarithms, trigonometric functions, powering (with a fixed exponent p), or special functions. The algorithm presented here combines table look-up, an enhanced minimax quadratic approximation, and an efficient evaluation of the second-degree polynomial (using a specialized squaring unit, redundant arithmetic, and multioperand addition). The execution times and area costs of an architecture implementing our method are estimated, showing the achievement of the fast execution times of linear approximation methods and the reduced area requirements of other second-degree interpolation algorithms. Moreover, the use of an enhanced minimax approximation which, through an iterative process, takes into account the effect of rounding the polynomial coefficients to a finite size allows for a further reduction in the size of the look-up tables to be used, making our method very suitable for the implementation of an elementary function generator in state-of-the-art DSPs or graphics processing units (GPUs)
Fast Compensated Algorithms for the Reciprocal Square Root, the Reciprocal Hypotenuse, and Givens Rotations
The reciprocal square root is an important computation for which many very
sophisticated algorithms exist (see for example \cite{863046,863031} and the
references therein). In this paper we develop a simple differential
compensation (much like those developed in \cite{borges}) that can be used to
improve the accuracy of a naive calculation. The approach relies on the use of
the fused multiply-add (FMA) which is widely available in hardware on a variety
of modern computer architectures. We then demonstrate how to combine this
approach with a somewhat inaccurate but fast square root free method for
estimating the reciprocal square root to get a method that is both fast (in
computing environments with a slow square root) and, experimentally, highly
accurate. Finally, we show how this same approach can be extended to the
reciprocal hypotenuse calculation and, most importantly, to the construction of
Givens rotations
Highly parallel sparse Cholesky factorization
Several fine grained parallel algorithms were developed and compared to compute the Cholesky factorization of a sparse matrix. The experimental implementations are on the Connection Machine, a distributed memory SIMD machine whose programming model conceptually supplies one processor per data element. In contrast to special purpose algorithms in which the matrix structure conforms to the connection structure of the machine, the focus is on matrices with arbitrary sparsity structure. The most promising algorithm is one whose inner loop performs several dense factorizations simultaneously on a 2-D grid of processors. Virtually any massively parallel dense factorization algorithm can be used as the key subroutine. The sparse code attains execution rates comparable to those of the dense subroutine. Although at present architectural limitations prevent the dense factorization from realizing its potential efficiency, it is concluded that a regular data parallel architecture can be used efficiently to solve arbitrarily structured sparse problems. A performance model is also presented and it is used to analyze the algorithms
Direct N-body Kernels for Multicore Platforms
Abstract—We present an inter-architectural comparison of single- and double-precision direct n-body implementations on modern multicore platforms, including those based on the Intel Nehalem and AMD Barcelona systems, the Sony-Toshiba-IBM PowerXCell/8i processor, and NVIDIA Tesla C870 and C1060 GPU systems. We compare our implementations across platforms on a variety of proxy measures, including performance, coding complexity, and energy efficiency. I
- …