6 research outputs found
Spherical harmonic transform with GPUs
We describe an algorithm for computing an inverse spherical harmonic
transform suitable for graphic processing units (GPU). We use CUDA and base our
implementation on a Fortran90 routine included in a publicly available parallel
package, S2HAT. We focus our attention on the two major sequential steps
involved in the transforms computation, retaining the efficient parallel
framework of the original code. We detail optimization techniques used to
enhance the performance of the CUDA-based code and contrast them with those
implemented in the Fortran90 version. We also present performance comparisons
of a single CPU plus GPU unit with the S2HAT code running on either a single or
4 processors. In particular we find that use of the latest generation of GPUs,
such as NVIDIA GF100 (Fermi), can accelerate the spherical harmonic transforms
by as much as 18 times with respect to S2HAT executed on one core, and by as
much as 5.5 with respect to S2HAT on 4 cores, with the overall performance
being limited by the Fast Fourier transforms. The work presented here has been
performed in the context of the Cosmic Microwave Background simulations and
analysis. However, we expect that the developed software will be of more
general interest and applicability
A Note on Spherical Needlets
Compared with the traditional spherical harmonics, the spherical needlets are
a new generation of spherical wavelets that possess several attractive
properties. Their double localization in both spatial and frequency domains
empowers them to easily and sparsely represent functions with small spatial
scale features. This paper is divided into two parts. First, it reviews the
spherical harmonics and discusses their limitations in representing functions
with small spatial scale features. To overcome the limitations, it introduces
the spherical needlets and their attractive properties. In the second part of
the paper, a Matlab package for the spherical needlets is presented. The
properties of the spherical needlets are demonstrated by several examples using
the package.Comment: 12 pages, 7 figures, technical repor
Using hybrid GPU/CPU kernel splitting to accelerate spherical convolutions
We present a general method for accelerating by more than an order of
magnitude the convolution of pixelated functions on the sphere with a
radially-symmetric kernel. Our method splits the kernel into a compact
real-space component and a compact spherical harmonic space component. These
components can then be convolved in parallel using an inexpensive commodity GPU
and a CPU. We provide models for the computational cost of both real-space and
Fourier space convolutions and an estimate for the approximation error. Using
these models we can determine the optimum split that minimizes the wall clock
time for the convolution while satisfying the desired error bounds. We apply
this technique to the problem of simulating a cosmic microwave background (CMB)
anisotropy sky map at the resolution typical of the high resolution maps
produced by the Planck mission. For the main Planck CMB science channels we
achieve a speedup of over a factor of ten, assuming an acceptable fractional
rms error of order 1.e-5 in the power spectrum of the output map.Comment: 9 pages, 11 figures, 1 table, accepted by Astronomy & Computing w/
minor revisions. arXiv admin note: substantial text overlap with
arXiv:1211.355
Estimating the tensor-to-scalar ratio and the effect of residual foreground contamination
We consider future balloon-borne and ground-based suborbital experiments
designed to search for inflationary gravitational waves, and investigate the
impact of residual foregrounds that remain in the estimated cosmic microwave
background maps. This is achieved by propagating foreground modelling
uncertainties from the component separation, under the assumption of a
spatially uniform foreground frequency scaling, through to the power spectrum
estimates, and up to measurement of the tensor to scalar ratio in the parameter
estimation step. We characterize the error covariance due to subtracted
foregrounds, and find it to be subdominant compared to instrumental noise and
sample variance in our simulated data analysis. We model the unsubtracted
residual foreground contribution using a two-parameter power law and show that
marginalization over these foreground parameters is effective in accounting for
a bias due to excess foreground power at low . We conclude that, at least
in the suborbital experimental setups we have simulated, foreground errors may
be modeled and propagated up to parameter estimation with only a slight
degradation of the target sensitivity of these experiments derived neglecting
the presence of the foregrounds.Comment: 19 pages, 12 figures, accepted for publication in JCA
Parallel Spherical Harmonic Transforms on heterogeneous architectures (GPUs/multi-core CPUs)
Spherical Harmonic Transforms (SHT) are at the heart of many scientific and
practical applications ranging from climate modelling to cosmological
observations. In many of these areas new, cutting-edge science goals have been
recently proposed requiring simulations and analyses of experimental or
observational data at very high resolutions and of unprecedented volumes. Both
these aspects pose formidable challenge for the currently existing
implementations of the transforms.
This paper describes parallel algorithms for computing SHT with two variants
of intra-node parallelism appropriate for novel supercomputer architectures,
multi-core processors and Graphic Processing Units (GPU). It also discusses
their performance, alone and embedded within a top-level, MPI-based
parallelisation layer ported from the S2HAT library, in terms of their
accuracy, overall efficiency and scalability. We show that our inverse SHT run
on GeForce 400 Series GPUs equipped with latest CUDA architecture ("Fermi")
outperforms the state of the art implementation for a multi-core processor
executed on a current Intel Core i7-2600K. Furthermore, we show that an
MPI/CUDA version of the inverse transform run on a cluster of 128 Nvidia Tesla
S1070 is as much as 3 times faster than the hybrid MPI/OpenMP version executed
on the same number of quad-core processors Intel Nahalem for problem sizes
motivated by our target applications. Performance of the direct transforms is
however found to be at the best comparable in these cases. We discuss in detail
the algorithmic solutions devised for major steps involved in the transforms
calculation, emphasising those with a major impact on their overall
performance, and elucidates the sources of the dichotomy between the direct and
the inverse operations