6 research outputs found

    Spherical harmonic transform with GPUs

    Get PDF
    We describe an algorithm for computing an inverse spherical harmonic transform suitable for graphic processing units (GPU). We use CUDA and base our implementation on a Fortran90 routine included in a publicly available parallel package, S2HAT. We focus our attention on the two major sequential steps involved in the transforms computation, retaining the efficient parallel framework of the original code. We detail optimization techniques used to enhance the performance of the CUDA-based code and contrast them with those implemented in the Fortran90 version. We also present performance comparisons of a single CPU plus GPU unit with the S2HAT code running on either a single or 4 processors. In particular we find that use of the latest generation of GPUs, such as NVIDIA GF100 (Fermi), can accelerate the spherical harmonic transforms by as much as 18 times with respect to S2HAT executed on one core, and by as much as 5.5 with respect to S2HAT on 4 cores, with the overall performance being limited by the Fast Fourier transforms. The work presented here has been performed in the context of the Cosmic Microwave Background simulations and analysis. However, we expect that the developed software will be of more general interest and applicability

    A Note on Spherical Needlets

    Full text link
    Compared with the traditional spherical harmonics, the spherical needlets are a new generation of spherical wavelets that possess several attractive properties. Their double localization in both spatial and frequency domains empowers them to easily and sparsely represent functions with small spatial scale features. This paper is divided into two parts. First, it reviews the spherical harmonics and discusses their limitations in representing functions with small spatial scale features. To overcome the limitations, it introduces the spherical needlets and their attractive properties. In the second part of the paper, a Matlab package for the spherical needlets is presented. The properties of the spherical needlets are demonstrated by several examples using the package.Comment: 12 pages, 7 figures, technical repor

    Using hybrid GPU/CPU kernel splitting to accelerate spherical convolutions

    Full text link
    We present a general method for accelerating by more than an order of magnitude the convolution of pixelated functions on the sphere with a radially-symmetric kernel. Our method splits the kernel into a compact real-space component and a compact spherical harmonic space component. These components can then be convolved in parallel using an inexpensive commodity GPU and a CPU. We provide models for the computational cost of both real-space and Fourier space convolutions and an estimate for the approximation error. Using these models we can determine the optimum split that minimizes the wall clock time for the convolution while satisfying the desired error bounds. We apply this technique to the problem of simulating a cosmic microwave background (CMB) anisotropy sky map at the resolution typical of the high resolution maps produced by the Planck mission. For the main Planck CMB science channels we achieve a speedup of over a factor of ten, assuming an acceptable fractional rms error of order 1.e-5 in the power spectrum of the output map.Comment: 9 pages, 11 figures, 1 table, accepted by Astronomy & Computing w/ minor revisions. arXiv admin note: substantial text overlap with arXiv:1211.355

    Estimating the tensor-to-scalar ratio and the effect of residual foreground contamination

    Full text link
    We consider future balloon-borne and ground-based suborbital experiments designed to search for inflationary gravitational waves, and investigate the impact of residual foregrounds that remain in the estimated cosmic microwave background maps. This is achieved by propagating foreground modelling uncertainties from the component separation, under the assumption of a spatially uniform foreground frequency scaling, through to the power spectrum estimates, and up to measurement of the tensor to scalar ratio in the parameter estimation step. We characterize the error covariance due to subtracted foregrounds, and find it to be subdominant compared to instrumental noise and sample variance in our simulated data analysis. We model the unsubtracted residual foreground contribution using a two-parameter power law and show that marginalization over these foreground parameters is effective in accounting for a bias due to excess foreground power at low \ell. We conclude that, at least in the suborbital experimental setups we have simulated, foreground errors may be modeled and propagated up to parameter estimation with only a slight degradation of the target sensitivity of these experiments derived neglecting the presence of the foregrounds.Comment: 19 pages, 12 figures, accepted for publication in JCA

    Parallel Spherical Harmonic Transforms on heterogeneous architectures (GPUs/multi-core CPUs)

    Get PDF
    Spherical Harmonic Transforms (SHT) are at the heart of many scientific and practical applications ranging from climate modelling to cosmological observations. In many of these areas new, cutting-edge science goals have been recently proposed requiring simulations and analyses of experimental or observational data at very high resolutions and of unprecedented volumes. Both these aspects pose formidable challenge for the currently existing implementations of the transforms. This paper describes parallel algorithms for computing SHT with two variants of intra-node parallelism appropriate for novel supercomputer architectures, multi-core processors and Graphic Processing Units (GPU). It also discusses their performance, alone and embedded within a top-level, MPI-based parallelisation layer ported from the S2HAT library, in terms of their accuracy, overall efficiency and scalability. We show that our inverse SHT run on GeForce 400 Series GPUs equipped with latest CUDA architecture ("Fermi") outperforms the state of the art implementation for a multi-core processor executed on a current Intel Core i7-2600K. Furthermore, we show that an MPI/CUDA version of the inverse transform run on a cluster of 128 Nvidia Tesla S1070 is as much as 3 times faster than the hybrid MPI/OpenMP version executed on the same number of quad-core processors Intel Nahalem for problem sizes motivated by our target applications. Performance of the direct transforms is however found to be at the best comparable in these cases. We discuss in detail the algorithmic solutions devised for major steps involved in the transforms calculation, emphasising those with a major impact on their overall performance, and elucidates the sources of the dichotomy between the direct and the inverse operations
    corecore