Search CORE

4,897 research outputs found

Galactos: Computing the Anisotropic 3-Point Correlation Function for 2 Billion Galaxies

Author: Austin Brian
Bard Deborah
Deslippe Jack
Dubey Pradeep
Eisenstein Daniel J
Friesen Brian
Patwary Md. Mostofa Ali
Prabhat
Satish Nadathur
Slepian Zachary
Sundaram Narayanan
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2017
Field of study

The nature of dark energy and the complete theory of gravity are two central questions currently facing cosmology. A vital tool for addressing them is the 3-point correlation function (3PCF), which probes deviations from a spatially random distribution of galaxies. However, the 3PCF's formidable computational expense has prevented its application to astronomical surveys comprising millions to billions of galaxies. We present Galactos, a high-performance implementation of a novel, O(N^2) algorithm that uses a load-balanced k-d tree and spherical harmonic expansions to compute the anisotropic 3PCF. Our implementation is optimized for the Intel Xeon Phi architecture, exploiting SIMD parallelism, instruction and thread concurrency, and significant L1 and L2 cache reuse, reaching 39% of peak performance on a single node. Galactos scales to the full Cori system, achieving 9.8PF (peak) and 5.06PF (sustained) across 9636 nodes, making the 3PCF easily computable for all galaxies in the observable universe.Comment: 11 pages, 7 figures, accepted to SuperComputing 201

arXiv.org e-Print Archive

eScholarship - University of California

Machine Learning Based Auto-tuning for Enhanced OpenCL Performance Portability

Author: Elster Anne C.
Falch Thomas L.
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 02/06/2015
Field of study

Heterogeneous computing, which combines devices with different architectures, is rising in popularity, and promises increased performance combined with reduced energy consumption. OpenCL has been proposed as a standard for programing such systems, and offers functional portability. It does, however, suffer from poor performance portability, code tuned for one device must be re-tuned to achieve good performance on another device. In this paper, we use machine learning-based auto-tuning to address this problem. Benchmarks are run on a random subset of the entire tuning parameter configuration space, and the results are used to build an artificial neural network based model. The model can then be used to find interesting parts of the parameter space for further search. We evaluate our method with different benchmarks, on several devices, including an Intel i7 3770 CPU, an Nvidia K40 GPU and an AMD Radeon HD 7970 GPU. Our model achieves a mean relative error as low as 6.1%, and is able to find configurations as little as 1.3% worse than the global minimum.Comment: This is a pre-print version an article to be published in the Proceedings of the 2015 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW). For personal use onl

arXiv.org e-Print Archive