455,542 research outputs found

    Analysing Astronomy Algorithms for GPUs and Beyond

    Full text link
    Astronomy depends on ever increasing computing power. Processor clock-rates have plateaued, and increased performance is now appearing in the form of additional processor cores on a single chip. This poses significant challenges to the astronomy software community. Graphics Processing Units (GPUs), now capable of general-purpose computation, exemplify both the difficult learning-curve and the significant speedups exhibited by massively-parallel hardware architectures. We present a generalised approach to tackling this paradigm shift, based on the analysis of algorithms. We describe a small collection of foundation algorithms relevant to astronomy and explain how they may be used to ease the transition to massively-parallel computing architectures. We demonstrate the effectiveness of our approach by applying it to four well-known astronomy problems: Hogbom CLEAN, inverse ray-shooting for gravitational lensing, pulsar dedispersion and volume rendering. Algorithms with well-defined memory access patterns and high arithmetic intensity stand to receive the greatest performance boost from massively-parallel architectures, while those that involve a significant amount of decision-making may struggle to take advantage of the available processing power.Comment: 10 pages, 3 figures, accepted for publication in MNRA

    A benchmark study on mantle convection in a 3-D spherical shell using CitcomS

    Get PDF
    As high-performance computing facilities and sophisticated modeling software become available, modeling mantle convection in a three-dimensional (3-D) spherical shell geometry with realistic physical parameters and processes becomes increasingly feasible. However, there is still a lack of comprehensive benchmark studies for 3-D spherical mantle convection. Here we present benchmark and test calculations using a finite element code CitcomS for 3-D spherical convection. Two classes of model calculations are presented: the Stokes' flow and thermal and thermochemical convection. For Stokes' flow, response functions of characteristic flow velocity, topography, and geoid at the surface and core-mantle boundary (CMB) at different spherical harmonic degrees are computed using CitcomS and are compared with those from analytic solutions using a propagator matrix method. For thermal and thermochemical convection, 24 cases are computed with different model parameters including Rayleigh number (7 Ă— 10^3 or 10^5) and viscosity contrast due to temperature dependence (1 to 10^7). For each case, time-averaged quantities at the steady state are computed, including surface and CMB Nussult numbers, RMS velocity, averaged temperature, and maximum and minimum flow velocity, and temperature at the midmantle depth and their standard deviations. For thermochemical convection cases, in addition to outputs for thermal convection, we also quantified entrainment of an initially dense component of the convection and the relative errors in conserving its volume. For nine thermal convection cases that have small viscosity variations and where previously published results were available, we find that the CitcomS results are mostly consistent with these previously published with less than 1% relative differences in globally averaged quantities including Nussult numbers and RMS velocities. For other 15 cases with either strongly temperature-dependent viscosity or thermochemical convection, no previous calculations are available for comparison, but these 15 test calculations from CitcomS are useful for future code developments and comparisons. We also presented results for parallel efficiency for CitcomS, showing that the code achieves 57% efficiency with 3072 cores on Texas Advanced Computing Center's parallel supercomputer Ranger

    DRC 2 : Dynamically Reconfigurable Computing Circuit based on Memory Architecture

    Get PDF
    International audienceThis paper presents a novel energy-efficient and Dynamically Reconfigurable Computing Circuit (DRC²) concept based on memory architecture for data-intensive (imaging, …) and secure (cryptography, …) applications. The proposed computing circuit is based on a 10-Transistor (10T) 3-Port SRAM bitcell array driven by a peripheral circuitry enabling all basic operations that can be traditionally performed by an ALU. As a result, logic and arithmetic operations can be entirely executed within the memory unit leading to a significant reduction in power consumption related to the data transfer between memories and computing units. Moreover, the proposed computing circuit can perform extremely-parallel operations enabling the processing of large volume of data. A test case based on image processing application and using the saturating increment function is analytically modeled to compare conventional and DRC²-based approaches. It is demonstrated that DRC²-based approach provides a reduction of clock cycle number of up to 2x. Finally, potential applications and must-be-considered changes at different design levels are discussed

    HyperPRAW : architecture-aware hypergraph restreaming partition to improve performance of parallel applications running on high performance computing systems

    Get PDF
    High Performance Computing (HPC) demand is on the rise, particularly for large distributed computing. HPC systems have, by design, very heterogeneous architectures, both in computation and in communication bandwidth, resulting in wide variations in the cost of communications between compute units. If large distributed applications are to take full advantage of HPC, the physical communication capabilities must be taken into consideration when allocating workload. Hypergraphs are good at modelling total volume of communication in parallel and distributed applications. To the best of our knowledge, there are no hypergraph partitioning algorithms to date that are architecture-aware. We propose a novel restreaming hypergraph partitioning algorithm (HyperPRAW) that takes advantage of peer to peer physical bandwidth profiling data to improve distributed applications performance in HPC systems. Our results show that not only the quality of the partitions achieved by our algorithm is comparable with state-of-the-art multilevel partitioning, but that the runtime performance in a synthetic benchmark is significantly reduced in 10 hypergraph models tested, with speedup factors of up to 14x

    Preconditioning the 2D Helmholtz equation with polarized traces

    Get PDF
    We present a domain decomposition solver for the 2D Helmholtz equation, with a special choice of integral transmission condition that involves polarizing the waves into oneway components. This refinement of the transmission condition is the key to combining local direct solves into an efficient iterative scheme, which can then be deployed in a highperformance computing environment. The method involves an expensive, but embarrassingly parallel precomputation of local Green's functions, and a fast online computation of layer potentials in partitioned low-rank form. The online part has sequential complexity that scales sublinearly with respect to the number of volume unknowns, even in the high-frequency regime. The favorable complexity scaling continues to hold in the context of low-order finite difference schemes for standard community models such as BP and Marmousi2, where convergence occurs in 5 to 10 GMRES iterations.TOTAL (Firm)United States. Air Force. Office of Scientific ResearchUnited States. Office of Naval ResearchNational Science Foundation (U.S.

    SFC-based Communication Metadata Encoding for Adaptive Mesh

    Get PDF
    This volume of the series “Advances in Parallel Computing” contains the proceedings of the International Conference on Parallel Programming – ParCo 2013 – held from 10 to 13 September 2013 in Garching, Germany. The conference was hosted by the Technische Universität München (Department of Informatics) and the Leibniz Supercomputing Centre.The present paper studies two adaptive mesh refinement (AMR) codes whose grids rely on recursive subdivison in combination with space-filling curves (SFCs). A non-overlapping domain decomposition based upon these SFCs yields several well-known advantageous properties with respect to communication demands, balancing, and partition connectivity. However, the administration of the meta data, i.e. to track which partitions exchange data in which cardinality, is nontrivial due to the SFC’s fractal meandering and the dynamic adaptivity. We introduce an analysed tree grammar for the meta data that restricts it without loss of information hierarchically along the subdivision tree and applies run length encoding. Hence, its meta data memory footprint is very small, and it can be computed and maintained on-the-fly even for permanently changing grids. It facilitates a forkjoin pattern for shared data parallelism. And it facilitates replicated data parallelism tackling latency and bandwidth constraints respectively due to communication in the background and reduces memory requirements by avoiding adjacency information stored per element. We demonstrate this at hands of shared and distributed parallelized domain decompositions.This work was supported by the German Research Foundation (DFG) as part of the Transregional Collaborative Research Centre “Invasive Computing (SFB/TR 89). It is partially based on work supported by Award No. UK-c0020, made by the King Abdullah University of Science and Technology (KAUST)

    Dynamic Smagorinsky Modeled Large-Eddy Simulations of Turbulence Using Tetrahedral Meshes

    Get PDF
    Eddy-resolving numerical computations of turbulent flows are emerging as viable alternatives to Reynolds Averaged Navier-Stokes (RANS) calculations for flows with an intrinsically steady mean state due to the advances in large-scale parallel computing. In these computations, medium to large turbulent eddies are resolved by the numerics while the smaller or subgrid scales are either modeled or taken care of by the inherent numerical dissipation. To advance the state of the art of unstructured-mesh turbulence simulation capabilities, large eddy simulations (LES) using the dynamic Smagorinsky model (DSM) on tetrahedral meshes are carried out with the space-time conservation element, solution element (CESE) method. In contrast to what has been reported in the literature, the present implementation of dynamic models allows for active backscattering without any ad-hoc limiting of the eddy viscosity calculated from the subgrid-scale model. For the benchmark problems involving compressible isotropic turbulence decay as well as the shock/turbulent boundary layer interaction benchmark problems, no numerical instability associated with kinetic energy growth is observed and the volume percentage of the backscattering portion accounts for about 38-40% of the simulation domain. A slip-wall model in conjunction with the implemented DSM is used to simulate a relatively high Reynolds number Mach 2.85 turbulent boundary layer over a 30 ramp with several tetrahedral meshes and a wall-normal spacing of either & = 10 or & = 20. The computed mean wall pressure distribution, separation region size, mean velocity profiles, and Reynolds stress agree reasonably well with experimental data
    • …
    corecore