88,465 research outputs found

    An investigation of the performance portability of OpenCL

    Get PDF
    This paper reports on the development of an MPI/OpenCL implementation of LU, an application-level benchmark from the NAS Parallel Benchmark Suite. An account of the design decisions addressed during the development of this code is presented, demonstrating the importance of memory arrangement and work-item/work-group distribution strategies when applications are deployed on different device types. The resulting platform-agnostic, single source application is benchmarked on a number of different architectures, and is shown to be 1.3–1.5× slower than native FORTRAN 77 or CUDA implementations on a single node and 1.3–3.1× slower on multiple nodes. We also explore the potential performance gains of OpenCL’s device fissioning capability, demonstrating up to a 3× speed-up over our original OpenCL implementation

    Study of laser deposited thin films Final report, 4 May 1967 - 4 May 1968

    Get PDF
    Feasibility of laser deposited metal films for mirror productio

    WMTrace : a lightweight memory allocation tracker and analysis framework

    Get PDF
    The diverging gap between processor and memory performance has been a well discussed aspect of computer architecture literature for some years. The use of multi-core processor designs has, however, brought new problems to the design of memory architectures - increased core density without matched improvement in memory capacity is reduc- ing the available memory per parallel process. Multiple cores accessing memory simultaneously degrades performance as a result of resource con- tention for memory channels and physical DIMMs. These issues combine to ensure that memory remains an on-going challenge in the design of parallel algorithms which scale. In this paper we present WMTrace, a lightweight tool to trace and analyse memory allocation events in parallel applications. This tool is able to dynamically link to pre-existing application binaries requiring no source code modification or recompilation. A post-execution analysis stage enables in-depth analysis of traces to be performed allowing memory allocations to be analysed by time, size or function. The second half of this paper features a case study in which we apply WMTrace to five parallel scientific applications and benchmarks, demonstrating its effectiveness at recording high-water mark memory consumption as well as memory use per-function over time. An in-depth analysis is provided for an unstructured mesh benchmark which reveals significant memory allocation imbalance across its participating processes

    New low-mass members of the Octans stellar association and an updated 30-40 Myr lithium age

    Full text link
    The Octans association is one of several young stellar moving groups recently discovered in the Solar neighbourhood, and hence a valuable laboratory for studies of stellar, circumstellar disc and planetary evolution. However, a lack of low-mass members or any members with trigonometric parallaxes means the age, distance and space motion of the group are poorly constrained. To better determine its membership and age, we present the first spectroscopic survey for new K and M-type Octans members, resulting in the discovery of 29 UV-bright K5-M4 stars with kinematics, photometry and distances consistent with existing members. Nine new members possess strong Li I absorption, which allow us to estimate a lithium age of 30-40 Myr, similar to that of the Tucana-Horologium association and bracketed by the firm lithium depletion boundary ages of the Beta Pictoris (20 Myr) and Argus/IC 2391 (50 Myr) associations. Several stars also show hints in our medium-resolution spectra of fast rotation or spectroscopic binarity. More so than other nearby associations, Octans is much larger than its age and internal velocity dispersion imply. It may be the dispersing remnant of a sparse, extended structure which includes some younger members of the foreground Octans-Near association recently proposed by Zuckerman and collaborators.Comment: Accepted for publication in MNRAS (16 pages, 5 tables

    Developing performance-portable molecular dynamics kernels in Open CL

    Get PDF
    This paper investigates the development of a molecular dynamics code that is highly portable between architectures. Using OpenCL, we develop an implementation of Sandia’s miniMD benchmark that achieves good levels of performance across a wide range of hardware: CPUs, discrete GPUs and integrated GPUs. We demonstrate that the performance bottlenecks of miniMD’s short-range force calculation kernel are the same across these architectures, and detail a number of platform- agnostic optimisations that improve its performance by at least 2x on all hardware considered. Our complete code is shown to be 1.7x faster than the original miniMD, and at most 2x slower than implementations individually hand-tuned for a specific architecture
    corecore