1,772 research outputs found

    Developing performance-portable molecular dynamics kernels in Open CL

    Get PDF
    This paper investigates the development of a molecular dynamics code that is highly portable between architectures. Using OpenCL, we develop an implementation of Sandia’s miniMD benchmark that achieves good levels of performance across a wide range of hardware: CPUs, discrete GPUs and integrated GPUs. We demonstrate that the performance bottlenecks of miniMD’s short-range force calculation kernel are the same across these architectures, and detail a number of platform- agnostic optimisations that improve its performance by at least 2x on all hardware considered. Our complete code is shown to be 1.7x faster than the original miniMD, and at most 2x slower than implementations individually hand-tuned for a specific architecture

    The Formation and Stability of a Microbial Community

    Get PDF
    New communities form regularly in nature, as many species rush to colonise a freshly formed island, pool, or microbiome, but it is unclear what rules govern the arrangement of these founders into a smaller, stable community, or whether the process is predictable. I simultaneously inoculated a master mix of bacterial colonisers into 45 identical environments, and allowed them to compete and evolve for around three months. By the end of the experiment, the species compositions of these communities had split into two broad groups, defined mostly by the mutual exclusivity of two Pseudomonas species, which may represent the ecological equivalence of the two species. Due to this functional similarity, I propose that community formation may be predictable at an ecological level, if not a taxonomic level. I also explored one of the communities formed in this experiment in further detail, investigating the maintenance of its diversity and stability. The community was fairly stable, as every species was able to persist even when it began at a much lower population size than its competitors, and no diversity was lost after 4 weeks of culture. I grew the species from this community in monoculture, as well as in every possible pair, triplet, and quartet, to fully assess the network of interactions, and found evidence for many significant higher-order interactions, which have been shown to have a stabilising effect in theoretical models

    Impurity Lattice and Sublattice Location by Electron Channeling

    Get PDF
    A new formulation is presented for the use of crystallographic orientation effects in electron scattering to determine impurity lattice location. The development of electron channeling techniques is reviewed and compared to high energy ion channeling and to the Borrmann effect in x-ray diffraction. The advantages of axial over planar geometry are discussed. Delocalization effects are more serious for quantitative analysis than have generally been believed. The new formulation applies to any crystal lattice and quantitatively includes delocalization effects via c-factors, which have been experimentally determined for diamond structure semiconductors. For sublattice site location this formulation removes the two major approximations of the original ALCHEMI formulation, namely that all the inner shell excitations are perfectly localized, and that all of the impurity atoms occupy distinct crystallographic sites. As an example, we study the location of small perfectly coherent Sb precipitates within the Si lattice

    WMTrace : a lightweight memory allocation tracker and analysis framework

    Get PDF
    The diverging gap between processor and memory performance has been a well discussed aspect of computer architecture literature for some years. The use of multi-core processor designs has, however, brought new problems to the design of memory architectures - increased core density without matched improvement in memory capacity is reduc- ing the available memory per parallel process. Multiple cores accessing memory simultaneously degrades performance as a result of resource con- tention for memory channels and physical DIMMs. These issues combine to ensure that memory remains an on-going challenge in the design of parallel algorithms which scale. In this paper we present WMTrace, a lightweight tool to trace and analyse memory allocation events in parallel applications. This tool is able to dynamically link to pre-existing application binaries requiring no source code modification or recompilation. A post-execution analysis stage enables in-depth analysis of traces to be performed allowing memory allocations to be analysed by time, size or function. The second half of this paper features a case study in which we apply WMTrace to five parallel scientific applications and benchmarks, demonstrating its effectiveness at recording high-water mark memory consumption as well as memory use per-function over time. An in-depth analysis is provided for an unstructured mesh benchmark which reveals significant memory allocation imbalance across its participating processes

    Evaluating the performance of legacy applications on emerging parallel architectures

    Get PDF
    The gap between a supercomputer's theoretical maximum (\peak") oatingpoint performance and that actually achieved by applications has grown wider over time. Today, a typical scientific application achieves only 5{20% of any given machine's peak processing capability, and this gap leaves room for significant improvements in execution times. This problem is most pronounced for modern \accelerator" architectures { collections of hundreds of simple, low-clocked cores capable of executing the same instruction on dozens of pieces of data simultaneously. This is a significant change from the low number of high-clocked cores found in traditional CPUs, and effective utilisation of accelerators typically requires extensive code and algorithmic changes. In many cases, the best way in which to map a parallel workload to these new architectures is unclear. The principle focus of the work presented in this thesis is the evaluation of emerging parallel architectures (specifically, modern CPUs, GPUs and Intel MIC) for two benchmark codes { the LU benchmark from the NAS Parallel Benchmark Suite and Sandia's miniMD benchmark { which exhibit complex parallel behaviours that are representative of many scientific applications. Using combinations of low-level intrinsic functions, OpenMP, CUDA and MPI, we demonstrate performance improvements of up to 7x for these workloads. We also detail a code development methodology that permits application developers to target multiple architecture types without maintaining completely separate implementations for each platform. Using OpenCL, we develop performance portable implementations of the LU and miniMD benchmarks that are faster than the original codes, and at most 2x slower than versions highly-tuned for particular hardware. Finally, we demonstrate the importance of evaluating architectures at scale (as opposed to on single nodes) through performance modelling techniques, highlighting the problems associated with strong-scaling on emerging accelerator architectures

    Parallelising wavefront applications on general-purpose GPU devices

    Get PDF
    Pipelined wavefront applications form a large portion of the high performance scientific computing workloads at supercomputing centres. This paper investigates the viability of graphics processing units (GPUs) for the acceleration of these codes, using NVIDIA's Compute Unified Device Architecture (CUDA). We identify the optimisations suitable for this new architecture and quantify the characteristics of those wavefront codes that are likely to experience speedups

    Experiences with porting and modelling wavefront algorithms on many-core architectures

    Get PDF
    We are currently investigating the viability of many-core architectures for the acceleration of wavefront applications and this report focuses on graphics processing units (GPUs) in particular. To this end, we have implemented NASA’s LU benchmark – a real world production-grade application – on GPUs employing NVIDIA’s Compute Unified Device Architecture (CUDA). This GPU implementation of the benchmark has been used to investigate the performance of a selection of GPUs, ranging from workstation-grade commodity GPUs to the HPC "Tesla” and "Fermi” GPUs. We have also compared the performance of the GPU solution at scale to that of traditional high perfor- mance computing (HPC) clusters based on a range of multi- core CPUs from a number of major vendors, including Intel (Nehalem), AMD (Opteron) and IBM (PowerPC). In previous work we have developed a predictive “plug-and-play” performance model of this class of application running on such clusters, in which CPUs communicate via the Message Passing Interface (MPI). By extending this model to also capture the performance behaviour of GPUs, we are able to: (1) comment on the effects that architectural changes will have on the performance of single-GPU solutions, and (2) make projections regarding the performance of multi-GPU solutions at larger scale

    Supercurrent through grain boundaries in the presence of strong correlations

    Full text link
    Strong correlations are known to severely reduce the mobility of charge carriers near half-filling and thus have an important influence on the current carrying properties of grain boundaries in the high-TcT_c cuprates. In this work we present an extension of the Gutzwiller projection approach to treat electronic correlations below as well as above half-filling consistently. We apply this method to investigate the critical current through grain boundaries with a wide range of misalignment angles for electron- and hole-doped systems. For the latter excellent agreement with experimental data is found. We further provide a detailed comparison to an analogous weak-coupling evaluation.Comment: 4 pages, 3 figure
    corecore