Search CORE

61 research outputs found

Density Functional Theory calculation on many-cores hybrid CPU-GPU architectures

Author: Alexey Neelov
Goedecker S.
Jean-François Méhaut
Luigi Genovese
Matthieu Ospici
Stefan Goedecker
Thierry Deutsch
Publication venue
Publication date: 01/01/2009
Field of study

The implementation of a full electronic structure calculation code on a hybrid parallel architecture with Graphic Processing Units (GPU) is presented. The code which is on the basis of our implementation is a GNU-GPL code based on Daubechies wavelets. It shows very good performances, systematic convergence properties and an excellent efficiency on parallel computers. Our GPU-based acceleration fully preserves all these properties. In particular, the code is able to run on many cores which may or may not have a GPU associated. It is thus able to run on parallel and massive parallel hybrid environment, also with a non-homogeneous ratio CPU/GPU. With double precision calculations, we may achieve considerable speedup, between a factor of 20 for some operations and a factor of 6 for the whole DFT code.Comment: 14 pages, 8 figure

arXiv.org e-Print Archive

CiteSeerX

Crossref

Hal - Université Grenoble Alpes

INRIA a CCSD electronic archive server

edoc

HAL-CEA

Automated parallel application creation and execution tool for clusters

Author: McAvaney Christopher
Publication venue: Deakin University, Faculty of Science and Technology, School of Information Technology
Publication date: 01/01/2003
Field of study

This research investigated an automated approach to re-writing traditional sequential computer programs into parallel programs for networked computers. A tool was designed and developed for generating parallel programs automatically and also executing these parallel programs on a network of computers. Performance is maximized by utilising all idle resources

Deakin Research Online

JIT-based cost models for adaptive parallelism

Author: Morton John Magnus
Publication venue
Publication date: 01/01/2018
Field of study

Parallel programming is extremely challenging. Worse yet, parallel architectures evolve quickly, and parallel programs must often be refactored for each new architecture. It is highly desirable to provide performance portability, so programs developed on one architecture can deliver good performance on other architectures. This thesis is part of the AJITPar project that investigates a novel approach for achieving performance portability by the development of suitable cost models to inform scheduling decisions with dynamic information about computational and communication costs on the target architecture. The main artifact of the AJITPar project is the Adaptive Skeleton Library (ASL) that pro- vides a distributed-memory master-worker implementation of a set of Algorithmic Skeletons i.e. programming patterns that abstract away the low-level intricacies of parallelism. After JIT warm-up, ASL uses a computational cost model applied to JIT trace information from the Pycket compiler, a tracing JIT implementation of the Racket language, to transform the skeletons. The execution time of an ASL task is primarily determined by computation and communication costs. The Pycket compiler is extended to enable runtime access to JIT traces, both the sequences of instructions and frequency of execution. Crucially for dynamic, adaption these are obtained with minimal overhead. A low cost, dynamic computation cost model for estimating the runtime of JIT compiled Pycket programs, Γ, is developed and validated. This is believed to be the first such model. The design explores the challenges of estimating execution time from JIT trace instructions and presents three increasingly sophisticated cost models. The cost model predicts execution time based on the PyPy JIT instructions present in compiled JIT traces. The final abstract cost model applies weightings for 5 different classes of trace instructions and also proposes a method for aggregating the cost models for single traces into a cost model for an entire program. Execution time is measured, and traces generated are recorded, from a suite of 41 benchmarks. Linear regression is used to determine the weightings for the abstract cost model from this data. The final cost model reveals that allocation operations count most for execution time, followed by guards and numeric operations. The suitability of Γ for predicting the effect of ASL program transformations is investigated. The real utility of Γ is not in absolute predictions of execution times for different programs, but in predicting the effects of applying program transformations on parallel programs. A linear relationship between the actual computational cost for a task, and that predicted by Γ for five benchmarks on two architectures is demonstrated. A series of increasingly accurate low cost, dynamic cost models for estimating the communi- cation costs of ASL programs, K, are developed and validated. Predicting the optimum task size in ASL not only relies on computational cost predictions, but also predictions of the over- head of communicating tasks to worker nodes and results back to the master. The design and iterative development of a cost model which predicts the serialisation, deserialisation, and network send times of spawning a task in ASL is presented. Linear regression of communication timings are used to determine the appropriate weighting parameters for each. K is shown to be valid for predicting other, arbitrary data structures by demonstrating an additive property of the model. The model K is validated by showing a linear relationship between the combined predicted costs of the simple types in instances of aggregated data structures, and measured communication time. This validation is performed on five benchmarks on two platforms. Finally, a low cost dynamic cost model, T , that predicts a good ASL task size by combining information from the computation and communication cost models (Γand K) is developed and validated. The key insight in the design in this model is to balance the communications cost on the master node with the computational and communications cost on the worker nodes. The predictive power of T is tested model using six benchmarks, and it is shown to more accurately predict the optimal task size, reducing total program runtimes when compared with the default ASL prototype

Glasgow Theses Service

Work-stealing prefix scan: Addressing load imbalance in large-scale image registration

Author: Berkels Benjamin
Bientinesi Paolo
Copik Marcin
Grosser Tobias
Hoefler Torsten
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2022
Field of study

Parallelism patterns (e.g., map or reduce) have proven to be effective tools for parallelizing high-performance applications. In this article, we study the recursive registration of a series of electron microscopy images - a time consuming and imbalanced computation necessary for nano-scale microscopy analysis. We show that by translating the image registration into a specific instance of the prefix scan, we can convert this seemingly sequential problem into a parallel computation that scales to over thousand of cores. We analyze a variety of scan algorithms that behave similarly for common low-compute operators and propose a novel work-stealing procedure for a hierarchical prefix scan. Our evaluation shows that by identifying a suitable and well-optimized prefix scan algorithm, we reduce time-to-solution on a series of 4,096 images spanning ten seconds of microscopy acquisition from over 10 hours to less than 3 minutes (using 1024 Intel Haswell cores), enabling derivation of material properties at nanoscale for long microscopy image series.ISSN:1045-9219ISSN:1558-2183ISSN:2161-988

Repository for Publications and Research Data

Edinburgh Research Explorer

Publikationsserver der RWTH Aachen University

Recommended from our members

Guided Automatic Binary Parallelisation

Author: ZHOU RUOYU
Publication venue: University of Cambridge
Publication date: 06/04/2018
Field of study

For decades, the software industry has amassed a vast repository of pre-compiled libraries and executables which are still valuable and actively in use. However, for a significant fraction of these binaries, most of the source code is absent or is written in old languages, making it practically impossible to recompile them for new generations of hardware. As the number of cores in chip multi-processors (CMPs) continue to scale, the performance of this legacy software becomes increasingly sub-optimal. Rewriting new optimised and parallel software would be a time-consuming and expensive task. Without source code, existing automatic performance enhancing and parallelisation techniques are not applicable for legacy software or parts of new applications linked with legacy libraries. In this dissertation, three tools are presented to address the challenge of optimising legacy binaries. The first, GBR (Guided Binary Recompilation), is a tool that recompiles stripped application binaries without the need for the source code or relocation information. GBR performs static binary analysis to determine how recompilation should be undertaken, and produces a domain-specific hint program. This hint program is loaded and interpreted by the GBR dynamic runtime, which is built on top of the open-source dynamic binary translator, DynamoRIO. In this manner, complicated recompilation of the target binary is carried out to achieve optimised execution on a real system. The problem of limited dataflow and type information is addressed through cooperation between the hint program and JIT optimisation. The utility of GBR is demonstrated by software prefetch and vectorisation optimisations to achieve performance improvements compared to their original native execution. The second tool is called BEEP (Binary Emulator for Estimating Parallelism), an extension to GBR for binary instrumentation. BEEP is used to identify potential thread-level parallelism through static binary analysis and binary instrumentation. BEEP performs preliminary static analysis on binaries and encodes all statically-undecided questions into a hint program. The hint program is interpreted by GBR so that on-demand binary instrumentation codes are inserted to answer the questions from runtime information. BEEP incorporates a few parallel cost models to evaluate identified parallelism under different parallelisation paradigms. The third tool is named GABP (Guided Automatic Binary Parallelisation), an extension to GBR for parallelisation. GABP focuses on loops from sequential application binaries and automatically extracts thread-level parallelism from them on-the-fly, under the direction of the hint program, for efficient parallel execution. It employs a range of runtime schemes, such as thread-level speculation and synchronisation, to handle runtime data dependences. GABP achieves a geometric mean of speedup of 1.91x on binaries from SPEC CPU2006 on a real x86-64 eight-core system compared to native sequential execution. Performance is obtained for SPEC CPU2006 executables compiled from a variety of source languages and by different compilers.St John's Benefactor Scholarship ARM Sponsorshi

Apollo (Cambridge)

Parallel computing 2011, ParCo 2011: book of abstracts

Author: D'Hollander Erik
Publication venue: Academia Press Scientific Publishers
Publication date: 01/01/2011
Field of study

This book contains the abstracts of the presentations at the conference Parallel Computing 2011, 30 August - 2 September 2011, Ghent, Belgiu

Ghent University Academic Bibliography

Using program behaviour to exploit heterogeneous multi-core processors

Author: McIlroy Ross
Publication venue
Publication date: 01/01/2010
Field of study

Multi-core CPU architectures have become prevalent in recent years. A number of multi-core CPUs consist of not only multiple processing cores, but multiple different types of processing cores, each with different capabilities and specialisations. These heterogeneous multi-core architectures (HMAs) can deliver exceptional performance; however, they are notoriously difficult to program effectively. This dissertation investigates the feasibility of ameliorating many of the difficulties encountered in application development on HMA processors, by employing a behaviour aware runtime system. This runtime system provides applications with the illusion of executing on a homogeneous architecture, by presenting a homogeneous virtual machine interface. The runtime system uses knowledge of a program's execution behaviour, gained through explicit code annotations, static analysis or runtime monitoring, to inform its resource allocation and scheduling decisions, such that the application makes best use of the HMA's heterogeneous processing cores. The goal of this runtime system is to enable non-specialist application developers to write applications that can exploit an HMA, without the developer requiring in-depth knowledge of the HMA's design. This dissertation describes the development of a Java runtime system, called Hera-JVM, aimed at investigating this premise. Hera-JVM supports the execution of unmodified Java applications on both processing core types of the heterogeneous IBM Cell processor. An application's threads of execution can be transparently migrated between the Cell's different core types by Hera-JVM, without requiring the application's involvement. A number of real-world Java benchmarks are executed across both of the Cell's core types, to evaluate the efficacy of abstracting a heterogeneous architecture behind a homogeneous virtual machine. By characterising the performance of each of the Cell processor's core types under different program behaviours, a set of influential program behaviour characteristics is uncovered. A set of code annotations are presented, which enable program code to be tagged with these behaviour characteristics, enabling a runtime system to track a program's behaviour throughout its execution. This information is fed into a cost function, which Hera-JVM uses to automatically estimate whether the executing program's threads of execution would benefit from being migrated to a different core type, given their current behaviour characteristics. The use of history, hysteresis and trend tracking, by this cost function, is explored as a means of increasing its stability and limiting detrimental thread migrations. The effectiveness of a number of different migration strategies is also investigated under real-world Java benchmarks, with the most effective found to be a strategy that can target code, such that a thread is migrated whenever it executes this code. This dissertation also investigates the use of runtime monitoring to enable a runtime system to automatically infer a program's behaviour characteristics, without the need for explicit code annotations. A lightweight runtime behaviour monitoring system is developed, and its effectiveness at choosing the most appropriate core type on which to execute a set of real-world Java benchmarks is examined. Combining explicit behaviour characteristic annotations with those characteristics which are monitored at runtime is also explored. Finally, an initial investigation is performed into the use of behaviour characteristics to improve application performance under a different type of heterogeneous architecture, specifically, a non-uniform memory access (NUMA) architecture. Thread teams are proposed as a method of automatically clustering communicating threads onto the same NUMA node, thereby reducing data access overheads. Evaluation of this approach shows that it is effective at improving application performance, if the application's threads can be partitioned across the available NUMA nodes of a system. The findings of this work demonstrate that a runtime system with a homogeneous virtual machine interface can reduce the challenge of application development for HMA processors, whilst still being able to exploit such a processor by taking program behaviour into account

Glasgow Theses Service

CiteSeerX

OpenGrey Repository

An automated OpenCL FPGA compilation framework targeting a configurable, VLIW chip multiprocessor

Author: Samuel J. Parker (7203041)
Publication venue
Publication date: 01/01/2015
Field of study

Modern system-on-chips augment their baseline CPU with coprocessors and accelerators to increase overall computational capacity and power efficiency, and thus have evolved into heterogeneous systems. Several languages have been developed to enable this paradigm shift, including CUDA and OpenCL. This thesis discusses a unified compilation environment to enable heterogeneous system design through the use of OpenCL and a customised VLIW chip multiprocessor (CMP) architecture, known as the LE1. An LLVM compilation framework was researched and a prototype developed to enable the execution of OpenCL applications on the LE1 CPU. The framework fully automates the compilation flow and supports work-item coalescing to better utilise the CPU cores and alleviate the effects of thread divergence. This thesis discusses in detail both the software stack and target hardware architecture and evaluates the scalability of the proposed framework on a highly precise cycle-accurate simulator. This is achieved through the execution of 12 benchmarks across 240 different machine configurations, as well as further results utilising an incomplete development branch of the compiler. It is shown that the problems generally scale well with the LE1 architecture, up to eight cores, when the memory system becomes a serious bottleneck. Results demonstrate superlinear performance on certain benchmarks (x9 for the bitonic sort benchmark with 8 dual-issue cores) with further improvements from compiler optimisations (x14 for bitonic with the same configuration

Loughborough University Institutional Repository