Search CORE

940 research outputs found

Optimistic Parallelism on GPUs

Author: E Ayguadé
J Nickolls
JE Stone
L Dagum
S Liu
S Wienke
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 05/03/2015
Field of study

Abstract. We present speculative parallelization techniques that can exploit parallelism in loops even in the presence of dynamic irregulari-ties that may give rise to cross-iteration dependences. The execution of a speculatively parallelized loop consists of five phases: scheduling, com-putation, misspeculation check, result committing, and misspeculation recovery. While the first two phases enable exploitation of data paral-lelism, the latter three phases represent overhead costs of using specu-lation. We perform misspeculation check on the GPU to minimize its cost. We perform result committing and misspeculation recovery on the CPU to reduce the result copying and recovery overhead. The scheduling policies are designed to reduce the misspeculation rate. Our program-ming model provides API for programmers to give hints about potential misspeculations to reduce their detection cost. Our experiments yielded speedups of 3.62x-13.76x on an nVidia Tesla C1060 hosted in an Intel(R) Xeon(R) E5540 machine.

CiteSeerX

Crossref

A Fast and Scalable Graph Coloring Algorithm for Multi-core and Many-core Architectures

Author: D Chakrabarti
H Cougny De
MM Strout
MR Garey
MT Jones
ÜV Çatalyürek
Publication venue
Publication date: 18/05/2015
Field of study

Irregular computations on unstructured data are an important class of problems for parallel programming. Graph coloring is often an important preprocessing step, e.g. as a way to perform dependency analysis for safe parallel execution. The total run time of a coloring algorithm adds to the overall parallel overhead of the application whereas the number of colors used determines the amount of exposed parallelism. A fast and scalable coloring algorithm using as few colors as possible is vital for the overall parallel performance and scalability of many irregular applications that depend upon runtime dependency analysis. Catalyurek et al. have proposed a graph coloring algorithm which relies on speculative, local assignment of colors. In this paper we present an improved version which runs even more optimistically with less thread synchronization and reduced number of conflicts compared to Catalyurek et al.'s algorithm. We show that the new technique scales better on multi-core and many-core systems and performs up to 1.5x faster than its predecessor on graphs with high-degree vertices, while keeping the number of colors at the same near-optimal levels.Comment: To appear in the proceedings of Euro Par 201

arXiv.org e-Print Archive

Crossref

Spiral - Imperial College Digital Repository

Optimising Simulation Data Structures for the Xeon Phi

Author: Chimeh Mozhgan K.
Cockshott Paul
Publication venue
Publication date: 01/01/2016
Field of study

In this paper, we propose a lock-free architecture to accelerate logic gate circuit simulation using SIMD multi-core machines. We evaluate its performance on different test circuits simulated on the Intel Xeon Phi and 2 other machines. Comparisons are presented of this software/hardware combination with reported performances of GPU and other multi-core simulation platforms. Comparisons are also given between the lock free architecture and a leading commercial simulator running on the same Intel hardware

Enlighten

Mixing multi-core CPUs and GPUs for scientific simulation software

Author: Hawick K.A.
Leist A.
Playne D.P.
Publication venue: 'Massey University'
Publication date: 01/01/2010
Field of study

Recent technological and economic developments have led to widespread availability of multi-core CPUs and specialist accelerator processors such as graphical processing units (GPUs). The accelerated computational performance possible from these devices can be very high for some applications paradigms. Software languages and systems such as NVIDIA's CUDA and Khronos consortium's open compute language (OpenCL) support a number of individual parallel application programming paradigms. To scale up the performance of some complex systems simulations, a hybrid of multi-core CPUs for coarse-grained parallelism and very many core GPUs for data parallelism is necessary. We describe our use of hybrid applica- tions using threading approaches and multi-core CPUs to control independent GPU devices. We present speed-up data and discuss multi-threading software issues for the applications level programmer and o er some suggested areas for language development and integration between coarse-grained and ne-grained multi-thread systems. We discuss results from three common simulation algorithmic areas including: partial di erential equations; graph cluster metric calculations and random number generation. We report on programming experiences and selected performance for these algorithms on: single and multiple GPUs; multi-core CPUs; a CellBE; and using OpenCL. We discuss programmer usability issues and the outlook and trends in multi-core programming for scienti c applications developers

Massey Research Online

Parallel Simulations for Analysing Portfolios of Catastrophic Event Risk

Author: Bahl Aman
Baltzer Oliver
Rau-Chaplin Andrew
Varghese Blesson
Publication venue
Publication date: 01/01/2012
Field of study

At the heart of the analytical pipeline of a modern quantitative insurance/reinsurance company is a stochastic simulation technique for portfolio risk analysis and pricing process referred to as Aggregate Analysis. Support for the computation of risk measures including Probable Maximum Loss (PML) and the Tail Value at Risk (TVAR) for a variety of types of complex property catastrophe insurance contracts including Cat eXcess of Loss (XL), or Per-Occurrence XL, and Aggregate XL, and contracts that combine these measures is obtained in Aggregate Analysis. In this paper, we explore parallel methods for aggregate risk analysis. A parallel aggregate risk analysis algorithm and an engine based on the algorithm is proposed. This engine is implemented in C and OpenMP for multi-core CPUs and in C and CUDA for many-core GPUs. Performance analysis of the algorithm indicates that GPUs offer an alternative HPC solution for aggregate risk analysis that is cost effective. The optimised algorithm on the GPU performs a 1 million trial aggregate simulation with 1000 catastrophic events per trial on a typical exposure set and contract structure in just over 20 seconds which is approximately 15x times faster than the sequential counterpart. This can sufficiently support the real-time pricing scenario in which an underwriter analyses different contractual terms and pricing while discussing a deal with a client over the phone.Comment: Proceedings of the Workshop at the International Conference for High Performance Computing, Networking, Storage and Analysis (SC), 2012, 8 page

arXiv.org e-Print Archive

CiteSeerX

Crossref

HeTM: Transactional Memory for Heterogeneous Systems

Author: Castro Daniel
Ilic Aleksandar
Khan Amin M.
Romano Paolo
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 02/09/2019
Field of study

Modern heterogeneous computing architectures, which couple multi-core CPUs with discrete many-core GPUs (or other specialized hardware accelerators), enable unprecedented peak performance and energy efficiency levels. Unfortunately, though, developing applications that can take full advantage of the potential of heterogeneous systems is a notoriously hard task. This work takes a step towards reducing the complexity of programming heterogeneous systems by introducing the abstraction of Heterogeneous Transactional Memory (HeTM). HeTM provides programmers with the illusion of a single memory region, shared among the CPUs and the (discrete) GPU(s) of a heterogeneous system, with support for atomic transactions. Besides introducing the abstract semantics and programming model of HeTM, we present the design and evaluation of a concrete implementation of the proposed abstraction, which we named Speculative HeTM (SHeTM). SHeTM makes use of a novel design that leverages on speculative techniques and aims at hiding the inherently large communication latency between CPUs and discrete GPUs and at minimizing inter-device synchronization overhead. SHeTM is based on a modular and extensible design that allows for easily integrating alternative TM implementations on the CPU's and GPU's sides, which allows the flexibility to adopt, on either side, the TM implementation (e.g., in hardware or software) that best fits the applications' workload and the architectural characteristics of the processing unit. We demonstrate the efficiency of the SHeTM via an extensive quantitative study based both on synthetic benchmarks and on a porting of a popular object caching system.Comment: The current work was accepted in the 28th International Conference on Parallel Architectures and Compilation Techniques (PACT'19

arXiv.org e-Print Archive

Crossref