Search CORE

855 research outputs found

Astrophysical Supercomputing with GPUs: Critical Decisions for Early Adopters

Author: Amr H. Hassan
Barsdell
Benjamin R. Barsdell
Christopher J. Fluke
David G. Barnes
Harris
Kirk
Larus
Nyland
Schaaf
Wayth
Publication venue: 'CSIRO Publishing'
Publication date: 26/08/2010
Field of study

General purpose computing on graphics processing units (GPGPU) is dramatically changing the landscape of high performance computing in astronomy. In this paper, we identify and investigate several key decision areas, with a goal of simplyfing the early adoption of GPGPU in astronomy. We consider the merits of OpenCL as an open standard in order to reduce risks associated with coding in a native, vendor-specific programming environment, and present a GPU programming philosophy based on using brute force solutions. We assert that effective use of new GPU-based supercomputing facilities will require a change in approach from astronomers. This will likely include improved programming training, an increased need for software development best-practice through the use of profiling and related optimisation tools, and a greater reliance on third-party code libraries. As with any new technology, those willing to take the risks, and make the investment of time and effort to become early adopters of GPGPU in astronomy, stand to reap great benefits.Comment: 13 pages, 5 figures, accepted for publication in PAS

arXiv.org e-Print Archive

Crossref

Swinburne Research Bank

Speeding Up Particle Filter Algorithm for Tracking Multiple Targets Using CUDA Programming

Author: Zhang Jinhua
Publication venue: e-Publications@Marquette
Publication date: 01/07/2020
Field of study

This thesis proposes to work on a parallelization method to speed up the computational runtime of the particle filter algorithm for multiple targets tracking. CUDA programming is utilized to execute the original implementation of the particle filter algorithm on GPU. The thesis provides a detailed discussion of the background information on the relevant topics. And then a presentation of the code architecture changes is followed. The detailed CUDA-based implementation is illustrated and discussed, which is followed by a discussion andcomparison of the results obtained from a series of tests.In this thesis, the introduction and description of the basic particle filter are presented first. Detailed illustrations of each step in the original implementation of the particle filter algorithm, which is executed sequentially on CPU, are provided. Then, background information of parallel programming technologies is provided, such as GPGPU and CUDA programming. The new design of the CUDA based implementation of the particle filter algorithm is proposed to speed up the execution of the original implementation, which is executed on CPU. Moreover, a detailed explanation of the CUDA-based implementation is given.Finally, the thesis will demonstrate the test results for both CPU and CUDA implementation as a comparison. The experiments indicate that the CUDA implementation can obtain a maximum of 7.5x speedup over the original implementation. After implementing more results and comparison, it was concluded that the CUDA implementation was significantly faster than the CPU version. Furthermore, the CUDA version still has much space for future optimizations to increase its performance

epublications@Marquette

On Designing Multicore-Aware Simulators for Systems Biology Endowed with OnLine Statistics

Author: Aldinucci M.
Calcagno C.
Coppo M.
Damiani F.
Drocco M.
Sciacca E.
Spinella S.
Torquati Massimo
Troina A.
Publication venue: 'Hindawi Limited'
Publication date: 01/01/2014
Field of study

The paper arguments are on enabling methodologies for the design of a fully parallel, online, interactive tool aiming to support the bioinformatics scientists .In particular, the features of these methodologies, supported by the FastFlow parallel programming framework, are shown on a simulation tool to perform the modeling, the tuning, and the sensitivity analysis of stochastic biological models. A stochastic simulation needs thousands of independent simulation trajectories turning into big data that should be analysed by statistic and data mining tools. In the considered approach the two stages are pipelined in such a way that the simulation stage streams out the partial results of all simulation trajectories to the analysis stage that immediately produces a partial result. The simulation-analysis workflow is validated for performance and effectiveness of the online analysis in capturing biological systems behavior on a multicore platform and representative proof-of-concept biological systems. The exploited methodologies include pattern-based parallel programming and data streaming that provide key features to the software designers such as performance portability and efficient in-memory (big) data management and movement. Two paradigmatic classes of biological systems exhibiting multistable and oscillatory behavior are used as a testbed

Directory of Open Access Journals

Archivio della Ricerca - Università di Pisa

PubMed Central

Archivio istituzionale della ricerca - Alma Mater Studiorum Università di Bologna

Institutional Research Information System University of Turin

Parallel Sequential Monte Carlo for Efficient Density Combination: The DeCo MATLAB Toolbox

Author: Casarin Roberto
Grassi Stefano
Ravazzolo Francesco
Van Dijk Herman K.
Publication venue: 'Foundation for Open Access Statistic'
Publication date: 01/01/2014
Field of study

This paper presents the Matlab package DeCo (Density Combination) which is based on the paper by Billio et al. (2013) where a constructive Bayesian approach is presented for combining predictive densities originating from different models or other sources of information. The combination weights are time-varying and may depend on past predictive forecasting performances and other learning mechanisms. The core algorithm is the function DeCo which applies banks of parallel Sequential Monte Carlo algorithms to filter the time-varying combination weights. The DeCo procedure has been implemented both for standard CPU computing and for Graphical Process Unit (GPU) parallel computing. For the GPU implementation we use the Matlab parallel computing toolbox and show how to use General Purposes GPU computing almost effortless. This GPU implementation comes with a speed up of the execution time up to seventy times compared to a standard CPU Matlab implementation on a multicore CPU. We show the use of the package and the computational gain of the GPU version, through some simulation experiments and empirical application

Directory of Open Access Journals

Norges Banks vitenarkiv

Archivio istituzionale della ricerca - Università degli Studi di Venezia Ca' Foscari

Kent Academic Repository

Journal of Statistical Software

Erasmus University Digital Repository

ART

Acceleration of parasitic multistatic radar system using GPGPU

Author: John Mathew
Publication venue: 'University of Cape Town'
Publication date: 01/01/2011
Field of study

This dissertation details the implementation of PMR [Parasitic Multistatic Radar] signal processing chain in the GPGPU [General Purpose Graphic Processing Units] platform. The primary objective of the project is to accelerate the signal processing chain without compromising the algorithm efficiency and to prove that GPGPUs are a promising platform for parasitic radar signal processing

Cape Town University OpenUCT

Memory Subsystem Optimization Techniques for Modern High-Performance General-Purpose Processors

Author
Publication venue
Publication date: 01/01/2018
Field of study

abstract: General-purpose processors propel the advances and innovations that are the subject of humanity’s many endeavors. Catering to this demand, chip-multiprocessors (CMPs) and general-purpose graphics processing units (GPGPUs) have seen many high-performance innovations in their architectures. With these advances, the memory subsystem has become the performance- and energy-limiting aspect of CMPs and GPGPUs alike. This dissertation identifies and mitigates the key performance and energy-efficiency bottlenecks in the memory subsystem of general-purpose processors via novel, practical, microarchitecture and system-architecture solutions. Addressing the important Last Level Cache (LLC) management problem in CMPs, I observe that LLC management decisions made in isolation, as in prior proposals, often lead to sub-optimal system performance. I demonstrate that in order to maximize system performance, it is essential to manage the LLCs while being cognizant of its interaction with the system main memory. I propose ReMAP, which reduces the net memory access cost by evicting cache lines that either have no reuse, or have low memory access cost. ReMAP improves the performance of the CMP system by as much as 13%, and by an average of 6.5%. Rather than the LLC, the L1 data cache has a pronounced impact on GPGPU performance by acting as the bandwidth filter for the rest of the memory subsystem. Prior work has shown that the severely constrained data cache capacity in GPGPUs leads to sub-optimal performance. In this thesis, I propose two novel techniques that address the GPGPU data cache capacity problem. I propose ID-Cache that performs effective cache bypassing and cache line size selection to improve cache capacity utilization. Next, I propose LATTE-CC that considers the GPU’s latency tolerance feature and adaptively compresses the data stored in the data cache, thereby increasing its effective capacity. ID-Cache and LATTE-CC are shown to achieve 71% and 19.2% speedup, respectively, over a wide variety of GPGPU applications. Complementing the aforementioned microarchitecture techniques, I identify the need for system architecture innovations to sustain performance scalability of GPG- PUs in the face of slowing Moore’s Law. I propose a novel GPU architecture called the Multi-Chip-Module GPU (MCM-GPU) that integrates multiple GPU modules to form a single logical GPU. With intelligent memory subsystem optimizations tailored for MCM-GPUs, it can achieve within 7% of the performance of a similar but hypothetical monolithic die GPU. Taking a step further, I present an in-depth study of the energy-efficiency characteristics of future MCM-GPUs. I demonstrate that the inherent non-uniform memory access side-effects form the key energy-efficiency bottleneck in the future. In summary, this thesis offers key insights into the performance and energy-efficiency bottlenecks in CMPs and GPGPUs, which can guide future architects towards developing high-performance and energy-efficient general-purpose processors.Dissertation/ThesisDoctoral Dissertation Computer Science 201

ASU Digital Repository

q-State Potts model metastability study using optimized GPU-based Monte Carlo algorithms

Author: Cannas Sergio A.
De Francesco Juan Pablo
Ferrero Ezequiel E.
Wolovick Nicolás
Publication venue: 'Elsevier BV'
Publication date: 09/03/2012
Field of study

We implemented a GPU based parallel code to perform Monte Carlo simulations of the two dimensional q-state Potts model. The algorithm is based on a checkerboard update scheme and assigns independent random numbers generators to each thread. The implementation allows to simulate systems up to ~10^9 spins with an average time per spin flip of 0.147ns on the fastest GPU card tested, representing a speedup up to 155x, compared with an optimized serial code running on a high-end CPU. The possibility of performing high speed simulations at large enough system sizes allowed us to provide a positive numerical evidence about the existence of metastability on very large systems based on Binder's criterion, namely, on the existence or not of specific heat singularities at spinodal temperatures different of the transition one.Comment: 30 pages, 7 figures. Accepted in Computer Physics Communications. code available at: http://www.famaf.unc.edu.ar/grupos/GPGPU/Potts/CUDAPotts.htm

arXiv.org e-Print Archive

CONICET Digital

Air pollution modelling using a graphics processing unit with CUDA

Author: Lagzi Istvan
Meszaros Robert
Molnar Jr. Ferenc
Szakaly Tamas
Publication venue: 'Elsevier BV'
Publication date: 16/12/2009
Field of study

The Graphics Processing Unit (GPU) is a powerful tool for parallel computing. In the past years the performance and capabilities of GPUs have increased, and the Compute Unified Device Architecture (CUDA) - a parallel computing architecture - has been developed by NVIDIA to utilize this performance in general purpose computations. Here we show for the first time a possible application of GPU for environmental studies serving as a basement for decision making strategies. A stochastic Lagrangian particle model has been developed on CUDA to estimate the transport and the transformation of the radionuclides from a single point source during an accidental release. Our results show that parallel implementation achieves typical acceleration values in the order of 80-120 times compared to CPU using a single-threaded implementation on a 2.33 GHz desktop computer. Only very small differences have been found between the results obtained from GPU and CPU simulations, which are comparable with the effect of stochastic transport phenomena in atmosphere. The relatively high speedup with no additional costs to maintain this parallel architecture could result in a wide usage of GPU for diversified environmental applications in the near future.Comment: 5 figure

arXiv.org e-Print Archive

ELTE Digital Institutional Repository (EDIT)

High Performance Datacenter Networks: Architectures, Algorithms, and Opportunities

Author: Arimilli Baba
Barroso Luiz André
Clos C
Dally W. J.
Dennis Abts
John Kim
Leiserson Charles E.
Scott S.
Singh Arjun
Sterling Thomas L.
Publication venue: 'Morgan & Claypool Publishers LLC'
Publication date
Field of study

Crossref

Distributed Verification of Rare Properties using Importance Splitting Observers

Author: Jegourel Cyrille
Legay Axel
Sedwards Sean
Traonouez Louis-Marie
Publication venue
Publication date: 28/04/2015
Field of study

Rare properties remain a challenge for statistical model checking (SMC) due to the quadratic scaling of variance with rarity. We address this with a variance reduction framework based on lightweight importance splitting observers. These expose the model-property automaton to allow the construction of score functions for high performance algorithms. The confidence intervals defined for importance splitting make it appealing for SMC, but optimising its performance in the standard way makes distribution inefficient. We show how it is possible to achieve equivalently good results in less time by distributing simpler algorithms. We first explore the challenges posed by importance splitting and present an algorithm optimised for distribution. We then define a specific bounded time logic that is compiled into memory-efficient observers to monitor executions. Finally, we demonstrate our framework on a number of challenging case studies

arXiv.org e-Print Archive

HAL-CentraleSupelec

INRIA a CCSD electronic archive server

Electronic Communications of the EASST (European Association of Software Science and Technology)

HAL-Rennes 1