Search CORE

103 research outputs found

Exposing errors related to weak memory in GPU applications

Author: Alastair F. Donaldson
Alcantara D. A. F.
Alglave J.
Alglave J.
Alglave J.
Bardsley E.
Chiang W.
Collier W. W.
Coplin J.
Feng W.
Hangal S.
Hwu W.-m. W.
Joshi S.
Lê N. M.
Sanders J.
Tyler Sorensen
Tzeng S.
Xiao S.
Yuki T.
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 20/01/2016
Field of study

© 2016 ACM.We present the systematic design of a testing environment that uses stressing and fuzzing to reveal errors in GPU applications that arise due to weak memory effects. We evaluate our approach on seven GPUS spanning three NVIDIA architectures, across ten CUDA applications that use fine-grained concurrency. Our results show that applications that rarely or never exhibit errors related to weak memory when executed natively can readily exhibit these errors when executed in our testing environment. Our testing environment also provides a means to help identify the root causes of such errors, and automatically suggests how to insert fences that harden an application against weak memory bugs. To understand the cost of GPU fences, we benchmark applications with fences provided by the hardening strategy as well as a more conservative, sound fencing strategy

Crossref

Spiral - Imperial College Digital Repository

¿Son las GPUs dispositivos eficientes energéticamente?

Author: De Giusti Laura Cristina
Naiouf Marcelo
Pi Puig Martín
Publication venue
Publication date: 01/10/2018
Field of study

With energy consumption emerging as one of the biggest issues in the development of HPC (High Performance Computing) applications, the importance of detailed power-related research works becomes a priority. In the last years, GPU coprocessors have been increasingly used to accelerate many of these high-priced systems even though they are embedding millions of transistors on their chips delivering an immediate increase on power consumption necessities. This paper analyzes a set of applications from the Rodinia benchmark suite in terms of CPU and GPU performance and energy consumption. Specifically, it compares single-threaded and multi-threaded CPU versions with GPU implementations, and characterize the execution time, true instant power and average energy consumption to test the idea that GPUs are power-hungry computing devices.Con el consumo de energía emergiendo como uno de los mayores problemas en el desarrollo de aplicaciones HPC (High Performance Computing), la importancia de trabajos específicos de investigación en este campo se convierte en una prioridad. En los últimos años, los coprocesadores GPU se han utilizado frecuentemente para acelerar muchos de estos costosos sistemas, a pesar de que incorporan millones de transistores en sus chips, lo que genera un aumento considerable en los requerimientos de energía. Este artículo analiza un conjunto de aplicaciones del benchmark Rodinia en términos de rendimiento y consumo de energía de CPU y GPU. Específicamente, se comparan las versiones secuenciales y multihilo en CPU con implementaciones GPU, caracterizando el tiempo de ejecución, la potencia real instantánea y el consumo promedio de energía, con el objetivo de probar la idea de que las GPU son dispositivos de baja eficiencia energética.Facultad de Informátic

¿Son las GPUs dispositivos eficientes energéticamente?

Author: De Giusti Laura Cristina
Naiouf Marcelo
Pi Puig Martín
Publication venue
Publication date: 01/10/2018
Field of study

Servicio de Difusión de la Creación Intelectual

GPU Performance and Power Consumption Analysis: A DCT based denoising application

Author: De Giusti Armando Eduardo
De Giusti Laura Cristina
Naiouf Marcelo
Pi Puig Martín
Publication venue
Publication date: 01/10/2017
Field of study

It is known that energy and power consumption are becoming serious metrics in the design of high performance workstations because of heat dissipation problems. In the last years, GPU accelerators have been integrating many of these expensive systems despite they are embedding more and more transistors on their chips producing a quick increase of power consumption requirements. This paper analyzes an image processing application, in particular a Discrete Cosine Transform denoising algorithm, in terms of CPU and GPU performance and energy consumption. Specifically, we want to compare single-threaded and multithreaded CPU versions with a GPU version, and characterize the execution time, true instant power and average energy consumption to deflate the idea that GPUs are non-green computing devices.XVIII Workshop de Procesamiento Distribuido y Paralelo (WPDP).Red de Universidades con Carreras en Informática (RedUNCI

Servicio de Difusión de la Creación Intelectual

Fast algorithm for real-time rings reconstruction

Author: Ammendola R.
Bauce Matteo
Biagioni A.
Capuani S.
Chiozzi Stefano
Cotta Ramusino Angelo
Di Domenico Giovanni
Fantechi R.
Fiorini Massimiliano
Giagu S.
Gianoli Alberto
Graverini E.
Lamanna Gianluca
Lonardo A.
Messina A.
Neri Ilaria
Palombo Marco
Pantaleo F.
Paolucci P.S.
Piandani R.
Pontisso L.
Rescigno M.
Simula F.
Sozzi Marco
Vicini P.
Publication venue: Verlag Deutsches Elektronen-Synchrotron
Publication date: 01/01/2015
Field of study

The GAP project is dedicated to study the application of GPU in several contexts in which real-time response is important to take decisions. The definition of real-time depends on the application under study, ranging from answer time of μs up to several hours in case of very computing intensive task. During this conference we presented our work in low level triggers [1] [2] and high level triggers [3] in high energy physics experiments, and specific application for nuclear magnetic resonance (NMR) [4] [5] and cone-beam CT [6]. Apart from the study of dedicated solution to decrease the latency due to data transport and preparation, the computing algorithms play an essential role in any GPU application. In this contribution, we show an original algorithm developed for triggers application, to accelerate the ring reconstruction in RICH detector when it is not possible to have seeds for reconstruction from external trackers

DESY Publication Database

DESY

Archivio istituzionale della ricerca - Università di Ferrara

Archivio della ricerca- Università di Roma La Sapienza

CERN Document Server

Deep learning in edge: evaluation of models and frameworks in ARM architecture

Author: Zanchetta Breno Fanchiotti
Publication venue
Publication date: 01/01/2022
Field of study

The boom and popularization of edge devices have molded its market due to stiff compe tition that provides better functionalities at low energy costs. The ARM architecture has been unanimously unopposed in the huge market segment of smartphones and still makes a presence beyond that: in drones, surveillance systems, cars, and robots. Also, it has been used successfully for the development of solutions for chains that supply food, fuel, and other services. Up until recently, ARM did not show much promise for high-level compu tation, i.e., thanks to its limited RISC instruction set, it was considered power efficient but weak in performance compared to x86 architecture. However, most recent advancements in ARM architecture pivoted that inflection point up thanks to the introduction of embed ded GPUs with DMA into LPDDR memory boards. Since this development in boards such as NVIDIA TK1, NVIDIA Jetson TX1, and NVIDIA TX2, perhaps it finally be came feasible to study and perform more challenging parallel and distributed workloads directly on a RISC-based architecture. On the other hand, the novelty of this technology poses a fundamental question of whether these boards are gaining a meaningful ratio be tween processing power and power consumption over conventional architectures or if they are bound to have reached their limitations. This work explores the Parallel Processing of Deep Learning on embedded GPUs of NVIDIA Jetson TX2 to evaluate the question above comprehensively. Thus, it uses 4 ARM boards, with 2 Deep Learning frameworks, 7 CNN models, and one medium-sized dataset combined into six board settings to con duct experiments. The experiments were conducted under similar environments, all built from the source. Altogether, the experiments ran for a total of 4,804 hours and revealed a slight advantage for MxNet on GPU-reliant training and a PyTorch overall advantage in total execution time and power, but especially for CPU-only executions. The experi ments also showed that the NVIDIA Jetson TX2 already makes feasible some complex workloads directly on its SoC

Lume 5.8

2D Reconstruction of Small Intestine's Interior Wall

Author: Attar Rahman
Wang Zhihua
Xie Xiang
Yue Shigang
Publication venue
Publication date: 15/03/2018
Field of study

Examining and interpreting of a large number of wireless endoscopic images from the gastrointestinal tract is a tiresome task for physicians. A practical solution is to automatically construct a two dimensional representation of the gastrointestinal tract for easy inspection. However, little has been done on wireless endoscopic image stitching, let alone systematic investigation. The proposed new wireless endoscopic image stitching method consists of two main steps to improve the accuracy and efficiency of image registration. First, the keypoints are extracted by Principle Component Analysis and Scale Invariant Feature Transform (PCA-SIFT) algorithm and refined with Maximum Likelihood Estimation SAmple Consensus (MLESAC) outlier removal to find the most reliable keypoints. Second, the optimal transformation parameters obtained from first step are fed to the Normalised Mutual Information (NMI) algorithm as an initial solution. With modified Marquardt-Levenberg search strategy in a multiscale framework, the NMI can find the optimal transformation parameters in the shortest time. The proposed methodology has been tested on two different datasets - one with real wireless endoscopic images and another with images obtained from Micro-Ball (a new wireless cubic endoscopy system with six image sensors). The results have demonstrated the accuracy and robustness of the proposed methodology both visually and quantitatively.Comment: Journal draf

arXiv.org e-Print Archive

Southampton (e-Prints Soton)

Spectrograms of ship wakes: identifying linear and nonlinear wave signals

Author: McCue Scott W.
Moroney Timothy J.
Pethiyagoda Ravindra
Publication venue: 'Cambridge University Press (CUP)'
Publication date: 27/10/2016
Field of study

A spectrogram is a useful way of using short-time discrete Fourier transforms to visualise surface height measurements taken of ship wakes in real world conditions. For a steadily moving ship that leaves behind small-amplitude waves, the spectrogram is known to have two clear linear components, a sliding-frequency mode caused by the divergent waves and a constant-frequency mode for the transverse waves. However, recent observations of high speed ferry data have identified additional components of the spectrograms that are not yet explained. We use computer simulations of linear and nonlinear ship wave patterns and apply time-frequency analysis to generate spectrograms for an idealised ship. We clarify the role of the linear dispersion relation and ship speed on the two linear components. We use a simple weakly nonlinear theory to identify higher order effects in a spectrogram and, while the high speed ferry data is very noisy, we propose that certain additional features in the experimental data are caused by nonlinearity. Finally, we provide a possible explanation for a further discrepancy between the high speed ferry spectrograms and linear theory by accounting for ship acceleration.Comment: 21 pages, 10 figures, submitte

arXiv.org e-Print Archive

Queensland University of Technology ePrints Archive