Search CORE

784 research outputs found

TensorFlow Doing HPC

Author: Bulatov Yaroslav
Chien Steven W. D.
Laure Erwin
Markidis Stefano
Olshevsky Vyacheslav
Vetter Jeffrey S.
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 11/03/2019
Field of study

TensorFlow is a popular emerging open-source programming framework supporting the execution of distributed applications on heterogeneous hardware. While TensorFlow has been initially designed for developing Machine Learning (ML) applications, in fact TensorFlow aims at supporting the development of a much broader range of application kinds that are outside the ML domain and can possibly include HPC applications. However, very few experiments have been conducted to evaluate TensorFlow performance when running HPC workloads on supercomputers. This work addresses this lack by designing four traditional HPC benchmark applications: STREAM, matrix-matrix multiply, Conjugate Gradient (CG) solver and Fast Fourier Transform (FFT). We analyze their performance on two supercomputers with accelerators and evaluate the potential of TensorFlow for developing HPC applications. Our tests show that TensorFlow can fully take advantage of high performance networks and accelerators on supercomputers. Running our TensorFlow STREAM benchmark, we obtain over 50% of theoretical communication bandwidth on our testing platform. We find an approximately 2x, 1.7x and 1.8x performance improvement when increasing the number of GPUs from two to four in the matrix-matrix multiply, CG and FFT applications respectively. All our performance results demonstrate that TensorFlow has high potential of emerging also as HPC programming framework for heterogeneous supercomputers.Comment: Accepted for publication at The Ninth International Workshop on Accelerators and Hybrid Exascale Systems (AsHES'19

arXiv.org e-Print Archive

Crossref

Enabling a High Throughput Real Time Data Pipeline for a Large Radio Telescope Array with GPUs

Author: Bracewell
Corey
D.A. Mitchell
DeBoer
Deboer
Furlanetto
H. Pfister
Hey
Högbom
K. Dale
L.J. Greenhill
Lonsdale
M.A. Clark
Mitchell
Mitchell
Momjian
R.B. Wayth
R.G. Edgar
S.M. Ord
Salah
Schwab
Terzian
Thompson
Thompson
Varbanescu
Wayth
Publication venue: 'Elsevier BV'
Publication date: 01/01/2010
Field of study

The Murchison Widefield Array (MWA) is a next-generation radio telescope currently under construction in the remote Western Australia Outback. Raw data will be generated continuously at 5GiB/s, grouped into 8s cadences. This high throughput motivates the development of on-site, real time processing and reduction in preference to archiving, transport and off-line processing. Each batch of 8s data must be completely reduced before the next batch arrives. Maintaining real time operation will require a sustained performance of around 2.5TFLOP/s (including convolutions, FFTs, interpolations and matrix multiplications). We describe a scalable heterogeneous computing pipeline implementation, exploiting both the high computing density and FLOP-per-Watt ratio of modern GPUs. The architecture is highly parallel within and across nodes, with all major processing elements performed by GPUs. Necessary scatter-gather operations along the pipeline are loosely synchronized between the nodes hosting the GPUs. The MWA will be a frontier scientific instrument and a pathfinder for planned peta- and exascale facilities.Comment: Version accepted by Comp. Phys. Com

arXiv.org e-Print Archive

Crossref

Harvard University - DASH

espace@Curtin

Exploiting graphic processing units parallelism to improve intelligent data acquisition system performance in JET's correlation reflectometer

Author: Arcas Castro Guillermo de
Barrera Lopez de Turiso Eduardo
Fonseca A.
López Navarro Juan Manuel
Murari Andrea
Nieto J.
Ruiz González Mariano
Vega Jesús
Publication venue: E.U.I.T. Telecomunicación (UPM)
Publication date: 01/04/2011
Field of study

The performance of intelligent data acquisition systems relies heavily on their processing capabilities and local bus bandwidth, especially in applications with high sample rates or high number of channels. This is the case of the self adaptive sampling rate data acquisition system installed as a pilot experiment in KG8B correlation reflectometer at JET. The system, which is based on the ITMS platform, continuously adapts the sample rate during the acquisition depending on the signal bandwidth. In order to do so it must transfer acquired data to a memory buffer in the host processor and run heavy computational algorithms for each data block. The processing capabilities of the host CPU and the bandwidth of the PXI bus limit the maximum sample rate that can be achieved, therefore limiting the maximum bandwidth of the phenomena that can be studied. Graphic processing units (GPU) are becoming an alternative for speeding up compute intensive kernels of scientific, imaging and simulation applications. However, integrating this technology into data acquisition systems is not a straight forward step, not to mention exploiting their parallelism efficiently. This paper discusses the use of GPUs with new high speed data bus interfaces to improve the performance of the self adaptive sampling rate data acquisition system installed on JET. Integration issues are discussed and performance evaluations are presente

Archivo Digital UPM

A Multi-GPU Programming Library for Real-Time Applications

Author: Schaetz Sebastian
Uecker Martin
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2012
Field of study

We present MGPU, a C++ programming library targeted at single-node multi-GPU systems. Such systems combine disproportionate floating point performance with high data locality and are thus well suited to implement real-time algorithms. We describe the library design, programming interface and implementation details in light of this specific problem domain. The core concepts of this work are a novel kind of container abstraction and MPI-like communication methods for intra-system communication. We further demonstrate how MGPU is used as a framework for porting existing GPU libraries to multi-device architectures. Putting our library to the test, we accelerate an iterative non-linear image reconstruction algorithm for real-time magnetic resonance imaging using multiple GPUs. We achieve a speed-up of about 1.7 using 2 GPUs and reach a final speed-up of 2.1 with 4 GPUs. These promising results lead us to conclude that multi-GPU systems are a viable solution for real-time MRI reconstruction as well as signal-processing applications in general.Comment: 15 pages, 10 figure

arXiv.org e-Print Archive

MPG.PuRe