Search CORE

26,350 research outputs found

Multi-Tenant Virtual GPUs for Optimising Performance of a Financial Risk Application

Author: Prades Javier
Reano Carlos
Silla Federico
Varghese Blesson
Publication venue
Publication date: 14/06/2016
Field of study

Graphics Processing Units (GPUs) are becoming popular accelerators in modern High-Performance Computing (HPC) clusters. Installing GPUs on each node of the cluster is not efficient resulting in high costs and power consumption as well as underutilisation of the accelerator. The research reported in this paper is motivated towards the use of few physical GPUs by providing cluster nodes access to remote GPUs on-demand for a financial risk application. We hypothesise that sharing GPUs between several nodes, referred to as multi-tenancy, reduces the execution time and energy consumed by an application. Two data transfer modes between the CPU and the GPUs, namely concurrent and sequential, are explored. The key result from the experiments is that multi-tenancy with few physical GPUs using sequential data transfers lowers the execution time and the energy consumed, thereby improving the overall performance of the application.Comment: Accepted to the Journal of Parallel and Distributed Computing (JPDC), 10 June 201

arXiv.org e-Print Archive

Queen's University Belfast Research Portal

Crossref

RiuNet

A parallel nearly implicit time-stepping scheme

Author: Botchev M.A.
Vorst H.A. van der
Publication venue: Elsevier
Publication date: 01/01/2001
Field of study

Across-the-space parallelism still remains the most mature, convenient and natural way to parallelize large scale problems. One of the major problems here is that implicit time stepping is often difficult to parallelize due to the structure of the system. Approximate implicit schemes have been suggested to circumvent the problem. These schemes have attractive stability properties and they are also very well parallelizable.\ud The purpose of this article is to give an overall assessment of the parallelism of the method

Elsevier - Publisher Connector

Crossref

CWI's Institutional Repository

University of Twente Research Information

On Characterizing the Data Movement Complexity of Computational DAGs for Parallel Execution

Author: Elango Venmugil
Pouchet Louis-Noël
Ramanujam J.
Rastello Fabrice
Sadayappan P.
Publication venue
Publication date: 01/01/2014
Field of study

Technology trends are making the cost of data movement increasingly dominant, both in terms of energy and time, over the cost of performing arithmetic operations in computer systems. The fundamental ratio of aggregate data movement bandwidth to the total computational power (also referred to the machine balance parameter) in parallel computer systems is decreasing. It is there- fore of considerable importance to characterize the inherent data movement requirements of parallel algorithms, so that the minimal architectural balance parameters required to support it on future systems can be well understood. In this paper, we develop an extension of the well-known red-blue pebble game to develop lower bounds on the data movement complexity for the parallel execution of computational directed acyclic graphs (CDAGs) on parallel systems. We model multi-node multi-core parallel systems, with the total physical memory distributed across the nodes (that are connected through some interconnection network) and in a multi-level shared cache hierarchy for processors within a node. We also develop new techniques for lower bound characterization of non-homogeneous CDAGs. We demonstrate the use of the methodology by analyzing the CDAGs of several numerical algorithms, to develop lower bounds on data movement for their parallel execution

arXiv.org e-Print Archive

CiteSeerX

Crossref

Hal - Université Grenoble Alpes

INRIA a CCSD electronic archive server

HAL-Rennes 1

Bounding Cache Miss Costs of Multithreaded Computations Under General Schedulers

Author: Cole Richard
Ramachandran Vijaya
Publication venue
Publication date: 28/09/2017
Field of study

We analyze the caching overhead incurred by a class of multithreaded algorithms when scheduled by an arbitrary scheduler. We obtain bounds that match or improve upon the well-known

O(Q+S \cdot (M/B))

caching cost for the randomized work stealing (RWS) scheduler, where

S

is the number of steals,

Q

is the sequential caching cost, and

M

and

B

are the cache size and block (or cache line) size respectively.Comment: Extended abstract in Proceedings of ACM Symp. on Parallel Alg. and Architectures (SPAA) 2017, pp. 339-350. This revision has a few small updates including a missing citation and the replacement of some big Oh terms with precise constant

arXiv.org e-Print Archive

Crossref