11,582 research outputs found
An Efficient Network API for in-Kernel Applications in Clusters
International audienceRunning parallel applications on clusters with high-speed local networks requires fast communication between computing nodes but also low latency and high bandwidth file access. However, the application programming interfaces of high-speed local networks were designed for MPI communication and do not always meet the requirements of other applications like distributed file systems. In this paper, we explore several solutions to improve the use of high-speed network for in-kernel applications. Distributed file systems implemented on top of the GM interface of Myrinet are first examined to demonstrate how hard it is to get an efficient interaction between such applications and the network. Then, we propose solutions to simplify and improve this interaction and integrate them into the kernel interface of the new Myrinet. Performance comparisons between MX and GM, and their usage in both a distributed file system and a zero-copy protocol show nice improvements. Moreover, we are able to improve the performance of the flexible kernel API we designed in MX that allows to remove some intermediate copy
Multi-Tenant Virtual GPUs for Optimising Performance of a Financial Risk Application
Graphics Processing Units (GPUs) are becoming popular accelerators in modern
High-Performance Computing (HPC) clusters. Installing GPUs on each node of the
cluster is not efficient resulting in high costs and power consumption as well
as underutilisation of the accelerator. The research reported in this paper is
motivated towards the use of few physical GPUs by providing cluster nodes
access to remote GPUs on-demand for a financial risk application. We
hypothesise that sharing GPUs between several nodes, referred to as
multi-tenancy, reduces the execution time and energy consumed by an
application. Two data transfer modes between the CPU and the GPUs, namely
concurrent and sequential, are explored. The key result from the experiments is
that multi-tenancy with few physical GPUs using sequential data transfers
lowers the execution time and the energy consumed, thereby improving the
overall performance of the application.Comment: Accepted to the Journal of Parallel and Distributed Computing (JPDC),
10 June 201
Checkpointing as a Service in Heterogeneous Cloud Environments
A non-invasive, cloud-agnostic approach is demonstrated for extending
existing cloud platforms to include checkpoint-restart capability. Most cloud
platforms currently rely on each application to provide its own fault
tolerance. A uniform mechanism within the cloud itself serves two purposes: (a)
direct support for long-running jobs, which would otherwise require a custom
fault-tolerant mechanism for each application; and (b) the administrative
capability to manage an over-subscribed cloud by temporarily swapping out jobs
when higher priority jobs arrive. An advantage of this uniform approach is that
it also supports parallel and distributed computations, over both TCP and
InfiniBand, thus allowing traditional HPC applications to take advantage of an
existing cloud infrastructure. Additionally, an integrated health-monitoring
mechanism detects when long-running jobs either fail or incur exceptionally low
performance, perhaps due to resource starvation, and proactively suspends the
job. The cloud-agnostic feature is demonstrated by applying the implementation
to two very different cloud platforms: Snooze and OpenStack. The use of a
cloud-agnostic architecture also enables, for the first time, migration of
applications from one cloud platform to another.Comment: 20 pages, 11 figures, appears in CCGrid, 201
AliEnFS - a Linux File System for the AliEn Grid Services
Among the services offered by the AliEn (ALICE Environment
http://alien.cern.ch) Grid framework there is a virtual file catalogue to allow
transparent access to distributed data-sets using various file transfer
protocols. (AliEn File System) integrates the AliEn file catalogue as
a new file system type into the Linux kernel using LUFS, a hybrid user space
file system framework (Open Source http://lufs.sourceforge.net). LUFS uses a
special kernel interface level called VFS (Virtual File System Switch) to
communicate via a generalised file system interface to the AliEn file system
daemon. The AliEn framework is used for authentication, catalogue browsing,
file registration and read/write transfer operations. A C++ API implements the
generic file system operations. The goal of AliEnFS is to allow users easy
interactive access to a worldwide distributed virtual file system using
familiar shell commands (f.e. cp,ls,rm ...) The paper discusses general aspects
of Grid File Systems, the AliEn implementation and present and future
developments for the AliEn Grid File System.Comment: 9 pages, 12 figure
- âŠ