1,255 research outputs found
Multi-Tenant Virtual GPUs for Optimising Performance of a Financial Risk Application
Graphics Processing Units (GPUs) are becoming popular accelerators in modern
High-Performance Computing (HPC) clusters. Installing GPUs on each node of the
cluster is not efficient resulting in high costs and power consumption as well
as underutilisation of the accelerator. The research reported in this paper is
motivated towards the use of few physical GPUs by providing cluster nodes
access to remote GPUs on-demand for a financial risk application. We
hypothesise that sharing GPUs between several nodes, referred to as
multi-tenancy, reduces the execution time and energy consumed by an
application. Two data transfer modes between the CPU and the GPUs, namely
concurrent and sequential, are explored. The key result from the experiments is
that multi-tenancy with few physical GPUs using sequential data transfers
lowers the execution time and the energy consumed, thereby improving the
overall performance of the application.Comment: Accepted to the Journal of Parallel and Distributed Computing (JPDC),
10 June 201
The Glasgow raspberry pi cloud: a scale model for cloud computing infrastructures
Data Centers (DC) used to support Cloud services
often consist of tens of thousands of networked machines under a single roof. The significant capital outlay required to replicate such infrastructures constitutes a major obstacle to practical implementation and evaluation of research in this domain. Currently, most research into Cloud computing relies on either limited software simulation, or the use of a testbed environments
with a handful of machines. The recent introduction of the
Raspberry Pi, a low-cost, low-power single-board computer, has made the construction of a miniature Cloud DCs more affordable.
In this paper, we present the Glasgow Raspberry Pi Cloud
(PiCloud), a scale model of a DC composed of clusters of
Raspberry Pi devices. The PiCloud emulates every layer of a
Cloud stack, ranging from resource virtualisation to network
behaviour, providing a full-featured Cloud Computing research and educational environment
Saber: window-based hybrid stream processing for heterogeneous architectures
Modern servers have become heterogeneous, often combining multicore CPUs with many-core GPGPUs. Such heterogeneous architectures have the potential to improve the performance of data-intensive stream processing applications, but they are not supported by current relational stream processing engines. For an engine to exploit a heterogeneous architecture, it must execute streaming SQL queries with sufficient data-parallelism to fully utilise all available heterogeneous processors, and decide how to use each in the most effective way. It must do this while respecting the semantics of streaming SQL queries, in particular with regard to window handling. We describe SABER, a hybrid high-performance relational stream processing engine for CPUs and GPGPUs. SABER executes windowbased streaming SQL queries in a data-parallel fashion using all available CPU and GPGPU cores. Instead of statically assigning query operators to heterogeneous processors, SABER employs a new adaptive heterogeneous lookahead scheduling strategy, which increases the share of queries executing on the processor that yields the highest performance. To hide data movement costs, SABER pipelines the transfer of stream data between different memory types and the CPU/GPGPU. Our experimental comparison against state-ofthe-art engines shows that SABER increases processing throughput while maintaining low latency for a wide range of streaming SQL queries with small and large windows sizes
Intra-node Memory Safe GPU Co-Scheduling
[EN] GPUs in High-Performance Computing systems remain under-utilised due to the unavailability of schedulers that can safely schedule multiple applications to share the same GPU. The research reported in this paper is motivated to improve the utilisation of GPUs by proposing a framework, we refer to as schedGPU, to facilitate intra-node GPU co-scheduling such that a GPU can be safely shared among multiple applications by taking memory constraints into account. Two approaches, namely a client-server and a shared memory approach are explored. However, the shared memory approach is more suitable due to lower overheads when compared to the former approach. Four policies are proposed in schedGPU to handle applications that are waiting to access the GPU, two of which account for priorities. The feasibility of schedGPU is validated on three real-world applications. The key observation is that a performance gain is achieved. For single applications, a gain of over 10 times, as measured by GPU utilisation and GPU memory utilisation, is obtained. For workloads comprising multiple applications, a speed-up of up to 5x in the total execution time is noted. Moreover, the average GPU utilisation and average GPU memory utilisation is increased by 5 and 12 times, respectively.This work was funded by Generalitat Valenciana under grant PROMETEO/2017/77.Reaño González, C.; Silla Jiménez, F.; Nikolopoulos, DS.; Varghese, B. (2018). Intra-node Memory Safe GPU Co-Scheduling. IEEE Transactions on Parallel and Distributed Systems. 29(5):1089-1102. https://doi.org/10.1109/TPDS.2017.2784428S1089110229
Parallel Reinforcement Learning Simulation for Visual Quadrotor Navigation
Reinforcement learning (RL) is an agent-based approach for teaching robots to
navigate within the physical world. Gathering data for RL is known to be a
laborious task, and real-world experiments can be risky. Simulators facilitate
the collection of training data in a quicker and more cost-effective manner.
However, RL frequently requires a significant number of simulation steps for an
agent to become skilful at simple tasks. This is a prevalent issue within the
field of RL-based visual quadrotor navigation where state dimensions are
typically very large and dynamic models are complex. Furthermore, rendering
images and obtaining physical properties of the agent can be computationally
expensive. To solve this, we present a simulation framework, built on AirSim,
which provides efficient parallel training. Building on this framework, Ape-X
is modified to incorporate decentralised training of AirSim environments to
make use of numerous networked computers. Through experiments we were able to
achieve a reduction in training time from 3.9 hours to 11 minutes using the
aforementioned framework and a total of 74 agents and two networked computers.
Further details including a github repo and videos about our project,
PRL4AirSim, can be found at https://sites.google.com/view/prl4airsim/homeComment: This work has been submitted to the IEEE International Conference on
Robotics and Automation (ICRA) for possible publication. Copyright may be
transferred without notice, after which this version may no longer be
accessibl
Capillary Refill using Augmented Reality
Master's thesis in Computer scienceThe opportunities within augmented reality is growing. Augmented reality is a combination of the real and the virtual world in real time, and large companies like Microsoft and Google is now investing heavily in the technology.
This thesis presents a solution for simulating a medical test called capillary refill, by using augmented reality. The simulation is performed with an augmented reality headset called HoloLens. The HoloLens will recognise a marker attached to an artificial hand. The marker is used to detect and keep tracking of the position and orientation of the hand. Then a virtual 3D hand will be rendered over the marker on the artificial hand. Inside the artificial hand there is a pressure sensor that will be used to detect when users are adding pressure to the index finger. The finger on the virtual 3D model will then change nail colour on user interaction, and thereby simulating capillary refill
Auto-tuning Distributed Stream Processing Systems using Reinforcement Learning
Fine tuning distributed systems is considered to be a craftsmanship, relying
on intuition and experience. This becomes even more challenging when the
systems need to react in near real time, as streaming engines have to do to
maintain pre-agreed service quality metrics. In this article, we present an
automated approach that builds on a combination of supervised and reinforcement
learning methods to recommend the most appropriate lever configurations based
on previous load. With this, streaming engines can be automatically tuned
without requiring a human to determine the right way and proper time to deploy
them. This opens the door to new configurations that are not being applied
today since the complexity of managing these systems has surpassed the
abilities of human experts. We show how reinforcement learning systems can find
substantially better configurations in less time than their human counterparts
and adapt to changing workloads
Memory-Optimised Parallel Processing of Hi-C Data
Abstract—This paper presents the optimisation efforts on the creation of a graph-based mapping representation of gene adjacency. The method is based on the Hi-C process, starting from Next Generation Sequencing data, and it analyses a huge amount of static data in order to produce maps for one or more genes. Straightforward parallelisation of this scheme does not yield acceptable performance on multicore architectures since the scalability is rather limited due to the memory bound nature of the problem. This work focuses on the memory optimisations that can be applied to the graph construction algorithm and its (complex) data structures to derive a cache-oblivious algorithm and eventually to improve the memory bandwidth utilisation. We used as running example NuChart-II, a tool for annotation and statistic analysis of Hi-C data that creates a gene-centric neigh-borhood graph. The proposed approach, which is exemplified for Hi-C, addresses several common issue in the parallelisation of memory bound algorithms for multicore. Results show that the proposed approach is able to increase the parallel speedup from 7x to 22x (on a 32-core platform). Finally, the proposed C++ implementation outperforms the first R NuChart prototype, by which it was not possible to complete the graph generation because of strong memory-saturation problems. I
Molecular docking with Raccoon2 on clouds: extending desktop applications with cloud computing
Molecular docking is a computer simulation that predicts the binding affinity between two molecules, a ligand and a receptor. Large-scale docking simulations, using one receptor and many ligands, are known as structure-based virtual screening. Often used in drug discovery, virtual screening can be very computationally demanding. This is why user-friendly domain-specific web or desktop applications that enable running simulations on powerful computing infrastructures have been created. Cloud computing provides on-demand availability, pay-per-use pricing, and great scalability which can improve the performance and efficiency of scientific applications. This paper investigates how domain-specific desktop applications can be extended to run scientific simulations on various clouds. A generic approach based on scientific workflows is proposed, and a proof of concept is implemented using the Raccoon2 desktop application for virtual screening, WS-PGRADE workflows, and gUSE services with the CloudBroker platform. The presented analysis illustrates that this approach of extending a domain-specific desktop application can run workflows on different types of clouds, and indeed makes use of the on-demand scalability provided by cloud computing. It also facilitates the execution of virtual screening simulations by life scientists without requiring them to abandon their favourite desktop environment and providing them resources without major capital investment
- …