453 research outputs found
Accelerating sequential programs using FastFlow and self-offloading
FastFlow is a programming environment specifically targeting cache-coherent
shared-memory multi-cores. FastFlow is implemented as a stack of C++ template
libraries built on top of lock-free (fence-free) synchronization mechanisms. In
this paper we present a further evolution of FastFlow enabling programmers to
offload part of their workload on a dynamically created software accelerator
running on unused CPUs. The offloaded function can be easily derived from
pre-existing sequential code. We emphasize in particular the effective
trade-off between human productivity and execution efficiency of the approach.Comment: 17 pages + cove
Tackling Exascale Software Challenges in Molecular Dynamics Simulations with GROMACS
GROMACS is a widely used package for biomolecular simulation, and over the
last two decades it has evolved from small-scale efficiency to advanced
heterogeneous acceleration and multi-level parallelism targeting some of the
largest supercomputers in the world. Here, we describe some of the ways we have
been able to realize this through the use of parallelization on all levels,
combined with a constant focus on absolute performance. Release 4.6 of GROMACS
uses SIMD acceleration on a wide range of architectures, GPU offloading
acceleration, and both OpenMP and MPI parallelism within and between nodes,
respectively. The recent work on acceleration made it necessary to revisit the
fundamental algorithms of molecular simulation, including the concept of
neighborsearching, and we discuss the present and future challenges we see for
exascale simulation - in particular a very fine-grained task parallelism. We
also discuss the software management, code peer review and continuous
integration testing required for a project of this complexity.Comment: EASC 2014 conference proceedin
OpenCL Actors - Adding Data Parallelism to Actor-based Programming with CAF
The actor model of computation has been designed for a seamless support of
concurrency and distribution. However, it remains unspecific about data
parallel program flows, while available processing power of modern many core
hardware such as graphics processing units (GPUs) or coprocessors increases the
relevance of data parallelism for general-purpose computation.
In this work, we introduce OpenCL-enabled actors to the C++ Actor Framework
(CAF). This offers a high level interface for accessing any OpenCL device
without leaving the actor paradigm. The new type of actor is integrated into
the runtime environment of CAF and gives rise to transparent message passing in
distributed systems on heterogeneous hardware. Following the actor logic in
CAF, OpenCL kernels can be composed while encapsulated in C++ actors, hence
operate in a multi-stage fashion on data resident at the GPU. Developers are
thus enabled to build complex data parallel programs from primitives without
leaving the actor paradigm, nor sacrificing performance. Our evaluations on
commodity GPUs, an Nvidia TESLA, and an Intel PHI reveal the expected linear
scaling behavior when offloading larger workloads. For sub-second duties, the
efficiency of offloading was found to largely differ between devices. Moreover,
our findings indicate a negligible overhead over programming with the native
OpenCL API.Comment: 28 page
O2ATH: An OpenMP Offloading Toolkit for the Sunway Heterogeneous Manycore Platform
The next generation Sunway supercomputer employs the SW26010pro processor,
which features a specialized on-chip heterogeneous architecture. Applications
with significant hotspots can benefit from the great computation capacity
improvement of Sunway many-core architectures by carefully making intensive
manual many-core parallelization efforts. However, some legacy projects with
large codebases, such as CESM, ROMS and WRF, contain numerous lines of code and
do not have significant hotspots. The cost of manually porting such
applications to the Sunway architecture is almost unaffordable. To overcome
such a challenge, we have developed a toolkit named O2ATH. O2ATH forwards GNU
OpenMP runtime library calls to Sunway's Athread library, which greatly
simplifies the parallelization work on the Sunway architecture.O2ATH enables
users to write both MPE and CPE code in a single file, and parallelization can
be achieved by utilizing OpenMP directives and attributes. In practice, O2ATH
has helped us to port two large projects, CESM and ROMS, to the CPEs of the
next generation Sunway supercomputers via the OpenMP offload method. In the
experiments, kernel speedups range from 3 to 15 times, resulting in 3 to 6
times whole application speedups.Furthermore, O2ATH requires significantly
fewer code modifications compared to manually crafting CPE functions.This
indicates that O2ATH can greatly enhance development efficiency when porting or
optimizing large software projects on Sunway supercomputers.Comment: 15 pages, 6 figures, 5 tables
GPU Computing to Improve Game Engine Performance
Although the graphics processing unit (GPU) was originally designed to accelerate the image creation for output to display, today's general purpose GPU (GPGPU) computing offers unprecedented performance by offloading computing-intensive portions of the application to the GPGPU, while running the remainder of the code on the central processing unit (CPU). The highly parallel structure of a many core GPGPU can process large blocks of data faster using multithreaded concurrent processing. A game engine has many "components" and multithreading can be used to implement their parallelism. However, effective implementation of multithreading in a multicore processor has challenges, such as data and task parallelism. In this paper, we investigate the impact of using a GPGPU with a CPU to design high-performance game engines. First, we implement a separable convolution filter (heavily used in image processing) with the GPGPU. Then, we implement a multiobject interactive game console in an eight-core workstation using a multithreaded asynchronous model (MAM), a multithreaded synchronous model (MSM), and an MSM with data parallelism (MSMDP). According to the experimental results, speedup of about 61x and 5x is achieved due to GPGPU and MSMDP implementation, respectively. Therefore, GPGPU-assisted parallel computing has the potential to improve multithreaded game engine performance
Small cell cloud proof of concept implementation and monitoring schemes analysis
Cloud Computing has grown exponentially in popularity in the last few years, becoming a key technology for both personal and enterprise applications due to the numerous benefits it offers. On the other hand, Small Cell technology is considered by many to be the solution to the challenges that are expected to arise caused by the continuously increasing number of interconnected mobile devices.
This project presents a basic design and a proof of concept implementation of a Small Cell Cloud, a current research field on mobile communications that aims to leverage the capabilities offered by the parallel and distributed computation of Cloud Computing to enhance Small Cells functionality.
The purpose of the described Small Cell Cloud is to allow application offloading of mobile devices to Small Cells, allowing the execution of more resource demanding applications at the same time energy consumption is reduced in those devices.
Furthermore, a detailed analysis on different Small Cell monitoring schemes is carried out, comparing the achieved performance with each of them in terms of data reliability and generated network traffic.
Finally, based on the proof of concept implementation and a series of stress performance test, conclusions on the viability of the proposed Small Cell Cloud design and the most appropriate monitoring scheme are presented. Guidelines for future research work are also provided, considering the work developed in this project as a first step towards a new mobile technology.IngenierĂa de TelecomunicaciĂł
Thirty Meter Telescope (TMT) Narrow Field Infrared Adaptive Optics System (NFIRAOS) real-time controller preliminary architecture
The Narrow Field Infrared Adaptive Optics System (NFIRAOS) is the first light Adaptive Optics (AO) system for the Thirty Meter Telescope (TMT). A critical component of NFIRAOS is the Real-Time Controller (RTC) subsystem which provides real-time wavefront correction by processing wavefront information to compute Deformable Mirror (DM) and Tip/Tilt Stage (TTS) commands. The National Research Council of Canada - Herzberg (NRC-H), in conjunction with TMT, has developed a preliminary design for the NFIRAOS RTC. The preliminary architecture for the RTC is comprised of several Linux-based servers. These servers are assigned various roles including: the High-Order Processing (HOP) servers, the Wavefront Corrector Controller (WCC) server, the Telemetry Engineering Display (TED) server, the Persistent Telemetry Storage (PTS) server, and additional testing and spare servers. There are up to six HOP servers that accept high-order wavefront pixels, and perform parallelized pixel processing and wavefront reconstruction to produce wavefront corrector error vectors. The WCC server performs low-order mode processing, and synchronizes and aggregates the high-order wavefront corrector error vectors from the HOP servers to generate wavefront corrector commands. The Telemetry Engineering Display (TED) server is the RTC interface to TMT and other subsystems. The TED server receives all external commands and dispatches them to the rest of the RTC servers and is responsible for aggregating several offloading and telemetry values that are reported to other subsystems within NFIRAOS and TMT. The TED server also provides the engineering GUIs and real-time displays. The Persistent Telemetry Storage (PTS) server contains fault tolerant data storage that receives and stores telemetry data, including data for Point-Spread Function Reconstruction (PSFR)
Software-only diverse redundancy on GPUs for autonomous driving platforms
Autonomous driving (AD) builds upon high-performance computing platforms including (1) general purpose CPUs as well as (2) specific accelerators, being GPUs one of the main representatives. Microcontrollers have reached ASIL-D compliance by implementing diverse redundancy with lockstep execution. However, ASIL-D compliant GPUs rely on either fully redundant lockstep GPUs (i.e. 2 GPUs), which doubles hardware costs, or fully redundant systems with a GPU and another accelerator, which virtually doubles design and validation/verification (V&V) costs. In this paper we analyze the degree of diversity achieved when implementing redundancy on a single GPU, showing that diverse redundancy is not achieved in many cases, and propose software strategies that guarantee achieving diverse redundancy for any kernel on systems using commercial off-the-shelf (COTS) GPUs, thus showing how to achieve ASIL-D compliance on a single COTS GPU in controlled scenarios.This work has been partially supported by the Spanish Ministry of Economy and Competitiveness (MINECO) under grant TIN2015-65316-P and the HiPEAC Network of Excellence. Jaume Abella has been partially supported by the MINECO under Ramon y Cajal postdoctoral fellowship number RYC2013-14717Peer ReviewedPostprint (author's final draft
- …