453 research outputs found

    Accelerating sequential programs using FastFlow and self-offloading

    Full text link
    FastFlow is a programming environment specifically targeting cache-coherent shared-memory multi-cores. FastFlow is implemented as a stack of C++ template libraries built on top of lock-free (fence-free) synchronization mechanisms. In this paper we present a further evolution of FastFlow enabling programmers to offload part of their workload on a dynamically created software accelerator running on unused CPUs. The offloaded function can be easily derived from pre-existing sequential code. We emphasize in particular the effective trade-off between human productivity and execution efficiency of the approach.Comment: 17 pages + cove

    Tackling Exascale Software Challenges in Molecular Dynamics Simulations with GROMACS

    Full text link
    GROMACS is a widely used package for biomolecular simulation, and over the last two decades it has evolved from small-scale efficiency to advanced heterogeneous acceleration and multi-level parallelism targeting some of the largest supercomputers in the world. Here, we describe some of the ways we have been able to realize this through the use of parallelization on all levels, combined with a constant focus on absolute performance. Release 4.6 of GROMACS uses SIMD acceleration on a wide range of architectures, GPU offloading acceleration, and both OpenMP and MPI parallelism within and between nodes, respectively. The recent work on acceleration made it necessary to revisit the fundamental algorithms of molecular simulation, including the concept of neighborsearching, and we discuss the present and future challenges we see for exascale simulation - in particular a very fine-grained task parallelism. We also discuss the software management, code peer review and continuous integration testing required for a project of this complexity.Comment: EASC 2014 conference proceedin

    OpenCL Actors - Adding Data Parallelism to Actor-based Programming with CAF

    Full text link
    The actor model of computation has been designed for a seamless support of concurrency and distribution. However, it remains unspecific about data parallel program flows, while available processing power of modern many core hardware such as graphics processing units (GPUs) or coprocessors increases the relevance of data parallelism for general-purpose computation. In this work, we introduce OpenCL-enabled actors to the C++ Actor Framework (CAF). This offers a high level interface for accessing any OpenCL device without leaving the actor paradigm. The new type of actor is integrated into the runtime environment of CAF and gives rise to transparent message passing in distributed systems on heterogeneous hardware. Following the actor logic in CAF, OpenCL kernels can be composed while encapsulated in C++ actors, hence operate in a multi-stage fashion on data resident at the GPU. Developers are thus enabled to build complex data parallel programs from primitives without leaving the actor paradigm, nor sacrificing performance. Our evaluations on commodity GPUs, an Nvidia TESLA, and an Intel PHI reveal the expected linear scaling behavior when offloading larger workloads. For sub-second duties, the efficiency of offloading was found to largely differ between devices. Moreover, our findings indicate a negligible overhead over programming with the native OpenCL API.Comment: 28 page

    O2ATH: An OpenMP Offloading Toolkit for the Sunway Heterogeneous Manycore Platform

    Full text link
    The next generation Sunway supercomputer employs the SW26010pro processor, which features a specialized on-chip heterogeneous architecture. Applications with significant hotspots can benefit from the great computation capacity improvement of Sunway many-core architectures by carefully making intensive manual many-core parallelization efforts. However, some legacy projects with large codebases, such as CESM, ROMS and WRF, contain numerous lines of code and do not have significant hotspots. The cost of manually porting such applications to the Sunway architecture is almost unaffordable. To overcome such a challenge, we have developed a toolkit named O2ATH. O2ATH forwards GNU OpenMP runtime library calls to Sunway's Athread library, which greatly simplifies the parallelization work on the Sunway architecture.O2ATH enables users to write both MPE and CPE code in a single file, and parallelization can be achieved by utilizing OpenMP directives and attributes. In practice, O2ATH has helped us to port two large projects, CESM and ROMS, to the CPEs of the next generation Sunway supercomputers via the OpenMP offload method. In the experiments, kernel speedups range from 3 to 15 times, resulting in 3 to 6 times whole application speedups.Furthermore, O2ATH requires significantly fewer code modifications compared to manually crafting CPE functions.This indicates that O2ATH can greatly enhance development efficiency when porting or optimizing large software projects on Sunway supercomputers.Comment: 15 pages, 6 figures, 5 tables

    GPU Computing to Improve Game Engine Performance

    Get PDF
    Although the graphics processing unit (GPU) was originally designed to accelerate the image creation for output to display, today's general purpose GPU (GPGPU) computing offers unprecedented performance by offloading computing-intensive portions of the application to the GPGPU, while running the remainder of the code on the central processing unit (CPU). The highly parallel structure of a many core GPGPU can process large blocks of data faster using multithreaded concurrent processing. A game engine has many "components" and multithreading can be used to implement their parallelism. However, effective implementation of multithreading in a multicore processor has challenges, such as data and task parallelism. In this paper, we investigate the impact of using a GPGPU with a CPU to design high-performance game engines. First, we implement a separable convolution filter (heavily used in image processing) with the GPGPU. Then, we implement a multiobject interactive game console in an eight-core workstation using a multithreaded asynchronous model (MAM), a multithreaded synchronous model (MSM), and an MSM with data parallelism (MSMDP). According to the experimental results, speedup of about 61x and 5x is achieved due to GPGPU and MSMDP implementation, respectively. Therefore, GPGPU-assisted parallel computing has the potential to improve multithreaded game engine performance

    Small cell cloud proof of concept implementation and monitoring schemes analysis

    Get PDF
    Cloud Computing has grown exponentially in popularity in the last few years, becoming a key technology for both personal and enterprise applications due to the numerous benefits it offers. On the other hand, Small Cell technology is considered by many to be the solution to the challenges that are expected to arise caused by the continuously increasing number of interconnected mobile devices. This project presents a basic design and a proof of concept implementation of a Small Cell Cloud, a current research field on mobile communications that aims to leverage the capabilities offered by the parallel and distributed computation of Cloud Computing to enhance Small Cells functionality. The purpose of the described Small Cell Cloud is to allow application offloading of mobile devices to Small Cells, allowing the execution of more resource demanding applications at the same time energy consumption is reduced in those devices. Furthermore, a detailed analysis on different Small Cell monitoring schemes is carried out, comparing the achieved performance with each of them in terms of data reliability and generated network traffic. Finally, based on the proof of concept implementation and a series of stress performance test, conclusions on the viability of the proposed Small Cell Cloud design and the most appropriate monitoring scheme are presented. Guidelines for future research work are also provided, considering the work developed in this project as a first step towards a new mobile technology.IngenierĂ­a de TelecomunicaciĂł

    Thirty Meter Telescope (TMT) Narrow Field Infrared Adaptive Optics System (NFIRAOS) real-time controller preliminary architecture

    Get PDF
    The Narrow Field Infrared Adaptive Optics System (NFIRAOS) is the first light Adaptive Optics (AO) system for the Thirty Meter Telescope (TMT). A critical component of NFIRAOS is the Real-Time Controller (RTC) subsystem which provides real-time wavefront correction by processing wavefront information to compute Deformable Mirror (DM) and Tip/Tilt Stage (TTS) commands. The National Research Council of Canada - Herzberg (NRC-H), in conjunction with TMT, has developed a preliminary design for the NFIRAOS RTC. The preliminary architecture for the RTC is comprised of several Linux-based servers. These servers are assigned various roles including: the High-Order Processing (HOP) servers, the Wavefront Corrector Controller (WCC) server, the Telemetry Engineering Display (TED) server, the Persistent Telemetry Storage (PTS) server, and additional testing and spare servers. There are up to six HOP servers that accept high-order wavefront pixels, and perform parallelized pixel processing and wavefront reconstruction to produce wavefront corrector error vectors. The WCC server performs low-order mode processing, and synchronizes and aggregates the high-order wavefront corrector error vectors from the HOP servers to generate wavefront corrector commands. The Telemetry Engineering Display (TED) server is the RTC interface to TMT and other subsystems. The TED server receives all external commands and dispatches them to the rest of the RTC servers and is responsible for aggregating several offloading and telemetry values that are reported to other subsystems within NFIRAOS and TMT. The TED server also provides the engineering GUIs and real-time displays. The Persistent Telemetry Storage (PTS) server contains fault tolerant data storage that receives and stores telemetry data, including data for Point-Spread Function Reconstruction (PSFR)

    Software-only diverse redundancy on GPUs for autonomous driving platforms

    Get PDF
    Autonomous driving (AD) builds upon high-performance computing platforms including (1) general purpose CPUs as well as (2) specific accelerators, being GPUs one of the main representatives. Microcontrollers have reached ASIL-D compliance by implementing diverse redundancy with lockstep execution. However, ASIL-D compliant GPUs rely on either fully redundant lockstep GPUs (i.e. 2 GPUs), which doubles hardware costs, or fully redundant systems with a GPU and another accelerator, which virtually doubles design and validation/verification (V&V) costs. In this paper we analyze the degree of diversity achieved when implementing redundancy on a single GPU, showing that diverse redundancy is not achieved in many cases, and propose software strategies that guarantee achieving diverse redundancy for any kernel on systems using commercial off-the-shelf (COTS) GPUs, thus showing how to achieve ASIL-D compliance on a single COTS GPU in controlled scenarios.This work has been partially supported by the Spanish Ministry of Economy and Competitiveness (MINECO) under grant TIN2015-65316-P and the HiPEAC Network of Excellence. Jaume Abella has been partially supported by the MINECO under Ramon y Cajal postdoctoral fellowship number RYC2013-14717Peer ReviewedPostprint (author's final draft
    • …
    corecore