8,584 research outputs found

    Hera-JVM: abstracting processor heterogeneity behind a virtual machine

    Get PDF
    Heterogeneous multi-core processors, such as the Cell processor, can deliver exceptional performance, however, they are notoriously difficult to program effectively. We present Hera-JVM, a runtime system which hides a processor’s heterogeneity behind a homogeneous virtual machine interface. Preliminary results of three benchmarks running under Hera-JVM are presented. These results suggest a set of application behaviour characteristics that the runtime system should take into account when placing threads on different core types.

    FPGA-Based CNN Inference Accelerator Synthesized from Multi-Threaded C Software

    Full text link
    A deep-learning inference accelerator is synthesized from a C-language software program parallelized with Pthreads. The software implementation uses the well-known producer/consumer model with parallel threads interconnected by FIFO queues. The LegUp high-level synthesis (HLS) tool synthesizes threads into parallel FPGA hardware, translating software parallelism into spatial parallelism. A complete system is generated where convolution, pooling and padding are realized in the synthesized accelerator, with remaining tasks executing on an embedded ARM processor. The accelerator incorporates reduced precision, and a novel approach for zero-weight-skipping in convolution. On a mid-sized Intel Arria 10 SoC FPGA, peak performance on VGG-16 is 138 effective GOPS

    A load-sharing architecture for high performance optimistic simulations on multi-core machines

    Get PDF
    In Parallel Discrete Event Simulation (PDES), the simulation model is partitioned into a set of distinct Logical Processes (LPs) which are allowed to concurrently execute simulation events. In this work we present an innovative approach to load-sharing on multi-core/multiprocessor machines, targeted at the optimistic PDES paradigm, where LPs are speculatively allowed to process simulation events with no preventive verification of causal consistency, and actual consistency violations (if any) are recovered via rollback techniques. In our approach, each simulation kernel instance, in charge of hosting and executing a specific set of LPs, runs a set of worker threads, which can be dynamically activated/deactivated on the basis of a distributed algorithm. The latter relies in turn on an analytical model that provides indications on how to reassign processor/core usage across the kernels in order to handle the simulation workload as efficiently as possible. We also present a real implementation of our load-sharing architecture within the ROme OpTimistic Simulator (ROOT-Sim), namely an open-source C-based simulation platform implemented according to the PDES paradigm and the optimistic synchronization approach. Experimental results for an assessment of the validity of our proposal are presented as well

    Mixing multi-core CPUs and GPUs for scientific simulation software

    Get PDF
    Recent technological and economic developments have led to widespread availability of multi-core CPUs and specialist accelerator processors such as graphical processing units (GPUs). The accelerated computational performance possible from these devices can be very high for some applications paradigms. Software languages and systems such as NVIDIA's CUDA and Khronos consortium's open compute language (OpenCL) support a number of individual parallel application programming paradigms. To scale up the performance of some complex systems simulations, a hybrid of multi-core CPUs for coarse-grained parallelism and very many core GPUs for data parallelism is necessary. We describe our use of hybrid applica- tions using threading approaches and multi-core CPUs to control independent GPU devices. We present speed-up data and discuss multi-threading software issues for the applications level programmer and o er some suggested areas for language development and integration between coarse-grained and ne-grained multi-thread systems. We discuss results from three common simulation algorithmic areas including: partial di erential equations; graph cluster metric calculations and random number generation. We report on programming experiences and selected performance for these algorithms on: single and multiple GPUs; multi-core CPUs; a CellBE; and using OpenCL. We discuss programmer usability issues and the outlook and trends in multi-core programming for scienti c applications developers

    Parallelization of a Six Degree of Freedom Entry Vehicle Trajectory Simulation Using OpenMP and OpenACC

    Get PDF
    The art and science of writing parallelized software, using methods such as Open Multi-Processing (OpenMP) and Open Accelerators (OpenACC), is dominated by computer scientists. Engineers and non-computer scientists looking to apply these techniques to their project applications face a steep learning curve, especially when looking to adapt their original single threaded software to run multi-threaded on graphics processing units (GPUs). There are significant changes in mindset that must occur; such as how to manage memory, the organization of instructions, and the use of if statements (also known as branching). The purpose of this work is twofold: 1) to demonstrate the applicability of parallelized coding methodologies, OpenMP and OpenACC, to tasks outside of the typical large scale matrix mathematics; and 2) to discuss, from an engineers perspective, the lessons learned from parallelizing software using these computer science techniques. This work applies OpenMP, on both multi-core central processing units (CPUs) and Intel Xeon Phi 7210, and OpenACC on GPUs. These parallelization techniques are used to tackle the simulation of thousands of entry vehicle trajectories through the integration of six degree of freedom (DoF) equations of motion (EoM). The forces and moments acting on the entry vehicle, and used by the EoM, are estimated using multiple models of varying levels of complexity. Several benchmark comparisons are made on the execution of six DoF trajectory simulation: single thread Intel Xeon E5-2670 CPU, multi-thread CPU using OpenMP, multi-thread Xeon Phi 7210 using OpenMP, and multi-thread NVIDIA Tesla K40 GPU using OpenACC. These benchmarks are run on the Pleiades Supercomputer Cluster at the National Aeronautics and Space Administration (NASA) Ames Research Center (ARC), and a Xeon Phi 7210 node at NASA Langley Research Center (LaRC)

    Development of a real-time full-field range imaging system

    Get PDF
    This article describes the development of a full-field range imaging system employing a high frequency amplitude modulated light source and image sensor. Depth images are produced at video frame rates in which each pixel in the image represents distance from the sensor to objects in the scene. The various hardware subsystems are described as are the details about the firmware and software implementation for processing the images in real-time. The system is flexible in that precision can be traded off for decreased acquisition time. Results are reported to illustrate this versatility for both high-speed (reduced precision) and high-precision operating modes
    corecore