66,374 research outputs found

    Overview of Swallow --- A Scalable 480-core System for Investigating the Performance and Energy Efficiency of Many-core Applications and Operating Systems

    Full text link
    We present Swallow, a scalable many-core architecture, with a current configuration of 480 x 32-bit processors. Swallow is an open-source architecture, designed from the ground up to deliver scalable increases in usable computational power to allow experimentation with many-core applications and the operating systems that support them. Scalability is enabled by the creation of a tile-able system with a low-latency interconnect, featuring an attractive communication-to-computation ratio and the use of a distributed memory configuration. We analyse the energy and computational and communication performances of Swallow. The system provides 240GIPS with each core consuming 71--193mW, dependent on workload. Power consumption per instruction is lower than almost all systems of comparable scale. We also show how the use of a distributed operating system (nOS) allows the easy creation of scalable software to exploit Swallow's potential. Finally, we show two use case studies: modelling neurons and the overlay of shared memory on a distributed memory system.Comment: An open source release of the Swallow system design and code will follow and references to these will be added at a later dat

    Actors: The Ideal Abstraction for Programming Kernel-Based Concurrency

    Get PDF
    GPU and multicore hardware architectures are commonly used in many different application areas to accelerate problem solutions relative to single CPU architectures. The typical approach to accessing these hardware architectures requires embedding logic into the programming language used to construct the application; the two primary forms of embedding are: calls to API routines to access the concurrent functionality, or pragmas providing concurrency hints to a language compiler such that particular blocks of code are targeted to the concurrent functionality. The former approach is verbose and semantically bankrupt, while the success of the latter approach is restricted to simple, static uses of the functionality. Actor-based applications are constructed from independent, encapsulated actors that interact through strongly-typed channels. This paper presents a first attempt at using actors to program kernels targeted at such concurrent hardware. Besides the glove-like fit of a kernel to the actor abstraction, quantitative code analysis shows that actor-based kernels are always significantly simpler than API-based coding, and generally simpler than pragma-based coding. Additionally, performance measurements show that the overheads of actor-based kernels are commensurate to API-based kernels, and range from equivalent to vastly improved for pragma-based annotations, both for sample and real-world applications

    A fine-grain time-sharing Time Warp system

    Get PDF
    Although Parallel Discrete Event Simulation (PDES) platforms relying on the Time Warp (optimistic) synchronization protocol already allow for exploiting parallelism, several techniques have been proposed to further favor performance. Among them we can mention optimized approaches for state restore, as well as techniques for load balancing or (dynamically) controlling the speculation degree, the latter being specifically targeted at reducing the incidence of causality errors leading to waste of computation. However, in state of the art Time Warp systems, events’ processing is not preemptable, which may prevent the possibility to promptly react to the injection of higher priority (say lower timestamp) events. Delaying the processing of these events may, in turn, give rise to higher incidence of incorrect speculation. In this article we present the design and realization of a fine-grain time-sharing Time Warp system, to be run on multi-core Linux machines, which makes systematic use of event preemption in order to dynamically reassign the CPU to higher priority events/tasks. Our proposal is based on a truly dual mode execution, application vs platform, which includes a timer-interrupt based support for bringing control back to platform mode for possible CPU reassignment according to very fine grain periods. The latter facility is offered by an ad-hoc timer-interrupt management module for Linux, which we release, together with the overall time-sharing support, within the open source ROOT-Sim platform. An experimental assessment based on the classical PHOLD benchmark and two real world models is presented, which shows how our proposal effectively leads to the reduction of the incidence of causality errors, as compared to traditional Time Warp, especially when running with higher degrees of parallelism

    Design and Implementation of MPICH2 over InfiniBand with RDMA Support

    Full text link
    For several years, MPI has been the de facto standard for writing parallel applications. One of the most popular MPI implementations is MPICH. Its successor, MPICH2, features a completely new design that provides more performance and flexibility. To ensure portability, it has a hierarchical structure based on which porting can be done at different levels. In this paper, we present our experiences designing and implementing MPICH2 over InfiniBand. Because of its high performance and open standard, InfiniBand is gaining popularity in the area of high-performance computing. Our study focuses on optimizing the performance of MPI-1 functions in MPICH2. One of our objectives is to exploit Remote Direct Memory Access (RDMA) in Infiniband to achieve high performance. We have based our design on the RDMA Channel interface provided by MPICH2, which encapsulates architecture-dependent communication functionalities into a very small set of functions. Starting with a basic design, we apply different optimizations and also propose a zero-copy-based design. We characterize the impact of our optimizations and designs using microbenchmarks. We have also performed an application-level evaluation using the NAS Parallel Benchmarks. Our optimized MPICH2 implementation achieves 7.6 ÎĽ\mus latency and 857 MB/s bandwidth, which are close to the raw performance of the underlying InfiniBand layer. Our study shows that the RDMA Channel interface in MPICH2 provides a simple, yet powerful, abstraction that enables implementations with high performance by exploiting RDMA operations in InfiniBand. To the best of our knowledge, this is the first high-performance design and implementation of MPICH2 on InfiniBand using RDMA support.Comment: 12 pages, 17 figure

    Transparent multi-core speculative parallelization of DES models with event and cross-state dependencies

    Get PDF
    In this article we tackle transparent parallelization of Discrete Event Simulation (DES) models to be run on top of multi-core machines according to speculative schemes. The innovation in our proposal lies in that we consider a more general programming and execution model, compared to the one targeted by state of the art PDES platforms, where the boundaries of the state portion accessible while processing an event at a specific simulation object do not limit access to the actual object state, or to shared global variables. Rather, the simulation object is allowed to access (and alter) the state of any other object, thus causing what we term cross-state dependency. We note that this model exactly complies with typical (easy to manage) sequential-style DES programming, where a (dynamically-allocated) state portion of object A can be accessed by object B in either read or write mode (or both) by, e.g., passing a pointer to B as the payload of a scheduled simulation event. However, while read/write memory accesses performed in the sequential run are always guaranteed to observe (and to give rise to) a consistent snapshot of the state of the simulation model, consistency is not automatically guaranteed in case of parallelization and concurrent execution of simulation objects with cross-state dependencies. We cope with such a consistency issue, and its application-transparent support, in the context of parallel and optimistic executions. This is achieved by introducing an advanced memory management architecture, able to efficiently detect read/write accesses by concurrent objects to whichever object state in an application transparent manner, together with advanced synchronization mechanisms providing the advantage of exploiting parallelism in the underlying multi-core architecture while transparently handling both cross-state and traditional event-based dependencies. Our proposal targets Linux and has been integrated with the ROOT-Sim open source optimistic simulation platform, although its design principles, and most parts of the developed software, are of general relevance. Copyright 2014 ACM

    Accelerating sequential programs using FastFlow and self-offloading

    Full text link
    FastFlow is a programming environment specifically targeting cache-coherent shared-memory multi-cores. FastFlow is implemented as a stack of C++ template libraries built on top of lock-free (fence-free) synchronization mechanisms. In this paper we present a further evolution of FastFlow enabling programmers to offload part of their workload on a dynamically created software accelerator running on unused CPUs. The offloaded function can be easily derived from pre-existing sequential code. We emphasize in particular the effective trade-off between human productivity and execution efficiency of the approach.Comment: 17 pages + cove

    A GPU based real-time software correlation system for the Murchison Widefield Array prototype

    Full text link
    Modern graphics processing units (GPUs) are inexpensive commodity hardware that offer Tflop/s theoretical computing capacity. GPUs are well suited to many compute-intensive tasks including digital signal processing. We describe the implementation and performance of a GPU-based digital correlator for radio astronomy. The correlator is implemented using the NVIDIA CUDA development environment. We evaluate three design options on two generations of NVIDIA hardware. The different designs utilize the internal registers, shared memory and multiprocessors in different ways. We find that optimal performance is achieved with the design that minimizes global memory reads on recent generations of hardware. The GPU-based correlator outperforms a single-threaded CPU equivalent by a factor of 60 for a 32 antenna array, and runs on commodity PC hardware. The extra compute capability provided by the GPU maximises the correlation capability of a PC while retaining the fast development time associated with using standard hardware, networking and programming languages. In this way, a GPU-based correlation system represents a middle ground in design space between high performance, custom built hardware and pure CPU-based software correlation. The correlator was deployed at the Murchison Widefield Array 32 antenna prototype system where it ran in real-time for extended periods. We briefly describe the data capture, streaming and correlation system for the prototype array.Comment: 11 pages, to appear in PAS
    • …
    corecore