166,220 research outputs found

    The Parallel Persistent Memory Model

    Full text link
    We consider a parallel computational model that consists of PP processors, each with a fast local ephemeral memory of limited size, and sharing a large persistent memory. The model allows for each processor to fault with bounded probability, and possibly restart. On faulting all processor state and local ephemeral memory are lost, but the persistent memory remains. This model is motivated by upcoming non-volatile memories that are as fast as existing random access memory, are accessible at the granularity of cache lines, and have the capability of surviving power outages. It is further motivated by the observation that in large parallel systems, failure of processors and their caches is not unusual. Within the model we develop a framework for developing locality efficient parallel algorithms that are resilient to failures. There are several challenges, including the need to recover from failures, the desire to do this in an asynchronous setting (i.e., not blocking other processors when one fails), and the need for synchronization primitives that are robust to failures. We describe approaches to solve these challenges based on breaking computations into what we call capsules, which have certain properties, and developing a work-stealing scheduler that functions properly within the context of failures. The scheduler guarantees a time bound of O(W/PA+D(P/PA)log1/fW)O(W/P_A + D(P/P_A) \lceil\log_{1/f} W\rceil) in expectation, where WW and DD are the work and depth of the computation (in the absence of failures), PAP_A is the average number of processors available during the computation, and f1/2f \le 1/2 is the probability that a capsule fails. Within the model and using the proposed methods, we develop efficient algorithms for parallel sorting and other primitives.Comment: This paper is the full version of a paper at SPAA 2018 with the same nam

    Load sharing for optimistic parallel simulations on multicore machines

    Get PDF
    Parallel Discrete Event Simulation (PDES) is based on the partitioning of the simulation model into distinct Logical Processes (LPs), each one modeling a portion of the entire system, which are allowed to execute simulation events concurrently. This allows exploiting parallel computing architectures to speedup model execution, and to make very large models tractable. In this article we cope with the optimistic approach to PDES, where LPs are allowed to concurrently process their events in a speculative fashion, and rollback/ recovery techniques are used to guarantee state consistency in case of causality violations along the speculative execution path. Particularly, we present an innovative load sharing approach targeted at optimizing resource usage for fruitful simulation work when running an optimistic PDES environment on top of multi-processor/multi-core machines. Beyond providing the load sharing model, we also define a load sharing oriented architectural scheme, based on a symmetric multi-threaded organization of the simulation platform. Finally, we present a real implementation of the load sharing architecture within the open source ROme OpTimistic Simulator (ROOT-Sim) package. Experimental data for an assessment of both viability and effectiveness of our proposal are presented as well. Copyright is held by author/owner(s)

    A C-DAG task model for scheduling complex real-time tasks on heterogeneous platforms: preemption matters

    Full text link
    Recent commercial hardware platforms for embedded real-time systems feature heterogeneous processing units and computing accelerators on the same System-on-Chip. When designing complex real-time application for such architectures, the designer needs to make a number of difficult choices: on which processor should a certain task be implemented? Should a component be implemented in parallel or sequentially? These choices may have a great impact on feasibility, as the difference in the processor internal architectures impact on the tasks' execution time and preemption cost. To help the designer explore the wide space of design choices and tune the scheduling parameters, in this paper we propose a novel real-time application model, called C-DAG, specifically conceived for heterogeneous platforms. A C-DAG allows to specify alternative implementations of the same component of an application for different processing engines to be selected off-line, as well as conditional branches to model if-then-else statements to be selected at run-time. We also propose a schedulability analysis for the C-DAG model and a heuristic allocation algorithm so that all deadlines are respected. Our analysis takes into account the cost of preempting a task, which can be non-negligible on certain processors. We demonstrate the effectiveness of our approach on a large set of synthetic experiments by comparing with state of the art algorithms in the literature

    Simulating a small turboshaft engine in real-time multiprocessor simulator (RTMPS) environment

    Get PDF
    A Real-Time Multiprocessor Simulator (RTMPS) has been developed at NASA Lewis Research Center. The RTMPS uses parallel microprocessors to achieve computing speeds needed for real-time engine simulation. This report describes the use of the RTMPS system to simulate a small turboshaft engine. The process of programming the engine equations and distributing them over one, two, and four processors is discussed. Steady-state and transient results from the RTMPS simulation are compared with results from a main-frame-based simulation. Processor execution times and the associated execution time savings for the two and four processor cases are presented using actual data obtained from the RTMPS system. Included is a discussion of why the minimum achievable calculation time for the turboshaft engine model was attained using four processors. Finally, future enhancements to the RTMPS system are discussed including the development of a generalized partitioning algorithm to automatically distribute the system equations among the processors in optimum fashion

    From Dyson to Hopfield: Processing on hierarchical networks

    Get PDF
    We consider statistical-mechanical models for spin systems built on hierarchical structures, which provide a simple example of non-mean-field framework. We show that the coupling decay with spin distance can give rise to peculiar features and phase diagrams much richer that their mean-field counterpart. In particular, we consider the Dyson model, mimicking ferromagnetism in lattices, and we prove the existence of a number of meta-stabilities, beyond the ordered state, which get stable in the thermodynamic limit. Such a feature is retained when the hierarchical structure is coupled with the Hebb rule for learning, hence mimicking the modular architecture of neurons, and gives rise to an associative network able to perform both as a serial processor as well as a parallel processor, depending crucially on the external stimuli and on the rate of interaction decay with distance; however, those emergent multitasking features reduce the network capacity with respect to the mean-field counterpart. The analysis is accomplished through statistical mechanics, graph theory, signal-to-noise technique and numerical simulations in full consistency. Our results shed light on the biological complexity shown by real networks, and suggest future directions for understanding more realistic models

    Macroservers: An Execution Model for DRAM Processor-In-Memory Arrays

    Get PDF
    The emergence of semiconductor fabrication technology allowing a tight coupling between high-density DRAM and CMOS logic on the same chip has led to the important new class of Processor-In-Memory (PIM) architectures. Newer developments provide powerful parallel processing capabilities on the chip, exploiting the facility to load wide words in single memory accesses and supporting complex address manipulations in the memory. Furthermore, large arrays of PIMs can be arranged into a massively parallel architecture. In this report, we describe an object-based programming model based on the notion of a macroserver. Macroservers encapsulate a set of variables and methods; threads, spawned by the activation of methods, operate asynchronously on the variables' state space. Data distributions provide a mechanism for mapping large data structures across the memory region of a macroserver, while work distributions allow explicit control of bindings between threads and data. Both data and work distributuions are first-class objects of the model, supporting the dynamic management of data and threads in memory. This offers the flexibility required for fully exploiting the processing power and memory bandwidth of a PIM array, in particular for irregular and adaptive applications. Thread synchronization is based on atomic methods, condition variables, and futures. A special type of lightweight macroserver allows the formulation of flexible scheduling strategies for the access to resources, using a monitor-like mechanism

    UPPAAL in practice : quantitative verification of a RapidIO network.

    Get PDF
    Packet switched networks are widely used for interconnecting distributed computing platforms. RapidIO (Rapid Input/Output) is an industry standard for packet switched networks to interconnect multiple processor boards. Key performance metrics for these platforms include average-case and worst-case packet transfer latencies. We focus on verifying such quantitative properties for a RapidIO based multi-processor platform that executes a motion control application. A performance model is available in the Parallel Object-Oriented Specification Language (POOSL) that allows for simulation based estimation results. It is however required to determine the exact worst-case latency as the application is time-critical. A model checking approach has been proposed in our previous work which transforms the POOSL model into an UPPAAL model. However, such an approach only works for a fairly small system. We extend the transformation approach with various heuristics to reduce the underlying state space, thereby providing an effective approximation approach that scales to industrial problems of a reasonable complexity

    Self-stabilizing wormhole routing

    Full text link
    Parallel and distributed systems are composed of individual processors that communicate with one another by exchanging messages through communication links. When the sender and the receiver of a message are not direct neighbors, intermediate processors must cooperate to ensure proper routing; Wormhole routing is most common in parallel architectures in which messages are sent in small fragments called flits. We assume that each processor will contain a single fixed-size flit buffer for each incoming link. A processor must forward the flit in a given link buffer to another processor before receiving another flit on that link. This permits messages to wind through the entire network from source to destination, resembling a worm. Wormhole routing is a lightweight and efficient method of routing messages between parallel processors; Our purpose is to modify existing wormhole routing algorithms in familiar topologies to make them self-stabilizing. Self-stabilization is a technique that guarantees tolerance to transient faults (e.g. memory corruption or communication hazard) for a given protocol. Transient faults would typically place the network in an illegitimate state, while Self-stabilization guarantees that the network recovers a correct behavior in finite time, without the need for human intervention. Self-stabilization also guarantees the safety property, meaning that once the network is in a legitimate state, it will remain there until another fault occurs; This paper presents self-stabilizing network algorithms in the wormhole routing model, using the unidirectional ring and the two-dimensional mesh topologies. We chose the ring topology to illustrate the numerous difficulties of self-stabilization in a wormhole routing environment, even in one of the most simple network topologies. We then extend the results of the ring topology to a more complex two-dimensional mesh network
    corecore