166,220 research outputs found
The Parallel Persistent Memory Model
We consider a parallel computational model that consists of processors,
each with a fast local ephemeral memory of limited size, and sharing a large
persistent memory. The model allows for each processor to fault with bounded
probability, and possibly restart. On faulting all processor state and local
ephemeral memory are lost, but the persistent memory remains. This model is
motivated by upcoming non-volatile memories that are as fast as existing random
access memory, are accessible at the granularity of cache lines, and have the
capability of surviving power outages. It is further motivated by the
observation that in large parallel systems, failure of processors and their
caches is not unusual.
Within the model we develop a framework for developing locality efficient
parallel algorithms that are resilient to failures. There are several
challenges, including the need to recover from failures, the desire to do this
in an asynchronous setting (i.e., not blocking other processors when one
fails), and the need for synchronization primitives that are robust to
failures. We describe approaches to solve these challenges based on breaking
computations into what we call capsules, which have certain properties, and
developing a work-stealing scheduler that functions properly within the context
of failures. The scheduler guarantees a time bound of in expectation, where and are the work and
depth of the computation (in the absence of failures), is the average
number of processors available during the computation, and is the
probability that a capsule fails. Within the model and using the proposed
methods, we develop efficient algorithms for parallel sorting and other
primitives.Comment: This paper is the full version of a paper at SPAA 2018 with the same
nam
Load sharing for optimistic parallel simulations on multicore machines
Parallel Discrete Event Simulation (PDES) is based on the partitioning of the simulation model into distinct Logical Processes (LPs), each one modeling a portion of the entire system, which are allowed to execute simulation events concurrently. This allows exploiting parallel computing architectures to speedup model execution, and to make very large models tractable. In this article we cope with the optimistic approach to PDES, where LPs are allowed to concurrently process their events in a speculative fashion, and rollback/ recovery techniques are used to guarantee state consistency in case of causality violations along the speculative execution path. Particularly, we present an innovative load sharing approach targeted at optimizing resource usage for fruitful simulation work when running an optimistic PDES environment on top of multi-processor/multi-core machines. Beyond providing the load sharing model, we also define a load sharing oriented architectural scheme, based on a symmetric multi-threaded organization of the simulation platform. Finally, we present a real implementation of the load sharing architecture within the open source ROme OpTimistic Simulator (ROOT-Sim) package. Experimental data for an assessment of both viability and effectiveness of our proposal are presented as well. Copyright is held by author/owner(s)
A C-DAG task model for scheduling complex real-time tasks on heterogeneous platforms: preemption matters
Recent commercial hardware platforms for embedded real-time systems feature
heterogeneous processing units and computing accelerators on the same
System-on-Chip. When designing complex real-time application for such
architectures, the designer needs to make a number of difficult choices: on
which processor should a certain task be implemented? Should a component be
implemented in parallel or sequentially? These choices may have a great impact
on feasibility, as the difference in the processor internal architectures
impact on the tasks' execution time and preemption cost. To help the designer
explore the wide space of design choices and tune the scheduling parameters, in
this paper we propose a novel real-time application model, called C-DAG,
specifically conceived for heterogeneous platforms. A C-DAG allows to specify
alternative implementations of the same component of an application for
different processing engines to be selected off-line, as well as conditional
branches to model if-then-else statements to be selected at run-time. We also
propose a schedulability analysis for the C-DAG model and a heuristic
allocation algorithm so that all deadlines are respected. Our analysis takes
into account the cost of preempting a task, which can be non-negligible on
certain processors. We demonstrate the effectiveness of our approach on a large
set of synthetic experiments by comparing with state of the art algorithms in
the literature
Simulating a small turboshaft engine in real-time multiprocessor simulator (RTMPS) environment
A Real-Time Multiprocessor Simulator (RTMPS) has been developed at NASA Lewis Research Center. The RTMPS uses parallel microprocessors to achieve computing speeds needed for real-time engine simulation. This report describes the use of the RTMPS system to simulate a small turboshaft engine. The process of programming the engine equations and distributing them over one, two, and four processors is discussed. Steady-state and transient results from the RTMPS simulation are compared with results from a main-frame-based simulation. Processor execution times and the associated execution time savings for the two and four processor cases are presented using actual data obtained from the RTMPS system. Included is a discussion of why the minimum achievable calculation time for the turboshaft engine model was attained using four processors. Finally, future enhancements to the RTMPS system are discussed including the development of a generalized partitioning algorithm to automatically distribute the system equations among the processors in optimum fashion
From Dyson to Hopfield: Processing on hierarchical networks
We consider statistical-mechanical models for spin systems built on
hierarchical structures, which provide a simple example of non-mean-field
framework. We show that the coupling decay with spin distance can give rise to
peculiar features and phase diagrams much richer that their mean-field
counterpart. In particular, we consider the Dyson model, mimicking
ferromagnetism in lattices, and we prove the existence of a number of
meta-stabilities, beyond the ordered state, which get stable in the
thermodynamic limit. Such a feature is retained when the hierarchical structure
is coupled with the Hebb rule for learning, hence mimicking the modular
architecture of neurons, and gives rise to an associative network able to
perform both as a serial processor as well as a parallel processor, depending
crucially on the external stimuli and on the rate of interaction decay with
distance; however, those emergent multitasking features reduce the network
capacity with respect to the mean-field counterpart. The analysis is
accomplished through statistical mechanics, graph theory, signal-to-noise
technique and numerical simulations in full consistency. Our results shed light
on the biological complexity shown by real networks, and suggest future
directions for understanding more realistic models
Macroservers: An Execution Model for DRAM Processor-In-Memory Arrays
The emergence of semiconductor fabrication technology allowing a tight coupling between high-density DRAM and CMOS logic on the same chip has led to the important new class of Processor-In-Memory (PIM) architectures. Newer developments provide powerful parallel processing capabilities on the chip, exploiting the facility to load wide words in single memory accesses and supporting complex address manipulations in the memory. Furthermore, large arrays of PIMs can be arranged into a massively parallel architecture. In this report, we describe an object-based programming model based on the notion of a macroserver. Macroservers encapsulate a set of variables and methods; threads, spawned by the activation of methods, operate asynchronously on the variables' state space. Data distributions provide a mechanism for mapping large data structures across the memory region of a macroserver, while work distributions allow explicit control of bindings between threads and data. Both data and work distributuions are first-class objects of the model, supporting the dynamic management of data and threads in memory. This offers the flexibility required for fully exploiting the processing power and memory bandwidth of a PIM array, in particular for irregular and adaptive applications. Thread synchronization is based on atomic methods, condition variables, and futures. A special type of lightweight macroserver allows the formulation of flexible scheduling strategies for the access to resources, using a monitor-like mechanism
UPPAAL in practice : quantitative verification of a RapidIO network.
Packet switched networks are widely used for interconnecting distributed computing platforms. RapidIO (Rapid Input/Output) is an industry standard for packet switched networks to interconnect multiple processor boards. Key performance metrics for these platforms include average-case and worst-case packet transfer latencies. We focus on verifying such quantitative properties for a RapidIO based multi-processor platform that executes a motion control application. A performance model is available in the Parallel Object-Oriented Specification Language (POOSL) that allows for simulation based estimation results. It is however required to determine the exact worst-case latency as the application is time-critical. A model checking approach has been proposed in our previous work which transforms the POOSL model into an UPPAAL model. However, such an approach only works for a fairly small system. We extend the transformation approach with various heuristics to reduce the underlying state space, thereby providing an effective approximation approach that scales to industrial problems of a reasonable complexity
Self-stabilizing wormhole routing
Parallel and distributed systems are composed of individual processors that communicate with one another by exchanging messages through communication links. When the sender and the receiver of a message are not direct neighbors, intermediate processors must cooperate to ensure proper routing; Wormhole routing is most common in parallel architectures in which messages are sent in small fragments called flits. We assume that each processor will contain a single fixed-size flit buffer for each incoming link. A processor must forward the flit in a given link buffer to another processor before receiving another flit on that link. This permits messages to wind through the entire network from source to destination, resembling a worm. Wormhole routing is a lightweight and efficient method of routing messages between parallel processors; Our purpose is to modify existing wormhole routing algorithms in familiar topologies to make them self-stabilizing. Self-stabilization is a technique that guarantees tolerance to transient faults (e.g. memory corruption or communication hazard) for a given protocol. Transient faults would typically place the network in an illegitimate state, while Self-stabilization guarantees that the network recovers a correct behavior in finite time, without the need for human intervention. Self-stabilization also guarantees the safety property, meaning that once the network is in a legitimate state, it will remain there until another fault occurs; This paper presents self-stabilizing network algorithms in the wormhole routing model, using the unidirectional ring and the two-dimensional mesh topologies. We chose the ring topology to illustrate the numerous difficulties of self-stabilization in a wormhole routing environment, even in one of the most simple network topologies. We then extend the results of the ring topology to a more complex two-dimensional mesh network
- …