107 research outputs found
Deterministic Computations on a PRAM with Static Processor and Memory Faults.
We consider Parallel Random Access Machine (PRAM) which has some processors
and memory cells faulty. The faults considered are static, i.e., once the
machine starts to operate, the operational/faulty status of PRAM components
does not change. We develop a deterministic simulation of a fully operational
PRAM on a similar faulty machine which has constant fractions of faults among
processors and memory cells. The simulating PRAM has processors and
memory cells, and simulates a PRAM with processors and a constant fraction
of memory cells. The simulation is in two phases: it starts with
preprocessing, which is followed by the simulation proper performed in a
step-by-step fashion. Preprocessing is performed in time . The slowdown of a step-by-step part of the simulation is
Models for Parallel Computation in Multi-Core, Heterogeneous, and Ultra Wide-Word Architectures
Multi-core processors have become the dominant processor architecture with 2, 4, and 8 cores on a chip being widely available and an increasing number of cores predicted for the future. In addition, the decreasing costs and increasing programmability of Graphic Processing Units (GPUs) have made these an accessible source of parallel processing power in general purpose computing. Among the many research challenges that this scenario has raised are the fundamental problems related to theoretical modeling of computation in these architectures. In this thesis we study several aspects of computation in modern parallel architectures, from modeling of computation in multi-cores and heterogeneous platforms, to multi-core cache management strategies, through the proposal of an architecture that exploits bit-parallelism on thousands of bits.
Observing that in practice multi-cores have a small number of cores, we propose a model for low-degree parallelism for these architectures. We argue that assuming a small number of processors (logarithmic in a problem's input size) simplifies the design of parallel algorithms. We show that in this model a large class of divide-and-conquer and dynamic programming algorithms can be parallelized with simple modifications to sequential programs, while achieving optimal parallel speedups. We further explore low-degree-parallelism in computation, providing evidence of fundamental differences in practice and theory between systems with a sublinear and linear number of processors, and suggesting a sharp theoretical gap between the classes of problems that are efficiently parallelizable in each case.
Efficient strategies to manage shared caches play a crucial role in multi-core performance. We propose a model for paging in multi-core shared caches, which extends classical paging to a setting in which several threads share the cache. We show that in this setting traditional cache management policies perform poorly, and that any effective strategy must partition the cache among threads, with a partition that adapts dynamically to the demands of each thread. Inspired by the shared cache setting,
we introduce the minimum cache usage problem, an extension to classical sequential paging in which algorithms must account for the amount of cache they use.
This cache-aware model seeks algorithms with good performance in terms of faults and the amount of cache used, and has applications in energy efficient caching and in shared cache scenarios.
The wide availability of GPUs has added to the parallel power of multi-cores, however, most applications underutilize the available resources. We propose a model for hybrid computation in heterogeneous systems with multi-cores and GPU, and describe strategies for generic parallelization and efficient scheduling of a large class of divide-and-conquer algorithms.
Lastly, we introduce the Ultra-Wide Word architecture and model, an extension of the word-RAM model, that allows for constant time operations on thousands of bits in parallel. We show that a large class of existing algorithms can be
implemented in the Ultra-Wide Word model, achieving speedups comparable to those of multi-threaded computations, while avoiding the more difficult aspects of parallel programming
Parallel and Distributed Computing
The 14 chapters presented in this book cover a wide variety of representative works ranging from hardware design to application development. Particularly, the topics that are addressed are programmable and reconfigurable devices and systems, dependability of GPUs (General Purpose Units), network topologies, cache coherence protocols, resource allocation, scheduling algorithms, peertopeer networks, largescale network simulation, and parallel routines and algorithms. In this way, the articles included in this book constitute an excellent reference for engineers and researchers who have particular interests in each of these topics in parallel and distributed computing
The Parallel Persistent Memory Model
We consider a parallel computational model that consists of processors,
each with a fast local ephemeral memory of limited size, and sharing a large
persistent memory. The model allows for each processor to fault with bounded
probability, and possibly restart. On faulting all processor state and local
ephemeral memory are lost, but the persistent memory remains. This model is
motivated by upcoming non-volatile memories that are as fast as existing random
access memory, are accessible at the granularity of cache lines, and have the
capability of surviving power outages. It is further motivated by the
observation that in large parallel systems, failure of processors and their
caches is not unusual.
Within the model we develop a framework for developing locality efficient
parallel algorithms that are resilient to failures. There are several
challenges, including the need to recover from failures, the desire to do this
in an asynchronous setting (i.e., not blocking other processors when one
fails), and the need for synchronization primitives that are robust to
failures. We describe approaches to solve these challenges based on breaking
computations into what we call capsules, which have certain properties, and
developing a work-stealing scheduler that functions properly within the context
of failures. The scheduler guarantees a time bound of in expectation, where and are the work and
depth of the computation (in the absence of failures), is the average
number of processors available during the computation, and is the
probability that a capsule fails. Within the model and using the proposed
methods, we develop efficient algorithms for parallel sorting and other
primitives.Comment: This paper is the full version of a paper at SPAA 2018 with the same
nam
Interconnection Networks Embeddings and Efficient Parallel Computations.
To obtain a greater performance, many processors are allowed to cooperate to solve a single problem. These processors communicate via an interconnection network or a bus. The most essential function of the underlying interconnection network is the efficient interchanging of messages between processes in different processors. Parallel machines based on the hypercube topology have gained a great respect in parallel computation because of its many attractive properties. Many versions of the hypercube have been introduced by many researchers mainly to enhance communications. The twisted hypercube is one of the most attractive versions of the hypercube. It preserves the important features of the hypercube and reduces its diameter by a factor of two. This dissertation investigates relations and transformations between various interconnection networks and the twisted hypercube and explore its efficiency in parallel computation. The capability of the twisted hypercube to simulate complete binary trees, complete quad trees, and rings is demonstrated and compared with the hypercube. Finally, the fault-tolerance of the twisted hypercube is investigated. We present optimal algorithms to simulate rings in a faulty twisted hypercube environment and compare that with the hypercube
Predictive analysis and optimisation of pipelined wavefront applications using reusable analytic models
Pipelined wavefront computations are an ubiquitous class of high performance parallel algorithms
used for the solution of many scientific and engineering applications. In order to aid
the design and optimisation of these applications, and to ensure that during procurement platforms
are chosen best suited to these codes, there has been considerable research in analysing
and evaluating their operational performance.
Wavefront codes exhibit complex computation, communication, synchronisation patterns,
and as a result there exist a large variety of such codes and possible optimisations. The
problem is compounded by each new generation of high performance computing system,
which has often introduced a previously unexplored architectural trait, requiring previous
performance models to be rewritten and reevaluated.
In this thesis, we address the performance modelling and optimisation of this class of
application, as a whole. This differs from previous studies in which bespoke models are applied
to specific applications. The analytic performance models are generalised and reusable,
and we demonstrate their application to the predictive analysis and optimisation of pipelined
wavefront computations running on modern high performance computing systems.
The performance model is based on the LogGP parameterisation, and uses a small
number of input parameters to specify the particular behaviour of most wavefront codes. The
new parameters and model equations capture the key structural and behavioural differences
among different wavefront application codes, providing a succinct summary of the operations
for each application and insights into alternative wavefront application design.
The models are applied to three industry-strength wavefront codes and are validated
on several systems including a Cray XT3/XT4 and an InfiniBand commodity cluster. Model
predictions show high quantitative accuracy (less than 20% error) for all high performance
configurations and excellent qualitative accuracy.
The thesis presents applications, projections and insights for optimisations using the
model, which show the utility of reusable analytic models for performance engineering of
high performance computing codes. In particular, we demonstrate the use of the model for:
(1) evaluating application configuration and resulting performance; (2) evaluating hardware
platform issues including platform sizing, configuration; (3) exploring hardware platform design
alternatives and system procurement and, (4) considering possible code and algorithmic
optimisations
- …