6 research outputs found
Algorithms in the Ultra-Wide Word Model
The effective use of parallel computing resources to speed up algorithms in
current multi-core parallel architectures remains a difficult challenge, with
ease of programming playing a key role in the eventual success of various
parallel architectures. In this paper we consider an alternative view of
parallelism in the form of an ultra-wide word processor. We introduce the
Ultra-Wide Word architecture and model, an extension of the word-RAM model that
allows for constant time operations on thousands of bits in parallel. Word
parallelism as exploited by the word-RAM model does not suffer from the more
difficult aspects of parallel programming, namely synchronization and
concurrency. For the standard word-RAM algorithms, the speedups obtained are
moderate, as they are limited by the word size. We argue that a large class of
word-RAM algorithms can be implemented in the Ultra-Wide Word model, obtaining
speedups comparable to multi-threaded computations while keeping the simplicity
of programming of the sequential RAM model. We show that this is the case by
describing implementations of Ultra-Wide Word algorithms for dynamic
programming and string searching. In addition, we show that the Ultra-Wide Word
model can be used to implement a nonstandard memory architecture, which enables
the sidestepping of lower bounds of important data structure problems such as
priority queues and dynamic prefix sums. While similar ideas about operating on
large words have been mentioned before in the context of multimedia processors
[Thorup 2003], it is only recently that an architecture like the one we propose
has become feasible and that details can be worked out.Comment: 28 pages, 5 figures; minor change
Partial Sums on the Ultra-Wide Word RAM
We consider the classic partial sums problem on the ultra-wide word RAM model
of computation. This model extends the classic -bit word RAM model with
special ultrawords of length bits that support standard arithmetic and
boolean operation and scattered memory access operations that can access
(non-contiguous) locations in memory. The ultra-wide word RAM model captures
(and idealizes) modern vector processor architectures.
Our main result is a new in-place data structure for the partial sum problem
that only stores a constant number of ultraword in addition to the input and
supports operations in doubly logarithmic time. This matches the best known
time bounds for the problem (among polynomial space data structures) while
improving the space from superlinear to a constant number of ultrawords. Our
results are based on a simple and elegant in-place word RAM data structure,
known as the Fenwick tree. Our main technical contribution is a new efficient
parallel ultra-wide word RAM implementation of the Fenwick tree, which is
likely of independent interest.Comment: Extended abstract appeared at TAMC 202
Performance of the Ultra-Wide Word Model
The Ultra-wide word model of computation (UWRAM) is an extension of the Word-RAM model which has an ALU that can operate on w^2 bits at a time, where w is the size in bits of a cell in memory. The purpose of this thesis is to explore the applicability of the UWRAM model, particularly when compared to the PRAM model, from an algorithmic point of view, to determine its potential for common applications.
The work is divided into three sections: First we describe the model, its instruction set, strengths and weaknesses, and provide a few small examples that showcase the functionality of the model and how simple techniques can be used to speed up sequential algorithms.
In the second section, we discuss the problem of sorting and searching, and show that elaborate data structures such as the fusion tree can be easily adapted to the model, allowing the sorting of n integers in O(n (log n/log log n) time with small constant factors.
Lastly, we provide simulations of UWRAM and PRAM programs to solve two problems: subset sum and string matching. In the first case we show how a dynamic programming algorithm can be sped up using bit parallelism where traditional parallelism is difficult to achieve, and in the second, we show that even in a problem that is simple to parallelize traditionally, the UWRAM can perform well when compared to a PRAM
Models for Parallel Computation in Multi-Core, Heterogeneous, and Ultra Wide-Word Architectures
Multi-core processors have become the dominant processor architecture with 2, 4, and 8 cores on a chip being widely available and an increasing number of cores predicted for the future. In addition, the decreasing costs and increasing programmability of Graphic Processing Units (GPUs) have made these an accessible source of parallel processing power in general purpose computing. Among the many research challenges that this scenario has raised are the fundamental problems related to theoretical modeling of computation in these architectures. In this thesis we study several aspects of computation in modern parallel architectures, from modeling of computation in multi-cores and heterogeneous platforms, to multi-core cache management strategies, through the proposal of an architecture that exploits bit-parallelism on thousands of bits.
Observing that in practice multi-cores have a small number of cores, we propose a model for low-degree parallelism for these architectures. We argue that assuming a small number of processors (logarithmic in a problem's input size) simplifies the design of parallel algorithms. We show that in this model a large class of divide-and-conquer and dynamic programming algorithms can be parallelized with simple modifications to sequential programs, while achieving optimal parallel speedups. We further explore low-degree-parallelism in computation, providing evidence of fundamental differences in practice and theory between systems with a sublinear and linear number of processors, and suggesting a sharp theoretical gap between the classes of problems that are efficiently parallelizable in each case.
Efficient strategies to manage shared caches play a crucial role in multi-core performance. We propose a model for paging in multi-core shared caches, which extends classical paging to a setting in which several threads share the cache. We show that in this setting traditional cache management policies perform poorly, and that any effective strategy must partition the cache among threads, with a partition that adapts dynamically to the demands of each thread. Inspired by the shared cache setting,
we introduce the minimum cache usage problem, an extension to classical sequential paging in which algorithms must account for the amount of cache they use.
This cache-aware model seeks algorithms with good performance in terms of faults and the amount of cache used, and has applications in energy efficient caching and in shared cache scenarios.
The wide availability of GPUs has added to the parallel power of multi-cores, however, most applications underutilize the available resources. We propose a model for hybrid computation in heterogeneous systems with multi-cores and GPU, and describe strategies for generic parallelization and efficient scheduling of a large class of divide-and-conquer algorithms.
Lastly, we introduce the Ultra-Wide Word architecture and model, an extension of the word-RAM model, that allows for constant time operations on thousands of bits in parallel. We show that a large class of existing algorithms can be
implemented in the Ultra-Wide Word model, achieving speedups comparable to those of multi-threaded computations, while avoiding the more difficult aspects of parallel programming