498 research outputs found
A lower bound for linear approximate compaction
The {\em -approximate compaction} problem is: given an input array of values, each either 0 or 1, place each value in an output array so that all the 1's are in the first array locations, where is the number of 1's in the input. is an accuracy parameter. This problem is of fundamental importance in parallel computation because of its applications to processor allocation and approximate counting. When is a constant, the problem is called {\em Linear Approximate Compaction} (LAC). On the CRCW PRAM model, %there is an algorithm that solves approximate compaction in \order{(\log\log n)^3} time for , using processors. Our main result shows that this is close to the best possible. Specifically, we prove that LAC requires % time using \order{n} processors. We also give a tradeoff between and the processing time. For , and , the time required is
Optimal parallel string algorithms: sorting, merching and computing the minimum
We study fundamental comparison problems on strings of characters, equipped with the usual lexicographical ordering. For each problem studied, we give a parallel algorithm that is optimal with respect to at least one criterion for which no optimal algorithm was previously known. Specifically, our main results are: % \begin{itemize} \item Two sorted sequences of strings, containing altogether ~characters, can be merged in time using operations on an EREW PRAM. This is optimal as regards both the running time and the number of operations. \item A sequence of strings, containing altogether ~characters represented by integers of size polynomial in~, can be sorted in time using operations on a CRCW PRAM. The running time is optimal for any polynomial number of processors. \item The minimum string in a sequence of strings containing altogether characters can be found using (expected) operations in constant expected time on a randomized CRCW PRAM, in time on a deterministic CRCW PRAM with a program depending on~, in time on a deterministic CRCW PRAM with a program not depending on~, in expected time on a randomized EREW PRAM, and in time on a deterministic EREW PRAM. The number of operations is optimal, and the running time is optimal for the randomized algorithms and, if the number of processors is limited to~, for the nonuniform deterministic CRCW PRAM algorithm as we
Efficient Parallel Path Checking for Linear-Time Temporal Logic With Past and Bounds
Path checking, the special case of the model checking problem where the model
under consideration is a single path, plays an important role in monitoring,
testing, and verification. We prove that for linear-time temporal logic (LTL),
path checking can be efficiently parallelized. In addition to the core logic,
we consider the extensions of LTL with bounded-future (BLTL) and past-time
(LTL+Past) operators. Even though both extensions improve the succinctness of
the logic exponentially, path checking remains efficiently parallelizable: Our
algorithm for LTL, LTL+Past, and BLTL+Past is in AC^1(logDCFL) \subseteq NC
Distributed-Memory Breadth-First Search on Massive Graphs
This chapter studies the problem of traversing large graphs using the
breadth-first search order on distributed-memory supercomputers. We consider
both the traditional level-synchronous top-down algorithm as well as the
recently discovered direction optimizing algorithm. We analyze the performance
and scalability trade-offs in using different local data structures such as CSR
and DCSC, enabling in-node multithreading, and graph decompositions such as 1D
and 2D decomposition.Comment: arXiv admin note: text overlap with arXiv:1104.451
Perfectly Oblivious (Parallel) RAM Revisited, and Improved Constructions
Oblivious RAM (ORAM)
is a technique for compiling any RAM program to an oblivious counterpart, i.e.,
one whose access patterns do not leak information about the secret inputs.
Similarly, Oblivious Parallel RAM (OPRAM) compiles a
{\it parallel} RAM program to an oblivious counterpart.
In this paper, we care about ORAM/OPRAM with {\it perfect security}, i.e.,
the access patterns must be {\it identically distributed}
no matter what the program\u27s memory request sequence is.
In the past, two types of perfect ORAMs/OPRAMs
have been considered:
constructions whose performance bounds hold {\it in expectation} (but may occasionally
run more slowly);
and constructions whose performance bounds hold {\it deterministically} (even though
the algorithms themselves are randomized).
In this paper, we revisit the performance metrics for perfect
ORAM/OPRAM, and
show novel constructions that achieve asymptotical improvements
for all performance metrics.
Our first result
is a new perfectly secure OPRAM
scheme with {\it expected} overhead.
In comparison, prior literature
has been stuck at for more than a decade.
Next, we show how to construct a perfect ORAM
with
{\it deterministic} simulation overhead. We further show how
to make the scheme parallel, resulting in an perfect OPRAM
with
{\it deterministic} simulation overhead.
For perfect ORAMs/OPRAMs
with deterministic performance bounds, our results achieve
{\it subexponential} improvement over the state-of-the-art.
Specifically, the best known prior scheme
incurs more than deterministic simulation overhead
(Raskin and Simkin, Asiacrypt\u2719); moreover, their scheme works
only for the sequential setting and is {\it not} amenable to parallelization.
Finally, we additionally consider perfect ORAMs/OPRAMs
whose performance bounds hold with high probability.
For this new performance metric, we show new constructions
whose simulation overhead is upper bounded by
except with negligible in probability, i.e., we prove
high-probability performance bounds that match the expected
bounds mentioned earlier
Modeling Algorithm Performance on Highly-threaded Many-core Architectures
The rapid growth of data processing required in various arenas of computation over the past decades necessitates extensive use of parallel computing engines. Among those, highly-threaded many-core machines, such as GPUs have become increasingly popular for accelerating a diverse range of data-intensive applications. They feature a large number of hardware threads with low-overhead context switches to hide the memory access latencies and therefore provide high computational throughput. However, understanding and harnessing such machines places great challenges on algorithm designers and performance tuners due to the complex interaction of threads and hierarchical memory subsystems of these machines. The achieved performance jointly depends on the parallelism exploited by the algorithm, the effectiveness of latency hiding, and the utilization of multiprocessors (occupancy). Contemporary work tries to model the performance of GPUs from various aspects with different emphasis and granularity. However, no model considers all of these factors together at the same time.
This dissertation presents an analytical framework that jointly addresses parallelism, latency-hiding, and occupancy for both theoretical and empirical performance analysis of algorithms on highly-threaded many-core machines so that it can guide both algorithm design and performance tuning. In particular, this framework not only helps to explore and reduce the runtime configuration space for tuning kernel execution on GPUs, but also reflects performance bottlenecks and predicts how the runtime will trend as the problem and other parameters scale. The framework consists of a pair of analytical models with one focusing on higher-level asymptotic algorithm performance on GPUs and the other one emphasizing lower-level details about scheduling and runtime configuration. Based on the two models, we have conducted extensive analysis of a large set of algorithms. Two analysis provides interesting results and explains previously unexplained data. In addition, the two models are further bridged and combined as a consistent framework. The framework is able to provide an end-to-end methodology for algorithm design, evaluation, comparison, implementation, and prediction of real runtime on GPUs fairly accurately.
To demonstrate the viability of our methods, the models are validated through data from implementations of a variety of classic algorithms, including hashing, Bloom filters, all-pairs shortest path, matrix multiplication, FFT, merge sort, list ranking, string matching via suffix tree/array, etc. We evaluate the models\u27 performance across a wide spectrum of parameters, data values, and machines. The results indicate that the models can be effectively used for algorithm performance analysis and runtime prediction on highly-threaded many-core machines
- …