498 research outputs found

    A lower bound for linear approximate compaction

    Get PDF
    The {\em λ\lambda-approximate compaction} problem is: given an input array of nn values, each either 0 or 1, place each value in an output array so that all the 1's are in the first (1+λ)k(1+\lambda)k array locations, where kk is the number of 1's in the input. λ\lambda is an accuracy parameter. This problem is of fundamental importance in parallel computation because of its applications to processor allocation and approximate counting. When λ\lambda is a constant, the problem is called {\em Linear Approximate Compaction} (LAC). On the CRCW PRAM model, %there is an algorithm that solves approximate compaction in \order{(\log\log n)^3} time for λ=1loglogn\lambda = \frac{1}{\log\log n}, using n(loglogn)3\frac{n}{(\log\log n)^3} processors. Our main result shows that this is close to the best possible. Specifically, we prove that LAC requires %Ω(loglogn)\Omega(\log\log n) time using \order{n} processors. We also give a tradeoff between λ\lambda and the processing time. For ϵ<1\epsilon < 1, and λ=nϵ\lambda = n^{\epsilon}, the time required is Ω(log1ϵ)\Omega(\log \frac{1}{\epsilon})

    Optimal parallel string algorithms: sorting, merching and computing the minimum

    No full text
    We study fundamental comparison problems on strings of characters, equipped with the usual lexicographical ordering. For each problem studied, we give a parallel algorithm that is optimal with respect to at least one criterion for which no optimal algorithm was previously known. Specifically, our main results are: % \begin{itemize} \item Two sorted sequences of strings, containing altogether nn~characters, can be merged in O(logn)O(\log n) time using O(n)O(n) operations on an EREW PRAM. This is optimal as regards both the running time and the number of operations. \item A sequence of strings, containing altogether nn~characters represented by integers of size polynomial in~nn, can be sorted in O(logn/loglogn)O({{\log n}/{\log\log n}}) time using O(nloglogn)O(n\log\log n) operations on a CRCW PRAM. The running time is optimal for any polynomial number of processors. \item The minimum string in a sequence of strings containing altogether nn characters can be found using (expected) O(n)O(n) operations in constant expected time on a randomized CRCW PRAM, in O(loglogn)O(\log\log n) time on a deterministic CRCW PRAM with a program depending on~nn, in O((loglogn)3)O((\log\log n)^3) time on a deterministic CRCW PRAM with a program not depending on~nn, in O(logn)O(\log n) expected time on a randomized EREW PRAM, and in O(lognloglogn)O(\log n\log\log n) time on a deterministic EREW PRAM. The number of operations is optimal, and the running time is optimal for the randomized algorithms and, if the number of processors is limited to~nn, for the nonuniform deterministic CRCW PRAM algorithm as we

    Efficient Parallel Path Checking for Linear-Time Temporal Logic With Past and Bounds

    Full text link
    Path checking, the special case of the model checking problem where the model under consideration is a single path, plays an important role in monitoring, testing, and verification. We prove that for linear-time temporal logic (LTL), path checking can be efficiently parallelized. In addition to the core logic, we consider the extensions of LTL with bounded-future (BLTL) and past-time (LTL+Past) operators. Even though both extensions improve the succinctness of the logic exponentially, path checking remains efficiently parallelizable: Our algorithm for LTL, LTL+Past, and BLTL+Past is in AC^1(logDCFL) \subseteq NC

    Distributed-Memory Breadth-First Search on Massive Graphs

    Full text link
    This chapter studies the problem of traversing large graphs using the breadth-first search order on distributed-memory supercomputers. We consider both the traditional level-synchronous top-down algorithm as well as the recently discovered direction optimizing algorithm. We analyze the performance and scalability trade-offs in using different local data structures such as CSR and DCSC, enabling in-node multithreading, and graph decompositions such as 1D and 2D decomposition.Comment: arXiv admin note: text overlap with arXiv:1104.451

    Perfectly Oblivious (Parallel) RAM Revisited, and Improved Constructions

    Get PDF
    Oblivious RAM (ORAM) is a technique for compiling any RAM program to an oblivious counterpart, i.e., one whose access patterns do not leak information about the secret inputs. Similarly, Oblivious Parallel RAM (OPRAM) compiles a {\it parallel} RAM program to an oblivious counterpart. In this paper, we care about ORAM/OPRAM with {\it perfect security}, i.e., the access patterns must be {\it identically distributed} no matter what the program\u27s memory request sequence is. In the past, two types of perfect ORAMs/OPRAMs have been considered: constructions whose performance bounds hold {\it in expectation} (but may occasionally run more slowly); and constructions whose performance bounds hold {\it deterministically} (even though the algorithms themselves are randomized). In this paper, we revisit the performance metrics for perfect ORAM/OPRAM, and show novel constructions that achieve asymptotical improvements for all performance metrics. Our first result is a new perfectly secure OPRAM scheme with O(log3N/loglogN)O(\log^3 N/\log \log N) {\it expected} overhead. In comparison, prior literature has been stuck at O(log3N)O(\log^3 N) for more than a decade. Next, we show how to construct a perfect ORAM with O(log3N/loglogN)O(\log^3 N/\log \log N) {\it deterministic} simulation overhead. We further show how to make the scheme parallel, resulting in an perfect OPRAM with O(log4N/loglogN)O(\log^4 N/\log \log N) {\it deterministic} simulation overhead. For perfect ORAMs/OPRAMs with deterministic performance bounds, our results achieve {\it subexponential} improvement over the state-of-the-art. Specifically, the best known prior scheme incurs more than N\sqrt{N} deterministic simulation overhead (Raskin and Simkin, Asiacrypt\u2719); moreover, their scheme works only for the sequential setting and is {\it not} amenable to parallelization. Finally, we additionally consider perfect ORAMs/OPRAMs whose performance bounds hold with high probability. For this new performance metric, we show new constructions whose simulation overhead is upper bounded by O(log3/loglogN)O(\log^3 /\log\log N) except with negligible in NN probability, i.e., we prove high-probability performance bounds that match the expected bounds mentioned earlier

    Modeling Algorithm Performance on Highly-threaded Many-core Architectures

    Get PDF
    The rapid growth of data processing required in various arenas of computation over the past decades necessitates extensive use of parallel computing engines. Among those, highly-threaded many-core machines, such as GPUs have become increasingly popular for accelerating a diverse range of data-intensive applications. They feature a large number of hardware threads with low-overhead context switches to hide the memory access latencies and therefore provide high computational throughput. However, understanding and harnessing such machines places great challenges on algorithm designers and performance tuners due to the complex interaction of threads and hierarchical memory subsystems of these machines. The achieved performance jointly depends on the parallelism exploited by the algorithm, the effectiveness of latency hiding, and the utilization of multiprocessors (occupancy). Contemporary work tries to model the performance of GPUs from various aspects with different emphasis and granularity. However, no model considers all of these factors together at the same time. This dissertation presents an analytical framework that jointly addresses parallelism, latency-hiding, and occupancy for both theoretical and empirical performance analysis of algorithms on highly-threaded many-core machines so that it can guide both algorithm design and performance tuning. In particular, this framework not only helps to explore and reduce the runtime configuration space for tuning kernel execution on GPUs, but also reflects performance bottlenecks and predicts how the runtime will trend as the problem and other parameters scale. The framework consists of a pair of analytical models with one focusing on higher-level asymptotic algorithm performance on GPUs and the other one emphasizing lower-level details about scheduling and runtime configuration. Based on the two models, we have conducted extensive analysis of a large set of algorithms. Two analysis provides interesting results and explains previously unexplained data. In addition, the two models are further bridged and combined as a consistent framework. The framework is able to provide an end-to-end methodology for algorithm design, evaluation, comparison, implementation, and prediction of real runtime on GPUs fairly accurately. To demonstrate the viability of our methods, the models are validated through data from implementations of a variety of classic algorithms, including hashing, Bloom filters, all-pairs shortest path, matrix multiplication, FFT, merge sort, list ranking, string matching via suffix tree/array, etc. We evaluate the models\u27 performance across a wide spectrum of parameters, data values, and machines. The results indicate that the models can be effectively used for algorithm performance analysis and runtime prediction on highly-threaded many-core machines
    corecore