33 research outputs found

    Parallel Algorithm and Dynamic Exponent for Diffusion-limited Aggregation

    Full text link
    A parallel algorithm for ``diffusion-limited aggregation'' (DLA) is described and analyzed from the perspective of computational complexity. The dynamic exponent z of the algorithm is defined with respect to the probabilistic parallel random-access machine (PRAM) model of parallel computation according to TLzT \sim L^{z}, where L is the cluster size, T is the running time, and the algorithm uses a number of processors polynomial in L\@. It is argued that z=D-D_2/2, where D is the fractal dimension and D_2 is the second generalized dimension. Simulations of DLA are carried out to measure D_2 and to test scaling assumptions employed in the complexity analysis of the parallel algorithm. It is plausible that the parallel algorithm attains the minimum possible value of the dynamic exponent in which case z characterizes the intrinsic history dependence of DLA.Comment: 24 pages Revtex and 2 figures. A major improvement to the algorithm and smaller dynamic exponent in this versio

    Internal Diffusion-Limited Aggregation: Parallel Algorithms and Complexity

    Get PDF
    The computational complexity of internal diffusion-limited aggregation (DLA) is examined from both a theoretical and a practical point of view. We show that for two or more dimensions, the problem of predicting the cluster from a given set of paths is complete for the complexity class CC, the subset of P characterized by circuits composed of comparator gates. CC-completeness is believed to imply that, in the worst case, growing a cluster of size n requires polynomial time in n even on a parallel computer. A parallel relaxation algorithm is presented that uses the fact that clusters are nearly spherical to guess the cluster from a given set of paths, and then corrects defects in the guessed cluster through a non-local annihilation process. The parallel running time of the relaxation algorithm for two-dimensional internal DLA is studied by simulating it on a serial computer. The numerical results are compatible with a running time that is either polylogarithmic in n or a small power of n. Thus the computational resources needed to grow large clusters are significantly less on average than the worst-case analysis would suggest. For a parallel machine with k processors, we show that random clusters in d dimensions can be generated in O((n/k + log k) n^{2/d}) steps. This is a significant speedup over explicit sequential simulation, which takes O(n^{1+2/d}) time on average. Finally, we show that in one dimension internal DLA can be predicted in O(log n) parallel time, and so is in the complexity class NC

    Aspects of practical implementations of PRAM algorithms

    Get PDF
    The PRAM is a shared memory model of parallel computation which abstracts away from inessential engineering details. It provides a very simple architecture independent model and provides a good programming environment. Theoreticians of the computer science community have proved that it is possible to emulate the theoretical PRAM model using current technology. Solutions have been found for effectively interconnecting processing elements, for routing data on these networks and for distributing the data among memory modules without hotspots. This thesis reviews this emulation and the possibilities it provides for large scale general purpose parallel computation. The emulation employs a bridging model which acts as an interface between the actual hardware and the PRAM model. We review the evidence that such a scheme crn achieve scalable parallel performance and portable parallel software and that PRAM algorithms can be optimally implemented on such practical models. In the course of this review we presented the following new results: 1. Concerning parallel approximation algorithms, we describe an NC algorithm for finding an approximation to a minimum weight perfect matching in a complete weighted graph. The algorithm is conceptually very simple and it is also the first NC-approximation algorithm for the task with a sub-linear performance ratio. 2. Concerning graph embedding, we describe dense edge-disjoint embeddings of the complete binary tree with n leaves in the following n-node communication networks: the hypercube, the de Bruijn and shuffle-exchange networks and the 2-dimcnsional mesh. In the embeddings the maximum distance from a leaf to the root of the tree is asymptotically optimally short. The embeddings facilitate efficient implementation of many PRAM algorithms on networks employing these graphs as interconnection networks. 3. Concerning bulk synchronous algorithmics, we describe scalable transportable algorithms for the following three commonly required types of computation; balanced tree computations. Fast Fourier Transforms and matrix multiplications

    Some Optimally Adaptive Parallel Graph Algorithms on EREW PRAM Model

    Get PDF
    The study of graph algorithms is an important area of research in computer science, since graphs offer useful tools to model many real-world situations. The commercial availability of parallel computers have led to the development of efficient parallel graph algorithms. Using an exclusive-read and exclusive-write (EREW) parallel random access machine (PRAM) as the computation model with a fixed number of processors, we design and analyze parallel algorithms for seven undirected graph problems, such as, connected components, spanning forest, fundamental cycle set, bridges, bipartiteness, assignment problems, and approximate vertex coloring. For all but the last two problems, the input data structure is an unordered list of edges, and divide-and-conquer is the paradigm for designing algorithms. One of the algorithms to solve the assignment problem makes use of an appropriate variant of dynamic programming strategy. An elegant data structure, called the adjacency list matrix, used in a vertex-coloring algorithm avoids the sequential nature of linked adjacency lists. Each of the proposed algorithms achieves optimal speedup, choosing an optimal granularity (thus exploiting maximum parallelism) which depends on the density or the number of vertices of the given graph. The processor-(time)2 product has been identified as a useful parameter to measure the cost-effectiveness of a parallel algorithm. We derive a lower bound on this measure for each of our algorithms

    A Practical Scalable Shared-Memory Parallel Algorithm for Computing Minimum Spanning Trees

    Get PDF

    Oblivious Network RAM and Leveraging Parallelism to Achieve Obliviousness

    Get PDF
    Oblivious RAM (ORAM) is a cryptographic primitive that allows a trusted CPU to securely access untrusted memory, such that the access patterns reveal nothing about sensitive data. ORAM is known to have broad applications in secure processor design and secure multi-party computation for big data. Unfortunately, due to a logarithmic lower bound by Goldreich and Ostrovsky (Journal of the ACM, \u2796), ORAM is bound to incur a moderate cost in practice. In particular, with the latest developments in ORAM constructions, we are quickly approaching this limit, and the room for performance improvement is small. In this paper, we consider new models of computation in which the cost of obliviousness can be fundamentally reduced in comparison with the standard ORAM model. We propose the Oblivious Network RAM model of computation, where a CPU communicates with multiple memory banks, such that the adversary observes only which bank the CPU is communicating with, but not the address oset within each memory bank. In other words, obliviousness within each bank comes for free either because the architecture prevents a malicious party from observing the address accessed within a bank, or because another solution is used to obfuscate memory accesses within each bank and hence we only need to obfuscate communication patterns between the CPU and the memory banks. We present new constructions for obliviously simulating general or parallel programs in the Network RAM model. We describe applications of our new model in secure processor design and in distributed storage applications with a network adversary

    Complexity, parallel computation and statistical physics

    Full text link
    The intuition that a long history is required for the emergence of complexity in natural systems is formalized using the notion of depth. The depth of a system is defined in terms of the number of parallel computational steps needed to simulate it. Depth provides an objective, irreducible measure of history applicable to systems of the kind studied in statistical physics. It is argued that physical complexity cannot occur in the absence of substantial depth and that depth is a useful proxy for physical complexity. The ideas are illustrated for a variety of systems in statistical physics.Comment: 21 pages, 7 figure

    Efficient parallel computation on multiprocessors with optical interconnection networks

    Get PDF
    This dissertation studies optical interconnection networks, their architecture, address schemes, and computation and communication capabilities. We focus on a simple but powerful optical interconnection network model - the Linear Array with Reconfigurable pipelined Bus System (LARPBS). We extend the LARPBS model to a simplified higher dimensional LAPRBS and provide a set of basic computation operations. We then study the following two groups of parallel computation problems on both one dimensional LARPBS\u27s as well as multi-dimensional LARPBS\u27s: parallel comparison problems, including sorting, merging, and selection; Boolean matrix multiplication, transitive closure and their applications to connected component problems. We implement an optimal sorting algorithm on an n-processor LARPBS. With this optimal sorting algorithm at disposal, we study the sorting problem for higher dimensional LARPBS\u27s and obtain the following results: • An optimal basic Columnsort algorithm on a 2D LARPBS. • Two optimal two-way merge sort algorithms on a 2D LARPBS. • An optimal multi-way merge sorting algorithm on a 2D LARPBS. • An optimal generalized column sort algorithm on a 2D LARPBS. • An optimal generalized column sort algorithm on a 3D LARPBS. • An optimal 5-phase sorting algorithm on a 3D LARPBS. Results for selection problems are as follows: • A constant time maximum-finding algorithm on an LARPBS. • An optimal maximum-finding algorithm on an LARPBS. • An O((log log n)2) time parallel selection algorithm on an LARPBS. • An O(k(log log n)2) time parallel multi-selection algorithm on an LARPBS. While studying the computation and communication properties of the LARPBS model, we find Boolean matrix multiplication and its applications to the graph are another set of problem that can be solved efficiently on the LARPBS. Following is a list of results we have obtained in this area. • A constant time Boolean matrix multiplication algorithm. • An O(log n)-time transitive closure algorithm. • An O(log n)-time connected components algorithm. • An O(log n)-time strongly connected components algorithm. The results provided in this dissertation show the strong computation and communication power of optical interconnection networks

    Oblivious Hashing Revisited, and Applications to Asymptotically Efficient ORAM and OPRAM

    Get PDF
    Oblivious RAM (ORAM) is a powerful cryptographic building block that allows a program to provably hide its access patterns to sensitive data. Since the original proposal of ORAM by Goldreich and Ostrovsky, numerous improvements have been made. To date, the best asymptotic overhead achievable for general block sizes is O(log2N/loglogN)O(\log^2 N/\log \log N), due to an elegant scheme by Kushilevitz et al., which in turn relies on the oblivious Cuckoo hashing scheme by Goodrich and Mitzenmacher. In this paper, we make the following contributions: we first revisit the prior O(log2N/loglogN)O(\log^2 N/\log \log N)-overhead ORAM result. We demonstrate the somewhat incompleteness of this prior result, due to the subtle incompleteness of a core building block, namely, Goodrich and Mitzenmacher\u27s oblivious Cuckoo hashing scheme. Even though we do show how to patch the prior result such that we can fully realize Goodrich and Mitzenmacher\u27s elegant blueprint for oblivious Cuckoo hashing, it is clear that the extreme complexity of oblivious Cuckoo hashing has made understanding, implementation, and proofs difficult. We show that there is a conceptually simple O(log2N/loglogN)O(\log^2 N/\log \log N)-overhead ORAM that dispenses with oblivious Cuckoo hashing entirely. We show that such a conceptually simple scheme lends to further extensions. Specifically, we obtain the first O(log2N/loglogN)O(\log^2 N/\log \log N) Oblivious Parallel RAM (OPRAM) scheme, thus not only matching the performance of the best known sequential ORAM, but also achieving super-logarithmic improvements in comparison with known OPRAM schemes

    Models for Parallel Computation in Multi-Core, Heterogeneous, and Ultra Wide-Word Architectures

    Get PDF
    Multi-core processors have become the dominant processor architecture with 2, 4, and 8 cores on a chip being widely available and an increasing number of cores predicted for the future. In addition, the decreasing costs and increasing programmability of Graphic Processing Units (GPUs) have made these an accessible source of parallel processing power in general purpose computing. Among the many research challenges that this scenario has raised are the fundamental problems related to theoretical modeling of computation in these architectures. In this thesis we study several aspects of computation in modern parallel architectures, from modeling of computation in multi-cores and heterogeneous platforms, to multi-core cache management strategies, through the proposal of an architecture that exploits bit-parallelism on thousands of bits. Observing that in practice multi-cores have a small number of cores, we propose a model for low-degree parallelism for these architectures. We argue that assuming a small number of processors (logarithmic in a problem's input size) simplifies the design of parallel algorithms. We show that in this model a large class of divide-and-conquer and dynamic programming algorithms can be parallelized with simple modifications to sequential programs, while achieving optimal parallel speedups. We further explore low-degree-parallelism in computation, providing evidence of fundamental differences in practice and theory between systems with a sublinear and linear number of processors, and suggesting a sharp theoretical gap between the classes of problems that are efficiently parallelizable in each case. Efficient strategies to manage shared caches play a crucial role in multi-core performance. We propose a model for paging in multi-core shared caches, which extends classical paging to a setting in which several threads share the cache. We show that in this setting traditional cache management policies perform poorly, and that any effective strategy must partition the cache among threads, with a partition that adapts dynamically to the demands of each thread. Inspired by the shared cache setting, we introduce the minimum cache usage problem, an extension to classical sequential paging in which algorithms must account for the amount of cache they use. This cache-aware model seeks algorithms with good performance in terms of faults and the amount of cache used, and has applications in energy efficient caching and in shared cache scenarios. The wide availability of GPUs has added to the parallel power of multi-cores, however, most applications underutilize the available resources. We propose a model for hybrid computation in heterogeneous systems with multi-cores and GPU, and describe strategies for generic parallelization and efficient scheduling of a large class of divide-and-conquer algorithms. Lastly, we introduce the Ultra-Wide Word architecture and model, an extension of the word-RAM model, that allows for constant time operations on thousands of bits in parallel. We show that a large class of existing algorithms can be implemented in the Ultra-Wide Word model, achieving speedups comparable to those of multi-threaded computations, while avoiding the more difficult aspects of parallel programming
    corecore