2,032 research outputs found

    Building Segment Trees in Parallel

    Get PDF
    The segment tree is a simple and important data structure in computational geometry [7,11]. We present an experimental study of parallel algorithms for building segment trees. We analyze the algorithms in the context of both the PRAM (Parallel Random Access Machine) and hypercube architectures. In addition, we present performance data for implementations developed on the Connection Machine. We compare two different parallel alforitms, and we also compare our parallel algorithms to a good sequential algorithm for doing the same job. In this way, we evaluate the overall efficiency of our parallel methods. Our performance results illustrates the problems involved in using popular machine models(PRAM) and analysis techniques (asymptotic efficiency) to predict the performance of parallel algorithms on real machines. We present two different analyses of our algorithms and show that neither is effective in predicting the actual performance numbers that we obtained

    Parallel RAM from Cyclic Circuits

    Full text link
    Known simulations of random access machines (RAMs) or parallel RAMs (PRAMs) by Boolean circuits incur significant polynomial blowup, due to the need to repeatedly simulate accesses to a large main memory. Consider two modifications to Boolean circuits: (1) remove the restriction that circuit graphs are acyclic and (2) enhance AND gates such that they output zero eagerly. If an AND gate has a zero input, it 'short circuits' and outputs zero without waiting for its second input. We call this the cyclic circuit model. Note, circuits in this model remain combinational, as they do not allow wire values to change over time. We simulate a bounded-word-size PRAM via a cyclic circuit, and the blowup from the simulation is only polylogarithmic. Consider a PRAM program PP that on a length nn input uses an arbitrary number of processors to manipulate words of size Θ(logn)\Theta(\log n) bits and then halts within W(n)W(n) work. We construct a size-O(W(n)log4n)O(W(n)\cdot \log^4 n) cyclic circuit that simulates PP. Suppose that on a particular input, PP halts in time TT; our circuit computes the same output within TO(log3n)T \cdot O(\log^3 n) gate delay. This implies theoretical feasibility of powerful parallel machines. Cyclic circuits can be implemented in hardware, and our circuit achieves performance within polylog factors of PRAM. Our simulated PRAM synchronizes processors by simply leveraging logical dependencies between wires

    Parallel Algorithm and Dynamic Exponent for Diffusion-limited Aggregation

    Full text link
    A parallel algorithm for ``diffusion-limited aggregation'' (DLA) is described and analyzed from the perspective of computational complexity. The dynamic exponent z of the algorithm is defined with respect to the probabilistic parallel random-access machine (PRAM) model of parallel computation according to TLzT \sim L^{z}, where L is the cluster size, T is the running time, and the algorithm uses a number of processors polynomial in L\@. It is argued that z=D-D_2/2, where D is the fractal dimension and D_2 is the second generalized dimension. Simulations of DLA are carried out to measure D_2 and to test scaling assumptions employed in the complexity analysis of the parallel algorithm. It is plausible that the parallel algorithm attains the minimum possible value of the dynamic exponent in which case z characterizes the intrinsic history dependence of DLA.Comment: 24 pages Revtex and 2 figures. A major improvement to the algorithm and smaller dynamic exponent in this versio

    Complexity, parallel computation and statistical physics

    Full text link
    The intuition that a long history is required for the emergence of complexity in natural systems is formalized using the notion of depth. The depth of a system is defined in terms of the number of parallel computational steps needed to simulate it. Depth provides an objective, irreducible measure of history applicable to systems of the kind studied in statistical physics. It is argued that physical complexity cannot occur in the absence of substantial depth and that depth is a useful proxy for physical complexity. The ideas are illustrated for a variety of systems in statistical physics.Comment: 21 pages, 7 figure

    Parallel Weighted Random Sampling

    Get PDF
    Data structures for efficient sampling from a set of weighted items are an important building block of many applications. However, few parallel solutions are known. We close many of these gaps both for shared-memory and distributed-memory machines. We give efficient, fast, and practicable algorithms for sampling single items, k items with/without replacement, permutations, subsets, and reservoirs. We also give improved sequential algorithms for alias table construction and for sampling with replacement. Experiments on shared-memory parallel machines with up to 158 threads show near linear speedups both for construction and queries

    Internal Diffusion-Limited Aggregation: Parallel Algorithms and Complexity

    Get PDF
    The computational complexity of internal diffusion-limited aggregation (DLA) is examined from both a theoretical and a practical point of view. We show that for two or more dimensions, the problem of predicting the cluster from a given set of paths is complete for the complexity class CC, the subset of P characterized by circuits composed of comparator gates. CC-completeness is believed to imply that, in the worst case, growing a cluster of size n requires polynomial time in n even on a parallel computer. A parallel relaxation algorithm is presented that uses the fact that clusters are nearly spherical to guess the cluster from a given set of paths, and then corrects defects in the guessed cluster through a non-local annihilation process. The parallel running time of the relaxation algorithm for two-dimensional internal DLA is studied by simulating it on a serial computer. The numerical results are compatible with a running time that is either polylogarithmic in n or a small power of n. Thus the computational resources needed to grow large clusters are significantly less on average than the worst-case analysis would suggest. For a parallel machine with k processors, we show that random clusters in d dimensions can be generated in O((n/k + log k) n^{2/d}) steps. This is a significant speedup over explicit sequential simulation, which takes O(n^{1+2/d}) time on average. Finally, we show that in one dimension internal DLA can be predicted in O(log n) parallel time, and so is in the complexity class NC

    Round Compression for Parallel Matching Algorithms

    Get PDF
    For over a decade now we have been witnessing the success of {\em massive parallel computation} (MPC) frameworks, such as MapReduce, Hadoop, Dryad, or Spark. One of the reasons for their success is the fact that these frameworks are able to accurately capture the nature of large-scale computation. In particular, compared to the classic distributed algorithms or PRAM models, these frameworks allow for much more local computation. The fundamental question that arises in this context is though: can we leverage this additional power to obtain even faster parallel algorithms? A prominent example here is the {\em maximum matching} problem---one of the most classic graph problems. It is well known that in the PRAM model one can compute a 2-approximate maximum matching in O(logn)O(\log{n}) rounds. However, the exact complexity of this problem in the MPC framework is still far from understood. Lattanzi et al. showed that if each machine has n1+Ω(1)n^{1+\Omega(1)} memory, this problem can also be solved 22-approximately in a constant number of rounds. These techniques, as well as the approaches developed in the follow up work, seem though to get stuck in a fundamental way at roughly O(logn)O(\log{n}) rounds once we enter the near-linear memory regime. It is thus entirely possible that in this regime, which captures in particular the case of sparse graph computations, the best MPC round complexity matches what one can already get in the PRAM model, without the need to take advantage of the extra local computation power. In this paper, we finally refute that perplexing possibility. That is, we break the above O(logn)O(\log n) round complexity bound even in the case of {\em slightly sublinear} memory per machine. In fact, our improvement here is {\em almost exponential}: we are able to deliver a (2+ϵ)(2+\epsilon)-approximation to maximum matching, for any fixed constant ϵ>0\epsilon>0, in O((loglogn)2)O((\log \log n)^2) rounds
    corecore