420 research outputs found

    Deterministic Selection on the Mesh and Hypercube

    Get PDF
    In this paper we present efficient deterministic algorithms for selection on the mesh connected computers (referred to as the mesh from hereon) and the hypercube. Our algorithm on the mesh runs in time O([n/p] log logp + √p logn) where n is the input size and p is the number of processors. The time bound is significantly better than that of the best existing algorithms when n is large. The run time of our algorithm on the hypercube is O ([n/p] log log p + Ts/p log nM/em\u3e), where Ts/p is the time needed to sort p element on a p-node hypercube. In fact, the same algorithm runs on an network in time O([n/p] log log p +Ts/p log), where Ts/p is the time needed for sorting p keys using p processors (assuming that broadcast and prefix computations take time less than or equal to Ts/p

    Efficient weighted multiselection in parallel architectures

    Get PDF
    ©2002 IEEE. Personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other works must be obtained from the IEEE.We study parallel solutions to the problem of weighted multiselection to select r elements on given weighted-ranks from a set S of n weighted elements, where an element is on weighted rank k if it is the smallest element such that the aggregated weight of all elements not greater than it in S is not smaller than k. We propose efficient algorithms on two of the most popular parallel architectures, hypercube and mesh. For a hypercube with p < n processors, we present a parallel algorithm running in 0(n^\varepsilon \min \{ r,\log p\} ) time for p = n^{1 - \varepsilon } ,0 < \varepsilon < 1 which is cost optimal when r \geqslant p. Our algorithm on \sqrt p \times \sqrt p mesh runs in 0(\sqrt p + \frac{n}{p}\log ^3 p) time which is the same as multiselection on mesh when r \geqslant \log p, and thus has the same optimality as multiselection in this case

    Sample sort on meshes

    Get PDF
    This paper provides an overview of lower and upper bounds for mesh-connected processor networks. Most attention goes to routing and sorting problems, but other problems are mentioned as well. Results from 1977 to 1995 are covered. We provide numerous results, references and open problems. The text is completed with an index. This is a worked-out version of the author's contribution to a joint paper with Grammatikakis, Hsu and Kraetzl on multicomputer routing, submitted to JPDC

    Aspects of practical implementations of PRAM algorithms

    Get PDF
    The PRAM is a shared memory model of parallel computation which abstracts away from inessential engineering details. It provides a very simple architecture independent model and provides a good programming environment. Theoreticians of the computer science community have proved that it is possible to emulate the theoretical PRAM model using current technology. Solutions have been found for effectively interconnecting processing elements, for routing data on these networks and for distributing the data among memory modules without hotspots. This thesis reviews this emulation and the possibilities it provides for large scale general purpose parallel computation. The emulation employs a bridging model which acts as an interface between the actual hardware and the PRAM model. We review the evidence that such a scheme crn achieve scalable parallel performance and portable parallel software and that PRAM algorithms can be optimally implemented on such practical models. In the course of this review we presented the following new results: 1. Concerning parallel approximation algorithms, we describe an NC algorithm for finding an approximation to a minimum weight perfect matching in a complete weighted graph. The algorithm is conceptually very simple and it is also the first NC-approximation algorithm for the task with a sub-linear performance ratio. 2. Concerning graph embedding, we describe dense edge-disjoint embeddings of the complete binary tree with n leaves in the following n-node communication networks: the hypercube, the de Bruijn and shuffle-exchange networks and the 2-dimcnsional mesh. In the embeddings the maximum distance from a leaf to the root of the tree is asymptotically optimally short. The embeddings facilitate efficient implementation of many PRAM algorithms on networks employing these graphs as interconnection networks. 3. Concerning bulk synchronous algorithmics, we describe scalable transportable algorithms for the following three commonly required types of computation; balanced tree computations. Fast Fourier Transforms and matrix multiplications

    Parallel Weighted Random Sampling

    Get PDF
    Data structures for efficient sampling from a set of weighted items are an important building block of many applications. However, few parallel solutions are known. We close many of these gaps both for shared-memory and distributed-memory machines. We give efficient, fast, and practicable algorithms for sampling single items, k items with/without replacement, permutations, subsets, and reservoirs. We also give improved sequential algorithms for alias table construction and for sampling with replacement. Experiments on shared-memory parallel machines with up to 158 threads show near linear speedups both for construction and queries

    Shared memory with hidden latency on a family of mesh-like networks

    Get PDF

    Towards better algorithms for parallel backtracking

    Get PDF
    Many algorithms in operations research and artificial intelligence are based on depth first search in implicitly defined trees. For parallelizing these algorithms, a load balancing scheme is needed which is able to evenly distribute parts of an irregularly shaped tree over the processors. It should work with minimal interprocessor communication and without prior knowledge of the tree\u27s shape. Previously known load balancing algorithms either require sending a message for each tree node or they only work efficiently for large search trees. This paper introduces new randomized dynamic load balancing algorithms for {\em tree structured computations}, a generalization of backtrack search.These algorithms only need to communicate when necessary and have an asymptotically optimal scalability for many important cases. They work work on hypercubes, butterflies, meshes and many other architectures

    On the implementation of P-RAM algorithms on feasible SIMD computers

    Get PDF
    The P-RAM model of computation has proved to be a very useful theoretical model for exploiting and extracting inherent parallelism in problems and thus for designing parallel algorithms. Therefore, it becomes very important to examine whether results obtained for such a model can be translated onto machines considered to be more realistic in the face of current technological constraints. In this thesis, we show how the implementation of many techniques and algorithms designed for the P-RAM can be achieved on the feasible SIMD class of computers. The first investigation concerns classes of problems solvable on the P-RAM model using the recursive techniques of compression, tree contraction and 'divide and conquer'. For such problems, specific methods are emphasised to achieve efficient implementations on some SIMD architectures. Problems such as list ranking, polynomial and expression evaluation are shown to have efficient solutions on the 2—dimensional mesh-connected computer. The balanced binary tree technique is widely employed to solve many problems in the P-RAM model. By proposing an implicit embedding of the binary tree of size n on a (√n x√n) mesh-connected computer (contrary to using the usual H-tree approach which requires a mesh of size ≈ (2√n x 2√n), we show that many of the problems solvable using this technique can be efficiently implementable on this architecture. Two efficient O (√n) algorithms for solving the bracket matching problem are presented. Consequently, the problems of expression evaluation (where the expression is given in an array form), evaluating algebraic expressions with a carrier of constant bounded size and parsing expressions of both bracket and input driven languages are all shown to have efficient solutions on the 2—dimensional mesh-connected computer. Dealing with non-tree structured computations we show that the Eulerian tour problem for a given graph with m edges and maximum vertex degree d can be solved in O(d√n) parallel time on the 2 —dimensional mesh-connected computer. A way to increase the processor utilisation on the 2-dimensional mesh-connected computer is also presented. The method suggested consists of pipelining sets of iteratively solvable problems each of which at each step of its execution uses only a fraction of available PE's

    Large-Scale Sorting in Uniform Memory Hierarchies

    Get PDF
    We present several e cient algorithms for sorting on the uniform memory hierarchy (UMH), introduced by Alpern, Carter, and Feig, and its paral- lelization P-UMH.We give optimal and nearly-optimal algorithms for a wide range of bandwidth degradations, including a parsimonious algorithm for constant bandwidth. We also develop optimal sorting algorithms for all bandwidths for other versions of UMH and P-UMH, including natural restrictions we introduce called RUMH and P-RUMH, which more closely correspond to current programming languages
    • …
    corecore