939 research outputs found

    Sorting Integers on the AP1000

    Full text link
    Sorting is one of the classic problems of computer science. Whilst well understood on sequential machines, the diversity of architectures amongst parallel systems means that algorithms do not perform uniformly on all platforms. This document describes the implementation of a radix based algorithm for sorting positive integers on a Fujitsu AP1000 Supercomputer, which was constructed as an entry in the Joint Symposium on Parallel Processing (JSPP) 1994 Parallel Software Contest (PSC94). Brief consideration is also given to a full radix sort conducted in parallel across the machine.Comment: 1994 Project Report, 23 page

    Lightweight MPI Communicators with Applications to Perfectly Balanced Quicksort

    Get PDF
    MPI uses the concept of communicators to connect groups of processes. It provides nonblocking collective operations on communicators to overlap communication and computation. Flexible algorithms demand flexible communicators. E.g., a process can work on different subproblems within different process groups simultaneously, new process groups can be created, or the members of a process group can change. Depending on the number of communicators, the time for communicator creation can drastically increase the running time of the algorithm. Furthermore, a new communicator synchronizes all processes as communicator creation routines are blocking collective operations. We present RBC, a communication library based on MPI, that creates range-based communicators in constant time without communication. These RBC communicators support (non)blocking point-to-point communication as well as (non)blocking collective operations. Our experiments show that the library reduces the time to create a new communicator by a factor of more than 400 whereas the running time of collective operations remains about the same. We propose Janus Quicksort, a distributed sorting algorithm that avoids any load imbalances. We improved the performance of this algorithm by a factor of 15 for moderate inputs by using RBC communicators. Finally, we discuss different approaches to bring nonblocking (local) communicator creation of lightweight (range-based) communicators into MPI

    Deterministic Selection on the Mesh and Hypercube

    Get PDF
    In this paper we present efficient deterministic algorithms for selection on the mesh connected computers (referred to as the mesh from hereon) and the hypercube. Our algorithm on the mesh runs in time O([n/p] log logp + √p logn) where n is the input size and p is the number of processors. The time bound is significantly better than that of the best existing algorithms when n is large. The run time of our algorithm on the hypercube is O ([n/p] log log p + Ts/p log nM/em\u3e), where Ts/p is the time needed to sort p element on a p-node hypercube. In fact, the same algorithm runs on an network in time O([n/p] log log p +Ts/p log), where Ts/p is the time needed for sorting p keys using p processors (assuming that broadcast and prefix computations take time less than or equal to Ts/p

    Robust massively parallel sorting

    Get PDF

    Quasirandom Load Balancing

    Full text link
    We propose a simple distributed algorithm for balancing indivisible tokens on graphs. The algorithm is completely deterministic, though it tries to imitate (and enhance) a random algorithm by keeping the accumulated rounding errors as small as possible. Our new algorithm surprisingly closely approximates the idealized process (where the tokens are divisible) on important network topologies. On d-dimensional torus graphs with n nodes it deviates from the idealized process only by an additive constant. In contrast to that, the randomized rounding approach of Friedrich and Sauerwald (2009) can deviate up to Omega(polylog(n)) and the deterministic algorithm of Rabani, Sinclair and Wanka (1998) has a deviation of Omega(n^{1/d}). This makes our quasirandom algorithm the first known algorithm for this setting which is optimal both in time and achieved smoothness. We further show that also on the hypercube our algorithm has a smaller deviation from the idealized process than the previous algorithms.Comment: 25 page

    Efficient weighted multiselection in parallel architectures

    Get PDF
    ©2002 IEEE. Personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other works must be obtained from the IEEE.We study parallel solutions to the problem of weighted multiselection to select r elements on given weighted-ranks from a set S of n weighted elements, where an element is on weighted rank k if it is the smallest element such that the aggregated weight of all elements not greater than it in S is not smaller than k. We propose efficient algorithms on two of the most popular parallel architectures, hypercube and mesh. For a hypercube with p < n processors, we present a parallel algorithm running in 0(n^\varepsilon \min \{ r,\log p\} ) time for p = n^{1 - \varepsilon } ,0 < \varepsilon < 1 which is cost optimal when r \geqslant p. Our algorithm on \sqrt p \times \sqrt p mesh runs in 0(\sqrt p + \frac{n}{p}\log ^3 p) time which is the same as multiselection on mesh when r \geqslant \log p, and thus has the same optimality as multiselection in this case

    Towards better algorithms for parallel backtracking

    Get PDF
    Many algorithms in operations research and artificial intelligence are based on depth first search in implicitly defined trees. For parallelizing these algorithms, a load balancing scheme is needed which is able to evenly distribute parts of an irregularly shaped tree over the processors. It should work with minimal interprocessor communication and without prior knowledge of the tree\u27s shape. Previously known load balancing algorithms either require sending a message for each tree node or they only work efficiently for large search trees. This paper introduces new randomized dynamic load balancing algorithms for {\em tree structured computations}, a generalization of backtrack search.These algorithms only need to communicate when necessary and have an asymptotically optimal scalability for many important cases. They work work on hypercubes, butterflies, meshes and many other architectures

    Randomized Parallel Selection

    Get PDF
    We show that selection on an input of size N can be performed on a P-node hypercube (P = N/(log N)) in time O(n/P) with high probability, provided each node can process all the incident edges in one unit of time (this model is called the parallel model and has been assumed by previous researchers (e.g.,[17])). This result is important in view of a lower bound of Plaxton that implies selection takes Ω((N/P)loglog P+log P) time on a P-node hypercube if each node can process only one edge at a time (this model is referred to as the sequential model)

    Parallel Architectures for Planetary Exploration Requirements (PAPER)

    Get PDF
    The Parallel Architectures for Planetary Exploration Requirements (PAPER) project is essentially research oriented towards technology insertion issues for NASA's unmanned planetary probes. It was initiated to complement and augment the long-term efforts for space exploration with particular reference to NASA/LaRC's (NASA Langley Research Center) research needs for planetary exploration missions of the mid and late 1990s. The requirements for space missions as given in the somewhat dated Advanced Information Processing Systems (AIPS) requirements document are contrasted with the new requirements from JPL/Caltech involving sensor data capture and scene analysis. It is shown that more stringent requirements have arisen as a result of technological advancements. Two possible architectures, the AIPS Proof of Concept (POC) configuration and the MAX Fault-tolerant dataflow multiprocessor, were evaluated. The main observation was that the AIPS design is biased towards fault tolerance and may not be an ideal architecture for planetary and deep space probes due to high cost and complexity. The MAX concepts appears to be a promising candidate, except that more detailed information is required. The feasibility for adding neural computation capability to this architecture needs to be studied. Key impact issues for architectural design of computing systems meant for planetary missions were also identified
    • …
    corecore