42 research outputs found

    Improved parallel integer sorting without concurrent writing

    No full text
    We show that nn integers in the range 1 \twodots n can be stably sorted on an \linebreak EREW PRAM using \nolinebreak O(t)O(t) time \linebreak and O(n(lognloglogn+(logn)2/t))O(n(\sqrt{\log n\log\log n}+{{(\log n)^2}/t})) operations, for arbitrary given \linebreak tlognloglognt\ge\log n\log\log n, and on a CREW PRAM using %O(lognloglogn)O(\log n\log\log n) time and O(nlogn)O(n\sqrt{\log n}) O(t)O(t) time and O(n(logn+logn/2t/logn))O(n(\sqrt{\log n}+{{\log n}/{2^{{t/{\log n}}}}})) operations, for arbitrary given tlognt\ge\log n. In addition, we are able to sort nn arbitrary integers on a randomized CREW PRAM % using %O(lognloglogn)O(\log n\log\log n) time and O(nlogn)O(n\sqrt{\log n}) operations within the same resource bounds with high probability. In each case our algorithm is a factor of almost Θ(logn)\Theta(\sqrt{\log n}) closer to optimality than all previous algorithms for the stated problem in the stated model, and our third result matches the operation count of the best known sequential algorithm. We also show that nn integers in the range 1 \twodots m can be sorted in O((logn)2)O((\log n)^2) time with O(n)O(n) operations on an EREW PRAM using a nonstandard word length of O(lognloglognlogm)O(\log n \log\log n \log m) bits, thereby greatly improving the upper bound on the word length necessary to sort integers with a linear time-processor product, even sequentially. Our algorithms were inspired by, and in one case directly use, the fusion trees of Fredman and Willard

    Fast integer merging on the EREW PRAM

    Get PDF
    We investigate the complexity of merging sequences of small integers on the EREW PRAM. Our most surprising result is that two sorted sequences of nn bits each can be merged in O(loglogn)O(\log\log n) time. More generally, we describe an algorithm to merge two sorted sequences of nn integers drawn from the set {0,,m1}\{0,\ldots,m-1\} in O(loglogn+logm)O(\log\log n+\log m) time using an optimal number of processors. No sublogarithmic merging algorithm for this model of computation was previously known. The algorithm not only produces the merged sequence, but also computes the rank of each input element in the merged sequence. On the other hand, we show a lower bound of Ω(logmin{n,m})\Omega(\log\min\{n,m\}) on the time needed to merge two sorted sequences of length nn each with elements in the set {0,,m1}\{0,\ldots,m-1\}, implying that our merging algorithm is as fast as possible for m=(logn)Ω(1)m=(\log n)^{\Omega(1)}. If we impose an additional stability condition requiring the ranks of each input sequence to form an increasing sequence, then the time complexity of the problem becomes Θ(logn)\Theta(\log n), even for m=2m=2. Stable merging is thus harder than nonstable merging

    Some Optimally Adaptive Parallel Graph Algorithms on EREW PRAM Model

    Get PDF
    The study of graph algorithms is an important area of research in computer science, since graphs offer useful tools to model many real-world situations. The commercial availability of parallel computers have led to the development of efficient parallel graph algorithms. Using an exclusive-read and exclusive-write (EREW) parallel random access machine (PRAM) as the computation model with a fixed number of processors, we design and analyze parallel algorithms for seven undirected graph problems, such as, connected components, spanning forest, fundamental cycle set, bridges, bipartiteness, assignment problems, and approximate vertex coloring. For all but the last two problems, the input data structure is an unordered list of edges, and divide-and-conquer is the paradigm for designing algorithms. One of the algorithms to solve the assignment problem makes use of an appropriate variant of dynamic programming strategy. An elegant data structure, called the adjacency list matrix, used in a vertex-coloring algorithm avoids the sequential nature of linked adjacency lists. Each of the proposed algorithms achieves optimal speedup, choosing an optimal granularity (thus exploiting maximum parallelism) which depends on the density or the number of vertices of the given graph. The processor-(time)2 product has been identified as a useful parameter to measure the cost-effectiveness of a parallel algorithm. We derive a lower bound on this measure for each of our algorithms

    Parallel Algorithms for Summing Floating-Point Numbers

    Full text link
    The problem of exactly summing n floating-point numbers is a fundamental problem that has many applications in large-scale simulations and computational geometry. Unfortunately, due to the round-off error in standard floating-point operations, this problem becomes very challenging. Moreover, all existing solutions rely on sequential algorithms which cannot scale to the huge datasets that need to be processed. In this paper, we provide several efficient parallel algorithms for summing n floating point numbers, so as to produce a faithfully rounded floating-point representation of the sum. We present algorithms in PRAM, external-memory, and MapReduce models, and we also provide an experimental analysis of our MapReduce algorithms, due to their simplicity and practical efficiency.Comment: Conference version appears in SPAA 201

    Deterministic Computations on a PRAM with Static Processor and Memory Faults.

    Get PDF
    We consider Parallel Random Access Machine (PRAM) which has some processors and memory cells faulty. The faults considered are static, i.e., once the machine starts to operate, the operational/faulty status of PRAM components does not change. We develop a deterministic simulation of a fully operational PRAM on a similar faulty machine which has constant fractions of faults among processors and memory cells. The simulating PRAM has nn processors and mm memory cells, and simulates a PRAM with nn processors and a constant fraction of mm memory cells. The simulation is in two phases: it starts with preprocessing, which is followed by the simulation proper performed in a step-by-step fashion. Preprocessing is performed in time O((mn+logn)logn)O((\frac{m}{n}+ \log n)\log n). The slowdown of a step-by-step part of the simulation is O(logm)O(\log m)

    Efficient parallel computation on multiprocessors with optical interconnection networks

    Get PDF
    This dissertation studies optical interconnection networks, their architecture, address schemes, and computation and communication capabilities. We focus on a simple but powerful optical interconnection network model - the Linear Array with Reconfigurable pipelined Bus System (LARPBS). We extend the LARPBS model to a simplified higher dimensional LAPRBS and provide a set of basic computation operations. We then study the following two groups of parallel computation problems on both one dimensional LARPBS\u27s as well as multi-dimensional LARPBS\u27s: parallel comparison problems, including sorting, merging, and selection; Boolean matrix multiplication, transitive closure and their applications to connected component problems. We implement an optimal sorting algorithm on an n-processor LARPBS. With this optimal sorting algorithm at disposal, we study the sorting problem for higher dimensional LARPBS\u27s and obtain the following results: • An optimal basic Columnsort algorithm on a 2D LARPBS. • Two optimal two-way merge sort algorithms on a 2D LARPBS. • An optimal multi-way merge sorting algorithm on a 2D LARPBS. • An optimal generalized column sort algorithm on a 2D LARPBS. • An optimal generalized column sort algorithm on a 3D LARPBS. • An optimal 5-phase sorting algorithm on a 3D LARPBS. Results for selection problems are as follows: • A constant time maximum-finding algorithm on an LARPBS. • An optimal maximum-finding algorithm on an LARPBS. • An O((log log n)2) time parallel selection algorithm on an LARPBS. • An O(k(log log n)2) time parallel multi-selection algorithm on an LARPBS. While studying the computation and communication properties of the LARPBS model, we find Boolean matrix multiplication and its applications to the graph are another set of problem that can be solved efficiently on the LARPBS. Following is a list of results we have obtained in this area. • A constant time Boolean matrix multiplication algorithm. • An O(log n)-time transitive closure algorithm. • An O(log n)-time connected components algorithm. • An O(log n)-time strongly connected components algorithm. The results provided in this dissertation show the strong computation and communication power of optical interconnection networks

    Efficient Parallel and Distributed Algorithms for GIS Polygon Overlay Processing

    Get PDF
    Polygon clipping is one of the complex operations in computational geometry. It is used in Geographic Information Systems (GIS), Computer Graphics, and VLSI CAD. For two polygons with n and m vertices, the number of intersections can be O(nm). In this dissertation, we present the first output-sensitive CREW PRAM algorithm, which can perform polygon clipping in O(log n) time using O(n + k + k\u27) processors, where n is the number of vertices, k is the number of intersections, and k\u27 is the additional temporary vertices introduced due to the partitioning of polygons. The current best algorithm by Karinthi, Srinivas, and Almasi does not handle self-intersecting polygons, is not output-sensitive and must employ O(n^2) processors to achieve O(log n) time. The second parallel algorithm is an output-sensitive PRAM algorithm based on Greiner-Hormann algorithm with O(log n) time complexity using O(n + k) processors. This is cost-optimal when compared to the time complexity of the best-known sequential plane-sweep based algorithm for polygon clipping. For self-intersecting polygons, the time complexity is O(((n + k) log n log log n)/p) using p In addition to these parallel algorithms, the other main contributions in this dissertation are 1) multi-core and many-core implementation for clipping a pair of polygons and 2) MPI-GIS and Hadoop Topology Suite for distributed polygon overlay using a cluster of nodes. Nvidia GPU and CUDA are used for the many-core implementation. The MPI based system achieves 44X speedup while processing about 600K polygons in two real-world GIS shapefiles 1) USA Detailed Water Bodies and 2) USA Block Group Boundaries) within 20 seconds on a 32-node (8 cores each) IBM iDataPlex cluster interconnected by InfiniBand technology
    corecore