4,354 research outputs found

    Engineering Parallel String Sorting

    Get PDF
    We discuss how string sorting algorithms can be parallelized on modern multi-core shared memory machines. As a synthesis of the best sequential string sorting algorithms and successful parallel sorting algorithms for atomic objects, we first propose string sample sort. The algorithm makes effective use of the memory hierarchy, uses additional word level parallelism, and largely avoids branch mispredictions. Then we focus on NUMA architectures, and develop parallel multiway LCP-merge and -mergesort to reduce the number of random memory accesses to remote nodes. Additionally, we parallelize variants of multikey quicksort and radix sort that are also useful in certain situations. Comprehensive experiments on five current multi-core platforms are then reported and discussed. The experiments show that our implementations scale very well on real-world inputs and modern machines.Comment: 46 pages, extension of "Parallel String Sample Sort" arXiv:1305.115

    Improved parallel integer sorting without concurrent writing

    No full text
    We show that nn integers in the range 1 \twodots n can be stably sorted on an \linebreak EREW PRAM using \nolinebreak O(t)O(t) time \linebreak and O(n(lognloglogn+(logn)2/t))O(n(\sqrt{\log n\log\log n}+{{(\log n)^2}/t})) operations, for arbitrary given \linebreak tlognloglognt\ge\log n\log\log n, and on a CREW PRAM using %O(lognloglogn)O(\log n\log\log n) time and O(nlogn)O(n\sqrt{\log n}) O(t)O(t) time and O(n(logn+logn/2t/logn))O(n(\sqrt{\log n}+{{\log n}/{2^{{t/{\log n}}}}})) operations, for arbitrary given tlognt\ge\log n. In addition, we are able to sort nn arbitrary integers on a randomized CREW PRAM % using %O(lognloglogn)O(\log n\log\log n) time and O(nlogn)O(n\sqrt{\log n}) operations within the same resource bounds with high probability. In each case our algorithm is a factor of almost Θ(logn)\Theta(\sqrt{\log n}) closer to optimality than all previous algorithms for the stated problem in the stated model, and our third result matches the operation count of the best known sequential algorithm. We also show that nn integers in the range 1 \twodots m can be sorted in O((logn)2)O((\log n)^2) time with O(n)O(n) operations on an EREW PRAM using a nonstandard word length of O(lognloglognlogm)O(\log n \log\log n \log m) bits, thereby greatly improving the upper bound on the word length necessary to sort integers with a linear time-processor product, even sequentially. Our algorithms were inspired by, and in one case directly use, the fusion trees of Fredman and Willard

    Space and Time Efficient Parallel Graph Decomposition, Clustering, and Diameter Approximation

    Full text link
    We develop a novel parallel decomposition strategy for unweighted, undirected graphs, based on growing disjoint connected clusters from batches of centers progressively selected from yet uncovered nodes. With respect to similar previous decompositions, our strategy exercises a tighter control on both the number of clusters and their maximum radius. We present two important applications of our parallel graph decomposition: (1) kk-center clustering approximation; and (2) diameter approximation. In both cases, we obtain algorithms which feature a polylogarithmic approximation factor and are amenable to a distributed implementation that is geared for massive (long-diameter) graphs. The total space needed for the computation is linear in the problem size, and the parallel depth is substantially sublinear in the diameter for graphs with low doubling dimension. To the best of our knowledge, ours are the first parallel approximations for these problems which achieve sub-diameter parallel time, for a relevant class of graphs, using only linear space. Besides the theoretical guarantees, our algorithms allow for a very simple implementation on clustered architectures: we report on extensive experiments which demonstrate their effectiveness and efficiency on large graphs as compared to alternative known approaches.Comment: 14 page

    High-Throughput SNP Genotyping by SBE/SBH

    Full text link
    Despite much progress over the past decade, current Single Nucleotide Polymorphism (SNP) genotyping technologies still offer an insufficient degree of multiplexing when required to handle user-selected sets of SNPs. In this paper we propose a new genotyping assay architecture combining multiplexed solution-phase single-base extension (SBE) reactions with sequencing by hybridization (SBH) using universal DNA arrays such as all kk-mer arrays. In addition to PCR amplification of genomic DNA, SNP genotyping using SBE/SBH assays involves the following steps: (1) Synthesizing primers complementing the genomic sequence immediately preceding SNPs of interest; (2) Hybridizing these primers with the genomic DNA; (3) Extending each primer by a single base using polymerase enzyme and dideoxynucleotides labeled with 4 different fluorescent dyes; and finally (4) Hybridizing extended primers to a universal DNA array and determining the identity of the bases that extend each primer by hybridization pattern analysis. Our contributions include a study of multiplexing algorithms for SBE/SBH genotyping assays and preliminary experimental results showing the achievable tradeoffs between the number of array probes and primer length on one hand and the number of SNPs that can be assayed simultaneously on the other. Simulation results on datasets both randomly generated and extracted from the NCBI dbSNP database suggest that the SBE/SBH architecture provides a flexible and cost-effective alternative to genotyping assays currently used in the industry, enabling genotyping of up to hundreds of thousands of user-specified SNPs per assay.Comment: 19 page

    Run-time parallelization and scheduling of loops

    Get PDF
    Run time methods are studied to automatically parallelize and schedule iterations of a do loop in certain cases, where compile-time information is inadequate. The methods presented involve execution time preprocessing of the loop. At compile-time, these methods set up the framework for performing a loop dependency analysis. At run time, wave fronts of concurrently executable loop iterations are identified. Using this wavefront information, loop iterations are reordered for increased parallelism. Symbolic transformation rules are used to produce: inspector procedures that perform execution time preprocessing and executors or transformed versions of source code loop structures. These transformed loop structures carry out the calculations planned in the inspector procedures. Performance results are presented from experiments conducted on the Encore Multimax. These results illustrate that run time reordering of loop indices can have a significant impact on performance. Furthermore, the overheads associated with this type of reordering are amortized when the loop is executed several times with the same dependency structure
    corecore