381 research outputs found

    A Novel Hybrid Quicksort Algorithm Vectorized using AVX-512 on Intel Skylake

    Full text link
    The modern CPU's design, which is composed of hierarchical memory and SIMD/vectorization capability, governs the potential for algorithms to be transformed into efficient implementations. The release of the AVX-512 changed things radically, and motivated us to search for an efficient sorting algorithm that can take advantage of it. In this paper, we describe the best strategy we have found, which is a novel two parts hybrid sort, based on the well-known Quicksort algorithm. The central partitioning operation is performed by a new algorithm, and small partitions/arrays are sorted using a branch-free Bitonic-based sort. This study is also an illustration of how classical algorithms can be adapted and enhanced by the AVX-512 extension. We evaluate the performance of our approach on a modern Intel Xeon Skylake and assess the different layers of our implementation by sorting/partitioning integers, double floating-point numbers, and key/value pairs of integers. Our results demonstrate that our approach is faster than two libraries of reference: the GNU \emph{C++} sort algorithm by a speedup factor of 4, and the Intel IPP library by a speedup factor of 1.4.Comment: 8 pages, research pape

    Analysing the Performance of Divide-and-Conquer Algorithms on Multicore Processors

    Get PDF
    Multicore systems are widely gaining popularity because of the significant avail-ability and performance increase over the single core systems. Multicore systems have a lesser power consumption and heat generation than that of the multiple single core systems. The different compiler support provided by different vendors also make multicore programming one of the main area of research. The multicore programming utilises the power of multiple cores to parallelise a task. The widely used algorithm paradigms for multicore programming are the Divide and Conquer algorithms. The divide and conquer algorithms are candidate problem for the multicore programming because divide and conquer algorithm divides a problem into sub- problems which can be solved by distributing the sub-problems among the different cores and parallel solve them. A wide range of divide and conquer algorithm has been parallelized. In this paper, we have taken two of the widely used divide and conquer algorithms, quick sort and convex hull, parallel implemented them to analyse their performance gain in compared to the sequential version of the algorithm. The parallel implementations distribute the load onto the multiple cores, parallel work upon the loads and finally merge individual results of the each core. We have also proposed a scheme for efficient merging of the parallel sorted sub-arrays in the quick sort. We have taken the mean and standard deviation theory for efficient merging of the sorted sub-arrays. The OpenMP programming model has been used for the implementation of the programs. The processor architecture used for analysing the behaviour of the algorithm is a shared memory based processo

    Engineering Parallel String Sorting

    Get PDF
    We discuss how string sorting algorithms can be parallelized on modern multi-core shared memory machines. As a synthesis of the best sequential string sorting algorithms and successful parallel sorting algorithms for atomic objects, we first propose string sample sort. The algorithm makes effective use of the memory hierarchy, uses additional word level parallelism, and largely avoids branch mispredictions. Then we focus on NUMA architectures, and develop parallel multiway LCP-merge and -mergesort to reduce the number of random memory accesses to remote nodes. Additionally, we parallelize variants of multikey quicksort and radix sort that are also useful in certain situations. Comprehensive experiments on five current multi-core platforms are then reported and discussed. The experiments show that our implementations scale very well on real-world inputs and modern machines.Comment: 46 pages, extension of "Parallel String Sample Sort" arXiv:1305.115

    Parallel String Sample Sort

    Get PDF
    We discuss how string sorting algorithms can be parallelized on modern multi-core shared memory machines. As a synthesis of the best sequential string sorting algorithms and successful parallel sorting algorithms for atomic objects, we propose string sample sort. The algorithm makes effective use of the memory hierarchy, uses additional word level parallelism, and largely avoids branch mispredictions. Additionally, we parallelize variants of multikey quicksort and radix sort that are also useful in certain situations.Comment: 34 pages, 7 figures and 12 table

    Applications for Multicore System

    Get PDF
    A multi-core processor is a single computing unit with two or more processors (“cores”). These cores are integrated into a single IC for enhanced performance, reduced power consumption and more efficient simultaneous processing of multiple tasks. Homogeneous multi-core systems include only identical cores, whereas heterogeneous multi-core systems have cores that are not identical. Most of the computers and workstations these days have multicore processors. However most software programs are not designed to make use of multi-core processors and hence even though we run these programs on the new machines equipped with multicore processors, we don’t see sizable improvements in application performance. The idea behind improved performance is in parallelizing the code and distributing the work amongst multiple cores, but writing programming logic to achieve this is complex. The conventional model of lock-based parallelism for writing such programs is difficult in use, error-prone and does not always lead to efficient use of the resources but with the help of OpenMP, programmers have enhanced support for parallel programming. In this work I have implemented quicksort algorithm using OpenMP library and analysed the performance in terms of execution time

    LearnedSort as a learning-augmented SampleSort: Analysis and Parallelization

    Full text link
    This work analyzes and parallelizes LearnedSort, the novel algorithm that sorts using machine learning models based on the cumulative distribution function. LearnedSort is analyzed under the lens of algorithms with predictions, and it is argued that LearnedSort is a learning-augmented SampleSort. A parallel LearnedSort algorithm is developed combining LearnedSort with the state-of-the-art SampleSort implementation, IPS4o. Benchmarks on synthetic and real-world datasets demonstrate improved parallel performance for parallel LearnedSort compared to IPS4o and other sorting algorithms.Comment: Published in SSDBM 202

    Counting Linear Extensions in Practice : MCMC versus Exponential Monte Carlo

    Get PDF
    Counting the linear extensions of a given partial order is a #P-complete problem that arises in numerous applications. For polynomial-time approximation, several Markov chain Monte Carlo schemes have been proposed; however, little is known of their efficiency in practice. This work presents an empirical evaluation of the state-of-the-art schemes and investigates a number of ideas to enhance their performance. In addition, we introduce a novel approximation scheme, adaptive relaxation Monte Carlo (ARMC), that leverages exact exponential-time counting algorithms. We show that approximate counting is feasible up to a few hundred elements on various classes of partial orders, and within this range ARMC typically outperforms the other schemes.Peer reviewe

    Distributed String Sorting Algorithms

    Get PDF
    corecore