381 research outputs found
A Novel Hybrid Quicksort Algorithm Vectorized using AVX-512 on Intel Skylake
The modern CPU's design, which is composed of hierarchical memory and
SIMD/vectorization capability, governs the potential for algorithms to be
transformed into efficient implementations. The release of the AVX-512 changed
things radically, and motivated us to search for an efficient sorting algorithm
that can take advantage of it. In this paper, we describe the best strategy we
have found, which is a novel two parts hybrid sort, based on the well-known
Quicksort algorithm. The central partitioning operation is performed by a new
algorithm, and small partitions/arrays are sorted using a branch-free
Bitonic-based sort. This study is also an illustration of how classical
algorithms can be adapted and enhanced by the AVX-512 extension. We evaluate
the performance of our approach on a modern Intel Xeon Skylake and assess the
different layers of our implementation by sorting/partitioning integers, double
floating-point numbers, and key/value pairs of integers. Our results
demonstrate that our approach is faster than two libraries of reference: the
GNU \emph{C++} sort algorithm by a speedup factor of 4, and the Intel IPP
library by a speedup factor of 1.4.Comment: 8 pages, research pape
Analysing the Performance of Divide-and-Conquer Algorithms on Multicore Processors
Multicore systems are widely gaining popularity because of the significant avail-ability and performance increase over the single core systems. Multicore systems have a lesser power consumption and heat generation than that of the multiple single core systems. The different compiler support provided by different vendors also make multicore programming one of the main area of research. The multicore programming utilises the power of multiple cores to parallelise a task. The widely used algorithm paradigms for multicore programming are the Divide and Conquer algorithms. The divide and conquer algorithms are candidate problem for the multicore programming because divide and conquer algorithm divides a problem into sub- problems which can be solved by distributing the sub-problems among the different cores and parallel solve them. A wide range of divide and conquer algorithm has been parallelized. In this paper, we have taken two of the widely used divide and conquer algorithms, quick sort and convex hull, parallel implemented them to analyse their performance gain in compared to the sequential version of the algorithm. The parallel implementations distribute the load onto the multiple cores, parallel work upon the loads and finally merge individual results of the each core. We have also proposed a scheme for efficient merging of the parallel sorted sub-arrays in the quick sort. We have taken the mean and standard deviation theory for efficient merging of the sorted sub-arrays. The OpenMP programming model has been used for the implementation of the programs. The processor architecture used for analysing the behaviour of the algorithm is a shared memory based processo
Engineering Parallel String Sorting
We discuss how string sorting algorithms can be parallelized on modern
multi-core shared memory machines. As a synthesis of the best sequential string
sorting algorithms and successful parallel sorting algorithms for atomic
objects, we first propose string sample sort. The algorithm makes effective use
of the memory hierarchy, uses additional word level parallelism, and largely
avoids branch mispredictions. Then we focus on NUMA architectures, and develop
parallel multiway LCP-merge and -mergesort to reduce the number of random
memory accesses to remote nodes. Additionally, we parallelize variants of
multikey quicksort and radix sort that are also useful in certain situations.
Comprehensive experiments on five current multi-core platforms are then
reported and discussed. The experiments show that our implementations scale
very well on real-world inputs and modern machines.Comment: 46 pages, extension of "Parallel String Sample Sort" arXiv:1305.115
Parallel String Sample Sort
We discuss how string sorting algorithms can be parallelized on modern
multi-core shared memory machines. As a synthesis of the best sequential string
sorting algorithms and successful parallel sorting algorithms for atomic
objects, we propose string sample sort. The algorithm makes effective use of
the memory hierarchy, uses additional word level parallelism, and largely
avoids branch mispredictions. Additionally, we parallelize variants of multikey
quicksort and radix sort that are also useful in certain situations.Comment: 34 pages, 7 figures and 12 table
Applications for Multicore System
A multi-core processor is a single computing unit with two or more processors (“cores”). These cores are integrated into a single IC for enhanced performance, reduced power consumption and more efficient simultaneous processing of multiple tasks. Homogeneous multi-core systems include only identical cores, whereas heterogeneous multi-core systems have cores that are not identical. Most of the computers and workstations these days have multicore processors. However most software programs are not designed to make use of multi-core processors and hence even though we run these programs on the new machines equipped with multicore processors, we don’t see sizable improvements in application performance. The idea behind improved performance is in parallelizing the code and distributing the work amongst multiple cores, but writing programming logic to achieve this is complex. The conventional model of lock-based parallelism for writing such programs is difficult in use, error-prone and does not always lead to efficient use of the resources but with the help of OpenMP, programmers have enhanced support for parallel programming. In this work I have implemented quicksort algorithm using OpenMP library and analysed the performance in terms of execution time
LearnedSort as a learning-augmented SampleSort: Analysis and Parallelization
This work analyzes and parallelizes LearnedSort, the novel algorithm that
sorts using machine learning models based on the cumulative distribution
function. LearnedSort is analyzed under the lens of algorithms with
predictions, and it is argued that LearnedSort is a learning-augmented
SampleSort. A parallel LearnedSort algorithm is developed combining LearnedSort
with the state-of-the-art SampleSort implementation, IPS4o. Benchmarks on
synthetic and real-world datasets demonstrate improved parallel performance for
parallel LearnedSort compared to IPS4o and other sorting algorithms.Comment: Published in SSDBM 202
Counting Linear Extensions in Practice : MCMC versus Exponential Monte Carlo
Counting the linear extensions of a given partial order is a #P-complete problem that arises in numerous applications. For polynomial-time approximation, several Markov chain Monte Carlo schemes have been proposed; however, little is known of their efficiency in practice. This work presents an empirical evaluation of the state-of-the-art schemes and investigates a number of ideas to enhance their performance. In addition, we introduce a novel approximation scheme, adaptive relaxation Monte Carlo (ARMC), that leverages exact exponential-time counting algorithms. We show that approximate counting is feasible up to a few hundred elements on various classes of partial orders, and within this range ARMC typically outperforms the other schemes.Peer reviewe
- …