18,344 research outputs found

    A Novel Hybrid Quicksort Algorithm Vectorized using AVX-512 on Intel Skylake

    Full text link
    The modern CPU's design, which is composed of hierarchical memory and SIMD/vectorization capability, governs the potential for algorithms to be transformed into efficient implementations. The release of the AVX-512 changed things radically, and motivated us to search for an efficient sorting algorithm that can take advantage of it. In this paper, we describe the best strategy we have found, which is a novel two parts hybrid sort, based on the well-known Quicksort algorithm. The central partitioning operation is performed by a new algorithm, and small partitions/arrays are sorted using a branch-free Bitonic-based sort. This study is also an illustration of how classical algorithms can be adapted and enhanced by the AVX-512 extension. We evaluate the performance of our approach on a modern Intel Xeon Skylake and assess the different layers of our implementation by sorting/partitioning integers, double floating-point numbers, and key/value pairs of integers. Our results demonstrate that our approach is faster than two libraries of reference: the GNU \emph{C++} sort algorithm by a speedup factor of 4, and the Intel IPP library by a speedup factor of 1.4.Comment: 8 pages, research pape

    RadixInsert, a much faster stable algorithm for sorting floating-point numbers

    Get PDF
    The problem addressed in this paper is that we want to sort an array a[] of n floating point numbers conforming to the IEEE 754 standard, both in the 64bit double precision and the 32bit single precision formats on a multi core computer with p real cores and shared memory (an ordinary PC). This we do by introducing a new stable, sorting algorithm, RadixInsert, both in a sequential version and with two parallel implementations. RadixInsert is tested on two different machines, a 2 core laptop and a 4 core desktop, outperforming the not stable Quicksort based algorithms from the Java library – both the sequential Arrays.sort() and a merge-based parallel version Arrays.parallelsort() for 500. The RadixInsert algorithm resembles in many ways the Shell sorting algorithm [1]. First, the array is pre-sorted to some degree – and in the case of Shell, Insertion sort is first used with long jumps and later shorter jumps along the array to ensure that small numbers end up near the start of the array and the larger ones towards the end. Finally, we perform a full insertion sort on the whole array to ensure correct sorting. RadixInsert first uses the ordinary right-to-left LSD Radix for sorting some left part of the floating-point numbers, then considered as integers. Finally, as with Shell sort, we perform a full Insertion sort on the whole array. This resembles in some ways a proposal by Sedgewick [10] for integer sorting and will be commented on later. The IEE754 standard was deliberately made such that positive floating-point numbers can be sorted as integers (both in the 32 and 64 bit format). The special case of a mix of positive and negative numbers is also handled in RadixInsert. One other main reason why Radix-sort is so well suited for this task is that the IEEE 754 standard normalizes numbers to the left side of the representation in a 64bit double or a 32bit float. The Radix algorithm will then in the same sorting on the leftmost bits in n floating-point numbers, sort both large and small numbers simultaneously. Finally, Radix is cache-friendly as it reads all its arrays left-to right with a small number of cache misses as a result, but writes them back in a different location in b[] in order to do the sorting. And thirdly, Radix-sort is a fast O(n) algorithm – faster than quicksort O(nlogn) or Shell sort O(n1.5). RadixInsert is in practice O(n), but as with Quicksort it might be possible to construct numbers where RadixInsert degenerates to an O(n2) algorithm. However, this worst case for RadixInsert was not found when sorting seven quite different distributions reported in this paper. Finally, the extra memory used by RadixInsert is n + some minor arrays whereas the sequential Quicksort in the Java library needs basically no extra memory. However, the merge based Arrays.parallelsort() in the Java library needs the same amount of n extra memory as RadixInsert

    Sorting Integers on the AP1000

    Full text link
    Sorting is one of the classic problems of computer science. Whilst well understood on sequential machines, the diversity of architectures amongst parallel systems means that algorithms do not perform uniformly on all platforms. This document describes the implementation of a radix based algorithm for sorting positive integers on a Fujitsu AP1000 Supercomputer, which was constructed as an entry in the Joint Symposium on Parallel Processing (JSPP) 1994 Parallel Software Contest (PSC94). Brief consideration is also given to a full radix sort conducted in parallel across the machine.Comment: 1994 Project Report, 23 page

    Even faster sorting of (not only) integers

    Full text link
    In this paper we introduce RADULS2, the fastest parallel sorter based on radix algorithm. It is optimized to process huge amounts of data making use of modern multicore CPUs. The main novelties include: extremely optimized algorithm for handling tiny arrays (up to about a hundred of records) that could appear even billions times as subproblems to handle and improved processing of larger subarrays with better use of non-temporal memory stores
    • …
    corecore