Search CORE

71 research outputs found

A Novel Hybrid Quicksort Algorithm Vectorized using AVX-512 on Intel Skylake

Author: Bramas Berenger
Publication venue: 'The Science and Information Organization'
Publication date: 01/10/2017
Field of study

The modern CPU's design, which is composed of hierarchical memory and SIMD/vectorization capability, governs the potential for algorithms to be transformed into efficient implementations. The release of the AVX-512 changed things radically, and motivated us to search for an efficient sorting algorithm that can take advantage of it. In this paper, we describe the best strategy we have found, which is a novel two parts hybrid sort, based on the well-known Quicksort algorithm. The central partitioning operation is performed by a new algorithm, and small partitions/arrays are sorted using a branch-free Bitonic-based sort. This study is also an illustration of how classical algorithms can be adapted and enhanced by the AVX-512 extension. We evaluate the performance of our approach on a modern Intel Xeon Skylake and assess the different layers of our implementation by sorting/partitioning integers, double floating-point numbers, and key/value pairs of integers. Our results demonstrate that our approach is faster than two libraries of reference: the GNU \emph{C++} sort algorithm by a speedup factor of 4, and the Intel IPP library by a speedup factor of 1.4.Comment: 8 pages, research pape

arXiv.org e-Print Archive

MPG.PuRe

Fast and Robust Vectorized In-Place Sorting of Primitive Types

Author: Blacher Mark
Giesen Joachim
Publication venue: LIPIcs - Leibniz International Proceedings in Informatics. 19th International Symposium on Experimental Algorithms (SEA 2021)
Publication date: 01/01/2021
Field of study

Modern CPUs provide single instruction-multiple data (SIMD) instructions. SIMD instructions process several elements of a primitive data type simultaneously in fixed-size vectors. Classical sorting algorithms are not directly expressible in SIMD instructions. Accelerating sorting algorithms with SIMD instruction is therefore a creative endeavor. A promising approach for sorting with SIMD instructions is to use sorting networks for small arrays and Quicksort for large arrays. In this paper we improve vectorization techniques for sorting networks and Quicksort. In particular, we show how to use the full capacity of vector registers in sorting networks and how to make vectorized Quicksort robust with respect to different key distributions. To demonstrate the performance of our techniques we implement an in-place hybrid sorting algorithm for the data type int with AVX2 intrinsics. Our implementation is at least 30% faster than state-of-the-art high-performance sorting alternatives

Dagstuhl Research Online Publication Server

Fast and Robust Vectorized In-Place Sorting of Primitive Types

Author: Blacher Mark
Giesen Joachim
Kühne Lars
Publication venue: Schloss Dagstuhl — Leibniz-Zentrum für Informatik
Publication date: 01/06/2021
Field of study

Institute of Transport Research:Publications

Implementation of a vectorized Quicksort using AVX-512 intrinsics

Author: Blacher Mark
Kühne Lars
Thiemicke Frank
Publication venue
Publication date: 02/08/2021
Field of study

Jahrzehntelang wurden Verbesserungen der Rechengeschwindigkeit erreicht, indem die Taktfrequenz der CPU erhöht wurde. Im Laufe der letzten Jahre wurde dieser Mechanismus durch physikalische Einflüsse gebremst. Daher müssen moderne Single-Thread-Anwendungen stärker CPU-Funktionen ausnutzen, um von den Fortschritten neuer Prozessorgenerationen zu profitieren. Eine dieser Funktionen ist die Vektorverarbeitung, um mehrere Datenelemente gleichzeitig zu verarbeiten. Diese Arbeit untersucht die Verwendung von AVX-512-Befehlen zur Sortierung von primitiven Typen der Länge 32 bit. Die Nutzung von Vektorinstruktionen für die Sortierung ist eine Herausforderung, da Sortieralgorithmen erst vektorisierungsfreundlich umgestaltet werden müssen. Glücklicherweise hat Mark Blacher einen effizienten vektorisierten Sortieralgorithmus auf Basis von Sortiernetzwerken und einem nichtquadratischen Quicksort entwickelt. In dieser Arbeit wird Blachers AVX2-basierte vektorisierte Implementierung auf den moderneren AVX-512-Befehlssatz portiert. Im Geschwindigkeitsvergleich schlägt Blachers AVX2-Version die in dieser Arbeit entwickelte AVX-512-Implementierung. Diese ist jedoch in der Lage, den bisherigen AVX-512-Sortieralgorithmus von Bramas zu übertreffen

Institute of Transport Research:Publications

JPEG steganography with particle swarm optimization accelerated by AVX

Author: Bramas B
Clerc M
Cox I
Jongen HT
Katzenbeisser S
Price KV
Raggo M
Wayner P
Publication venue: 'Wiley'
Publication date: 01/01/2020
Field of study

Digital steganography aims at hiding secret messages in digital data transmitted over insecure channels. The JPEG format is prevalent in digital communication, and images are often used as cover objects in digital steganography. Optimization methods can improve the properties of images with embedded secret but introduce additional computational complexity to their processing. AVX instructions available in modern CPUs are, in this work, used to accelerate data parallel operations that are part of image steganography with advanced optimizations.Web of Science328art. no. e544

Crossref

DSpace at VSB Technical University of Ostrava

Engineering faster sorters for small sets of items

Author: Bingmann Timo
Marianczuk Jasper
Sanders Peter
Publication venue: John Wiley and Sons
Publication date: 14/12/2020
Field of study

Sorting a set of items is a task that can be useful by itself or as a building block for more complex operations. That is why a lot of effort has been put into finding sorting algorithms that sort large sets as efficiently as possible. But the more sophisticated and complex the algorithms become, the less efficient they are for small sets of items due to large constant factors. A relatively simple sorting algorithm that is often used as a base case sorter is insertion sort, because it has small code size and small constant factors influencing its execution time. We aim to determine if there is a faster way to sort small sets of items to provide an efficient base case sorter. We looked at sorting networks, at how they can improve the speed of sorting few elements, and how to implement them in an efficient manner using conditional moves. Since sorting networks need to be implemented explicitly for each set size, providing networks for larger sizes becomes less efficient due to increased code sizes. To also enable the sorting of slightly larger base cases, we adapted sample sort to Register Sample Sort, to break down those larger sets into sizes that can in turn be sorted by sorting networks. From our experiments we found that when sorting only small sets of integers, the sorting networks outperform insertion sort by a factor of at least 1.76 for any array size between six and 16, and by a factor of 2.72 on average across all machines and array sizes. When integrating sorting networks as a base case sorter into Quicksort, we achieved far less performance improvements over using insertion sort, which is probably due to the networks having a larger code size and cluttering the L1 instruction cache. The same effect occurs when including Register Sample Sort as a base case sorter for IPS4o. But for x86 machines that have a larger L1 instruction cache of 64 KiB or more, we obtained speedups of 12.7% when using sorting networks as a base case sorter in std::sort, and of 5%–6% when integrating Register Sample Sort as a base case sorter into IPS4o, each in comparison to using insertion sort as the base case sorter. In conclusion, the desired improvement in speed could only be achieved under special circumstances, but the results clearly show the potential of using conditional moves in the field of sorting algorithms

KITopen

Engineering Faster Sorters for Small Sets of Items

Author: Bingmann Timo
Marianczuk Jasper
Sanders Peter
Publication venue
Publication date: 02/10/2020
Field of study

Sorting a set of items is a task that can be useful by itself or as a building block for more complex operations. That is why a lot of effort has been put into finding sorting algorithms that sort large sets as fast as possible. But the more sophisticated and complex the algorithms become, the less efficient they are for small sets of items due to large constant factors. We aim to determine if there is a faster way than insertion sort to sort small sets of items to provide a more efficient base case sorter. We looked at sorting networks, at how they can improve the speed of sorting few elements, and how to implement them in an efficient manner by using conditional moves. Since sorting networks need to be implemented explicitly for each set size, providing networks for larger sizes becomes less efficient due to increased code sizes. To also enable the sorting of slightly larger base cases, we adapted sample sort to Register Sample Sort, to break down those larger sets into sizes that can in turn be sorted by sorting networks. From our experiments we found that when sorting only small sets, the sorting networks outperform insertion sort by a factor of at least 1.76 for any array size between six and sixteen, and by a factor of 2.72 on average across all machines and array sizes. When integrating sorting networks as a base case sorter into Quicksort, we achieved far less performance improvements, which is probably due to the networks having a larger code size and cluttering the L1 instruction cache. But for x86 machines with a larger L1 instruction cache of 64 KiB or more, we obtained speedups of 12.7% when using sorting networks as a base case sorter in std::sort. In conclusion, the desired improvement in speed could only be achieved under special circumstances, but the results clearly show the potential of using conditional moves in the field of sorting algorithms.Comment: arXiv admin note: substantial text overlap with arXiv:1908.0811

arXiv.org e-Print Archive

KITopen

Toward Reliable and Efficient Message Passing Software for HPC Systems: Fault Tolerance and Vector Extension

Author: Zhong Dong
Publication venue: TRACE: Tennessee Research and Creative Exchange
Publication date: 01/08/2021
Field of study

As the scale of High-performance Computing (HPC) systems continues to grow, researchers are devoted themselves to achieve the best performance of running long computing jobs on these systems. My research focus on reliability and efficiency study for HPC software. First, as systems become larger, mean-time-to-failure (MTTF) of these HPC systems is negatively impacted and tends to decrease. Handling system failures becomes a prime challenge. My research aims to present a general design and implementation of an efficient runtime-level failure detection and propagation strategy targeting large-scale, dynamic systems that is able to detect both node and process failures. Using multiple overlapping topologies to optimize the detection and propagation, minimizing the incurred overhead sand guaranteeing the scalability of the entire framework. Results from different machines and benchmarks compared to related works shows that my design and implementation outperforms non-HPC solutions significantly, and is competitive with specialized HPC solutions that can manage only MPI applications. Second, I endeavor to implore instruction level parallelization to achieve optimal performance. Novel processors support long vector extensions, which enables researchers to exploit the potential peak performance of target architectures. Intel introduced Advanced Vector Extension (AVX512 and AVX2) instructions for x86 Instruction Set Architecture (ISA). Arm introduced Scalable Vector Extension (SVE) with a new set of A64 instructions. Both enable greater parallelisms. My research utilizes long vector reduction instructions to improve the performance of MPI reduction operations. Also, I use gather and scatter feature to speed up the packing and unpacking operation in MPI. The evaluation of the resulting software stack under different scenarios demonstrates that the approach is not only efficient but also generalizable to many vector architecture and efficient

University of Tennessee, Knoxville: Trace