Search CORE

19 research outputs found

A full parallel Quicksort algorithm for multicore processors

Author: Maus Arne
Publication venue: NIKT Foundation
Publication date: 29/10/2015
Field of study

The problem addressed in this paper is that we want to sort an integer array a[] of length n in parallel on a multi core machine with p cores using Quicksort. Amdahl’s law tells us that the inherent sequential part of any algorithm will in the end dominate and limit the speedup we get from parallelisation. This paper introduces ParaQuick, a full parallel quicksort algorithm for use on an ordinary shared memory multi core machine that has just a few simple statements in its sequential part. It can be seen as an improvement over traditional parallelization of the Quicksort algorithm, where one follows the sequential algorithm and substitute recursive calls with the creation of parallel threads for these calls in the top of the recursion tree. The ParaQuick algorithm, starts with k parallel threads, where k is a multiple of p (here k = 8*p) in a k way partition of the original array with the same pivot value, and hence we get 2k partitioned areas in the first pass. We then calculate where the pivot index, the division between the small and large elements if this had been ordinary sequential Quicksort partition. In full parallel we then swap all small elements to the right of this pivot index with the large elements to the left of this pivot index – these two ‘displaced’ sets are by definition of equal size. We can then recursively with half of the threads now do the left part, and with the other half of the threads the right part (more details and synchronization considerations in the paper). Finally, when there is only one thread left working on one such area, sequential Quicksort and Insertionsort are used, as in the traditional way of doing parallel Quicksort. In the last part of the paper, this new algorithm is empirically tested against two other algorithms and Arrays.sort from the Java library. Five different distributions of the numbers to be sorted end three different machines with p = 2(4 hyper threaded), 4(8) and 32(64) are tested. Finally, conclusions are presented and an explanation is given why this ParaQuick algorithm for large values of n and some distributions is so much faster than a traditional parallelisation

BIBSYS: Open Journals Systems

RadixInsert, a much faster stable algorithm for sorting floating-point numbers

Author: Maus Arne
Publication venue: NIKT Foundation
Publication date: 30/10/2019
Field of study

The problem addressed in this paper is that we want to sort an array a[] of n floating point numbers conforming to the IEEE 754 standard, both in the 64bit double precision and the 32bit single precision formats on a multi core computer with p real cores and shared memory (an ordinary PC). This we do by introducing a new stable, sorting algorithm, RadixInsert, both in a sequential version and with two parallel implementations. RadixInsert is tested on two different machines, a 2 core laptop and a 4 core desktop, outperforming the not stable Quicksort based algorithms from the Java library – both the sequential Arrays.sort() and a merge-based parallel version Arrays.parallelsort() for 500. The RadixInsert algorithm resembles in many ways the Shell sorting algorithm [1]. First, the array is pre-sorted to some degree – and in the case of Shell, Insertion sort is first used with long jumps and later shorter jumps along the array to ensure that small numbers end up near the start of the array and the larger ones towards the end. Finally, we perform a full insertion sort on the whole array to ensure correct sorting. RadixInsert first uses the ordinary right-to-left LSD Radix for sorting some left part of the floating-point numbers, then considered as integers. Finally, as with Shell sort, we perform a full Insertion sort on the whole array. This resembles in some ways a proposal by Sedgewick [10] for integer sorting and will be commented on later. The IEE754 standard was deliberately made such that positive floating-point numbers can be sorted as integers (both in the 32 and 64 bit format). The special case of a mix of positive and negative numbers is also handled in RadixInsert. One other main reason why Radix-sort is so well suited for this task is that the IEEE 754 standard normalizes numbers to the left side of the representation in a 64bit double or a 32bit float. The Radix algorithm will then in the same sorting on the leftmost bits in n floating-point numbers, sort both large and small numbers simultaneously. Finally, Radix is cache-friendly as it reads all its arrays left-to right with a small number of cache misses as a result, but writes them back in a different location in b[] in order to do the sorting. And thirdly, Radix-sort is a fast O(n) algorithm – faster than quicksort O(nlogn) or Shell sort O(n1.5). RadixInsert is in practice O(n), but as with Quicksort it might be possible to construct numbers where RadixInsert degenerates to an O(n2) algorithm. However, this worst case for RadixInsert was not found when sorting seven quite different distributions reported in this paper. Finally, the extra memory used by RadixInsert is n + some minor arrays whereas the sequential Quicksort in the Java library needs basically no extra memory. However, the merge based Arrays.parallelsort() in the Java library needs the same amount of n extra memory as RadixInsert

BIBSYS: Open Journals Systems

A faster all parallel Mergesort algorithm for multicore processors

Author: Maus Arne
Publication venue: NIKT Foundation
Publication date: 08/08/2018
Field of study

The problem addressed in this paper is that we want to sort an integer array a[] of length n in parallel on a multi core machine with p cores using mergesort. Amdahl’s law tells us that the inherent sequential part of any algorithm will in the end dominate and limit the speedup we get from parallelisation. This paper introduces ParaMerge, an all parallel mergesort algorithm for use on an ordinary shared memory multi core machine that has just a few simple statements in its sequential part. The new algorithm is all parallel in the sense that by recursive decent it is two parallel in the top node, four parallel on the next level in the recursion, then eight parallel until we at least have started one thread for all the p cores. After parallelization, each thread then uses sequential recursion mergesort with a variant of insertion sort for sorting short subsections at the end. ParaMerge can be seen as an improvement over traditional parallelization of the mergesort algorithm where one follows the sequential algorithm and substitute recursive calls with the creation of parallel threads in the top of the recursion tree. This traditional parallel mergesort finally does a sequential merging of the two sorted halves of a[]. First at the next level it goes two-parallel, then four parallel on the next level, and so on. After parallelization my implementation of this traditional algorithm also use the same sequential mergesort and insertion sort algorithm as the ParaMerge algorithm in each thread. There are two main improvements in Paramerge: First the observation that merging can both be done from the start left to right picking the smallest elements of the two sections to be merged, and at the same time from the end of the same sections from right to left picking the largest elements. The second improvement is that the contract between a node and its two sub-nodes is changed. In a traditional parallelization a node is given a section of a[], sort this by merging two sorted halves it recursively receives from its own two sub nodes and returns its to its mother node. In Paramerge the two sub nodes each receive a full sorting from its two own sub nodes of the section itself got from its mother node (so this problem is already solved). Every node has a twin node. In parallel these two twin nodes then merge their two sorted sections, one from left and the other from right as described above. The two twin sub nodes have then sorted the whole section given to their common mother node. This goes also for the top node. We have thus raised the level of parallelization by a factor of two at each level of the top of the recursion tree. The ParaMerge algorithm also contains other improvements, such as a controlled sorting back and forth between a[] and a scratch area b[] of the same size such that the sorted result always ends up in a[] without any copy, and a special insertion sort that is central for achieving this copy-free feature. ParaMerge is compared with other published algorithms, and in only one case is one of the ‘new’ features in Paramerge found. This other algorithm is described and compared in some detail.Finally, ParaMerge is empirically compared with three other algorithms sorting arrays of length n =10,20,…,50 m, and ..1000m when p=32. demonstrating that it is significantly faster than two other merge algorithms, the sequential and the traditional parallel algorithm, and Arrays.sort(), a sequential Quicksort algorithm from the Java library

BIBSYS: Open Journals Systems

A Fairness Algorithm for High-speed Networks based on a Resilient Packet Ring Architecture

Author: Gjessing Stein
Maus Arne
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2002
Field of study

IEEE is currently standardizing a spatial reuse ring topology network called the Resilient Packet Ring (RPR, IEEE P802.17). The goal of the RPR development is to make a LAN/MAN standard, but also WANs are discussed. A ring network needs a fairness algorithm that regulates each stations access to the ring. The RPR fairness algorithm is currently being developed with mostly long distances between stations in mind. In this paper we discuss the feedback aspects of this algorithm and how it needs to be changed in order to give good performance if and when RPR is used for high-speed networks and LANs with shorter distances between stations. We discuss different architectural parameters including buffers sizes and distances between stations. We suggest the use of triggers instead of timers to meet the response requirements of high-speed networks. We have developed a discrete event simulator in the programming language Java. The proposed improvements are compared and evaluated using a ring network model that we have built using our simulator. (c) 2002 IEEE. Personal use of this material is permitted

NORA - Norwegian Open Research Archives

Some Faster Algorithms for Finding Large Prime Gaps

Author: Gulli Thomas
Jul Eric B.
Maus Arne
Publication venue: NIKT Foundation
Publication date: 23/11/2020
Field of study

This paper investigates the problem of finding large prime gaps (the difference between two consecutive prime numbers, pi+1 – pi) and on the development of a small, efficient program for generating such large prime gaps for a single computer, a laptop or a workstation. In Wikipedia [1], one can find a table of all known record prime gaps less than 264, the record is a 20 decimal digit number. We wanted to go beyond 64 bit numbers and demonstrate algorithms that do not needed a huge number of computers in a grid to produce useful results. After some preliminary tests, we found that the Sieve of Eratosthenes, SE, from the year 250 BC was the fastest for finding prime numbers and it could also be made space efficient. Each odd number is represented by one bit and when storing 8 odd numbers in a single byte (representing 16 consecutive numbers ignoring the even numbers), we found that we should not make one long SE table, but instead divide the SE table into segments (called SE segments), each of length 108 or 109 and dynamically generate the necessary SE segments as to find prime numbers. First, we made a basic segment of all prime numbers < 108 (in less than a second). We also relied heavily on the old observation [2] that when using SE to find all prime numbers ?????, we cross out all numbers using the prime numbers ???? ? ?????, and that the first number crossed off when crossing out for prime number p is p2. When we want to find prime gaps, we first create one or more consecutive SE in that range, say starting on 274 and ending with the value M – initially these big segments are crossed out by our first basic set of primes < 108 , To find all prime number in these big segments, we next need the rest of prime numbers ???? ? ????? . These can be all be constructed by using our first set of prime numbers to generate segments of consecutive SE from 108. The primes in these segments are used to cross out in the big SE segment and can then be discarded (each prime used only once). Our most significant algorithm was to find a simple formula for using primes from a range 3 – 236 to cross out the non-primes in any SE segment without crossing out in all the numbers between 236 and 272. This leads to an exponential saving in both space and execution time. In addition to this, we created a small package Int3 to represent numbers > 264 by storing 8 decimal values in each of 3 integer variables together with the necessary mathematical operations. The Int3 package can handle numbers up to 24 decimal digits and is significantly faster than the BigInteger package in the Java library. We also created a faster algorithm for finding all record prime gaps. The results presented in this paper are some tables of prime gaps for primes significantly larger than 264 and data supporting an observation that big prime gaps in these segments are much more frequent than the ones we find in the Wikipedia table where the search starts at prime number 3. Our combined set of algorithms is also sufficiently fast to test every entry in the Wikipedia table in less than 5 minutes. We conclude by reflecting on the use of brute force (more computers) versus smarter algorithms

BIBSYS: Open Journals Systems

A full parallel radix sorting algorithm for multicore processors

Author: Arne Maus
Publication venue
Publication date: 01/01/2011
Field of study

The problem addressed in this paper is that we want to sort an integer array a [] of length n on a multi core machine with k cores. Amdahl’s law tells us that the inherent sequential part of any algorithm will in the end dominate and limit the speedup we get from parallelisation of that algorithm. This paper introduces PARL, a parallel left radix sorting algorithm for use on ordinary shared memory multi core machines, that has just one simple statement in its sequential part. It can be seen as a major rework of the Partitioned Parallel Radix Sort (PPR) that was developed for use on a network of communicating machines with separate memories. The PARL algorithm, which was developed independently of the PPR algorithm, has in principle some of the same phases as PPR, but also many significant differences as described in this paper. On a 32 core server, a speedup of 5-12 times is achieved compared with the same sequential ARL algorithm when sorting more than 100 000 numbers and half that speedup on a 4 core PC work station and on two dual core laptops. Since the original sequential ARL algorithm in addition is 3-5 times faster than the standard Java Arrays.sort algorithm, this parallelisation translates to a significant speedup of approx. 10 to 30 for ordinary user programs sorting larger arrays. The reason that we don’t get better results, i.e. a speedup equal to the number of cores when the number of cores exceeds 2, is chiefly explained by a limited memory bandwidth. This thread pool implementation of PARL is also user friendly in that the user calling this PARL algorithm does not have to deal with creating threads themselves; to sort their data, they just create a sorting object and make a call to a thread safe method in that object

CiteSeerX

NORA - Norwegian Open Research Archives

Making a fast unstable sorting algorithm stable

Author: Arne Maus
Publication venue
Publication date
Field of study

This paper demonstrates how an unstable in place sorting algorithm, the ALR algorithm, can be made stable by temporary changing the sorting keys during the recursion. At ‘the bottom of the recursion ’ all subsequences with equal valued element are then individually sorted with a stable sorting subalgorithm (insertion sort or radix). Later, on backtrack the original keys are restored. This results in a stable sorting of the whole input. Unstable ALR is much faster than Quicksort (which is also unstable). In this paper it is demonstrated that StableALR, which is some 10-30 % slower than the original unstable ALR, is still in most cases 20-60 % faster than Quicksort. It is also shown to be faster than Flashsort, a new unstable in place, bucket type sorting algorithm. This is demonstrated for five different distributions of integers in a array of length from 50 to 97 million elements. The StableALR sorting algorithm can be extended to sort floating point numbers and strings and make effective use of a multi core CPU

CiteSeerX