866 research outputs found

    Parallel String Sample Sort

    Get PDF
    We discuss how string sorting algorithms can be parallelized on modern multi-core shared memory machines. As a synthesis of the best sequential string sorting algorithms and successful parallel sorting algorithms for atomic objects, we propose string sample sort. The algorithm makes effective use of the memory hierarchy, uses additional word level parallelism, and largely avoids branch mispredictions. Additionally, we parallelize variants of multikey quicksort and radix sort that are also useful in certain situations.Comment: 34 pages, 7 figures and 12 table

    Engineering Parallel String Sorting

    Get PDF
    We discuss how string sorting algorithms can be parallelized on modern multi-core shared memory machines. As a synthesis of the best sequential string sorting algorithms and successful parallel sorting algorithms for atomic objects, we first propose string sample sort. The algorithm makes effective use of the memory hierarchy, uses additional word level parallelism, and largely avoids branch mispredictions. Then we focus on NUMA architectures, and develop parallel multiway LCP-merge and -mergesort to reduce the number of random memory accesses to remote nodes. Additionally, we parallelize variants of multikey quicksort and radix sort that are also useful in certain situations. Comprehensive experiments on five current multi-core platforms are then reported and discussed. The experiments show that our implementations scale very well on real-world inputs and modern machines.Comment: 46 pages, extension of "Parallel String Sample Sort" arXiv:1305.115

    A Novel Hybrid Quicksort Algorithm Vectorized using AVX-512 on Intel Skylake

    Full text link
    The modern CPU's design, which is composed of hierarchical memory and SIMD/vectorization capability, governs the potential for algorithms to be transformed into efficient implementations. The release of the AVX-512 changed things radically, and motivated us to search for an efficient sorting algorithm that can take advantage of it. In this paper, we describe the best strategy we have found, which is a novel two parts hybrid sort, based on the well-known Quicksort algorithm. The central partitioning operation is performed by a new algorithm, and small partitions/arrays are sorted using a branch-free Bitonic-based sort. This study is also an illustration of how classical algorithms can be adapted and enhanced by the AVX-512 extension. We evaluate the performance of our approach on a modern Intel Xeon Skylake and assess the different layers of our implementation by sorting/partitioning integers, double floating-point numbers, and key/value pairs of integers. Our results demonstrate that our approach is faster than two libraries of reference: the GNU \emph{C++} sort algorithm by a speedup factor of 4, and the Intel IPP library by a speedup factor of 1.4.Comment: 8 pages, research pape

    Configurable Strategies for Work-stealing

    Full text link
    Work-stealing systems are typically oblivious to the nature of the tasks they are scheduling. For instance, they do not know or take into account how long a task will take to execute or how many subtasks it will spawn. Moreover, the actual task execution order is typically determined by the underlying task storage data structure, and cannot be changed. There are thus possibilities for optimizing task parallel executions by providing information on specific tasks and their preferred execution order to the scheduling system. We introduce scheduling strategies to enable applications to dynamically provide hints to the task-scheduling system on the nature of specific tasks. Scheduling strategies can be used to independently control both local task execution order as well as steal order. In contrast to conventional scheduling policies that are normally global in scope, strategies allow the scheduler to apply optimizations on individual tasks. This flexibility greatly improves composability as it allows the scheduler to apply different, specific scheduling choices for different parts of applications simultaneously. We present a number of benchmarks that highlight diverse, beneficial effects that can be achieved with scheduling strategies. Some benchmarks (branch-and-bound, single-source shortest path) show that prioritization of tasks can reduce the total amount of work compared to standard work-stealing execution order. For other benchmarks (triangle strip generation) qualitatively better results can be achieved in shorter time. Other optimizations, such as dynamic merging of tasks or stealing of half the work, instead of half the tasks, are also shown to improve performance. Composability is demonstrated by examples that combine different strategies, both within the same kernel (prefix sum) as well as when scheduling multiple kernels (prefix sum and unbalanced tree search)

    A hierarchical architecture for increasing efficiency of large photovoltaic plants under non-homogeneous solar irradiation

    Get PDF
    Under non-homogeneous solar irradiation, photovoltaic (PV) panels receive different solar irradiance, resulting in a decrease in efficiency of the PV generation system. There are a few technical options to fix this issue that goes under the name of mismatch. One of these is the reconfiguration of the PV generation system, namely changing the connections of the PV panels from the initial configuration to the optimal one. Such technique has been widely considered for small systems, due to the excessive number of required switches. In this paper, the authors propose a new method for increasing the efficiency of large PV systems under non-homogeneous solar irradiation using Series-Parallel (SP) topology. In the first part of the paper, the authors propose a method containing two key points: a switching matrix to change the connection of PV panels based on SP topology and the proof that the SP-based reconfiguration method can increase the efficiency of the photovoltaic system up to 50%. In the second part, the authors propose the extension of the method proposed in the first part to improve the efficiency of large solar generation systems by means of a two-levels architecture to minimize the cost of fabrication of the switching matrix

    Analysing the Performance of Divide-and-Conquer Algorithms on Multicore Processors

    Get PDF
    Multicore systems are widely gaining popularity because of the significant avail-ability and performance increase over the single core systems. Multicore systems have a lesser power consumption and heat generation than that of the multiple single core systems. The different compiler support provided by different vendors also make multicore programming one of the main area of research. The multicore programming utilises the power of multiple cores to parallelise a task. The widely used algorithm paradigms for multicore programming are the Divide and Conquer algorithms. The divide and conquer algorithms are candidate problem for the multicore programming because divide and conquer algorithm divides a problem into sub- problems which can be solved by distributing the sub-problems among the different cores and parallel solve them. A wide range of divide and conquer algorithm has been parallelized. In this paper, we have taken two of the widely used divide and conquer algorithms, quick sort and convex hull, parallel implemented them to analyse their performance gain in compared to the sequential version of the algorithm. The parallel implementations distribute the load onto the multiple cores, parallel work upon the loads and finally merge individual results of the each core. We have also proposed a scheme for efficient merging of the parallel sorted sub-arrays in the quick sort. We have taken the mean and standard deviation theory for efficient merging of the sorted sub-arrays. The OpenMP programming model has been used for the implementation of the programs. The processor architecture used for analysing the behaviour of the algorithm is a shared memory based processo

    Applying inspection to object-oriented software

    Get PDF
    The benefits of the object-oriented paradigmare widely cited. At the same time, inspection is deemed to be the most cost-effective means of detecting defects in software products. Why then, is there no published experience, let alone quantitative data, on the application of inspection to object-oriented systems? We describe the facilities of the object-oriented paradigm and the issues that these raise when inspecting object-oriented code. Several problems are caused by the disparity between the static code structure and its dynamic runtime behaviour. The large number of small methods in object-oriented systems can also cause problems. We then go on to describe three areas which may help mitigate problems found. Firstly, the use of various programming methods may assist in making object-oriented code easier to inspect. Secondly, improved program documentation can help the inspector understand the code which is under inspection. Finally, tool support can help the inspector to analyse the dynamic behaviour of the code. We conclude that while both the object-oriented paradigm and inspection provide excellent benefits on their own, combining the two may be a difficult exercise, requiring extensive support if it is to be successful

    Applications for Multicore System

    Get PDF
    A multi-core processor is a single computing unit with two or more processors (“cores”). These cores are integrated into a single IC for enhanced performance, reduced power consumption and more efficient simultaneous processing of multiple tasks. Homogeneous multi-core systems include only identical cores, whereas heterogeneous multi-core systems have cores that are not identical. Most of the computers and workstations these days have multicore processors. However most software programs are not designed to make use of multi-core processors and hence even though we run these programs on the new machines equipped with multicore processors, we don’t see sizable improvements in application performance. The idea behind improved performance is in parallelizing the code and distributing the work amongst multiple cores, but writing programming logic to achieve this is complex. The conventional model of lock-based parallelism for writing such programs is difficult in use, error-prone and does not always lead to efficient use of the resources but with the help of OpenMP, programmers have enhanced support for parallel programming. In this work I have implemented quicksort algorithm using OpenMP library and analysed the performance in terms of execution time
    corecore