Search CORE

12 research outputs found

Adaptation of multiway-merge sorting algorithm to MIMD architectures with an experimental study

Author: Cantürk Levent
Publication venue: Bilkent University
Publication date: 01/01/2002
Field of study

Ankara : The Department of Computer Engineering and the Institute of Engineering and Science of Bilkent University, 2002.Thesis (Master's) -- Bilkent University, 2002.Includes bibliographical references leaves 73-78.Sorting is perhaps one of the most widely studied problems of computing. Numerous asymptotically optimal sequential algorithms have been discovered. Asymptotically optimal algorithms have been presented for varying parallel models as well. Parallel sorting algorithms have already been proposed for a variety of multiple instruction, multiple data streams (MIMD) architectures. In this thesis, we adapt the multiwaymerge sorting algorithm that is originally designed for product networks, to MIMD architectures. It has good load balancing properties, modest communication needs and well performance. The multiway-merge sort algorithm requires only two all-to-all personalized communication (AAPC) and two one-to-one communications independent from the input size. In addition to evenly distributed load balancing, the algorithm requires only size of 2N/P local memory for each processor in the worst case, where N is the number of items to be sorted and P is the number of processors. We have implemented the algorithm on the PC Cluster that is established at Computer Engineering Department of Bilkent University. To compare the results we have implemented a sample sort algorithm (PSRS Parallel Sorting by Regular Sampling) by X. Liu et all and a parallel quicksort algorithm (HyperQuickSort) on the same cluster. In the experimental studies we have used three different benchmarks namely Uniformly, Gaussian, and Zero distributed inputs. Although the multiwaymerge algorithm did not achieve better results than the other two, which are theoretically cost optimal algorithms, there are some cases that the multiway-merge algorithm outperforms the other two like in Zero distributed input. The results of the experiments are reported in detail. The multiway-merge sort algorithm is not necessarily the best parallel sorting algorithm, but it is expected to achieve acceptable performance on a wide spectrum of MIMD architectures.Cantürk, LeventM.S

Bilkent University Institutional Repository

Experimental analysis of space-bounded schedulers

Author: Aapo Kyrola
Guy E Blelloch
Harsha Vardhan Simhadri
Jeremy T Fineman
Phillip B Gibbons
Publication venue
Publication date: 03/04/2020
Field of study

ABSTRACT The running time of nested parallel programs on shared memory machines depends in significant part on how well the scheduler mapping the program to the machine is optimized for the organization of caches and processors on the machine. Recent work proposed "space-bounded schedulers" for scheduling such programs on the multi-level cache hierarchies of current machines. The main benefit of this class of schedulers is that they provably preserve locality of the program at every level in the hierarchy, resulting (in theory) in fewer cache misses and better use of bandwidth than the popular work-stealing scheduler. On the other hand, compared to work-stealing, space-bounded schedulers are inferior at load balancing and may have greater scheduling overheads, raising the question as to the relative effectiveness of the two schedulers in practice. In this paper, we provide the first experimental study aimed at addressing this question. To facilitate this study, we built a flexible experimental framework with separate interfaces for programs and schedulers. This enables a headto-head comparison of the relative strengths of schedulers in terms of running times and cache miss counts across a range of benchmarks. (The framework is validated by comparisons with the Intel R Cilk TM Plus work-stealing scheduler.) We present experimental results on a 32-core Xeon R 7560 comparing work-stealing, hierarchy-minded work-stealing, and two variants of space-bounded schedulers on both divideand-conquer micro-benchmarks and some popular algorithmic kernels. Our results indicate that space-bounded schedulers reduce the number of L3 cache misses compared to work-stealing schedulers by 25-65% for most of the benchmarks, but incur up to 7% additional scheduler and loadimbalance overhead. Only for memory-intensive benchmarks can the reduction in cache misses overcome the added overhead, resulting in up to a 25% improvement in running time for synthetic benchmarks and about 20% improvement for algorithmic kernels. We also quantify runtime improvements ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of the national government of United States. As such, the Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only. varying the available bandwidth per core (the "bandwidth gap"), and show up to 50% improvements in the running times of kernels as this gap increases 4-fold. As part of our study, we generalize prior definitions of space-bounded schedulers to allow for more practical variants (while still preserving their guarantees), and explore implementation tradeoffs

CiteSeerX

2006 Academic Excellence Showcase Proceedings

Author: Western Oregon University
Publication venue: Digital Commons@WOU
Publication date: 31/05/2006
Field of study

Western Oregon University: Digital Commons@WOU

Recommended from our members

The application of software visualization technology to evolutionary computation: a case study in Genetic Algorithms

Author: Collins Trevor D.
Publication venue
Publication date: 01/01/1998
Field of study

Evolutionary computation is an area within the field of artificial intelligence that is founded upon the principles of biological evolution. Evolution can be defined as the process of gradual development. Evolutionary algorithms are typically applied as a generic problem solving method, searching a problem space in order to locate good solutions. These solutions are found through an iterative evolutionary search that progresses by means of gradual developments. In the majority of cases of evolutionary computation the user is not aware of their algorithm's search behaviour. This causes two problems. First, the user has no way of assuring the quality of any solutions found other than to compare the solutions found by the algorithm with any available benchmark solutions or to re-run the algorithm and check if the results can be repeated or improved upon. Second, because the user is unaware of the algorithm's behaviour they have no way of identifying the contribution of the different components of the algorithm and therefore, no direct way of analyzing the algorithm's design and assigning credit to good algorithm components, or locating and improving ineffective algorithm components. The artificial intelligence and engineering communities have been slow to accept evolutionary computation as a robust problem-solving method because, unlike cased-based systems, rule-based systems or belief networks, they are unable to follow the algorithm's reasoning when locating a set of solutions in the problem space. During an evolutionary algorithm's execution the user may be able to see the results of the search but the search process itself like is a "black box" to the user. It is the search behaviour of evolutionary algorithms that needs to be understood by the user, in order for evolutionary computation to become more accepted within these communities. The aim of software visualization is to help people understand and use computer software. Software visualization technology has been applied successfully to illustrate a variety of heuristic search algorithms, programming languages and data structures. This thesis adopts software visualization as an approach for illustrating the search behaviour of evolutionary algorithms. Genetic Algorithms ("GAs") are used here as a specific case study to illustrate how software visualization may be applied to evolutionary computation. A set of visualization requirements are derived from the findings of a GA user study. A number of search space visualization techniques are examined for illustrating the search behaviour of a GA. "Henson," an extendable framework for developing visualization tools for genetic algorithms is presented. Finally, the application of the Henson framework is illustrated by the development of "Gonzo," a visualization tool designed to enable GA users to explore their algorithm's search behaviour. The contributions made in this thesis extend into the areas of software visualization, evolutionary computation and the psychology of programming. The GA user study presented here is the first and only known study of the working practices of GA users. The search space visualization techniques proposed here have never been applied in this domain before, and the resulting interactive visualizations provide the GA user with a previously unavailable insight into their algorithm's operation

Open Research Online (The Open University)

Design and simulation of an MIMD shared memory multiprocessor with interleaved instruction streams

Author: Stiemerling Thomas R.
Publication venue: The University of Edinburgh
Publication date: 01/01/1991
Field of study

Edinburgh Research Archive

Compiler techniques for scalable performance of stream programs on multicore architectures

Author: Gordon Michael I. (Michael Ian), 1978-
Publication venue: Massachusetts Institute of Technology
Publication date: 01/01/2010
Field of study

Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2010.Cataloged from PDF version of thesis.Includes bibliographical references (p. 211-222).Given the ubiquity of multicore processors, there is an acute need to enable the development of scalable parallel applications without unduly burdening programmers. Currently, programmers are asked not only to explicitly expose parallelism but also concern themselves with issues of granularity, load-balancing, synchronization, and communication. This thesis demonstrates that when algorithmic parallelism is expressed in the form of a stream program, a compiler can effectively and automatically manage the parallelism. Our compiler assumes responsibility for low-level architectural details, transforming implicit algorithmic parallelism into a mapping that achieves scalable parallel performance for a given multicore target. Stream programming is characterized by regular processing of sequences of data, and it is a natural expression of algorithms in the areas of audio, video, digital signal processing, networking, and encryption. Streaming computation is represented as a graph of independent computation nodes that communicate explicitly over data channels. Our techniques operate on contiguous regions of the stream graph where the input and output rates of the nodes are statically determinable. Within a static region, the compiler first automatically adjusts the granularity and then exploits data, task, and pipeline parallelism in a holistic fashion. We introduce techniques that data-parallelize nodes that operate on overlapping sliding windows of their input, translating serializing state into minimal and parametrized inter-core communication. Finally, for nodes that cannot be data-parallelized due to state, we are the first to automatically apply software-pipelining techniques at a coarse granularity to exploit pipeline parallelism between stateful nodes. Our framework is evaluated in the context of the StreamIt programming language. StreamIt is a high-level stream programming language that has been shown to improve programmer productivity in implementing streaming algorithms. We employ the StreamIt Core benchmark suite of 12 real-world applications to demonstrate the effectiveness of our techniques for varying multicore architectures. For a 16-core distributed memory multicore, we achieve a 14.9x mean speedup. For benchmarks that include sliding-window computation, our sliding-window data-parallelization techniques are required to enable scalable performance for a 16-core SMP multicore (14x mean speedup) and a 64-core distributed shared memory multicore (52x mean speedup).by Michael I. Gordon.Ph.D

DSpace@MIT

Pertanika Journal of Science & Technology

Author: Universiti Putra Malaysia Press
Publication venue: Universiti Putra Malaysia Press
Publication date: 01/01/2016
Field of study

Universiti Putra Malaysia Institutional Repository

Pertanika Journal of Science & Technology

Author: Universiti Putra Malaysia Press
Publication venue: Universiti Putra Malaysia Press
Publication date: 01/01/2016
Field of study

Universiti Putra Malaysia Institutional Repository

Refinement of Parallel Algorithms down to LLVM

Author: Lammich Peter
Publication venue: LIPIcs - Leibniz International Proceedings in Informatics. 13th International Conference on Interactive Theorem Proving (ITP 2022)
Publication date: 01/01/2022
Field of study

We present a stepwise refinement approach to develop verified parallel algorithms, down to efficient LLVM code. The resulting algorithms\u27 performance is competitive with their counterparts implemented in C/C++. Our approach is backwards compatible with the Isabelle Refinement Framework, such that existing sequential formalizations can easily be adapted or re-used. As case study, we verify a parallel quicksort algorithm, and show that it performs on par with its C++ implementation, and is competitive to state-of-the-art parallel sorting algorithms

Dagstuhl Research Online Publication Server

University of Twente Research Information