1,569 research outputs found

    SOLUTIONS FOR OPTIMIZING THE DATA PARALLEL PREFIX SUM ALGORITHM USING THE COMPUTE UNIFIED DEVICE ARCHITECTURE

    Get PDF
    In this paper, we analyze solutions for optimizing the data parallel prefix sum function using the Compute Unified Device Architecture (CUDA) that provides a viable solution for accelerating a broad class of applications. The parallel prefix sum function is an essential building block for many data mining algorithms, and therefore its optimization facilitates the whole data mining process. Finally, we benchmark and evaluate the performance of the optimized parallel prefix sum building block in CUDA.CUDA, threads, GPGPU, parallel prefix sum, parallel processing, task synchronization, warp

    A Sound and Complete Abstraction for Reasoning about Parallel Prefix Sums

    Get PDF
    Prefix sums are key building blocks in the implementation of many concurrent software applications, and recently much work has gone into efficiently implementing prefix sums to run on massively par-allel graphics processing units (GPUs). Because they lie at the heart of many GPU-accelerated applications, the correctness of prefix sum implementations is of prime importance. We introduce a novel abstraction, the interval of summations, that allows scalable reasoning about implementations of prefix sums. We present this abstraction as a monoid, and prove a sound-ness and completeness result showing that a generic sequential pre-fix sum implementation is correct for an array of length n if and only if it computes the correct result for a specific test case when instantiated with the interval of summations monoid. This allows correctness to be established by running a single test where the in

    Parallel Sort-Based Matching for Data Distribution Management on Shared-Memory Multiprocessors

    Full text link
    In this paper we consider the problem of identifying intersections between two sets of d-dimensional axis-parallel rectangles. This is a common problem that arises in many agent-based simulation studies, and is of central importance in the context of High Level Architecture (HLA), where it is at the core of the Data Distribution Management (DDM) service. Several realizations of the DDM service have been proposed; however, many of them are either inefficient or inherently sequential. These are serious limitations since multicore processors are now ubiquitous, and DDM algorithms -- being CPU-intensive -- could benefit from additional computing power. We propose a parallel version of the Sort-Based Matching algorithm for shared-memory multiprocessors. Sort-Based Matching is one of the most efficient serial algorithms for the DDM problem, but is quite difficult to parallelize due to data dependencies. We describe the algorithm and compute its asymptotic running time; we complete the analysis by assessing its performance and scalability through extensive experiments on two commodity multicore systems based on a dual socket Intel Xeon processor, and a single socket Intel Core i7 processor.Comment: Proceedings of the 21-th ACM/IEEE International Symposium on Distributed Simulation and Real Time Applications (DS-RT 2017). Best Paper Award @DS-RT 201

    Decentralized Online Scheduling of Malleable NP-hard Jobs

    Get PDF
    In this work, we address an online job scheduling problem in a large distributed computing environment. Each job has a priority and a demand of resources, takes an unknown amount of time, and is malleable, i.e., the number of allotted workers can fluctuate during its execution. We subdivide the problem into (a) determining a fair amount of resources for each job and (b) assigning each job to an according number of processing elements. Our approach is fully decentralized, uses lightweight communication, and arranges each job as a binary tree of workers which can grow and shrink as necessary. Using the NP-complete problem of propositional satisfiability (SAT) as a case study, we experimentally show on up to 128 machines (6144 cores) that our approach leads to near-optimal utilization, imposes minimal computational overhead, and performs fair scheduling of incoming jobs within a few milliseconds

    Optimizing for a Many-Core Architecture without Compromising Ease-of-Programming

    Get PDF
    Faced with nearly stagnant clock speed advances, chip manufacturers have turned to parallelism as the source for continuing performance improvements. But even though numerous parallel architectures have already been brought to market, a universally accepted methodology for programming them for general purpose applications has yet to emerge. Existing solutions tend to be hardware-specific, rendering them difficult to use for the majority of application programmers and domain experts, and not providing scalability guarantees for future generations of the hardware. This dissertation advances the validation of the following thesis: it is possible to develop efficient general-purpose programs for a many-core platform using a model recognized for its simplicity. To prove this thesis, we refer to the eXplicit Multi-Threading (XMT) architecture designed and built at the University of Maryland. XMT is an attempt at re-inventing parallel computing with a solid theoretical foundation and an aggressive scalable design. Algorithmically, XMT is inspired by the PRAM (Parallel Random Access Machine) model and the architecture design is focused on reducing inter-task communication and synchronization overheads and providing an easy-to-program parallel model. This thesis builds upon the existing XMT infrastructure to improve support for efficient execution with a focus on ease-of-programming. Our contributions aim at reducing the programmer's effort in developing XMT applications and improving the overall performance. More concretely, we: (1) present a work-flow guiding programmers to produce efficient parallel solutions starting from a high-level problem; (2) introduce an analytical performance model for XMT programs and provide a methodology to project running time from an implementation; (3) propose and evaluate RAP -- an improved resource-aware compiler loop prefetching algorithm targeted at fine-grained many-core architectures; we demonstrate performance improvements of up to 34.79% on average over the GCC loop prefetching implementation and up to 24.61% on average over a simple hardware prefetching scheme; and (4) implement a number of parallel benchmarks and evaluate the overall performance of XMT relative to existing serial and parallel solutions, showing speedups of up to 13.89x vs.~ a serial processor and 8.10x vs.~parallel code optimized for an existing many-core (GPU). We also discuss the implementation and optimization of the Max-Flow algorithm on XMT, a problem which is among the more advanced in terms of complexity, benchmarking and research interest in the parallel algorithms community. We demonstrate better speed-ups compared to a best serial solution than previous attempts on other parallel platforms
    corecore