9,317 research outputs found

    An Efficient Multiway Mergesort for GPU Architectures

    Full text link
    Sorting is a primitive operation that is a building block for countless algorithms. As such, it is important to design sorting algorithms that approach peak performance on a range of hardware architectures. Graphics Processing Units (GPUs) are particularly attractive architectures as they provides massive parallelism and computing power. However, the intricacies of their compute and memory hierarchies make designing GPU-efficient algorithms challenging. In this work we present GPU Multiway Mergesort (MMS), a new GPU-efficient multiway mergesort algorithm. MMS employs a new partitioning technique that exposes the parallelism needed by modern GPU architectures. To the best of our knowledge, MMS is the first sorting algorithm for the GPU that is asymptotically optimal in terms of global memory accesses and that is completely free of shared memory bank conflicts. We realize an initial implementation of MMS, evaluate its performance on three modern GPU architectures, and compare it to competitive implementations available in state-of-the-art GPU libraries. Despite these implementations being highly optimized, MMS compares favorably, achieving performance improvements for most random inputs. Furthermore, unlike MMS, state-of-the-art algorithms are susceptible to bank conflicts. We find that for certain inputs that cause these algorithms to incur large numbers of bank conflicts, MMS can achieve up to a 37.6% speedup over its fastest competitor. Overall, even though its current implementation is not fully optimized, due to its efficient use of the memory hierarchy, MMS outperforms the fastest comparison-based sorting implementations available to date

    Parallel processing and expert systems

    Get PDF
    Whether it be monitoring the thermal subsystem of Space Station Freedom, or controlling the navigation of the autonomous rover on Mars, NASA missions in the 1990s cannot enjoy an increased level of autonomy without the efficient implementation of expert systems. Merely increasing the computational speed of uniprocessors may not be able to guarantee that real-time demands are met for larger systems. Speedup via parallel processing must be pursued alongside the optimization of sequential implementations. Prototypes of parallel expert systems have been built at universities and industrial laboratories in the U.S. and Japan. The state-of-the-art research in progress related to parallel execution of expert systems is surveyed. The survey discusses multiprocessors for expert systems, parallel languages for symbolic computations, and mapping expert systems to multiprocessors. Results to date indicate that the parallelism achieved for these systems is small. The main reasons are (1) the body of knowledge applicable in any given situation and the amount of computation executed by each rule firing are small, (2) dividing the problem solving process into relatively independent partitions is difficult, and (3) implementation decisions that enable expert systems to be incrementally refined hamper compile-time optimization. In order to obtain greater speedups, data parallelism and application parallelism must be exploited

    Enhanced molecular dynamics performance with a programmable graphics processor

    Full text link
    Design considerations for molecular dynamics algorithms capable of taking advantage of the computational power of a graphics processing unit (GPU) are described. Accommodating the constraints of scalable streaming-multiprocessor hardware necessitates a reformulation of the underlying algorithm. Performance measurements demonstrate the considerable benefit and cost-effectiveness of such an approach, which produces a factor of 2.5 speed improvement over previous work for the case of the soft-sphere potential.Comment: 20 pages (v2: minor additions and changes; v3: corrected typos

    A counterexample to Thiagarajan's conjecture on regular event structures

    Full text link
    We provide a counterexample to a conjecture by Thiagarajan (1996 and 2002) that regular event structures correspond exactly to event structures obtained as unfoldings of finite 1-safe Petri nets. The same counterexample is used to disprove a closely related conjecture by Badouel, Darondeau, and Raoult (1999) that domains of regular event structures with bounded \natural-cliques are recognizable by finite trace automata. Event structures, trace automata, and Petri nets are fundamental models in concurrency theory. There exist nice interpretations of these structures as combinatorial and geometric objects. Namely, from a graph theoretical point of view, the domains of prime event structures correspond exactly to median graphs; from a geometric point of view, these domains are in bijection with CAT(0) cube complexes. A necessary condition for both conjectures to be true is that domains of regular event structures (with bounded \natural-cliques) admit a regular nice labeling. To disprove these conjectures, we describe a regular event domain (with bounded \natural-cliques) that does not admit a regular nice labeling. Our counterexample is derived from an example by Wise (1996 and 2007) of a nonpositively curved square complex whose universal cover is a CAT(0) square complex containing a particular plane with an aperiodic tiling. We prove that other counterexamples to Thiagarajan's conjecture arise from aperiodic 4-way deterministic tile sets of Kari and Papasoglu (1999) and Lukkarila (2009). On the positive side, using breakthrough results by Agol (2013) and Haglund and Wise (2008, 2012) from geometric group theory, we prove that Thiagarajan's conjecture is true for regular event structures whose domains occur as principal filters of hyperbolic CAT(0) cube complexes which are universal covers of finite nonpositively curved cube complexes

    The Parallel Persistent Memory Model

    Full text link
    We consider a parallel computational model that consists of PP processors, each with a fast local ephemeral memory of limited size, and sharing a large persistent memory. The model allows for each processor to fault with bounded probability, and possibly restart. On faulting all processor state and local ephemeral memory are lost, but the persistent memory remains. This model is motivated by upcoming non-volatile memories that are as fast as existing random access memory, are accessible at the granularity of cache lines, and have the capability of surviving power outages. It is further motivated by the observation that in large parallel systems, failure of processors and their caches is not unusual. Within the model we develop a framework for developing locality efficient parallel algorithms that are resilient to failures. There are several challenges, including the need to recover from failures, the desire to do this in an asynchronous setting (i.e., not blocking other processors when one fails), and the need for synchronization primitives that are robust to failures. We describe approaches to solve these challenges based on breaking computations into what we call capsules, which have certain properties, and developing a work-stealing scheduler that functions properly within the context of failures. The scheduler guarantees a time bound of O(W/PA+D(P/PA)log1/fW)O(W/P_A + D(P/P_A) \lceil\log_{1/f} W\rceil) in expectation, where WW and DD are the work and depth of the computation (in the absence of failures), PAP_A is the average number of processors available during the computation, and f1/2f \le 1/2 is the probability that a capsule fails. Within the model and using the proposed methods, we develop efficient algorithms for parallel sorting and other primitives.Comment: This paper is the full version of a paper at SPAA 2018 with the same nam
    corecore