227 research outputs found

    Array languages and the challenges of modern computer architecture

    Get PDF
    There has always been a close relationship between programming language design and computer design. Electronic computers and programming languages are both 'computers' in Turing's sense. They are systems which allow the performance of bounded universal computation. Each allows any computable function to be evaluated, up to some memory limit. This equivalence has been understood since the 30s' when Turing machines (Turing 1937) were shown to be of the same computational power as the λ calculus

    Theoretically Efficient Parallel Graph Algorithms Can Be Fast and Scalable

    Full text link
    There has been significant recent interest in parallel graph processing due to the need to quickly analyze the large graphs available today. Many graph codes have been designed for distributed memory or external memory. However, today even the largest publicly-available real-world graph (the Hyperlink Web graph with over 3.5 billion vertices and 128 billion edges) can fit in the memory of a single commodity multicore server. Nevertheless, most experimental work in the literature report results on much smaller graphs, and the ones for the Hyperlink graph use distributed or external memory. Therefore, it is natural to ask whether we can efficiently solve a broad class of graph problems on this graph in memory. This paper shows that theoretically-efficient parallel graph algorithms can scale to the largest publicly-available graphs using a single machine with a terabyte of RAM, processing them in minutes. We give implementations of theoretically-efficient parallel algorithms for 20 important graph problems. We also present the optimizations and techniques that we used in our implementations, which were crucial in enabling us to process these large graphs quickly. We show that the running times of our implementations outperform existing state-of-the-art implementations on the largest real-world graphs. For many of the problems that we consider, this is the first time they have been solved on graphs at this scale. We have made the implementations developed in this work publicly-available as the Graph-Based Benchmark Suite (GBBS).Comment: This is the full version of the paper appearing in the ACM Symposium on Parallelism in Algorithms and Architectures (SPAA), 201

    A Sound and Complete Abstraction for Reasoning about Parallel Prefix Sums

    Get PDF
    Prefix sums are key building blocks in the implementation of many concurrent software applications, and recently much work has gone into efficiently implementing prefix sums to run on massively par-allel graphics processing units (GPUs). Because they lie at the heart of many GPU-accelerated applications, the correctness of prefix sum implementations is of prime importance. We introduce a novel abstraction, the interval of summations, that allows scalable reasoning about implementations of prefix sums. We present this abstraction as a monoid, and prove a sound-ness and completeness result showing that a generic sequential pre-fix sum implementation is correct for an array of length n if and only if it computes the correct result for a specific test case when instantiated with the interval of summations monoid. This allows correctness to be established by running a single test where the in

    Efficient computation of hashes

    Get PDF
    The sequential computation of hashes at the core of many distributed storage systems and found, for example, in grid services can hinder efficiency in service quality and even pose security challenges that can only be addressed by the use of parallel hash tree modes. The main contributions of this paper are, first, the identification of several efficiency and security challenges posed by the use of sequential hash computation based on the Merkle-Damgard engine. In addition, alternatives for the parallel computation of hash trees are discussed, and a prototype for a new parallel implementation of the Keccak function, the SHA-3 winner, is introduced

    On the Streets of San Francisco: Highlights from the ISSCR Annual Meeting 2010

    Get PDF
    The 2010 Annual Meeting of the International Society for Stem Cell Research (ISSCR) was held in San Francisco in June with an exciting program covering a wealth of stem cell research from basic science to clinical research

    Optimal (Randomized) Parallel Algorithms in the Binary-Forking Model

    Full text link
    In this paper we develop optimal algorithms in the binary-forking model for a variety of fundamental problems, including sorting, semisorting, list ranking, tree contraction, range minima, and ordered set union, intersection and difference. In the binary-forking model, tasks can only fork into two child tasks, but can do so recursively and asynchronously. The tasks share memory, supporting reads, writes and test-and-sets. Costs are measured in terms of work (total number of instructions), and span (longest dependence chain). The binary-forking model is meant to capture both algorithm performance and algorithm-design considerations on many existing multithreaded languages, which are also asynchronous and rely on binary forks either explicitly or under the covers. In contrast to the widely studied PRAM model, it does not assume arbitrary-way forks nor synchronous operations, both of which are hard to implement in modern hardware. While optimal PRAM algorithms are known for the problems studied herein, it turns out that arbitrary-way forking and strict synchronization are powerful, if unrealistic, capabilities. Natural simulations of these PRAM algorithms in the binary-forking model (i.e., implementations in existing parallel languages) incur an Ω(logn)\Omega(\log n) overhead in span. This paper explores techniques for designing optimal algorithms when limited to binary forking and assuming asynchrony. All algorithms described in this paper are the first algorithms with optimal work and span in the binary-forking model. Most of the algorithms are simple. Many are randomized

    The Parallel Persistent Memory Model

    Full text link
    We consider a parallel computational model that consists of PP processors, each with a fast local ephemeral memory of limited size, and sharing a large persistent memory. The model allows for each processor to fault with bounded probability, and possibly restart. On faulting all processor state and local ephemeral memory are lost, but the persistent memory remains. This model is motivated by upcoming non-volatile memories that are as fast as existing random access memory, are accessible at the granularity of cache lines, and have the capability of surviving power outages. It is further motivated by the observation that in large parallel systems, failure of processors and their caches is not unusual. Within the model we develop a framework for developing locality efficient parallel algorithms that are resilient to failures. There are several challenges, including the need to recover from failures, the desire to do this in an asynchronous setting (i.e., not blocking other processors when one fails), and the need for synchronization primitives that are robust to failures. We describe approaches to solve these challenges based on breaking computations into what we call capsules, which have certain properties, and developing a work-stealing scheduler that functions properly within the context of failures. The scheduler guarantees a time bound of O(W/PA+D(P/PA)log1/fW)O(W/P_A + D(P/P_A) \lceil\log_{1/f} W\rceil) in expectation, where WW and DD are the work and depth of the computation (in the absence of failures), PAP_A is the average number of processors available during the computation, and f1/2f \le 1/2 is the probability that a capsule fails. Within the model and using the proposed methods, we develop efficient algorithms for parallel sorting and other primitives.Comment: This paper is the full version of a paper at SPAA 2018 with the same nam

    Parallel Unsmoothed Aggregation Algebraic Multigrid Algorithms on GPUs

    Full text link
    We design and implement a parallel algebraic multigrid method for isotropic graph Laplacian problems on multicore Graphical Processing Units (GPUs). The proposed AMG method is based on the aggregation framework. The setup phase of the algorithm uses a parallel maximal independent set algorithm in forming aggregates and the resulting coarse level hierarchy is then used in a K-cycle iteration solve phase with a 1\ell^1-Jacobi smoother. Numerical tests of a parallel implementation of the method for graphics processors are presented to demonstrate its effectiveness.Comment: 18 pages, 3 figure

    Validation of Methods to Predict Vibration of a Panel in the Near Field of a Hot Supersonic Rocket Plume

    Get PDF
    This paper describes the measurement and analysis of surface fluctuating pressure level (FPL) data and vibration data from a plume impingement aero-acoustic and vibration (PIAAV) test to validate NASA s physics-based modeling methods for prediction of panel vibration in the near field of a hot supersonic rocket plume. For this test - reported more fully in a companion paper by Osterholt & Knox at 26th Aerospace Testing Seminar, 2011 - the flexible panel was located 2.4 nozzle diameters from the plume centerline and 4.3 nozzle diameters downstream from the nozzle exit. The FPL loading is analyzed in terms of its auto spectrum, its cross spectrum, its spatial correlation parameters and its statistical properties. The panel vibration data is used to estimate the in-situ damping under plume FPL loading conditions and to validate both finite element analysis (FEA) and statistical energy analysis (SEA) methods for prediction of panel response. An assessment is also made of the effects of non-linearity in the panel elasticity

    The impact of microRNAs on transcriptional heterogeneity and gene co-expression across single embryonic stem cells

    Get PDF
    MicroRNAs act posttranscriptionally to suppress multiple target genes within a cell population. To what extent this multi-target suppression occurs in individual cells and how it impacts transcriptional heterogeneity and gene co-expression remains unknown. Here we used single-cell sequencing combined with introduction of individual microRNAs. miR-294 and let-7c were introduced into otherwise microRNA-deficient Dgcr8 knockout mouse embryonic stem cells. Both microRNAs induce suppression and correlated expression of their respective gene targets. The two microRNAs had opposing effects on transcriptional heterogeneity within the cell population, with let-7c increasing and miR-294 decreasing the heterogeneity between cells. Furthermore, let-7c promotes, whereas miR-294 suppresses, the phasing of cell cycle genes. These results show at the individual cell level how a microRNA simultaneously has impacts on its many targets and how that in turn can influence a population of cells. The findings have important implications in the understanding of how microRNAs influence the co-expression of genes and pathways, and thus ultimately cell fate
    corecore