37 research outputs found

    Improving the scalability of parallel N-body applications with an event driven constraint based execution model

    Full text link
    The scalability and efficiency of graph applications are significantly constrained by conventional systems and their supporting programming models. Technology trends like multicore, manycore, and heterogeneous system architectures are introducing further challenges and possibilities for emerging application domains such as graph applications. This paper explores the space of effective parallel execution of ephemeral graphs that are dynamically generated using the Barnes-Hut algorithm to exemplify dynamic workloads. The workloads are expressed using the semantics of an Exascale computing execution model called ParalleX. For comparison, results using conventional execution model semantics are also presented. We find improved load balancing during runtime and automatic parallelism discovery improving efficiency using the advanced semantics for Exascale computing.Comment: 11 figure

    First-Class Functions for First-Order Database Engines

    Full text link
    We describe Query Defunctionalization which enables off-the-shelf first-order database engines to process queries over first-class functions. Support for first-class functions is characterized by the ability to treat functions like regular data items that can be constructed at query runtime, passed to or returned from other (higher-order) functions, assigned to variables, and stored in persistent data structures. Query defunctionalization is a non-invasive approach that transforms such function-centric queries into the data-centric operations implemented by common query processors. Experiments with XQuery and PL/SQL database systems demonstrate that first-order database engines can faithfully and efficiently support the expressive "functions as data" paradigm.Comment: Proceedings of the 14th International Symposium on Database Programming Languages (DBPL 2013), August 30, 2013, Riva del Garda, Trento, Ital

    On One-Pass CPS Transformations

    Get PDF
    We bridge two distinct approaches to one-pass CPS transformations, i.e., CPS transformations that reduce administrative redexes at transformation time instead of in a post-processing phase. One approach is compositional and higher-order, and is due to Appel, Danvy and Filinski, and Wand, building on Plotkin's seminal work. The other is non-compositional and based on a syntactic theory of the lambda-calculus, and is due to Sabry and Felleisen. To relate the two approaches, we use Church encoding, Reynolds's defunctionalization, and an implementation technique for syntactic theories, refocusing, developed in the second author's PhD thesis

    Programs with continuations and linear logic

    Get PDF
    AbstractThe purpose of this paper is to investigate the programs with continuations in the framework of classical linear logic which can also be regarded as improvement of traditional classical logic. First, simply typed λ-calculus λ→c with continuation primitives is introduced. Second, a translation from λ→c to linear logic is given. Last, a correspondence between them is presented

    GraphStep: A System Architecture for Sparse-Graph Algorithms

    Get PDF
    Many important applications are organized around long-lived, irregular sparse graphs (e.g., data and knowledge bases, CAD optimization, numerical problems, simulations). The graph structures are large, and the applications need regular access to a large, data-dependent portion of the graph for each operation (e.g., the algorithm may need to walk the graph, visiting all nodes, or propagate changes through many nodes in the graph). On conventional microprocessors, the graph structures exceed on-chip cache capacities, making main-memory bandwidth and latency the key performance limiters. To avoid this “memory wall,” we introduce a concurrent system architecture for sparse graph algorithms that places graph nodes in small distributed memories paired with specialized graph processing nodes interconnected by a lightweight network. This gives us a scalable way to map these applications so that they can exploit the high-bandwidth and low-latency capabilities of embedded memories (e.g., FPGA Block RAMs). On typical spreading activation queries on the ConceptNet Knowledge Base, a sample application, this translates into an order of magnitude speedup per FPGA compared to a state-of-the-art Pentium processor

    On One-Pass CPS Transformations

    Get PDF
    We bridge two distinct approaches to one-pass CPS transformations, i.e., CPS transformations that reduce administrative redexes at transformation time instead of in a post-processing phase. One approach is compositional and higher-order, and is independently due to Appel, Danvy and Filinski, and Wand, building on Plotkin's seminal work. The other is non-compositional and based on a reduction semantics for the lambda-calculus, and is due to Sabry and Felleisen. To relate the two approaches, we use three tools: Reynolds's defunctionalization and its left inverse, refunctionalization; a special case of fold-unfold fusion due to Ohori and Sasano, fixed-point promotion; and an implementation technique for reduction semantics due to Danvy and Nielsen, refocusing. This work is directly applicable to transforming programs into monadic normal form

    The Parallel Persistent Memory Model

    Full text link
    We consider a parallel computational model that consists of PP processors, each with a fast local ephemeral memory of limited size, and sharing a large persistent memory. The model allows for each processor to fault with bounded probability, and possibly restart. On faulting all processor state and local ephemeral memory are lost, but the persistent memory remains. This model is motivated by upcoming non-volatile memories that are as fast as existing random access memory, are accessible at the granularity of cache lines, and have the capability of surviving power outages. It is further motivated by the observation that in large parallel systems, failure of processors and their caches is not unusual. Within the model we develop a framework for developing locality efficient parallel algorithms that are resilient to failures. There are several challenges, including the need to recover from failures, the desire to do this in an asynchronous setting (i.e., not blocking other processors when one fails), and the need for synchronization primitives that are robust to failures. We describe approaches to solve these challenges based on breaking computations into what we call capsules, which have certain properties, and developing a work-stealing scheduler that functions properly within the context of failures. The scheduler guarantees a time bound of O(W/PA+D(P/PA)log1/fW)O(W/P_A + D(P/P_A) \lceil\log_{1/f} W\rceil) in expectation, where WW and DD are the work and depth of the computation (in the absence of failures), PAP_A is the average number of processors available during the computation, and f1/2f \le 1/2 is the probability that a capsule fails. Within the model and using the proposed methods, we develop efficient algorithms for parallel sorting and other primitives.Comment: This paper is the full version of a paper at SPAA 2018 with the same nam
    corecore