178 research outputs found

    Communication Lower Bounds for Distributed-Memory Computations

    Get PDF
    In this paper we propose a new approach to the study of the communication requirements of distributed computations, which advocates for the removal of the restrictive assumptions under which earlier results were derived. We illustrate our approach by giving tight lower bounds on the communication complexity required to solve several computational problems in a distributed-memory parallel machine, namely standard matrix multiplication, stencil computations, comparison sorting, and the Fast Fourier Transform. Our bounds rely only on a mild assumption on work distribution, and significantly strengthen previous results which require either the computation to be balanced among the processors, or specific initial distributions of the input data, or an upper bound on the size of processors\u27 local memories

    Network-Oblivious Algorithms

    Get PDF
    A framework is proposed for the design and analysis of network-oblivious algorithms, namely algorithms that can run unchanged, yet efficiently, on a variety of machines characterized by different degrees of parallelism and communication capabilities. The framework prescribes that a network-oblivious algorithm be specified on a parallel model of computation where the only parameter is the problem\u2019s input size, and then evaluated on a model with two parameters, capturing parallelism granularity and communication latency. It is shown that for a wide class of network-oblivious algorithms, optimality in the latter model implies optimality in the decomposable bulk synchronous parallel model, which is known to effectively describe a wide and significant class of parallel platforms. The proposed framework can be regarded as an attempt to port the notion of obliviousness, well established in the context of cache hierarchies, to the realm of parallel computation. Its effectiveness is illustrated by providing optimal network-oblivious algorithms for a number of key problems. Some limitations of the oblivious approach are also discussed

    Flexible Management on BSP Process Rescheduling: Offering Migration at Middleware and Application Levels

    Get PDF
    This article describes the rationales for developing jMigBSP - a Java programming library that offers object rescheduling. It was designed to work on grid computing environments and offers an interface that follows the BSP (Bulk Synchronous Parallel) style. jMigBSP’s main contribution focuses on the rescheduling facility in two different ways: (i) by using migration directives on the application coded irectly and (ii) through automatic load balancing at middleware level. Especially, this second idea is feasible thanks to the Java’s inheritance feature, in which transforms a simple jMigBSP application in amigratable one only by changing a single line of code. In addition, the presented library makes the object interaction easier by providing one-sided message passing directives and hides network latency through asynchronous communications. Finally, we developed three BSP applications: (i) Prefix Sum; (ii) Fractal Image Compression (FIC) and; (iii) Fast Fourier Transform (FFT).They show our library as viable solution to offer load balancing on BSP applications. Specially, the FIC results present gains up to 37% when applying migration directives inside the code. Finally, the FFT tests emphasize strength of jMigBSP. In this situation, it outperforms a native library denoted BSPlib when migration facilities take place.Keywords: Bulk Synchronous Parallel, rescheduling, Java, adaptation, object migration, grid computing

    On Characterizing the Data Access Complexity of Programs

    Full text link
    Technology trends will cause data movement to account for the majority of energy expenditure and execution time on emerging computers. Therefore, computational complexity will no longer be a sufficient metric for comparing algorithms, and a fundamental characterization of data access complexity will be increasingly important. The problem of developing lower bounds for data access complexity has been modeled using the formalism of Hong & Kung's red/blue pebble game for computational directed acyclic graphs (CDAGs). However, previously developed approaches to lower bounds analysis for the red/blue pebble game are very limited in effectiveness when applied to CDAGs of real programs, with computations comprised of multiple sub-computations with differing DAG structure. We address this problem by developing an approach for effectively composing lower bounds based on graph decomposition. We also develop a static analysis algorithm to derive the asymptotic data-access lower bounds of programs, as a function of the problem size and cache size

    Real-Time Processing Library for Open-Source Hardware Biomedical Sensors

    Get PDF
    Applications involving data acquisition from sensors need samples at a preset frequency rate, the filtering out of noise and/or analysis of certain frequency components. We propose a novel software architecture based on open-software hardware platforms which allows programmers to create data streams from input channels and easily implement filters and frequency analysis objects. The performances of the different classes given in the size of memory allocated and execution time (number of clock cycles) were analyzed in the low-cost platform Arduino Genuino. In addition, 11 people took part in an experiment in which they had to implement several exercises and complete a usability test. Sampling rates under 250 Hz (typical for many biomedical applications) makes it feasible to implement filters, sliding windows and Fourier analysis, operating in real time. Participants rated software usability at 70.2 out of 100 and the ease of use when implementing several signal processing applications was rated at just over 4.4 out of 5. Participants showed their intention of using this software because it was percieved as useful and very easy to use. The performances of the library showed that it may be appropriate for implementing small biomedical real-time applications or for human movement monitoring, even in a simple open-source hardware device like Arduino Genuino. The general perception about this library is that it is easy to use and intuitive

    Partial aggregation for collective communication in distributed memory machines

    Get PDF
    High Performance Computing (HPC) systems interconnect a large number of Processing Elements (PEs) in high-bandwidth networks to simulate complex scientific problems. The increasing scale of HPC systems poses great challenges on algorithm designers. As the average distance between PEs increases, data movement across hierarchical memory subsystems introduces high latency. Minimizing latency is particularly challenging in collective communications, where many PEs may interact in complex communication patterns. Although collective communications can be optimized for network-level parallelism, occasional synchronization delays due to dependencies in the communication pattern degrade application performance. To reduce the performance impact of communication and synchronization costs, parallel algorithms are designed with sophisticated latency hiding techniques. The principle is to interleave computation with asynchronous communication, which increases the overall occupancy of compute cores. However, collective communication primitives abstract parallelism which limits the integration of latency hiding techniques. Approaches to work around these limitations either modify the algorithmic structure of application codes, or replace collective primitives with verbose low-level communication calls. While these approaches give fine-grained control for latency hiding, implementing collective communication algorithms is challenging and requires expertise knowledge about HPC network topologies. A collective communication pattern is commonly described as a Directed Acyclic Graph (DAG) where a set of PEs, represented as vertices, resolve data dependencies through communication along the edges. Our approach improves latency hiding in collective communication through partial aggregation. Based on mathematical rules of binary operations and homomorphism, we expose data parallelism in a respective DAG to overlap computation with communication. The proposed concepts are implemented and evaluated with a subset of collective primitives in the Message Passing Interface (MPI), an established communication standard in scientific computing. An experimental analysis with communication-bound microbenchmarks shows considerable performance benefits for the evaluated collective primitives. A detailed case study with a large-scale distributed sort algorithm demonstrates, how partial aggregation significantly improves performance in data-intensive scenarios. Besides better latency hiding capabilities with collective communication primitives, our approach enables further optimizations of their implementations within MPI libraries. The vast amount of asynchronous programming models, which are actively studied in the HPC community, benefit from partial aggregation in collective communication patterns. Future work can utilize partial aggregation to improve the interaction of MPI collectives with acclerator architectures, and to design more efficient communication algorithms

    Early Prediction of MPP Performance: The SP2, T and Paragon Experiences

    Get PDF
    • …
    corecore