42 research outputs found

    An experimental validation of the PRO model for parallel and distributed computation

    Get PDF
    National audienceThe Parallel Resource-Optimal (PRO) computation model was introduced by Gebremedhin et al. [2002] as a framework for the design and analysis of efficient parallel algorithms. The key features of the PRO model that distinguish it from previous parallel computation models are the full integration of resource-optimality into the design process and the use of a {granularity function as a parameter for measuring quality. In this paper we present experimental results on parallel algorithms, designed using the PRO model, for two representative problems: list ranking and sorting. The algorithms are implemented using SSCRAP, our environment for developing coarse-grained algorithms. The experimental performance results observed agree well with analytical predictions using the PRO model. Moreover, by using different platforms to run our experiments, we have been able to provide an integrated view of the modeling of an underlying architecture and the design and implementation of scalable parallel algorithms

    A Partition-centric Distributed Algorithm for Identifying Euler Circuits in Large Graphs

    Full text link
    Finding the Eulerian circuit in graphs is a classic problem, but inadequately explored for parallel computation. With such cycles finding use in neuroscience and Internet of Things for large graphs, designing a distributed algorithm for finding the Euler circuit is important. Existing parallel algorithms are impractical for commodity clusters and Clouds. We propose a novel partition-centric algorithm to find the Euler circuit, over large graphs partitioned across distributed machines and executed iteratively using a Bulk Synchronous Parallel (BSP) model. The algorithm finds partial paths and cycles within each partition, and refines these into longer paths by recursively merging the partitions. We describe the algorithm, analyze its complexity, validate it on Apache Spark for large graphs, and offer experimental results. We also identify memory bottlenecks in the algorithm and propose an enhanced design to address it.Comment: To appear in Proceedings of 5th IEEE International Workshop on High-Performance Big Data, Deep Learning, and Cloud Computing, In conjunction with The 33rd IEEE International Parallel and Distributed Processing Symposium (IPDPS 2019), Rio de Janeiro, Brazil, May 20th, 201

    List Ranking on a Coarse Grained Multiprocessor

    Get PDF
    We present a deterministic algorithm for the List Ranking Problem on a Coarse Grained p-Multiprocessor (CGM) that is only a factor of log*(p) away from optimality. This statement holds as well for counting communication rounds where it achieves O(log(p) log*(p)) and for the required communication cost and total computation time where it achieves O(n log*(p)). We report on experimental studies of that algorithm on a variety of platforms that show the validity of the chosen CGM-model, and also show the possible gains and limits of such an algorithm. Finally, we suggest to extend CGM model by the communication blow up to allow better a priori predictions of communication costs of algorithms

    List Ranking on PC Clusters

    Get PDF
    We present two algorithms for the List Ranking Problem in the Coarse Grained Multicomputer model (CGM for short): if pp is the number of processors and nn the size of the list, then we give a deterministic one that achieves O(logplogp)O(\log p \log^* p) communication rounds and O(nlogp)O(n \log^* p) for the required communication cost and total computation time; and a randomized one that requires O(logp)O(\log p) communication rounds and O(n)O(n) for the required communication cost and total computation time. We report on experimental studies of these algorithms on a PC cluster interconnected by a Myrinet network. As far as we know, it is the first portable code on this problem that runs on a cluster. With these experimental studies, we study the validity of the chosen CGM-model, and also show the possible gains and limits of such algorithms for PC clusters

    Parallel Query Processing on 2D Mesh and Linear Array Architectures

    Get PDF
    As the size of the web grows, it is necessary to parallelize the process of retrieving information from the web. Incorporating parallelism in search engines is one of the approaches towards achieving this aim. This paper presents an algorithm for query processing on the 2D mesh architecture and two algorithms for linear array architectures. We attempt to exploit the arrangement of processors and the communication pattern in both 2D mesh and linear array architectures to attain high speedup and efficiency for queries-keywords comparisons. A cost model is presented for each algorithm based on both processing and communication cost. Proposed algorithms are evaluated using speedup and efficiency performance metrics. For the same number of processors, 2D Mesh_QP outperforms both linear array algorithms (LA_QPAKP and LA_QPKE). Keywords: 2D Mesh, Linear Arrays, Parallel computing, Query processin

    Practical Parallel External Memory Algorithms via Simulation of Parallel Algorithms

    Full text link
    This thesis introduces PEMS2, an improvement to PEMS (Parallel External Memory System). PEMS executes Bulk-Synchronous Parallel (BSP) algorithms in an External Memory (EM) context, enabling computation with very large data sets which exceed the size of main memory. Many parallel algorithms have been designed and implemented for Bulk-Synchronous Parallel models of computation. Such algorithms generally assume that the entire data set is stored in main memory at once. PEMS overcomes this limitation without requiring any modification to the algorithm by using disk space as memory for additional "virtual processors". Previous work has shown this to be a promising approach which scales well as computational resources (i.e. processors and disks) are added. However, the technique incurs significant overhead when compared with purpose-built EM algorithms. PEMS2 introduces refinements to the simulation process intended to reduce this overhead as well as the amount of disk space required to run the simulation. New functionality is also introduced, including asynchronous I/O and support for multi-core processors. Experimental results show that these changes significantly improve the runtime of the simulation. PEMS2 narrows the performance gap between simulated BSP algorithms and their hand-crafted EM counterparts, providing a practical system for using BSP algorithms with data sets which exceed the size of RAM

    iC2mpi: A Platform for Parallel Execution of Graph-Structured Iterative Computations

    Get PDF
    Parallelization of sequential programs is often daunting because of the substantial development cost involved. Various solutions have been proposed to address this concern, including directive-based approaches and parallelization platforms. These solutions have not always been successful, in part because many try to address all types of applications. We propose a platform for parallelization of a class of applications that have similar computational structure, namely graph-structured iterative applications. iC2mpi is a unique proof-of-concept prototype platform that provides relatively easy parallelization of existing sequential programs and facilitates experimentation with static partitioning and dynamic load balancing schemes. We demonstrate with various generic application graph topologies and an existing application, namely a time-stepped battlefield management simulation, that our platform can produce good performance with very little effort

    Overlapping of Communication and Computation and Early Binding: Fundamental Mechanisms for Improving Parallel Performance on Clusters of Workstations

    Get PDF
    This study considers software techniques for improving performance on clusters of workstations and approaches for designing message-passing middleware that facilitate scalable, parallel processing. Early binding and overlapping of communication and computation are identified as fundamental approaches for improving parallel performance and scalability on clusters. Currently, cluster computers using the Message-Passing Interface for interprocess communication are the predominant choice for building high-performance computing facilities, which makes the findings of this work relevant to a wide audience from the areas of high-performance computing and parallel processing. The performance-enhancing techniques studied in this work are presently underutilized in practice because of the lack of adequate support by existing message-passing libraries and are also rarely considered by parallel algorithm designers. Furthermore, commonly accepted methods for performance analysis and evaluation of parallel systems omit these techniques and focus primarily on more obvious communication characteristics such as latency and bandwidth. This study provides a theoretical framework for describing early binding and overlapping of communication and computation in models for parallel programming. This framework defines four new performance metrics that facilitate new approaches for performance analysis of parallel systems and algorithms. This dissertation provides experimental data that validate the correctness and accuracy of the performance analysis based on the new framework. The theoretical results of this performance analysis can be used by designers of parallel system and application software for assessing the quality of their implementations and for predicting the effective performance benefits of early binding and overlapping. This work presents MPI/Pro, a new MPI implementation that is specifically optimized for clusters of workstations interconnected with high-speed networks. This MPI implementation emphasizes features such as persistent communication, asynchronous processing, low processor overhead, and independent message progress. These features are identified as critical for delivering maximum performance to applications. The experimental section of this dissertation demonstrates the capability of MPI/Pro to facilitate software techniques that result in significant application performance improvements. Specific demonstrations with Virtual Interface Architecture and TCP/IP over Ethernet are offered
    corecore