289 research outputs found

    Solving the At-Most-Once Problem with Nearly Optimal Effectiveness

    Full text link
    We present and analyze a wait-free deterministic algorithm for solving the at-most-once problem: how m shared-memory fail-prone processes perform asynchronously n jobs at most once. Our algorithmic strategy provides for the first time nearly optimal effectiveness, which is a measure that expresses the total number of jobs completed in the worst case. The effectiveness of our algorithm equals n-2m+2. This is up to an additive factor of m close to the known effectiveness upper bound n-m+1 over all possible algorithms and improves on the previously best known deterministic solutions that have effectiveness only n-log m o(n). We also present an iterative version of our algorithm that for any m=O(n/logn3+ϵ)m = O\left(\sqrt[3+\epsilon]{n/\log n}\right) is both effectiveness-optimal and work-optimal, for any constant ϵ>0\epsilon > 0. We then employ this algorithm to provide a new algorithmic solution for the Write-All problem which is work optimal for any m=O(n/logn3+ϵ)m=O\left(\sqrt[3+\epsilon]{n/\log n}\right).Comment: Updated Version. A Brief Announcement was published in PODC 2011. An Extended Abstract was published in the proceeding of ICDCN 2012. A full version was published in Theoretical Computer Science, Volume 496, 22 July 2013, Pages 69 - 8

    Aspects of practical implementations of PRAM algorithms

    Get PDF
    The PRAM is a shared memory model of parallel computation which abstracts away from inessential engineering details. It provides a very simple architecture independent model and provides a good programming environment. Theoreticians of the computer science community have proved that it is possible to emulate the theoretical PRAM model using current technology. Solutions have been found for effectively interconnecting processing elements, for routing data on these networks and for distributing the data among memory modules without hotspots. This thesis reviews this emulation and the possibilities it provides for large scale general purpose parallel computation. The emulation employs a bridging model which acts as an interface between the actual hardware and the PRAM model. We review the evidence that such a scheme crn achieve scalable parallel performance and portable parallel software and that PRAM algorithms can be optimally implemented on such practical models. In the course of this review we presented the following new results: 1. Concerning parallel approximation algorithms, we describe an NC algorithm for finding an approximation to a minimum weight perfect matching in a complete weighted graph. The algorithm is conceptually very simple and it is also the first NC-approximation algorithm for the task with a sub-linear performance ratio. 2. Concerning graph embedding, we describe dense edge-disjoint embeddings of the complete binary tree with n leaves in the following n-node communication networks: the hypercube, the de Bruijn and shuffle-exchange networks and the 2-dimcnsional mesh. In the embeddings the maximum distance from a leaf to the root of the tree is asymptotically optimally short. The embeddings facilitate efficient implementation of many PRAM algorithms on networks employing these graphs as interconnection networks. 3. Concerning bulk synchronous algorithmics, we describe scalable transportable algorithms for the following three commonly required types of computation; balanced tree computations. Fast Fourier Transforms and matrix multiplications

    Theoretically Efficient Parallel Graph Algorithms Can Be Fast and Scalable

    Full text link
    There has been significant recent interest in parallel graph processing due to the need to quickly analyze the large graphs available today. Many graph codes have been designed for distributed memory or external memory. However, today even the largest publicly-available real-world graph (the Hyperlink Web graph with over 3.5 billion vertices and 128 billion edges) can fit in the memory of a single commodity multicore server. Nevertheless, most experimental work in the literature report results on much smaller graphs, and the ones for the Hyperlink graph use distributed or external memory. Therefore, it is natural to ask whether we can efficiently solve a broad class of graph problems on this graph in memory. This paper shows that theoretically-efficient parallel graph algorithms can scale to the largest publicly-available graphs using a single machine with a terabyte of RAM, processing them in minutes. We give implementations of theoretically-efficient parallel algorithms for 20 important graph problems. We also present the optimizations and techniques that we used in our implementations, which were crucial in enabling us to process these large graphs quickly. We show that the running times of our implementations outperform existing state-of-the-art implementations on the largest real-world graphs. For many of the problems that we consider, this is the first time they have been solved on graphs at this scale. We have made the implementations developed in this work publicly-available as the Graph-Based Benchmark Suite (GBBS).Comment: This is the full version of the paper appearing in the ACM Symposium on Parallelism in Algorithms and Architectures (SPAA), 201

    Self-distributing computation

    Get PDF
    Thesis (M.Eng.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2002.Includes bibliographical references (leaves 54-55).In this thesis, I propose a new model for distributing computational work in a parallel or distributed system. This model relies on exposing the topology and performance characteristics of the underlying architecture to the application. Responsibility for task distribution is divided between a run-time system, which determines when tasks should be distributed or consolidated, and the application, which specifies to the runtime system its first-choice distribution based on a representation of the current state of the underlying architecture. Discussing my experience in implementing this model as a Java-based simulator, I argue for the advantages of this approach as they relate to performance on changing architectures and ease of programming.by Thomas R. Woodfin.M.Eng

    Shared memory with hidden latency on a family of mesh-like networks

    Get PDF

    Asynchronous Teams and Tasks in a Message Passing Environment

    Get PDF
    As the discipline of scientific computing grows, so too does the "skills gap" between the increasingly complex scientific applications and the efficient algorithms required. Increasing demand for computational power on the march towards exascale requires innovative approaches. Closing the skills gap avoids the many pitfalls that lead to poor utilisation of resources and wasted investment. This thesis tackles two challenges: asynchronous algorithms for parallel computing and fault tolerance. First I present a novel asynchronous task invocation methodology for Discontinuous Galerkin codes called enclave tasking. The approach modifies the parallel ordering of tasks that allows for efficient scaling on dynamic meshes up to 756 cores. It ensures high levels of concurrency and intermixes tasks of different computational properties. Critical tasks along domain boundaries are prioritised for an overlap of computation and communication. The second contribution is the teaMPI library, forming teams of MPI processes exchanging consistency data through an asynchronous "heartbeat". In contrast to previous approaches, teaMPI operates fully asynchronously with reduced overhead. It is also capable of detecting individually slow or failing ranks and inconsistent data among replicas. Finally I provide an outlook into how asynchronous teams using enclave tasking can be combined into an advanced team-based diffusive load balancing scheme. Both concepts are integrated into and contribute towards the ExaHyPE project, a next generation code that solves hyperbolic equation systems on dynamically adaptive cartesian grids

    Throughput-Efficient Network-on-Chip Router Design with STT-MRAM

    Get PDF
    As the number of processor cores on a chip increases with the advance of CMOS technology, there has been a growing need of more efficient Network-on-Chip (NoC) design since communication delay has become a major bottleneck in large-scale multicore systems. In designing efficient input buffers of NoC routers for better performance and power efficiency, Spin-Torque Transfer Magnetic RAM (STT-MRAM) is regarded as a promising solution due to its nature of high density and near-zero leakage power. Previous work that adopts STT-MRAM in designing NoC router input buffer shows a limitation in minimizing the overhead of power consumption, even though it succeeds to some degree in achieving high network throughput by the use of SRAM to hide the long write latency of STT-MRAM. In this thesis, we propose a novel input buffer design that depends solely on STT-MRAM without the need of SRAM to maximize the benefits of low leakage power and area efficiency inherent in STT-MRAM. In addition, we introduce power-efficient buffer refreshing schemes synergized with age-based switch arbitration that gives higher priority to older flits to remove unnecessary refreshing operations. On an average, we observed throughput improvements of 16% on synthetic workloads and benchmarks

    Parallel and Distributed Algorithms for a Class of Graph-Related Computational Problems.

    Get PDF
    There exist at least two models of parallel computing, namely, shared-memory and message-passing. This research addresses problems in both these types of systems, and proposes efficient parallel (Shared-Memory Model) and distributed (message-passing) algorithms for a variety of graph related computational problems. In part I, we design algorithms for three generic problems in distributed systems: set manipulation, network structure recognition and facility placement. We present optimal distributed algorithms for recognizing rectangular-mesh networks. The time and message complexity of our algorithm is linear in the number of nodes in the network. We also lay the foundation for the recognition of 2-reducible, outer-planar and cactus graphs. These algorithms have a message complexity of O(kn), where, k is the number of isolated two degree nodes in the network. We introduce the problem of reliable r-domination and design unified optimal distributed algorithms for the total, reliable and independent r-domination on trees. The time and message complexity of our algorithm is O(n), where n is the number of nodes in the tree. In the domain of set manipulation we design optimal algorithms for determining the intersection of sets in a distributed environment, where each processor is assumed to have its own set. The time and message complexity of our set intersection algorithm is O(mn), where m is the cardinality of the smallest set. In part II of our research we design optimal algorithms for r-domination and efficient parallel algorithms for the p-center problems on trees. We also present an optimal algorithm for computing the maximum independent set on intervals i the EREW-PRAM model. The r-domination problem on trees can now be solved in O(logn)time with O(n/logn) processors using the EREW-PRAM model. A parallel algorithm for range searching is developed using the concept of distributed data structures. We show that O(logn) search time can be effected for a range search on n 3-dimensional points using (2.log\sp2n-14.logn + 8) processors. Our algorithm can easily be generalized for the case of d-dimensional range search. (Abstract shortened with permission of author.)

    High Performance Web Servers: A Study In Concurrent Programming Models

    Get PDF
    With the advent of commodity large-scale multi-core computers, the performance of software running on these computers has become a challenge to researchers and enterprise developers. While academic research and industrial products have moved in the direction of writing scalable and highly available services using distributed computing, single machine performance remains an active domain, one which is far from saturated. This thesis selects an archetypal software example and workload in this domain, and describes software characteristics affecting performance. The example is highly-parallel web-servers processing a static workload. Particularly, this work examines concurrent programming models in the context of high-performance web-servers across different architectures — threaded (Apache, Go and μKnot), event-driven (Nginx, μServer) and staged (WatPipe) — compared with two static workloads in two different domains. The two workloads are a Zipf distribution of file sizes representing a user session pulling an assortment of many small and a few large files, and a 50KB file representing chunked streaming of a large audio or video file. Significant effort is made to fairly compare eight web-servers by carefully tuning each via their adjustment parameters. Tuning plays a significant role in workload-specific performance. The two domains are no disk I/O (in-memory file set) and medium disk I/O. The domains are created by lowering the amount of RAM available to the web-server from 4GB to 2GB, forcing files to be evicted from the file-system cache. Both domains are also restricted to 4 CPUs. The primary goal of this thesis is to examine fundamental performance differences between threaded and event-driven concurrency models, with particular emphasis on user-level threading models. Additionally, a secondary goal of the work is to examine high-performance software under restricted hardware environments. Over-provisioned hardware environments can mask architectural and implementation shortcomings in software – the hypothesis in this work is that restricting resources stresses the application, bringing out important performance characteristics and properties. Experimental results for the given workload show that memory pressure is one of the most significant factors for the degradation of web-server performance, because it forces both the onset and amount of disk I/O. With an ever increasing need to support more content at faster rates, a web-server relies heavily on in-memory caching of files and related content. In fact, personal and small business web-servers are even run on minimal hardware, like the Raspberry Pi, with only 1GB of RAM and a small SD card for the file system. Therefore, understanding behaviour and performance in restricted contexts should be a normal aspect of testing a web server (and other software systems)
    corecore