98 research outputs found

    Architecture and Performance of the Mether Network Shared Memory

    Get PDF
    Mether is a Network Shared Memory (NSM). It allows applications on autonomous computers connected by a network to share a segment of memory. NSMs offer the attraction of a simple abstraction for shared state, i.e., shared memory. NSMs have a potential performance problem in the cost of remote references, which is typically solved by grouping memory into larger units such as pages, and caching pages. While Mether employs grouping and caching to reduce the average memory reference delay, it also removes the need for many remote references (page faults) by providing a facility with relaxed consistency requirements. Applications ported from a multiprocessor supercomputer with shared memory to a 16-workstation Mether configuration showed a cost/performance advantage of over 300 in favor of the Mether system. While Mether is currently implemented for Sun-3 and Sun-4 systems connected via Ethernet, other characteristics (such as a choice of page sizes and a semaphore-like access mode useful for process synchronization) should suit it to a wide variety of networks. A reimplementation for an alternate configuration employing packet-switched networks is in progress

    Optimizing for a Many-Core Architecture without Compromising Ease-of-Programming

    Get PDF
    Faced with nearly stagnant clock speed advances, chip manufacturers have turned to parallelism as the source for continuing performance improvements. But even though numerous parallel architectures have already been brought to market, a universally accepted methodology for programming them for general purpose applications has yet to emerge. Existing solutions tend to be hardware-specific, rendering them difficult to use for the majority of application programmers and domain experts, and not providing scalability guarantees for future generations of the hardware. This dissertation advances the validation of the following thesis: it is possible to develop efficient general-purpose programs for a many-core platform using a model recognized for its simplicity. To prove this thesis, we refer to the eXplicit Multi-Threading (XMT) architecture designed and built at the University of Maryland. XMT is an attempt at re-inventing parallel computing with a solid theoretical foundation and an aggressive scalable design. Algorithmically, XMT is inspired by the PRAM (Parallel Random Access Machine) model and the architecture design is focused on reducing inter-task communication and synchronization overheads and providing an easy-to-program parallel model. This thesis builds upon the existing XMT infrastructure to improve support for efficient execution with a focus on ease-of-programming. Our contributions aim at reducing the programmer's effort in developing XMT applications and improving the overall performance. More concretely, we: (1) present a work-flow guiding programmers to produce efficient parallel solutions starting from a high-level problem; (2) introduce an analytical performance model for XMT programs and provide a methodology to project running time from an implementation; (3) propose and evaluate RAP -- an improved resource-aware compiler loop prefetching algorithm targeted at fine-grained many-core architectures; we demonstrate performance improvements of up to 34.79% on average over the GCC loop prefetching implementation and up to 24.61% on average over a simple hardware prefetching scheme; and (4) implement a number of parallel benchmarks and evaluate the overall performance of XMT relative to existing serial and parallel solutions, showing speedups of up to 13.89x vs.~ a serial processor and 8.10x vs.~parallel code optimized for an existing many-core (GPU). We also discuss the implementation and optimization of the Max-Flow algorithm on XMT, a problem which is among the more advanced in terms of complexity, benchmarking and research interest in the parallel algorithms community. We demonstrate better speed-ups compared to a best serial solution than previous attempts on other parallel platforms

    Coûts de Synchronization dans les Programmes ParallÚles et les Structures de DonnÚes Simultanées

    Get PDF
    To use the computational power of modern computing machines, we have to deal with concurrent programs. Writing efficient concurrent programs is notoriously difficult, primarily due to the need of harnessing synchronization costs. In this thesis, we focus on synchronization costs in parallel programs and concurrent data structures.First, we present a novel granularity control technique for parallel programs designed for the dynamic multithreading environment. Then in the context of concurrent data structures, we consider the notion of concurrency-optimality and propose the first implementation of a concurrency-optimal binary search tree that, intuitively, accepts a concurrent schedule if and only if the schedule is correct. Also, we propose parallel combining, a technique that enables efficient implementations of concurrent data structures from their parallel batched counterparts. We validate the proposed techniques via experimental evaluations showing superior or comparable performance with respect to state-of-the-art algorithms.From a more formal perspective, we consider the phenomenon of helping in concurrent data structures. Intuitively, helping is observed when the order of some operation in a linearization is fixed by a step of another process. We show that no wait-free linearizable implementation of stack using read, write, compare&swap and fetch&add primitives can be help-free, correcting a mistake in an earlier proof by Censor-Hillel et al. Finally, we propose a simple way to analytically predict the throughput of data structures based on coarse-grained locking.Pour utiliser la puissance de calcul des ordinateurs modernes, nous devons Ă©crire des programmes concurrents. L’écriture de programme concurrent efficace est notoirement difficile, principalement en raison de la nĂ©cessitĂ© de gĂ©rer les coĂ»ts de synchronization. Dans cette thĂšse, nous nous concentrons sur les coĂ»ts de synchronisation dans les programmes parallĂšles et les structures de donnĂ©es concurrentes.D’abord, nous prĂ©sentons une nouvelle technique de contrĂŽle de la granularitĂ© pour les programmes parallĂšles conçus pour un environnement de multi-threading dynamique. Ensuite, dans le contexte des structures de donnĂ©es concurrentes, nous considĂ©rons la notion d’optimalitĂ© de concurrence (concurrency-optimality) et proposons la premiĂšre implĂ©mentation concurrence-optimal d’un arbre binaire de recherche qui, intuitivement, accepte un ordonnancement concurrent si et seulement si l’ordonnancement est correct. Nous proposons aussi la combinaison parallĂšle (parallel combining), une technique qui permet l’implĂ©mentation efficace des structures de donnĂ©es concurrences Ă  partir de leur version parallĂšle par lots. Nous validons les techniques proposĂ©es par une Ă©valuation expĂ©rimentale, qui montre des performances supĂ©rieures ou comparables Ă  celles des algorithmes de l’état de l’art.Dans une perspective plus formelle, nous considĂ©rons le phĂ©nomĂšne d’assistance (helping) dans des structures de donnĂ©es concurrentes. On observe un phĂ©nomĂšne d’assistance quand l’ordre d’une opĂ©ration d’un processus dans une trace linĂ©arisĂ©e est fixĂ©e par une Ă©tape d’un autre processus. Nous montrons qu’aucune implĂ©mentation sans attente (wait-free) linĂ©arisable d’une pile utilisant les primitives read, write, compare&swap et fetch&add ne peut ĂȘtre “sans assistance” (help-free), corrigeant une erreur dans une preuve antĂ©rieure de Censor-Hillel et al. Finalement, nous proposons une façon simple de prĂ©dire analytiquement le dĂ©bit (throughput) des structures de donnĂ©es basĂ©es sur des verrous Ă  gros grains

    Computer science I like proceedings of miniconference on 4.11.2011

    Get PDF
    • 

    corecore