12 research outputs found

    Consistencia de ejecuci贸n: una propuesta no cache coherente

    Get PDF
    La presencia de uno o varios niveles de memoria cache en los procesadores modernos, cuyo objetivo es reducir el tiempo efectivo de acceso a memoria, adquiere especial relevancia en un ambiente multiprocesador del tipo DSM dado el mucho mayor costo de las referencias a memoria en m贸dulos remotos. Claramente, el protocolo de coherencia de cache debe responder al modelo de consistencia de memoria adoptado. El modelo secuencial SC, aceptado generalmente como el m谩s natural, junto a una serie de modelos m谩s relajados como consistencia de procesador PC, release RC, y m谩s recientemente Java, asumen coherencia de cache. Existen, aunque en proporci贸n mucho menor, otros modelos como el Dag y el location consistency LC que prescinden del requerimiento de coherencia. En este trabajo, analizadas las limitaciones que impone a nivel de hardware y software la coherencia, formulamos un nuevo modelo no cache coherente y un protocolo eficiente de cache para soportarlo. Este modelo, al cual referiremos como consistencia de ejecuci贸n EC, permite una ejecuci贸n secuencialmente consistente con programas paralelos libre de carrera, data race free, y en los casos de operaciones asincr贸nicas posibilita un comportamiento asimilable al del modelo Slow, lo cual lo tornar铆a v谩lido para aplicaciones no sincronizadasVI Workshop de Procesamiento Distribuido y Paralelo (WPDP)Red de Universidades con Carreras en Inform谩tica (RedUNCI

    Scheduling threads for constructive cache sharing on CMPs

    Get PDF
    In chip multiprocessors (CMPs), limiting the number of offchip cache misses is crucial for good performance. Many multithreaded programs provide opportunities for constructive cache sharing, in which concurrently scheduled threads share a largely overlapping working set. In this paper, we compare the performance of two state-of-the-art schedulers proposed for fine-grained multithreaded programs: Parallel Depth First (PDF), which is specifically designed for constructive cache sharing, and Work Stealing (WS), which is a more traditional design. Our experimental results indicate that PDF scheduling yields a 1.3 - 1.6X performance improvement relative to WS for several fine- grain parallel benchmarks on projected future CMP configurations; we also report several issues that may limit the advantage of PDF in certain applications. These results also indicate that PDF more effectively utilizes off-chip bandwidth, making it possible to trade-off on-chip cache for a larger number of cores. Moreover, we find that task granularity plays a key role in cache performance. Therefore, we present an automatic approach for selecting effective grain sizes, based on a new working set profiling algorithm that is an order of magnitude faster than previous approaches. This is the first paper demonstrating the effectiveness of PDF on real benchmarks, providing a direct comparison between PDF and WS, revealing the limiting factors for PDF in practice, and presenting an approach for overcoming these factors. Copyright 2007 ACM

    Experimental analysis of space-bounded schedulers

    Get PDF
    ABSTRACT The running time of nested parallel programs on shared memory machines depends in significant part on how well the scheduler mapping the program to the machine is optimized for the organization of caches and processors on the machine. Recent work proposed "space-bounded schedulers" for scheduling such programs on the multi-level cache hierarchies of current machines. The main benefit of this class of schedulers is that they provably preserve locality of the program at every level in the hierarchy, resulting (in theory) in fewer cache misses and better use of bandwidth than the popular work-stealing scheduler. On the other hand, compared to work-stealing, space-bounded schedulers are inferior at load balancing and may have greater scheduling overheads, raising the question as to the relative effectiveness of the two schedulers in practice. In this paper, we provide the first experimental study aimed at addressing this question. To facilitate this study, we built a flexible experimental framework with separate interfaces for programs and schedulers. This enables a headto-head comparison of the relative strengths of schedulers in terms of running times and cache miss counts across a range of benchmarks. (The framework is validated by comparisons with the Intel R Cilk TM Plus work-stealing scheduler.) We present experimental results on a 32-core Xeon R 7560 comparing work-stealing, hierarchy-minded work-stealing, and two variants of space-bounded schedulers on both divideand-conquer micro-benchmarks and some popular algorithmic kernels. Our results indicate that space-bounded schedulers reduce the number of L3 cache misses compared to work-stealing schedulers by 25-65% for most of the benchmarks, but incur up to 7% additional scheduler and loadimbalance overhead. Only for memory-intensive benchmarks can the reduction in cache misses overcome the added overhead, resulting in up to a 25% improvement in running time for synthetic benchmarks and about 20% improvement for algorithmic kernels. We also quantify runtime improvements ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of the national government of United States. As such, the Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only. varying the available bandwidth per core (the "bandwidth gap"), and show up to 50% improvements in the running times of kernels as this gap increases 4-fold. As part of our study, we generalize prior definitions of space-bounded schedulers to allow for more practical variants (while still preserving their guarantees), and explore implementation tradeoffs

    An Analysis of Dag-Consistent Distributed Shared-Memory Algorithms

    No full text
    In this paper, we analyze the performance of parallel multithreaded algorithms that use dag-consistent distributed shared memory. Specifically, we analyze execution time, page faults, and space requirements for multithreaded algorithms executed by a workstealing thread scheduler and the BACKER algorithm for maintaining dag consistency. We prove that if the accesses to the backing store are random and independent (the BACKER algorithm actually uses hashing), the expected execution time T P (C) of a "fully strict" multithreaded computation on P processors, each with a LRU cache of C pages, is O(T 1 (C)=P+mCT ), where T 1 (C) is the total work of the computation including page faults, T is its critical-path length excluding page faults, and m is the minimum page transfer time. As a corollary to this theorem, we show that the expected number F P (C) of page faults incurred by a computation executed on P processors can be related to the number F 1 (C) of serial page faults by the formula..

    Cilk : efficient multithreaded computing

    Get PDF
    Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 1998.Includes bibliographical references (p. 170-179).by Keith H. Randall.Ph.D

    Portable high-performance programs

    Get PDF
    Thesis (Ph.D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 1999.Includes bibliographical references (p. 159-169).by Matteo Frigo.Ph.D
    corecore