Search CORE

12 research outputs found

Consistencia de ejecución: una propuesta no cache coherente

Author: Ardenghi Jorge Raúl
García Rafael B.
Publication venue
Publication date: 30/10/2012
Field of study

La presencia de uno o varios niveles de memoria cache en los procesadores modernos, cuyo objetivo es reducir el tiempo efectivo de acceso a memoria, adquiere especial relevancia en un ambiente multiprocesador del tipo DSM dado el mucho mayor costo de las referencias a memoria en módulos remotos. Claramente, el protocolo de coherencia de cache debe responder al modelo de consistencia de memoria adoptado. El modelo secuencial SC, aceptado generalmente como el más natural, junto a una serie de modelos más relajados como consistencia de procesador PC, release RC, y más recientemente Java, asumen coherencia de cache. Existen, aunque en proporción mucho menor, otros modelos como el Dag y el location consistency LC que prescinden del requerimiento de coherencia. En este trabajo, analizadas las limitaciones que impone a nivel de hardware y software la coherencia, formulamos un nuevo modelo no cache coherente y un protocolo eficiente de cache para soportarlo. Este modelo, al cual referiremos como consistencia de ejecución EC, permite una ejecución secuencialmente consistente con programas paralelos libre de carrera, data race free, y en los casos de operaciones asincrónicas posibilita un comportamiento asimilable al del modelo Slow, lo cual lo tornaría válido para aplicaciones no sincronizadasVI Workshop de Procesamiento Distribuido y Paralelo (WPDP)Red de Universidades con Carreras en Informática (RedUNCI

Servicio de Difusión de la Creación Intelectual

Scheduling threads for constructive cache sharing on CMPs

Author: Ailamaki Anastassia
Blelloch Guy E.
Chen Shimin
Falsafi Babak
Fix Limor
Gibbons Phillip B.
Hardavellas Nikos
Kozuch Michael
Liaskovitis Vasileios
Mowry Todd C.
Wilkerson Chris
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 06/04/2009
Field of study

In chip multiprocessors (CMPs), limiting the number of offchip cache misses is crucial for good performance. Many multithreaded programs provide opportunities for constructive cache sharing, in which concurrently scheduled threads share a largely overlapping working set. In this paper, we compare the performance of two state-of-the-art schedulers proposed for fine-grained multithreaded programs: Parallel Depth First (PDF), which is specifically designed for constructive cache sharing, and Work Stealing (WS), which is a more traditional design. Our experimental results indicate that PDF scheduling yields a 1.3 - 1.6X performance improvement relative to WS for several fine- grain parallel benchmarks on projected future CMP configurations; we also report several issues that may limit the advantage of PDF in certain applications. These results also indicate that PDF more effectively utilizes off-chip bandwidth, making it possible to trade-off on-chip cache for a larger number of cores. Moreover, we find that task granularity plays a key role in cache performance. Therefore, we present an automatic approach for selecting effective grain sizes, based on a new working set profiling algorithm that is an order of magnitude faster than previous approaches. This is the first paper demonstrating the effectiveness of PDF on real benchmarks, providing a direct comparison between PDF and WS, revealing the limiting factors for PDF in practice, and presenting an approach for overcoming these factors. Copyright 2007 ACM

Infoscience - École polytechnique fédérale de Lausanne

Experimental analysis of space-bounded schedulers

Author: Aapo Kyrola
Guy E Blelloch
Harsha Vardhan Simhadri
Jeremy T Fineman
Phillip B Gibbons
Publication venue
Publication date: 03/04/2020
Field of study

ABSTRACT The running time of nested parallel programs on shared memory machines depends in significant part on how well the scheduler mapping the program to the machine is optimized for the organization of caches and processors on the machine. Recent work proposed "space-bounded schedulers" for scheduling such programs on the multi-level cache hierarchies of current machines. The main benefit of this class of schedulers is that they provably preserve locality of the program at every level in the hierarchy, resulting (in theory) in fewer cache misses and better use of bandwidth than the popular work-stealing scheduler. On the other hand, compared to work-stealing, space-bounded schedulers are inferior at load balancing and may have greater scheduling overheads, raising the question as to the relative effectiveness of the two schedulers in practice. In this paper, we provide the first experimental study aimed at addressing this question. To facilitate this study, we built a flexible experimental framework with separate interfaces for programs and schedulers. This enables a headto-head comparison of the relative strengths of schedulers in terms of running times and cache miss counts across a range of benchmarks. (The framework is validated by comparisons with the Intel R Cilk TM Plus work-stealing scheduler.) We present experimental results on a 32-core Xeon R 7560 comparing work-stealing, hierarchy-minded work-stealing, and two variants of space-bounded schedulers on both divideand-conquer micro-benchmarks and some popular algorithmic kernels. Our results indicate that space-bounded schedulers reduce the number of L3 cache misses compared to work-stealing schedulers by 25-65% for most of the benchmarks, but incur up to 7% additional scheduler and loadimbalance overhead. Only for memory-intensive benchmarks can the reduction in cache misses overcome the added overhead, resulting in up to a 25% improvement in running time for synthetic benchmarks and about 20% improvement for algorithmic kernels. We also quantify runtime improvements ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of the national government of United States. As such, the Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only. varying the available bandwidth per core (the "bandwidth gap"), and show up to 50% improvements in the running times of kernels as this gap increases 4-fold. As part of our study, we generalize prior definitions of space-bounded schedulers to allow for more practical variants (while still preserving their guarantees), and explore implementation tradeoffs

CiteSeerX

An Analysis of Dag-Consistent Distributed Shared-Memory Algorithms

Author: Charles E Leiserson
Christopher F Joerg
Keith H Randall
Matteo Frigo
Robert Blumofe
Publication venue
Publication date: 01/01/1996
Field of study

In this paper, we analyze the performance of parallel multithreaded algorithms that use dag-consistent distributed shared memory. Specifically, we analyze execution time, page faults, and space requirements for multithreaded algorithms executed by a workstealing thread scheduler and the BACKER algorithm for maintaining dag consistency. We prove that if the accesses to the backing store are random and independent (the BACKER algorithm actually uses hashing), the expected execution time T P (C) of a "fully strict" multithreaded computation on P processors, each with a LRU cache of C pages, is O(T 1 (C)=P+mCT ), where T 1 (C) is the total work of the computation including page faults, T is its critical-path length excluding page faults, and m is the minimum page transfer time. As a corollary to this theorem, we show that the expected number F P (C) of page faults incurred by a computation executed on P processors can be related to the number F 1 (C) of serial page faults by the formula..

CiteSeerX

Crossref

Cilk : efficient multithreaded computing

Author: Randall Keith H. (Keith Harold)
Publication venue: Massachusetts Institute of Technology
Publication date: 01/01/1998
Field of study

Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 1998.Includes bibliographical references (p. 170-179).by Keith H. Randall.Ph.D

DSpace@MIT

Portable high-performance programs

Author: Frigo Matteo, 1968-
Publication venue: Massachusetts Institute of Technology
Publication date: 01/01/1999
Field of study

Thesis (Ph.D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 1999.Includes bibliographical references (p. 159-169).by Matteo Frigo.Ph.D

CiteSeerX

DSpace@MIT