9 research outputs found

    Replication Techniques for Speeding up Parallel Applications on Distributed Systems

    Get PDF
    This paper discusses the design choices involved in replicating objects and their effect on performance. Important issues are: how to maintain consistency among different copies of an object; how to implement changes to objects; and which strategy for object replication to use. We have implemented several options to determine which ones are most efficient

    Cache coherence requirements for interprocess rendezvous

    Full text link
    Multiprocessors in which a shared bus is used by the processor to communicate with common memory are an emerging class of machines where there is a need to support parallel programming languages. A language construct that is found in a number of parallel programming languages to support synchronization and communication in the interprocess rendezvous. Shared-bus multiprocessor require a protocol to keep the date in their caches coherent. There are two major categories of these protocols: invalidation and write-boadcast. This paper examines the requirements for cache coherence protocols to support efficient interprocessor rendezvous. The approach taken is to examine the memory referencing patterns to the run-time data structures during rendezvous execution. The appropriate coherence protocol is shown to be a function of the processor scheduling strategy used by the run-time system at synchronzation points during the rendezvous. When processes migrate freely as a result of the scheduling strategy, invalidation protocols are found to be more efficient. When migration is restricted by the scheduler, write-broadcast protocols are more efficient.Peer Reviewedhttp://deepblue.lib.umich.edu/bitstream/2027.42/44571/1/10766_2005_Article_BF01407863.pd

    Directory Based Cache Coherency Protocols for Shared Memory Multiprocessors

    Get PDF
    Directory based cache coherency protocols can be used to build large scale, weakly ordered, shared memory multiprocessors. The salient feature of these protocols is that they are interconnection network independent, making them more scaleable than snoopy bus protocols. The major criticisms of previously defined directory protocols point to the size of memory heeded to store the directory and the amount of communication across the interconnection network required to maintain coherence. This thesis tries solving these problems by changing the entry format of the global table, altering the architecture of the global table, and developing new protocols. Some alternative directory entry formats are described, including a special entry format for implementing queueing semaphores. Evaluation of the various entry formats is done with probabilistic models of shared cache blocks and software simulation. A variable length global table organization is presented which can be used to reduce the size of the global table, regardless of the entry format. Its performance is analyzed using software simulation. A protocol which maintains a linked list of processors which have a particular block cached is presented. Several variations of this protocol induce less interconnection network traffic than traditional protocols

    An efficient buffer management methods for Cachet

    Get PDF
    Thesis (S.M.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2004.Includes bibliographical references (leaves 83-85).(cont.) running simulation. Third, since BCachet is almost linear under the assumption of a reasonable number of sites, BCachet is very scalable. Therefore, it can be used to for very large scale multiprocessor systems.We propose an efficient buffer management method for Cachet [7], called BCachet. Cachet is an adaptive cache coherence protocol based on a mechanism-oriented memory model called Commit-Reconcile & Fences (CRF) [1]. Although Cachet is theoretically proved to be sound and live, a direct implementation of Cachet is not feasible because it requires too expensive hardware. We greatly reduced the hardware cost for buffer management in BCachet without changing the memory model and the adaptive nature of Cachet. Hardware cost for the incoming message buffer of the memory site is greatly reduced from PxN FIFOs to two FIFOs in BCachet where P is the number of sites and N is the number of address lines in a memory unit. We also reduced the minimum size of suspended message buffer per memory site from (log₂ P+V) xPx(rq[max]), to log₂ P where V is the size of a memory block in terms of bits and rqma is the maximum number of request messages per cache. BCachet has three architectural merits. First, BCachet separates buffer management units for deadlock avoidance and those units for livelock avoidance so that a designer has an option in the liveness level and the corresponding hardware cost: (1) allows a livelock with an extremely low probability and saves hardware cost for fairness control. (2) does not allow a livelock at all and accept hardware cost for fairness control. Second, a designer can easily parameterize the sizes of buffer units to explore the cost-performance curves without affecting the soundness and the liveness. Because usual sizes of buffer management units are much larger than the minimum sizes of those units that guarantee the liveness and soundness of the system, a designer can easily find optimum trade-off point for those units by changing size parameters andby Byungsub Kim.S.M

    Modeling Out-of-Order Superscalar Processor Performance Quickly and Accurately with Traces

    Get PDF
    Fast and accurate processor simulation is essential in processor design. Trace-driven simulation is a widely practiced fast simulation method. However, serious accuracy issues arise when an out-of-order superscalar processor is considered. In this thesis, trace-driven simulation methods are suggested to quickly and accurately model out-of-order superscalar processor performance with reduced traces. The approaches abstract the processor core and focus on the processor's uncore events rather than the processor's internal events. As a result, fast simulation speed is achieved while maintaining fairly small error compared with an execution-driven simulator. Traces can be generated either by a cycle-accurate simulator or an abstract timing model on top of a simple functional simulator. Simulation results are more accurate with the method using traces generated from a cycle-accurate simulator. Faster trace generation speed is achieved with the abstract timing model. The methods determine how to treat a cache miss with respect to other cache misses recorded in the trace by dynamically reconstructing the reorder buffer state during simulation and honoring the dependencies between the trace items. This approach preserves a processor's dynamic uncore access patterns and accurately predicts the relative performance change when the processor's uncore-level parameters are changed. The methods are attractive especially in the early design stages due to its fast simulation speed

    Modello dei costi delle tecniche di cache coherence nelle architetture multiprocessor

    Get PDF
    In questa tesi viene affrontato il problema della Cache Coherence nelle architetture multiprocessor. Dopo aver introdotto le principali tecniche messe a disposizione dal livello hardware-firmware di questi sistemi e i principali protocolli di cache coherence sviluppati per la risoluzione di questo problema, vengono descritte le due principali strategie di implementazione. Da una parte abbiamo la soluzione snoopy-based che fa uso di un bus come punto di centralizzazione a livello firmware, dall’altra la soluzione adottata nei sistemi a più alto grado di parallelismo basata sul concetto di directory. Una soluzione alternativa è l’approccio algorithm-dependent che permette di affrontare la risoluzione del problema della cache coherence nella progettazione del supporto della mutua esclusione. Nella tesi è stato sviluppato un modello dei costi che usa come punto di partenza la valutazione delle prestazioni delle architetture multiprocessor basata sulla teoria delle code. Il modello presentato permette di valutare l’impatto che questi due approcci hanno sulle prestazioni dei programmi paralleli, mettendo in evidenza come l’approccio algorithm-dependent riesca a minimizzare il numero di trasferimenti di blocchi di cache e di comunicazioni interprocessor rispetto alle soluzioni automatiche

    Simulation models of shared-memory multiprocessor systems

    Get PDF

    Structured Parallel Programming and Cache Coherence in Multicore Architectures

    Get PDF
    It is clear that multicore processors have become the building blocks of today’s high-performance computing platforms. The advent of massively parallel single-chip microprocessors further emphasizes the gap that exists between parallel architectures and parallel programming maturity. Our research group, starting from the experiences on distributed and shared memory multiprocessor, was one of the first to propose a Structured Parallel Programming approach to bridge this gap. In this scenario, one of the biggest problems is that an application’s performance is often affected by the sharing pattern of data and its impact on Cache Coherence. Currently multicore platforms rely on hardware or automatic cache coherence techniques that allow programmers to develop programs without taking into account the problem. It is well known that standard coherency protocols are inefficient for certain data communication patterns and these inefficiencies will be amplified by the increased core number and the complex memory hierarchies. Following a structured parallelism approach, our methodology to attack these problems is based on two interrelated issues: structured parallelism paradigms and cost models (or performance models). Evaluating the performance of a program, although widely studied, is still an open problem in the research community and, notably, specific cost models to de- scribe multicores are missing. For this reason in this thesis, we define an abstract model for cache coherent architectures, which is able to capture the essential elements and the qualitative behaviors of multicore-based systems. Furthermore, we show how this abstract model combined with well known performance modelling techniques, such as analytical modelling (e.g., queueing models and stochastic process algebras) or simulations, provide an application- and architecture-dependent cost model to predict structured parallel applications performances. Starting out from the behavior and performance predictability of structured parallelism schemes, in this thesis we address the issue of cache coherence in multicore architectures, following an algorithm-dependent approach, a particular kind of software cache coherence solution characterized by explicit cache management strategies, which are specific of the algorithm to be executed. Notably, we ensure parallel correctness by exploiting architecture-specific mechanisms and by defining proper data structures in order to “emulate” cache coherence solutions in an efficient way for each computation. Algorithm-dependent cache coherence can be efficiently implemented at the support level of structured parallelism paradigms, with absolute transparency with respect to the application programmer. Moreover, by using the cost model, in this thesis we study and compare different algorithm-dependent implementations, such as those based on automatic cache coherence with respect to an original, non-automatic and lock-free solution based on interprocessor communications. Notably, with this latter implementation, in some cases, we are able to reduce the number of memory accesses, cache transfers and synchronizations and increasing computation parallelism with respect to the use of automatic cache coherence. Current architectures do not usually allow disabling automatic cache coherence. However, the emergence of many-core architectures somewhat changed the scenario, so that some architectures, such as the Tilera TilePro64, allow to control and disable the automatic cache coherence facilities. For this reason, in this thesis we finally apply our methodology to TilePro64 platform in order provide a further validation of the results obtained by our cost model
    corecore