24 research outputs found

    Evaluating Cache Coherent Shared Virtual Memory for Heterogeneous Multicore Chips

    Full text link
    The trend in industry is towards heterogeneous multicore processors (HMCs), including chips with CPUs and massively-threaded throughput-oriented processors (MTTOPs) such as GPUs. Although current homogeneous chips tightly couple the cores with cache-coherent shared virtual memory (CCSVM), this is not the communication paradigm used by any current HMC. In this paper, we present a CCSVM design for a CPU/MTTOP chip, as well as an extension of the pthreads programming model, called xthreads, for programming this HMC. Our goal is to evaluate the potential performance benefits of tightly coupling heterogeneous cores with CCSVM

    Cache coherence requirements for interprocess rendezvous

    Full text link
    Multiprocessors in which a shared bus is used by the processor to communicate with common memory are an emerging class of machines where there is a need to support parallel programming languages. A language construct that is found in a number of parallel programming languages to support synchronization and communication in the interprocess rendezvous. Shared-bus multiprocessor require a protocol to keep the date in their caches coherent. There are two major categories of these protocols: invalidation and write-boadcast. This paper examines the requirements for cache coherence protocols to support efficient interprocessor rendezvous. The approach taken is to examine the memory referencing patterns to the run-time data structures during rendezvous execution. The appropriate coherence protocol is shown to be a function of the processor scheduling strategy used by the run-time system at synchronzation points during the rendezvous. When processes migrate freely as a result of the scheduling strategy, invalidation protocols are found to be more efficient. When migration is restricted by the scheduler, write-broadcast protocols are more efficient.Peer Reviewedhttp://deepblue.lib.umich.edu/bitstream/2027.42/44571/1/10766_2005_Article_BF01407863.pd

    Design of Large-Scale Symmetric Multiprocessors (SMPs) using Parallel Optical Interconnects

    Get PDF
    Abstract In this paper, we address the primary limitation of bandwidth demands for address transaction in future cache coherent symmetric multiprocessors (SMPs

    Extending magny-cours cache coherence

    Full text link
    One cost-effective way to meet the increasing demand for larger high-performance shared-memory servers is to build clusters with off-the-shelf processors connected with low-latency point-to-point interconnections like HyperTransport. Unfortunately, HyperTransport addressing limitations prevent building systems with more than eight nodes. While the recent High-Node Count HyperTransport specification overcomes this limitation, recently launched twelve-core Magny-Cours processors have already inherited it and provide only 3 bits to encode the pointers used by the directory cache which they include to increase the scalability of their coherence protocol. In this work, we propose and develop an external device to extend the coherence domain of Magny-Cours processors beyond the 8-node limit while maintaining the advantages provided by the directory cache. Evaluation results for systems with up to 32 nodes show that the performance offered by our solution scales with the number of nodes, enhancing the directory cache effectiveness by filtering additional messages. Particularly, we reduce execution time by 47 percent in a 32-die system with respect to the 8-die Magny-Cours configuration.This work was supported by the Spanish MICINN, Consolider Programme and Plan E funds, as well as European Commission FEDER funds, under Grants CSD2006-00046 and TIN2009-14475-C04-01/03. It was also partly supported by (PROMETEO from Generalitat Valenciana (GVA) under Grant PROMETEO/2008/060).Ros Bardisa, A.; Cuesta Sáez, BA.; Fernández-Pascual, R.; Gómez Requena, ME.; Acacio Sánchez, ME.; Robles Martínez, A.; García Carrasco, JM.... (2012). Extending magny-cours cache coherence. IEEE Transactions on Computers. 61(5):593-606. https://doi.org/10.1109/TC.2011.65S59360661

    Efficient and scalable starvation prevention mechanism for token coherence

    Full text link
    [EN] Token Coherence is a cache coherence protocol that simultaneously captures the best attributes of the traditional approximations to coherence: direct communication between processors (like snooping-based protocols) and no reliance on bus-like interconnects (like directory-based protocols). This is possible thanks to a class of unordered requests that usually succeed in resolving the cache misses. The problem of the unordered requests is that they can cause protocol races, which prevent some misses from being resolved. To eliminate races and ensure the completion of the unresolved misses, Token Coherence uses a starvation prevention mechanism named persistent requests. This mechanism is extremely inefficient and, besides, it endangers the scalability of Token Coherence since it requires storage structures (at each node) whose size grows proportionally to the system size. While multiprocessors continue including an increasingly number of nodes, both the performance and scalability of cache coherence protocols will continue to be key aspects. In this work, we propose an alternative starvation prevention mechanism, named priority requests, that outperforms the persistent request one. This mechanism is able to reduce the application runtime more than 20 percent (on average) in a 64-processor system. Furthermore, thanks to the flexibility shown by priority requests, it is possible to drastically minimize its storage requirements, thereby improving the whole scalability of Token Coherence. Although this is achieved at the expense of a slight performance degradation, priority requests still outperform persistent requests significantly.This work was partially supported by the Spanish MEC and MICINN, as well as European Commission FEDER funds, under Grants CSD2006-00046 and TIN2009-14475-C04-01. Antonio Robles is taking a sabbatical granted by the Universidad Politecnica de Valencia for updating his teaching and research activities.Cuesta Sáez, BA.; Robles Martínez, A.; Duato Marín, JF. (2011). Efficient and scalable starvation prevention mechanism for token coherence. IEEE Transactions on Parallel and Distributed Systems. 22(10):1610-1623. doi:10.1109/TPDS.2011.30S16101623221

    Hybrid Caching for Chip Multiprocessors Using Compiler-Based Data Classification

    Get PDF
    The high performance delivered by modern computer system keeps scaling with an increasingnumber of processors connected using distributed network on-chip. As a result, memory accesslatency, largely dominated by remote data cache access and inter-processor communication, is becoming a critical performance bottleneck. To release this problem, it is necessary to localize data access as much as possible while keep efficient on-chip cache memory utilization. Achieving this however, is application dependent and needs a keen insight into the memory access characteristics of the applications. This thesis demonstrates how using fairly simple thus inexpensive compiler analysis memory accesses can be classified into private data access and shared data access. In addition, we introduce a third classification named probably private access and demonstrate the impact of this category compared to traditional private and shared memory classification. The memory access classification information from the compiler analysis is then provided to the runtime system through a modified memory allocator and page table to facilitate a hybrid private-shared caching technique. The hybrid cache mechanism is aware of different data access classification and adopts appropriate placement and search policies accordingly to improve performance. Our analysis demonstrates that many applications have a significant amount of both private and shared data and that compiler analysis can identify the private data effectively for many applications. Experimentsresults show that the implemented hybrid caching scheme achieves 4.03% performance improvement over state of the art NUCA-base caching

    Coherent network interfaces for fine-grain communication

    Get PDF
    Using coherence can improve performance by facilitating burst transfers of whole cache blocks and reducing control overheads. This paper describes an attempt to explore network interfaces that use coherence, i.e., coherent network interfaces (CNIs), to improve communication performance. First, it reports on the development and optimization of two mechanisms that CNIs use to communicate with processors. A taxonomy and comparison of four CNIs with a more conventional NI are then presented
    corecore