1,771 research outputs found

    Evaluating Cache Coherent Shared Virtual Memory for Heterogeneous Multicore Chips

    Full text link
    The trend in industry is towards heterogeneous multicore processors (HMCs), including chips with CPUs and massively-threaded throughput-oriented processors (MTTOPs) such as GPUs. Although current homogeneous chips tightly couple the cores with cache-coherent shared virtual memory (CCSVM), this is not the communication paradigm used by any current HMC. In this paper, we present a CCSVM design for a CPU/MTTOP chip, as well as an extension of the pthreads programming model, called xthreads, for programming this HMC. Our goal is to evaluate the potential performance benefits of tightly coupling heterogeneous cores with CCSVM

    An Implementation of a Dual-Processor System on FPGA

    Full text link
    In recent years, Field-Programmable Gate Arrays (FPGA) have evolved rapidly paving the way for a whole new range of computing paradigms. On the other hand, computer applications are evolving. There is a rising demand for a system that is general-purpose and yet has the processing abilities to accommodate current trends in application processing. This work proposes a design and implementation of a tightly-coupled FPGA-based dual-processor platform. We architect a platform that optimizes the utilization of FPGA resources and allows for the investigation of practical implementation issues such as cache design. The performance of the proposed prototype is then evaluated, as different configurations of a uniprocessor and a dual-processor system are studied and compared against each other and against published results for common industry-standard CPU platforms. The proposed implementation utilizes the Nios II 32-bit embedded soft-core processor architecture designed for the Altera Cyclone III family of FPGAs

    Avalanche: A communication and memory architecture for scalable parallel computing

    Get PDF
    technical reportAs the gap between processor and memory speeds widens?? system designers will inevitably incorpo rate increasingly deep memory hierarchies to maintain the balance between processor and memory system performance At the same time?? most communication subsystems are permitted access only to main memory and not a processor s top level cache As memory latencies increase?? this lack of integration between the memory and communication systems will seriously impede interprocessor communication performance and limit e ective scalability In the Avalanche project we are re designing the memory architecture of a commercial RISC multiprocessor?? the HP PA RISC ?? to include a new multi level context sensitive cache that is tightly coupled to the communication fabric The primary goal of Avalanche s integrated cache and communication controller is attack ing end to end communication latency in all of its forms This includes cache misses induced by excessive invalidations and reloading of shared data by write invalidate coherence protocols and cache misses induced by depositing incoming message data in main memory and faulting it into the cache An execution driven simulation study of Avalanche s architecture indicates that it can reduce cache stalls by and overall execution times b

    Avalanche: A communication and memory architecture for scalable parallel computing

    Get PDF
    technical reportAs the gap between processor and memory speeds widens, system designers will inevitably incorporate increasingly deep memory hierarchies to maintain the balance between processor and memory system performance. At the same time, most communication subsystems are permitted access only to main memory and not a processor's top level cache. As memory latencies increase, this lack of integration between the memory and communication systems will seriously impede interprocessor communication performance and limit effective scalability. In the Avalanche project we are redesigning the memory architecture of a commercial RISC multiprocessor, the HP PA-RISC 7100, to include a new multi-level context sensitive cache that is tightly coupled to the communication fabric. The primary goal of Avalanche's integrated cache and communication controller is attacking end to end communication latency in all of its forms. This includes cache misses induced by excessive invalidations and reloading of shared data by write-invalidate coherence protocols and cache misses induced by depositing incoming message data in main memory and faulting it into the cache. An execution-driven simulation study of Avalanche's architecture indicates that it can reduce cache stalls by 5-60% and overall execution times by 10-28%

    Reducing consistency traffic and cache misses in the avalanche multiprocessor

    Get PDF
    Journal ArticleFor a parallel architecture to scale effectively, communication latency between processors must be avoided. We have found that the source of a large number of avoidable cache misses is the use of hardwired write-invalidate coherency protocols, which often exhibit high cache miss rates due to excessive invalidations and subsequent reloading of shared data. In the Avalanche project at the University of Utah, we are building a 64-node multiprocessor designed to reduce the end-to-end communication latency of both shared memory and message passing programs. As part of our design efforts, we are evaluating the potential performance benefits and implementation complexity of providing hardware support for multiple coherency protocols. Using a detailed architecture simulation of Avalanche, we have found that support for multiple consistency protocols can reduce the time parallel applications spend stalled on memory operations by up to 66% and overall execution time by up to 31%. Most of this reduction in memory stall time is due to a novel release-consistent multiple-writer write-update protocol implemented using a write state buffer

    Memory performance of and-parallel prolog on shared-memory architectures

    Get PDF
    The goal of the RAP-WAM AND-parallel Prolog abstract architecture is to provide inference speeds significantly beyond those of sequential systems, while supporting Prolog semantics and preserving sequential performance and storage efficiency. This paper presents simulation results supporting these claims with special emphasis on memory performance on a two-level sharedmemory multiprocessor organization. Several solutions to the cache coherency problem are analyzed. It is shown that RAP-WAM offers good locality and storage efficiency and that it can effectively take advantage of broadcast caches. It is argued that speeds in excess of 2 ML IPS on real applications exhibiting medium parallelism can be attained with current technology
    • …
    corecore