5 research outputs found

    Active memory controller

    Full text link
    Inability to hide main memory latency has been increasingly limiting the performance of modern processors. The problem is worse in large-scale shared memory systems, where remote memory latencies are hundreds, and soon thousands, of processor cycles. To mitigate this problem, we propose an intelligent memory and cache coherence controller (AMC) that can execute Active Memory Operations (AMOs). AMOs are select operations sent to and executed on the home memory controller of data. AMOs can eliminate a significant number of coherence messages, minimize intranode and internode memory traffic, and create opportunities for parallelism. Our implementation of AMOs is cache-coherent and requires no changes to the processor core or DRAM chips. In this paper, we present the microarchitecture design of AMC, and the programming model of AMOs. We compare AMOs\u27 performance to that of several other memory architectures on a variety of scientific and commercial benchmarks. Through simulation, we show that AMOs offer dramatic performance improvements for an important set of data-intensive operations, e.g., up to 50x faster barriers, 12x faster spinlocks, 8.5x-15x faster stream/array operations, and 3x faster database queries. We also present an analytical model that can predict the performance benefits of using AMOs with decent accuracy. The silicon cost required to support AMOs is less than 1% of the die area of a typical high performance processor, based on a standard cell implementation

    TRADEOFFS IN THE DESIGN OF SINGLE-CHIP MULTIPROCESSORS

    No full text
    : By the end of the decade, as VLSI integration levels continue to increase, building a multiprocessor system on a single chip will become feasible. In this paper, we propose to analyze the tradeoffs involved in designing such a chip, and specifically address whether to allocate available chip area to larger caches or to large numbers of processors. Using the dimensions of the Alpha 21064 microprocessor as a basis, we determine several candidate configurations which vary in cache size and number of processors, and evaluate them in terms of both processing power and cycle time. We then investigate fine tuning the architecture in order to further improve performance, by trading off the number of processors for a larger TLB size. Our results show that for a coarse-grain execution environment, adding processors at the expense of cache size improves performance up to a point. We then show that increasing TLB size at the expense of the number of processors can further improve performance. K..

    STATS: A framework for microprocessor and system-level design space exploration

    No full text
    As microprocessor-based systems grow in complexity, and the processor-memory speed gap widens further, more emphasis needs to be placed on early design space exploration in order to produce the highest performance systems with minimal schedule impact. We discuss the critical issues associated with architectural evaluation of complex microprocessor-based systems, and present a methodology for the comprehensive and semi-automatic evaluation of processor, cache hierarchy, system interconnect, and main memory architectural and technological alternatives. We discuss the implementation of the methodology, and describe how it can be used in early design space exploration. The unique aspects of the methodology are further illustrated through two architectural investigations performed using the toolset. 1 Introduction The design of commercial microprocessor computer systems typically begins with the comparative evaluation of architectural alternatives, resulting in architectural specification..

    A mean value analysis multiprocessor model incorporating superscalar processors and latency tolerating techniques

    No full text
    : Several approximate Mean Value Analysis (MVA) shared memory multiprocessor models have been developed and used to evaluate a number of system architectures. In recent years, the use of superscalar processors, multilevel cache hierarchies, and latency tolerating techniques has significantly increased the complexity of multiprocessor system modeling. We present an analytical performance model which extends previous multiprocessor MVA models by incorporating these new features and in addition, increases the level of modeling detail to improve flexibility and accuracy. The extensions required to analyze the impact of these new features are described in detail. We then use the model to demonstrate some of the tradeoffs involved in designing modern multiprocessors, including the impact of highly superscalar architectures on the scalability of multiprocessor systems. Key Words: Shared memory multiprocessors, Mean Value Analysis, performance evaluation, latency tolerating techniques, supersc..

    An Architectural And Circuit-Level Approach To Improving The Energy Efficiency Of Microprocessor Memory Structures

    No full text
    We present a combined architectural and circuit technique for reducing the energy dissipation of microprocessor memory structures. This approach exploits the subarray partitioning of high speed memories and varying application requirements to dynamically disable partitions during appropriate execution periods. When applied to 4-way set associative caches, trading off a 2% performance degradation yields a combined 40% reduction in L1 Dcache and L2 cache energy dissipation. 1. INTRODUCTION The continuing microprocessor performance gains afforded by advances in semiconductor technology have come at the cost of increased power consumption. Each new high performance microprocessor generation brings additional on-chip functionality, and thus an increase in switching capacitance, as well as increased clock speeds over the previous generation. For example, both transistor count and clock speed have roughly doubled in the three years separating the Alpha 21164 microprocessor [6, 11] and the..
    corecore