3,206 research outputs found

    Improving processor efficiency by exploiting common-case behaviors of memory instructions

    Get PDF
    Processor efficiency can be described with the help of a number of  desirable effects or metrics, for example, performance, power, area, design complexity and access latency. These metrics serve as valuable tools used in designing new processors and they also act as  effective standards for comparing current processors. Various factors impact the efficiency of modern out-of-order processors and one important factor is the manner in which instructions are processed through the processor pipeline. In this dissertation research, we study the impact of load and store instructions (collectively known as memory instructions) on processor efficiency,  and show how to improve efficiency by exploiting common-case or  predictable patterns in the behavior of memory instructions. The memory behavior patterns that we focus on in our research are the predictability of memory dependences, the predictability in data forwarding patterns,   predictability in instruction criticality and conservativeness in resource allocation and deallocation policies. We first design a scalable  and high-performance memory dependence predictor and then apply accurate memory dependence prediction to improve the efficiency of the fetch engine of a simultaneous multi-threaded processor. We then use predictable data forwarding patterns to eliminate power-hungry  hardware in the processor with no loss in performance.  We then move to  studying instruction criticality to improve  processor efficiency. We study the behavior of critical load instructions  and propose applications that can be optimized using  predictable, load-criticality  information. Finally, we explore conventional techniques for allocation and deallocation  of critical structures that process memory instructions and propose new techniques to optimize the same.  Our new designs have the potential to reduce  the power and the area required by processors significantly without losing  performance, which lead to efficient designs of processors.Ph.D.Committee Chair: Loh, Gabriel H.; Committee Member: Clark, Nathan; Committee Member: Jaleel, Aamer; Committee Member: Kim, Hyesoon; Committee Member: Lee, Hsien-Hsin S.; Committee Member: Prvulovic, Milo

    Memory disambiguation hardware: a review

    Get PDF
    One of the main challenges of modern processor designs is the implementation of scalable and efficient mechanisms to detect memory access order violations as a result of out-of-order execution. Conventional structures performing this task are complex, inefficient and power-hungry. This fact has generated a large body of work on optimizing address-based memory disambiguation logic, namely the load-store queue. In this paper we review the most significant proposals in this research field, focusing on our own contributions.Facultad de Informátic

    NoSQ: Store-Load Communication without a Store Queue

    Get PDF
    This paper presents NoSQ (short for No Store Queue), a microarchitecture that performs store-load communication without a store queue and without executing stores in the out-of-order engine. NoSQ implements store-load communication using speculative memory bypassing (SMB), the dynamic short-circuiting of DEF-store-load-USE chains to DEF-USE chains. Whereas previous proposals used SMB as an opportunistic complement to conventional store queue-based forwarding, NoSQ uses SMB as a store queue replacement. NoSQ relies on two supporting mechanisms. The first is an advanced store-load bypassing predictor that for a given dynamic load can predict whether that load will bypass and the identity of the communicating store. The second is an efficient verification mechanism for both bypassed and non-bypassed loads using in-order load re-execution with an SMB-aware store vulnerability window (SVW) filter. The primary benefit of NoSQ is a simple, fast datapath that does not contain store-load forwarding hardware; all loads get their values either from the data cache or from the register file. Experiments show that this simpler design - despite being more speculative - slightly outperforms a conventional store-queue based design on most benchmarks (by 2% on average)

    Memory disambiguation hardware: a review

    Get PDF
    One of the main challenges of modern processor designs is the implementation of scalable and efficient mechanisms to detect memory access order violations as a result of out-of-order execution. Conventional structures performing this task are complex, inefficient and power-hungry. This fact has generated a large body of work on optimizing address-based memory disambiguation logic, namely the load-store queue. In this paper we review the most significant proposals in this research field, focusing on our own contributions.Facultad de Informátic

    RingScalar: A Complexity-Effective Out-of-Order Superscalar Microarchitecture

    Get PDF
    RingScalar is a complexity-effective microarchitecture for out-of-order superscalar processors, that reduces the area, latency, and power of all major structures in the instruction flow. The design divides an N-way superscalar into N columns connected in a unidirectional ring, where each column contains a portion of the instruction window, a bank of the register file, and an ALU. The design exploits the fact that most decoded instructions are waiting on just one operand to use only a single tag per issue window entry, and to restrict instruction wakeup and value bypass to only communicate with the neighboring column. Detailed simulations of four-issue single-threaded machines running SPECint2000 show that RingScalar has IPC only 13% lower than an idealized superscalar, while providing large reductions in area, power, and circuit latency

    Static and Dynamic Scheduling for Effective Use of Multicore Systems

    Get PDF
    Multicore systems have increasingly gained importance in high performance computers. Compared to the traditional microarchitectures, multicore architectures have a simpler design, higher performance-to-area ratio, and improved power efficiency. Although the multicore architecture has various advantages, traditional parallel programming techniques do not apply to the new architecture efficiently. This dissertation addresses how to determine optimized thread schedules to improve data reuse on shared-memory multicore systems and how to seek a scalable solution to designing parallel software on both shared-memory and distributed-memory multicore systems. We propose an analytical cache model to predict the number of cache misses on the time-sharing L2 cache on a multicore processor. The model provides an insight into the impact of cache sharing and cache contention between threads. Inspired by the model, we build the framework of affinity based thread scheduling to determine optimized thread schedules to improve data reuse on all the levels in a complex memory hierarchy. The affinity based thread scheduling framework includes a model to estimate the cost of a thread schedule, which consists of three submodels: an affinity graph submodel, a memory hierarchy submodel, and a cost submodel. Based on the model, we design a hierarchical graph partitioning algorithm to determine near-optimal solutions. We have also extended the algorithm to support threads with data dependences. The algorithms are implemented and incorporated into a feedback directed optimization prototype system. The prototype system builds upon a binary instrumentation tool and can improve program performance greatly on shared-memory multicore architectures. We also study the dynamic data-availability driven scheduling approach to designing new parallel software on distributed-memory multicore architectures. We have implemented a decentralized dynamic runtime system. The design of the runtime system is focused on the scalability metric. At any time only a small portion of a task graph exists in memory. We propose an algorithm to solve data dependences without process cooperation in a distributed manner. Our experimental results demonstrate the scalability and practicality of the approach for both shared-memory and distributed-memory multicore systems. Finally, we present a scalable nonblocking topology-aware multicast scheme for distributed DAG scheduling applications

    Reducción de consumo en la caché de datos de nivel 1 utilizando un predictor de forwarding

    Get PDF
    En la mayoría de los diseños de los procesadores actuales, el acceso a la caché de datos de nivel 1 (L1D) se ha convertido en uno de los componentes de mayor consumo debido a su incremento de tamaño y elevadas frecuencias de acceso. Para reducir este consumo, proponemos una sencilla técnica de filtrado. Nuestra idea se basa en un predictor de forwarding de alta precisión que determina si una instrucción de load tomará su dato vía forwarding a través de la LSQ –evitando en este caso el acceso a la L1D- o si debe ir a por él a la caché de datos. Nuestros resultados de simulación muestran que podemos ahorrar de media un 35% del consumo de la L1D, con una degradación despreciable de rendimient
    corecore