103 research outputs found

    Reducing the LSQ and L1 data cache power consumption

    Get PDF
    In most modern processor designs, the HW dedicated to store data and instructions (memory hierarchy) has become a major consumer of power. In order to reduce this power consumption, we propose in this paper two techniques, one to filter accesses to the LSQ (Load-Store Queue) based on both timing and address information, and the other to filter accesses to the first level data cache based on a forwarding predictor. Our simulation results show that the power consumption decreases in 30-40% in each structure, with a negligible performance penalty of less than 0.1%.Presentado en el V Workshop Arquitectura, Redes y Sistemas Operativos (WARSO)Red de Universidades con Carreras en Informática (RedUNCI

    Reducing the LSQ and L1 Data Cache Power Consuption

    Get PDF
    In most modern processor designs, the HW dedicated to store data and instructions (memory hierarchy) has become a major consumer of power. In order to reduce this power consumption, we propose in this paper two techniques, one to filter accesses to the LSQ (Load-Store Queue) based on both timing and address information, and the other to filter accesses to the first level data cache based on a forwarding predictor. Our simulation results show that the power consumption decreases in 30-40% in each structure, with a negligible performance penalty of less than 0.1%

    A low-power cache system for high-performance processors

    Get PDF
    制度:新 ; 報告番号:甲3439号 ; 学位の種類:博士(工学) ; 授与年月日:12-Sep-11 ; 早大学位記番号:新576

    Enlarging instruction streams

    Get PDF
    The stream fetch engine is a high-performance fetch architecture based on the concept of an instruction stream. We call a sequence of instructions from the target of a taken branch to the next taken branch, potentially containing multiple basic blocks, a stream. The long length of instruction streams makes it possible for the stream fetch engine to provide a high fetch bandwidth and to hide the branch predictor access latency, leading to performance results close to a trace cache at a lower implementation cost and complexity. Therefore, enlarging instruction streams is an excellent way to improve the stream fetch engine. In this paper, we present several hardware and software mechanisms focused on enlarging those streams that finalize at particular branch types. However, our results point out that focusing on particular branch types is not a good strategy due to Amdahl's law. Consequently, we propose the multiple-stream predictor, a novel mechanism that deals with all branch types by combining single streams into long virtual streams. This proposal tolerates the prediction table access latency without requiring the complexity caused by additional hardware mechanisms like prediction overriding. Moreover, it provides high-performance results which are comparable to state-of-the-art fetch architectures but with a simpler design that consumes less energy.Peer ReviewedPostprint (published version

    Improving processor efficiency by exploiting common-case behaviors of memory instructions

    Get PDF
    Processor efficiency can be described with the help of a number of  desirable effects or metrics, for example, performance, power, area, design complexity and access latency. These metrics serve as valuable tools used in designing new processors and they also act as  effective standards for comparing current processors. Various factors impact the efficiency of modern out-of-order processors and one important factor is the manner in which instructions are processed through the processor pipeline. In this dissertation research, we study the impact of load and store instructions (collectively known as memory instructions) on processor efficiency,  and show how to improve efficiency by exploiting common-case or  predictable patterns in the behavior of memory instructions. The memory behavior patterns that we focus on in our research are the predictability of memory dependences, the predictability in data forwarding patterns,   predictability in instruction criticality and conservativeness in resource allocation and deallocation policies. We first design a scalable  and high-performance memory dependence predictor and then apply accurate memory dependence prediction to improve the efficiency of the fetch engine of a simultaneous multi-threaded processor. We then use predictable data forwarding patterns to eliminate power-hungry  hardware in the processor with no loss in performance.  We then move to  studying instruction criticality to improve  processor efficiency. We study the behavior of critical load instructions  and propose applications that can be optimized using  predictable, load-criticality  information. Finally, we explore conventional techniques for allocation and deallocation  of critical structures that process memory instructions and propose new techniques to optimize the same.  Our new designs have the potential to reduce  the power and the area required by processors significantly without losing  performance, which lead to efficient designs of processors.Ph.D.Committee Chair: Loh, Gabriel H.; Committee Member: Clark, Nathan; Committee Member: Jaleel, Aamer; Committee Member: Kim, Hyesoon; Committee Member: Lee, Hsien-Hsin S.; Committee Member: Prvulovic, Milo

    Predicting multiple streams per cycle

    Get PDF
    The next stream predictor is an accurate branch predictor that provides stream level sequencing. Every stream prediction contains a full stream of instructions, that is, a sequence of instructions from the target of a taken branch to the next taken branch, potentially containing multiple basic blocks. The long size of instruction streams makes it possible for the stream predictor to provide high fetch bandwidth and to tolerate the prediction table access latency. Therefore, an excellent way for improving the behavior of the next stream predictor is to enlarge instruction streams. In this paper, we provide a comprehensive analysis of dynamic instruction streams, showing that focusing on particular kinds of stream is not a good strategy due to Amdahl's law. Consequently, we propose the multiple stream predictor, a novel mechanism that deals with all kinds of streams by combining single streams into long virtual streams. We show that our multiple stream predictor is able to tolerate the prediction table access latency without requiring the complexity caused by additional hardware mechanisms like prediction overriding, also reducing the overall branch predictor energy consumption.Postprint (published version

    Die-Stacked DRAM Caches for Servers: Hit Ratio, Latency, or Bandwidth? Have It All with Footprint Cache

    Get PDF
    Recent research advocates using large die-stacked DRAM caches to break the memory bandwidth wall. Existing DRAM cache designs fall into one of two categories — block-based and page-based. The former organize data in conventional blocks (e.g., 64B), ensuring low off-chip bandwidth utilization, but co-locate tags and data in the stacked DRAM, incurring high lookup latency. Furthermore, such designs suffer from low hit ratios due to poor temporal locality. In contrast, page-based caches, which manage data at larger granularity (e.g., 4KB pages), allow for reduced tag array overhead and fast lookup, and leverage high spatial locality at the cost of moving large amounts of data on and off the chip. This paper introduces Footprint Cache, an efficient die-stacked DRAM cache design for server processors. Footprint Cache allocates data at the granularity of pages, but identifies and fetches only those blocks within a page that will be touched during the page's residency in the cache — i.e., the page's footprint. In doing so, Footprint Cache eliminates the excessive off-chip traffic associated with page-based designs, while preserving their high hit ratio, small tag array overhead, and low lookup latency. Cycle-accurate simulation results of a 16-core server with up to 512MB Footprint Cache indicate a 57% performance improvement over a baseline chip without a die-stacked cache. Compared to a state-of-the-art block-based design, our design improves performance by 13% while reducing dynamic energy of stacked DRAM by 24%

    Multi-Gigabyte On-Chip DRAM Caches for Servers

    Get PDF
    While DRAM latency has long been recognized as a major bottleneck in servers, DRAM bandwidth is emerging as an important bottleneck as server processors shift to many-core architectures to allow for sustainable throughput improvements. The rapid expansion of the digital universe, increasingly stored in memory, rapidly pushes the need for higher DRAM density as well. Emerging die-stacked DRAM technology dramatically improves the three major DRAM properties: latency, bandwidth and density. Recent advancements in die-stacking technology made it possible to integrate a sizeable amount of DRAM directly on top of the processor. While the feasible on-chip DRAM capacities are insufficient to satisfy the memory needs of modern servers, architecting on-chip DRAM as a high-capacity low-latency high-bandwidth cache has the potential to provide significant reduction both in off-chip memory traffic and in average memory access latency. We make the observation that high-capacity on-chip DRAM caches expose abundant spatial locality present in server applications and a modest amount of temporal data reuse. As a consequence, DRAM caches that manage and fetch data at a coarser granularity, e.g., in 2KB pages, exhibit overall superior properties compared to caches that do fine-grain management using 64B blocks. These properties include substantially higher hit rates, smaller tag storage, higher energy efficiency and set-associativity. Unfortunately, naive employment of page-based caches results in excessive data overfetch and capacity waste, as some of the fetched and allocated blocks are never accessed prior to their eviction. We demonstrate that if the cache is organized in pages, then page footprints -- i.e., the set of blocks that are touched while the page is in the cache -- are highly predictable using well-established code-correlation techniques. Accurately predicting access patterns within a page can eliminate most of the bandwidth overhead and capacity waste that page-based caches suffer from

    Ultra low power cooperative branch prediction

    Get PDF
    Branch Prediction is a key task in the operation of a high performance processor. An inaccurate branch predictor results in increased program run-time and a rise in energy consumption. The drive towards processors with limited die-space and tighter energy requirements will continue to intensify over the coming years, as will the shift towards increasingly multicore processors. Both trends make it increasingly important and increasingly difficult to find effective and efficient branch predictor designs. This thesis presents savings in energy and die-space through the use of more efficient cooperative branch predictors achieved through novel branch prediction designs. The first contribution is a new take on the problem of a hybrid dynamic-static branch predictor allocating branches to be predicted by one of its sub-predictors. A new bias parameter is introduced as a mechanism for trading off a small amount of performance for savings in die-space and energy. This is achieved by predicting more branches with the static predictor, ensuring that only the branches that will most benefit from the dynamic predictor’s resources are predicted dynamically. This reduces pressure on the dynamic predictor’s resources allowing for a smaller predictor to achieve very high accuracy. An improvement in run-time of 7-8% over the baseline BTFN predictor is observed at a cost of a branch predictor bits budget of much less than 1KB. Next, a novel approach to branch prediction for multicore data-parallel applications is presented. The Peloton branch prediction scheme uses a pack of cyclists as an illustration of how a group of processors running similar tasks can share branch predictions to improve accuracy and reduce runtime. The results show that sharing updates for conditional branches across the existing interconnect for I-cache and D-cache updates results in a reduction of mispredictions of up to 25% and a reduction in run-time of up to 6%. McPAT is used to present an energy model that suggests the savings are achieved at little to no increase in energy required. The technique is then extended to architectures where the size of the branch predictors may differ between cores. The results show that such heterogeneity can dramatically reduce the die-space required for an accurate branch predictor while having little impact on performance and up to 9% energy savings. The approach can be combined with the Peloton branch prediction scheme for reduction in branch mispredictions of up to 5%
    corecore