    [[alternative]]The Design of Advanced Value Predictors

    Load Value Approximation: Approaching the Ideal Memory Access Latency

    Approximate computing recognizes that many applications can tolerate inexactness. These applications, which range from multimedia processing to machine learning, operate on inherently noisy and imprecise data. As a result, we can tradeoff some loss in output value integrity for improved processor performance and energy-efficiency. In this paper, we introduce load value approximation. In modern processors, upon a load miss in the private cache, the data must be retrieved from main memory or from the higher-level caches. These data accesses are costly both in terms of latency and energy. We implement load value approximators, which are hardware structures that learn value patterns and generate approximations of the data. The processor can then use these approximate data values to continue executing without incurring the high cost of accessing memory. We show that load value approximators can achieve high coverage while maintaining very low error in the application’s output. By exploiting the approximate nature of applications, we can draw closer to the ideal memory access latency. 1

    Fluid Stochastic Petri Nets: From Fluid Atoms in ILP Processor Pipelines to Fluid Atoms in P2P Streaming Networks

    © 2012 Mitrevski and Kotevski, licensee InTech. This is an open access chapter distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Fluid Stochastic Petri Nets: From Fluid Atoms in ILP Processor Pipelines to Fluid Atoms in P2P Streaming Networ

    EOLE: Paving the Way for an Effective Implementation of Value Prediction

    A fait l'objet d'une publication au "International Symposium on Computer Architecture (ISCA) 2014" Lien : http://people.irisa.fr/Arthur.Perais/data/ISCA%2714_EOLE.pdfEven in the multicore era, there is a continuous demand to increase the performance of single-threaded applications. However, the conventional path of increasing both issue width and instruction window size inevitably leads to the power wall. Value prediction (VP) was proposed in the mid 90's as an alternative path to further enhance the performance of wide-issue superscalar processors. Still, it was considered up to recently that a performance-effective implementation of Value Prediction would add tremendous complexity and power consumption in almost every stage of the pipeline. Nonetheless, recent work in the field of VP has shown that given an efficient confidence estimation mechanism, prediction validation could be removed from the out-of-order engine and delayed until commit time. As a result, recovering from mispredictions via selective replay can be avoided and a much simpler mechanism - pipeline squashing - can be used, while the out-of-order engine remains mostly unmodified. Nonetheless, VP and validation at commit time entail strong constraints on the Physical Register File. Write ports are needed to write predicted results and read ports are needed in order to validate them at commit time, potentially rendering the overall number of ports unbearable. Fortunately, VP also implies that many single-cycle ALU instructions have their operands predicted in the front-end and can be executed in-place and in-order. Similarly, the execution of single-cycle instructions whose result has been predicted can be delayed until just before commit since predictions are validated at commit time. Consequently, a significant number of instructions - 10% to 60% in our experiments - can bypass the out-of-order engine, allowing the reduction of the issue width, which is a major contributor to both out-of-order engine complexity and register file port requirement. This reduction paves the way for a truly practical implementation of Value Prediction. Furthermore, since Value Prediction in itself usually increases performance, our resulting {Early | Out-of-Order | Late} Execution architecture (EOLE), is often more efficient than a baseline VP-augmented 6-issue superscalar while having a significantly narrower 4-issue out-of-order engine.MĂȘme Ă  l'Ăšre des multicoeurs, il existe une demande continue pour l'augmentation de la performance sur les applications mono-threads. Cependant, la solution conventionnelle consistant Ă  augmenter la largeur d'exĂ©cution ainsi que la taille de la fenĂȘtre d'instructions se heurte inĂ©vitablement au mur de la consommation. La PrĂ©diction de Valeurs (VP) a Ă©tĂ© proposĂ©e dans les annĂ©es 90 comme une alternative permettant d'amĂ©liorer la performance des processeurs superscalaires. Cela Ă©tant, une implĂ©mentation intĂ©ressante du point de vue cout-efficacitĂ© Ă©tait jusqu'ici considĂ©rĂ©e comme impossible Ă  cause de la complexitĂ© ainsi que de la consommation induite. Cependant, des travaux rĂ©cents dans le domaine de la PrĂ©diction de Valeurs ont montrĂ©s qu'avec un mĂ©canisme d'estimation de la confiance efficace, la validation d'une prĂ©diction pouvait ĂȘtre repoussĂ©e au moment ou l'instruction est retirĂ©e du pipeline. ConsĂ©quemment, rĂ©cupĂ©rer d'une mauvaise prĂ©diction via une rĂ©-exĂ©cution sĂ©lective peut-ĂȘtre Ă©vitĂ© et un mĂ©canisme bien plus simple - vidage du pipeline - peut-ĂȘtre utilisĂ©. Toute la partie du processeur chargĂ©e d'exĂ©cuter les instructions dans le dĂ©sordre n'est donc pas modifiĂ©e. NĂ©anmoins, VP et la validation au retirement impliquent des contraintes fortes sur le fichier de registres. Des ports d'Ă©criture sont requis pour Ă©crire les prĂ©dictions et des ports de lecture sont requis pour valider les prĂ©dictions au retirement. Heureusement, VP implique aussi que beaucoup d'instructions simples ont leurs opĂ©randes disponibles tĂŽt dans le pipeline et peuvent ĂȘtre exĂ©cutĂ©es dans l'ordre. De façon similaire, l'exĂ©cution des instructions simples ayant Ă©tĂ© prĂ©dites peut ĂȘtre reportĂ©e aux derniers Ă©tages du pipeline puisque les prĂ©dictions sont validĂ©es au retirement. Au final, une proportion significative des instructions - 10% to 60% dans notre Ă©tude - peuvent contourner le moteur d'exĂ©cution dans le dĂ©sordre, ce qui permet de rĂ©duire la largeur d'exĂ©cution, qui contribue grandement Ă  la complexitĂ© du processeur. Cette rĂ©duction ouvre la porte Ă  une implĂ©mentation rĂ©aliste de la PrĂ©diction de Valeurs. De plus, puisque la VP augmente la performance, notre architecture {Early | Out-of-Order | Late} Execution architecture (EOLE), est souvent plus performante qu'une architecture superscalaire implĂ©mentant la VP tout en ayant un moteur d'exĂ©cution dans le dĂ©sordre bien moins complexe

    DAMOV: A New Methodology and Benchmark Suite for Evaluating Data Movement Bottlenecks

    Data movement between the CPU and main memory is a first-order obstacle against improving performance, scalability, and energy efficiency in modern systems. Computer systems employ a range of techniques to reduce overheads tied to data movement, spanning from traditional mechanisms (e.g., deep multi-level cache hierarchies, aggressive hardware prefetchers) to emerging techniques such as Near-Data Processing (NDP), where some computation is moved close to memory. Our goal is to methodically identify potential sources of data movement over a broad set of applications and to comprehensively compare traditional compute-centric data movement mitigation techniques to more memory-centric techniques, thereby developing a rigorous understanding of the best techniques to mitigate each source of data movement. With this goal in mind, we perform the first large-scale characterization of a wide variety of applications, across a wide range of application domains, to identify fundamental program properties that lead to data movement to/from main memory. We develop the first systematic methodology to classify applications based on the sources contributing to data movement bottlenecks. From our large-scale characterization of 77K functions across 345 applications, we select 144 functions to form the first open-source benchmark suite (DAMOV) for main memory data movement studies. We select a diverse range of functions that (1) represent different types of data movement bottlenecks, and (2) come from a wide range of application domains. Using NDP as a case study, we identify new insights about the different data movement bottlenecks and use these insights to determine the most suitable data movement mitigation mechanism for a particular application. We open-source DAMOV and the complete source code for our new characterization methodology at https://github.com/CMU-SAFARI/DAMOV.Comment: Our open source software is available at https://github.com/CMU-SAFARI/DAMO

    Mitigating the impact of decompression latency in L1 compressed data caches via prefetching

    Expanding cache size is a common approach for reducing cache miss rates and increasing performance in processors. This approach, however, comes at a cost of increased static and dynamic power consumption by the cache. Static power scales with the number of transistors in the design, while dynamic power increases with the number of transistors being switched and the effective operating frequency of the cache. Cache compression is a technique that can increase the effective capacity of cache memory without experiencing the same gains in static and dynamic power consumption. Alternatively, this technique can reduce the physical size and therefore the static and dynamic energy usage of the cache while maintaining reasonable effective cache capacity. A drawback of compression is that a delay, or decompression latency, is experienced when accessing the compressed data, which affects the critical execution path of the processor. This latency can have a noticeable impact on processor performance, especially when implemented in first level caches. Cache prefetching techniques have been used to hide the latency of lower level memory accesses. This work aims to investigate the combination of current prefetching techniques and cache compression techniques to reduce the effect of decompression latency and therefore improve the feasibility of power reduction via compression in high level caches. We propose an architecture that combines L1 data cache compression with table-based prefetching to predict which cache lines will require decompression. The architecture then performs decompression in parallel, moving the delay due to decompression off the critical path of the processor. The architecture is verified using 90nm CMOS technology simulations in a new branch of SimpleScalar, using Wattch as a baseline, and cache model inputs from CACTI. Compression and decompression hardware are synthesized using the 90nm Cadence GPDK and verified at the register-transfer level. The results of our verifications demonstrate that using Base-Delta-Immediate (BΔI) compression, in combination with Last Outcome (LO), Stride (S), and Two-Level (2L) prefetch methods, or hybrid combinations of these methods (S/LO or 2L/S), provides performance improvement over Base-Delta-Immediate (BΔI) compression alone in L1 data cache. On average, across the SPEC CPU 2000 benchmarks tested, Base-Delta-Immediate (BΔI) compression results in a slowdown of 3.6%. Implementing a 1K-Set Last Outcome prefetch mechanism improves slowdown to 2.1% and reduces the energy consumption of the L1 Data Cache by 21% versus a baseline scheme with no compression

    Understanding and Improving the Latency of DRAM-Based Memory Systems

    Over the past two decades, the storage capacity and access bandwidth of main memory have improved tremendously, by 128x and 20x, respectively. These improvements are mainly due to the continuous technology scaling of DRAM (dynamic random-access memory), which has been used as the physical substrate for main memory. In stark contrast with capacity and bandwidth, DRAM latency has remained almost constant, reducing by only 1.3x in the same time frame. Therefore, long DRAM latency continues to be a critical performance bottleneck in modern systems. Increasing core counts, and the emergence of increasingly more data-intensive and latency-critical applications further stress the importance of providing low-latency memory access. In this dissertation, we identify three main problems that contribute significantly to long latency of DRAM accesses. To address these problems, we present a series of new techniques. Our new techniques significantly improve both system performance and energy efficiency. We also examine the critical relationship between supply voltage and latency in modern DRAM chips and develop new mechanisms that exploit this voltage-latency trade-off to improve energy efficiency. The key conclusion of this dissertation is that augmenting DRAM architecture with simple and low-cost features, and developing a better understanding of manufactured DRAM chips together lead to significant memory latency reduction as well as energy efficiency improvement. We hope and believe that the proposed architectural techniques and the detailed experimental data and observations on real commodity DRAM chips presented in this dissertation will enable development of other new mechanisms to improve the performance, energy efficiency, or reliability of future memory systems.Comment: PhD Dissertatio

    Applying Perceptrons to Speculation in Computer Architecture

    Speculation plays an ever-increasing role in optimizing the execution of programs in computer architecture. Speculative decision-makers are typically required to have high speed and small size, thus limiting their complexity and capability. Because of these restrictions, predictors often consider only a small subset of the available data in making decisions, and consequently do not realize their potential accuracy. Perceptrons, or simple neural networks, can be highly useful in speculation for their ability to examine larger quantities of available data, and identify which data lead to accurate results. Recent research has demonstrated that perceptrons can operate successfully within the strict size and latency restrictions of speculation in computer architecture. This dissertation first studies how perceptrons can be made to predict accurately when they directly replace the traditional pattern table predictor. Several weight training methods and multiple-bit perceptron topologies are modeled and evaluated in their ability to learn data patterns that pattern tables can learn. The effects of interference between past data on perceptrons are evaluated, and different interference reduction strategies are explored. Perceptrons are then applied to two speculative applications: data value prediction and dataflow critical path prediction. Several new perceptron value predictors are proposed that can consider longer or more varied data histories than existing table-based value predictors. These include a global-based local predictor that uses global correlations between data values to predict past local values, a global-based global predictor that uses global correlations to predict past global values, and a bitwise predictor that can use global correlations to generate new data values. Several new perceptron criticality predictors are proposed that use global correlations between instruction behaviors to accurately determine whether instructions lie on the critical path. These predictors are evaluated against local table-based approaches on a custom cycle-accurate processor simulator, and are shown on average to have both superior accuracy and higher instruction-per-cycle performance. Finally, the perceptron predictors are simulated using the different weight training approaches and multiple-bit topologies. It is shown that for these applications, perceptron topologies and training approaches must be selected that respond well to highly imbalanced and poorly correlated past data patterns