Search CORE

11 research outputs found

[[alternative]]The Design of Advanced Value Predictors

Author: 莊博任
Publication venue
Publication date
Field of study

計畫編號：NSC90-2213-E032-019研究期間：200108~200207研究經費：357,000[[sponsorship]]行政院國家科學委員

Tamkang University Institutional Repository

Load Value Approximation: Approaching the Ideal Memory Access Latency

Author: Joshua San Miguel
Natalie Enright Jerger
Publication venue
Publication date
Field of study

Approximate computing recognizes that many applications can tolerate inexactness. These applications, which range from multimedia processing to machine learning, operate on inherently noisy and imprecise data. As a result, we can tradeoff some loss in output value integrity for improved processor performance and energy-efficiency. In this paper, we introduce load value approximation. In modern processors, upon a load miss in the private cache, the data must be retrieved from main memory or from the higher-level caches. These data accesses are costly both in terms of latency and energy. We implement load value approximators, which are hardware structures that learn value patterns and generate approximations of the data. The processor can then use these approximate data values to continue executing without incurring the high cost of accessing memory. We show that load value approximators can achieve high coverage while maintaining very low error in the application’s output. By exploiting the approximate nature of applications, we can draw closer to the ideal memory access latency. 1

CiteSeerX

Fluid Stochastic Petri Nets: From Fluid Atoms in ILP Processor Pipelines to Fluid Atoms in P2P Streaming Networks

Author: Kotevski Zoran
Mitrevski Pece
Publication venue: 'IntechOpen'
Publication date: 28/08/2012
Field of study

Â© 2012 Mitrevski and Kotevski, licensee InTech. This is an open access chapter distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Fluid Stochastic Petri Nets: From Fluid Atoms in ILP Processor Pipelines to Fluid Atoms in P2P Streaming Networ

IntechOpen

Scipedia

EOLE: Paving the Way for an Effective Implementation of Value Prediction

Author: Perais Arthur
Seznec André
Publication venue: HAL CCSD
Publication date: 22/11/2013
Field of study

A fait l'objet d'une publication au "International Symposium on Computer Architecture (ISCA) 2014" Lien : http://people.irisa.fr/Arthur.Perais/data/ISCA%2714_EOLE.pdfEven in the multicore era, there is a continuous demand to increase the performance of single-threaded applications. However, the conventional path of increasing both issue width and instruction window size inevitably leads to the power wall. Value prediction (VP) was proposed in the mid 90's as an alternative path to further enhance the performance of wide-issue superscalar processors. Still, it was considered up to recently that a performance-effective implementation of Value Prediction would add tremendous complexity and power consumption in almost every stage of the pipeline. Nonetheless, recent work in the field of VP has shown that given an efficient confidence estimation mechanism, prediction validation could be removed from the out-of-order engine and delayed until commit time. As a result, recovering from mispredictions via selective replay can be avoided and a much simpler mechanism - pipeline squashing - can be used, while the out-of-order engine remains mostly unmodified. Nonetheless, VP and validation at commit time entail strong constraints on the Physical Register File. Write ports are needed to write predicted results and read ports are needed in order to validate them at commit time, potentially rendering the overall number of ports unbearable. Fortunately, VP also implies that many single-cycle ALU instructions have their operands predicted in the front-end and can be executed in-place and in-order. Similarly, the execution of single-cycle instructions whose result has been predicted can be delayed until just before commit since predictions are validated at commit time. Consequently, a significant number of instructions - 10% to 60% in our experiments - can bypass the out-of-order engine, allowing the reduction of the issue width, which is a major contributor to both out-of-order engine complexity and register file port requirement. This reduction paves the way for a truly practical implementation of Value Prediction. Furthermore, since Value Prediction in itself usually increases performance, our resulting {Early | Out-of-Order | Late} Execution architecture (EOLE), is often more efficient than a baseline VP-augmented 6-issue superscalar while having a significantly narrower 4-issue out-of-order engine.Même à l'ère des multicoeurs, il existe une demande continue pour l'augmentation de la performance sur les applications mono-threads. Cependant, la solution conventionnelle consistant à augmenter la largeur d'exécution ainsi que la taille de la fenêtre d'instructions se heurte inévitablement au mur de la consommation. La Prédiction de Valeurs (VP) a été proposée dans les années 90 comme une alternative permettant d'améliorer la performance des processeurs superscalaires. Cela étant, une implémentation intéressante du point de vue cout-efficacité était jusqu'ici considérée comme impossible à cause de la complexité ainsi que de la consommation induite. Cependant, des travaux récents dans le domaine de la Prédiction de Valeurs ont montrés qu'avec un mécanisme d'estimation de la confiance efficace, la validation d'une prédiction pouvait être repoussée au moment ou l'instruction est retirée du pipeline. Conséquemment, récupérer d'une mauvaise prédiction via une ré-exécution sélective peut-être évité et un mécanisme bien plus simple - vidage du pipeline - peut-être utilisé. Toute la partie du processeur chargée d'exécuter les instructions dans le désordre n'est donc pas modifiée. Néanmoins, VP et la validation au retirement impliquent des contraintes fortes sur le fichier de registres. Des ports d'écriture sont requis pour écrire les prédictions et des ports de lecture sont requis pour valider les prédictions au retirement. Heureusement, VP implique aussi que beaucoup d'instructions simples ont leurs opérandes disponibles tôt dans le pipeline et peuvent être exécutées dans l'ordre. De façon similaire, l'exécution des instructions simples ayant été prédites peut être reportée aux derniers étages du pipeline puisque les prédictions sont validées au retirement. Au final, une proportion significative des instructions - 10% to 60% dans notre étude - peuvent contourner le moteur d'exécution dans le désordre, ce qui permet de réduire la largeur d'exécution, qui contribue grandement à la complexité du processeur. Cette réduction ouvre la porte à une implémentation réaliste de la Prédiction de Valeurs. De plus, puisque la VP augmente la performance, notre architecture {Early | Out-of-Order | Late} Execution architecture (EOLE), est souvent plus performante qu'une architecture superscalaire implémentant la VP tout en ayant un moteur d'exécution dans le désordre bien moins complexe

INRIA a CCSD electronic archive server

DAMOV: A New Methodology and Benchmark Suite for Evaluating Data Movement Bottlenecks

Author: Fernandez Ivan
Ghose Saugata
Gómez-Luna Juan
Mutlu Onur
Oliveira Geraldo F.
Orosa Lois
Sadrosadati Mohammad
Vijaykumar Nandita
Publication venue
Publication date: 01/01/2021
Field of study

Data movement between the CPU and main memory is a first-order obstacle against improving performance, scalability, and energy efficiency in modern systems. Computer systems employ a range of techniques to reduce overheads tied to data movement, spanning from traditional mechanisms (e.g., deep multi-level cache hierarchies, aggressive hardware prefetchers) to emerging techniques such as Near-Data Processing (NDP), where some computation is moved close to memory. Our goal is to methodically identify potential sources of data movement over a broad set of applications and to comprehensively compare traditional compute-centric data movement mitigation techniques to more memory-centric techniques, thereby developing a rigorous understanding of the best techniques to mitigate each source of data movement. With this goal in mind, we perform the first large-scale characterization of a wide variety of applications, across a wide range of application domains, to identify fundamental program properties that lead to data movement to/from main memory. We develop the first systematic methodology to classify applications based on the sources contributing to data movement bottlenecks. From our large-scale characterization of 77K functions across 345 applications, we select 144 functions to form the first open-source benchmark suite (DAMOV) for main memory data movement studies. We select a diverse range of functions that (1) represent different types of data movement bottlenecks, and (2) come from a wide range of application domains. Using NDP as a case study, we identify new insights about the different data movement bottlenecks and use these insights to determine the most suitable data movement mitigation mechanism for a particular application. We open-source DAMOV and the complete source code for our new characterization methodology at https://github.com/CMU-SAFARI/DAMOV.Comment: Our open source software is available at https://github.com/CMU-SAFARI/DAMO

arXiv.org e-Print Archive

Repository for Publications and Research Data

Directory of Open Access Journals

Mitigating the impact of decompression latency in L1 compressed data caches via prefetching

Author: Rea Sean
Publication venue
Publication date: 01/01/2017
Field of study

Expanding cache size is a common approach for reducing cache miss rates and increasing performance in processors. This approach, however, comes at a cost of increased static and dynamic power consumption by the cache. Static power scales with the number of transistors in the design, while dynamic power increases with the number of transistors being switched and the effective operating frequency of the cache. Cache compression is a technique that can increase the effective capacity of cache memory without experiencing the same gains in static and dynamic power consumption. Alternatively, this technique can reduce the physical size and therefore the static and dynamic energy usage of the cache while maintaining reasonable effective cache capacity. A drawback of compression is that a delay, or decompression latency, is experienced when accessing the compressed data, which affects the critical execution path of the processor. This latency can have a noticeable impact on processor performance, especially when implemented in first level caches. Cache prefetching techniques have been used to hide the latency of lower level memory accesses. This work aims to investigate the combination of current prefetching techniques and cache compression techniques to reduce the effect of decompression latency and therefore improve the feasibility of power reduction via compression in high level caches. We propose an architecture that combines L1 data cache compression with table-based prefetching to predict which cache lines will require decompression. The architecture then performs decompression in parallel, moving the delay due to decompression off the critical path of the processor. The architecture is verified using 90nm CMOS technology simulations in a new branch of SimpleScalar, using Wattch as a baseline, and cache model inputs from CACTI. Compression and decompression hardware are synthesized using the 90nm Cadence GPDK and verified at the register-transfer level. The results of our verifications demonstrate that using Base-Delta-Immediate (BΔI) compression, in combination with Last Outcome (LO), Stride (S), and Two-Level (2L) prefetch methods, or hybrid combinations of these methods (S/LO or 2L/S), provides performance improvement over Base-Delta-Immediate (BΔI) compression alone in L1 data cache. On average, across the SPEC CPU 2000 benchmarks tested, Base-Delta-Immediate (BΔI) compression results in a slowdown of 3.6%. Implementing a 1K-Set Last Outcome prefetch mechanism improves slowdown to 2.1% and reduces the energy consumption of the L1 Data Cache by 21% versus a baseline scheme with no compression

Lakehead University Knowledge Commons

Understanding and Improving the Latency of DRAM-Based Memory Systems

Author: Chang Kevin K.
Publication venue
Publication date: 21/12/2017
Field of study

Over the past two decades, the storage capacity and access bandwidth of main memory have improved tremendously, by 128x and 20x, respectively. These improvements are mainly due to the continuous technology scaling of DRAM (dynamic random-access memory), which has been used as the physical substrate for main memory. In stark contrast with capacity and bandwidth, DRAM latency has remained almost constant, reducing by only 1.3x in the same time frame. Therefore, long DRAM latency continues to be a critical performance bottleneck in modern systems. Increasing core counts, and the emergence of increasingly more data-intensive and latency-critical applications further stress the importance of providing low-latency memory access. In this dissertation, we identify three main problems that contribute significantly to long latency of DRAM accesses. To address these problems, we present a series of new techniques. Our new techniques significantly improve both system performance and energy efficiency. We also examine the critical relationship between supply voltage and latency in modern DRAM chips and develop new mechanisms that exploit this voltage-latency trade-off to improve energy efficiency. The key conclusion of this dissertation is that augmenting DRAM architecture with simple and low-cost features, and developing a better understanding of manufactured DRAM chips together lead to significant memory latency reduction as well as energy efficiency improvement. We hope and believe that the proposed architectural techniques and the detailed experimental data and observations on real commodity DRAM chips presented in this dissertation will enable development of other new mechanisms to improve the performance, energy efficiency, or reliability of future memory systems.Comment: PhD Dissertatio

arXiv.org e-Print Archive

FigShare

Applying Perceptrons to Speculation in Computer Architecture

Author: Black Michael David
Publication venue
Publication date: 05/04/2007
Field of study

Speculation plays an ever-increasing role in optimizing the execution of programs in computer architecture. Speculative decision-makers are typically required to have high speed and small size, thus limiting their complexity and capability. Because of these restrictions, predictors often consider only a small subset of the available data in making decisions, and consequently do not realize their potential accuracy. Perceptrons, or simple neural networks, can be highly useful in speculation for their ability to examine larger quantities of available data, and identify which data lead to accurate results. Recent research has demonstrated that perceptrons can operate successfully within the strict size and latency restrictions of speculation in computer architecture. This dissertation first studies how perceptrons can be made to predict accurately when they directly replace the traditional pattern table predictor. Several weight training methods and multiple-bit perceptron topologies are modeled and evaluated in their ability to learn data patterns that pattern tables can learn. The effects of interference between past data on perceptrons are evaluated, and different interference reduction strategies are explored. Perceptrons are then applied to two speculative applications: data value prediction and dataflow critical path prediction. Several new perceptron value predictors are proposed that can consider longer or more varied data histories than existing table-based value predictors. These include a global-based local predictor that uses global correlations between data values to predict past local values, a global-based global predictor that uses global correlations to predict past global values, and a bitwise predictor that can use global correlations to generate new data values. Several new perceptron criticality predictors are proposed that use global correlations between instruction behaviors to accurately determine whether instructions lie on the critical path. These predictors are evaluated against local table-based approaches on a custom cycle-accurate processor simulator, and are shown on average to have both superior accuracy and higher instruction-per-cycle performance. Finally, the perceptron predictors are simulated using the different weight training approaches and multiple-bit topologies. It is shown that for these applications, perceptron topologies and training approaches must be selected that respond well to highly imbalanced and poorly correlated past data patterns

Digital Repository at the University of Maryland